Vacancy is archived. Applications are no longer accepted.

Senior Software Engineer – AI Infrastructure And Tooling

at Nvidia

📍 Santa Clara, United States

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Go @ 6 Kubernetes @ 6 Linux @ 6 Prometheus @ 4 DevOps @ 4 Terraform @ 6 Python @ 6 Distributed Systems @ 7 Machine Learning @ 4 AWS @ 4 SRE @ 4 API @ 6 GPU @ 4

Details

We are looking for a highly motivated AI infrastructure automation and tools development expert to join us. As a seasoned professional with a strong passion for designing and implementing cutting-edge infrastructure solutions, you will play a key role in architecting and driving advancements in our large-scale cloud and on-premise computing clusters. We are a small and fast moving team, and we own production excellence of everything we develop, on all layers from OS and up to the services. Please apply if you are passionate about operational reliability, building AWS infrastructure automation and deployment tools and working on new technologies and Cloud Native applications. The solutions you propose and build will directly impact the efficiency of the NVIDIA Autonomous Vehicles development team!

Responsibilities

Applying strong programming skills and a deep understanding of distributed systems design to craft and build production-grade software.
Designing and implementing Continuous Deployment (CD) pipelines to ensure flawless and efficient software delivery.
Responsible for the big picture of system interrelations, utilizing a breadth of tools and approaches to tackle diverse problems.

Requirements

BS or MS in CS/CE/EE or equivalent experience.
4+ years of experience developing Kubernetes-based computing platform tooling/APIs.
At least 4 years building automation software for cloud with Terraform, Python, Go.
Strong AWS fundamentals including IAM, VPC, RDS, S3, CDN, EC2.
Expert knowledge of DevOps principles, tools, and methodologies.
Experience with Continuous Deployment (CD) pipelines.
Good understanding of Traffic Engineering solutions including Load Balancing and Layer7 proxies.
In-depth understanding of all layers of Internet protocols.
Operational expertise with Observability, Prometheus ecosystem, and logs ingestion at scale.
Proficiency in Linux environment.
Excellent written and verbal interpersonal skills.
Motivated team player who enjoys challenges and celebrates success.

Ways to Stand Out

Experience building sophisticated tooling and SRE automation on large GPU/CPU clusters.
Experience with Agentic AI tools for computing infrastructure management.
Experience with Artifactory Management at scale.
Good understanding of cloud and datacenter security concepts, preferably AWS.
Solid understanding of large-scale Kubernetes observability platforms.

Benefits

Base salary range: 184,000 USD - 356,500 USD, determined by location, experience, and pay of similar positions.
Eligibility for equity and benefits.
Opportunity to join NVIDIA, a leader in AI, machine learning, and data center acceleration.
Commitment to fostering diversity and being an equal opportunity employer.