Senior Devops And Sre Engineer

at Nvidia

📍 Santa Clara, United States

USD 168,000-333,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Ansible @ 3 Chef @ 3 Go @ 4 Grafana @ 4 Jenkins @ 4 Kubernetes @ 4 Linux @ 4 MySQL @ 4 Prometheus @ 4 DevOps @ 4 Kibana @ 4 Terraform @ 3 Python @ 4 SQL @ 4 GitHub @ 4 GitHub Actions @ 4 NoSQL @ 4 CI/CD @ 4 Algorithms @ 7 MongoDB @ 4 Networking @ 4 SRE @ 4 Planning @ 4 Microservices @ 4 Android @ 4 Splunk @ 4 Puppet @ 3 Cassandra @ 4

Details

NVIDIA is seeking a passionate, motivated, and technical Kubernetes Architect/Engineer to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization as a Principal DevOps & SRE Engineer. The role supports the design and implementation of Kubernetes solutions for the company's Cloud Platform.

The team develops and maintains sophisticated build & test environments for multiple hardware platforms including NVIDIA GPUs and Tegra Processors, across various operating systems such as Windows, Linux, and Android. Collaboration occurs with several NVIDIA Software business units including Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics, and Autonomous Cars.

Responsibilities

Architect, design, implement, and maintain Kubernetes environments from planning to production/deployment to support CI/CD pipelines using Gitlab, Jenkins, and GitHub Actions.
Design solutions involving service discovery, networking, monitoring, logging, and scheduling in Kubernetes.
Ensure the platform is easy to use, reliable, scalable, and resistant to disruptions.
Enable developers to deliver value with high stability and security.
Participate actively in product workshops, roadmap planning, design sessions, and lead technical demos, whiteboards, and working sessions.
Defend architectural designs before the DevSecOps review board.
Develop automation to improve efficiency and productivity.
Participate in on-call support and critical issue resolution as a Site Reliability Engineer.
Engage in prototyping, crafting, and developing cloud infrastructure for NVIDIA.

Requirements

Expertise in Kubernetes with extensive experience building scalable and resilient platforms in both public and private clouds.
High proficiency in administering and configuring Kubernetes.
Programming skills in Python, Go, or similar scripting languages.
Experience maintaining cloud infrastructure and highly available production environments.
Ability to automate processes using CI/CD tools and familiarity with Configuration as Code and Infrastructure-as-Code tools such as Ansible, Puppet, Chef, Terraform.
Strong background with CI/CD systems like Gitlab, Jenkins, GitHub Actions, and artifact management with Artifactory.
Experience with SQL and NoSQL databases (MySQL, Elastic Search, MongoDB, Cassandra).
Experience with customer management/onboarding, data analytics/visualization, and monitoring tools like Kibana, Grafana, Splunk, Zabbix, Prometheus.
Over 8 years of proven experience.
Bachelor's or Master's degree in Computer Science, Software Engineering, or equivalent.

Ways to Stand Out

Solid understanding of containerization and microservices architecture.
Certifications such as Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS), and Certified Kubernetes Application Developer (CKAD) preferred.
Ability to thrive in a fast-paced, multi-tasking environment with evolving priorities.
Competence in analyzing complex problems, designing simple and efficient systems.
Prior experience with large-scale operations teams and improving data centers.
Strong background in computer algorithms and scaling solutions.

Competitive salaries and generous benefits are offered. NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.