Senior Systems Engineer, Storage - DGX Cloud

at Nvidia

📍 World
📍 Canada
📍 United States

USD 208,000-414,000 per year

SENIOR

✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Ansible @ 4 Chef @ 4 Docker @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 6 Java @ 6 CI/CD @ 4 Algorithms @ 4 ArgoCD @ 4 Data Structures @ 4 Distributed Systems @ 7 Git @ 4 Helm @ 4 OpenStack @ 4 Debugging @ 7 Puppet @ 4 Observability @ 4 AI @ 4

Details

Systems Engineering focuses on building, automating, and operating platforms and tooling that deliver large-scale production systems with high efficiency, reliability, and velocity. This role combines software and systems engineering practices across infrastructure automation, containerized platforms, storage, telemetry, and observability. Systems engineers on this team deploy and operate reliable, automated platforms and build tools and services that keep storage and data infrastructure healthy and performant.

Responsibilities

Design, deploy, and operate solutions on Kubernetes for large-scale storage and data platforms, including manifests, Helm charts, and operators.
Build tools, services, and automation that improve the lifecycle of storage and data systems — from provisioning and configuration through deployment, scaling, and day-2 operations.
Develop and operate telemetry and observability for production systems: metrics, logging, tracing, dashboards, and alerting so health, availability, and latency are measurable and actionable.
Apply strong analytical troubleshooting skills to diagnose and resolve complex issues across distributed, containerized infrastructure.
Work closely with peers and partner teams to improve the lifecycle of services, from inception and design through deployment, operation, and refinement.
Scale systems sustainably through automation, infrastructure-as-code, and CI/CD, and evolve systems by pushing for changes that improve reliability and velocity.
Support services before they go live through deployment automation, capacity planning, and launch/readiness reviews.
Practice sustainable incident response and postmortems, and participate in an on-call rotation to support production systems.

Requirements

BS degree (or equivalent experience) in Computer Science or related technical field involving coding.
12+ years of practical experience.
Hands-on experience with Kubernetes — deploying, configuring, and operating workloads and solutions on Kubernetes in production.
Experience building tools and services for storage, data, or platform infrastructure, with solid software design fundamentals (algorithms, data structures, complexity analysis) on large-scale Linux-based systems.
Experience building and operating telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack.
Strong analytical troubleshooting skills with a systematic, root-cause-driven approach to identifying and resolving complex problems.
Proficiency in one or more of the following: Python, Go, or Java.
Good knowledge of infrastructure configuration management and infrastructure-as-code tools such as Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform.

Ways to Stand Out

Customer-first mindset with a focus on customer satisfaction and a passion for ensuring customer success.
Experience with Git, code review, pipelines, and CI/CD. Experience using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.
Interest in crafting, analyzing, and fixing large-scale distributed systems, with strong debugging skills and a systematic problem-solving approach.
Experience designing storage- or data-focused tooling and automating their operations at scale.
Thrive in collaborative environments and enjoy working with various teams, and are flexible in adapting to different working styles.

Additional Information

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 333,500 USD for Level 5, and 256,000 USD - 414,000 USD for Level 6.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until June 12, 2026. This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.