Senior Storage Production Engineer - DGX Cloud

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Ansible @ 4 Chef @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 Java @ 4 CI/CD @ 3 Algorithms @ 4 Data Structures @ 4 Bash @ 4 Communication @ 7 Git @ 3 Networking @ 4 OpenStack @ 4 Debugging @ 7 Puppet @ 4 Compliance @ 4

Details

Production engineering is a discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses software and systems engineering practices, storage, data management, and services. Production Engineers at NVIDIA specialize in storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, capacity planning, continuous delivery and deployment, and cloud-enabling technologies such as Kubernetes, containers, and virtualization. The role focuses on ensuring reliable, scalable, high-performance storage solutions and low-latency data access for HPC and AI/ML workloads.

Responsibilities

  • Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
  • Develop and maintain storage monitoring, logging, and alerting systems for proactive detection and resolution of performance issues.
  • Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
  • Improve the lifecycle of storage services from inception and design to deployment, operation, and continuous optimization.
  • Support storage services before they go live via system design consulting, automation frameworks, capacity management, and launch reviews.
  • Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
  • Optimize storage efficiency using compression, deduplication, tiering strategies, and intelligent workload placement.
  • Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
  • Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
  • Participate in blameless root cause analysis and be part of an on-call rotation supporting storage and production systems.

Requirements

  • BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
  • Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
  • Solid understanding of block, file, and object storage technologies, including scalability, reliability, and performance characteristics.
  • Experience with storage networking protocols: NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, NVMe over Fabrics.
  • Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
  • Experience with one or more programming languages for storage automation and tuning: C/C++, Java, Python, Go, NodeJS, and Bash.
  • Hands-on experience with infrastructure configuration management and automation tools such as Ansible, Chef, Puppet, and Terraform.
  • Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
  • Excellent written and oral communication skills, strong teamwork, and a commitment to producing quality work and completing tasks reliably.

Ways to stand out

  • Deep understanding of large-scale distributed storage architectures, replication strategies, and erasure coding techniques.
  • Experience in capacity planning, performance tuning, and troubleshooting high-throughput storage systems.
  • Familiarity with Git, code review, pipelines, and CI/CD for infrastructure as code.
  • Experience operating private and public cloud storage solutions based on Kubernetes, OpenStack, or hybrid cloud architectures.
  • Ability to design and implement automated storage migration, backup, and disaster recovery strategies.
  • Strong debugging skills and systematic problem-solving for complex storage issues.
  • Proven understanding of network protocols and troubleshooting related to storage performance and availability.

Compensation & Benefits

  • Base salary ranges provided by level:
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • You will also be eligible for equity and benefits (see NVIDIA benefits page).

Other details

  • Applications for this job will be accepted at least until August 17, 2025.
  • NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.