Senior Storage Production Engineer - DGX Cloud

at Nvidia
USD 168,000-333,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Ansible @ 4 Chef @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 Java @ 4 CI/CD @ 4 Algorithms @ 4 Data Structures @ 4 Bash @ 4 Communication @ 7 Git @ 4 Networking @ 4 OpenStack @ 4 Debugging @ 7 Puppet @ 4 Compliance @ 4

Details

Production engineering is a discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. This role focuses on storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, capacity planning, continuous delivery and deployment, and cloud-enabling technologies like Kubernetes, containers, and virtualization. The role emphasizes automating storage operations, performance tuning, and optimizing storage for AI/ML and HPC workloads.

Responsibilities

  • Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
  • Develop and maintain storage monitoring, logging, and alerting systems for proactive detection and resolution of performance issues.
  • Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
  • Improve the lifecycle of storage services from design to deployment, operation, and continuous optimization, including system design consulting, automation frameworks, capacity management, and launch reviews.
  • Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
  • Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement.
  • Scale storage systems using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
  • Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
  • Participate in on-call rotation, sustainable incident response, and blameless root cause analysis.

Requirements

  • BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
  • Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
  • Solid understanding of block, file, and object storage technologies and their scalability, reliability, and performance characteristics.
  • Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
  • Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
  • Experience in one or more of: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
  • Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
  • Experience with observability and tracing tools such as InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
  • Strong written and oral communication skills, teamwork, and a commitment to producing high-quality work.

Ways to stand out

  • Deep understanding of large-scale distributed storage architectures, replication strategies, and erasure coding techniques.
  • Experience in capacity planning, performance tuning, and troubleshooting high-throughput storage systems at scale.
  • Experience with Git, code review, CI/CD pipelines, and infrastructure-as-code practices.
  • Experience operating private and public cloud storage solutions based on Kubernetes, OpenStack, or hybrid cloud architectures.
  • Ability to design and implement automated storage migration, backup, and disaster recovery strategies.
  • Strong debugging skills and proven understanding of network protocols and architectures as they relate to storage performance and availability.

Compensation & Benefits

  • Base salary ranges (dependent on level, location, and experience):
    • Level 4: 168,000 USD - 270,250 USD
    • Level 5: 208,000 USD - 333,500 USD
  • Eligible for equity and additional benefits (link provided in original posting).

Additional details

  • Location: Santa Clara, California, United States.
  • Application deadline (as listed): at least until January 6, 2026.
  • NVIDIA is an equal opportunity employer and supports a diverse workforce.