Senior Storage Production Engineer - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 148,000-287,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Ansible @ 4 Ceph @ 4 Chef @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Ruby @ 4 Terraform @ 4 Python @ 4 Java @ 4 CI/CD @ 4 Algorithms @ 4 Data Structures @ 4 Git @ 4 Networking @ 4 OpenStack @ 4 Perl @ 4 Debugging @ 7 Puppet @ 4 Compliance @ 4 Cloud Computing @ 4

Details

Production engineering is a team that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses software and systems engineering practices, storage, data management, and services. Production Engineers have expertise in storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, capacity planning, continuous delivery, deployment, and open-source cloud technologies like Kubernetes, containers, and virtualization.

Responsibilities

Design, implement, and support large-scale storage clusters ensuring scalability, high availability, and data integrity.
Develop and maintain storage monitoring, logging, and alerting systems for proactive performance issue detection and resolution.
Optimize storage architectures for AI/ML workloads focusing on low-latency access, efficient caching, and high-throughput performance.
Improve storage service lifecycle from design to deployment, operation, and continuous optimization.
Support storage services pre-launch via system design consulting, automation framework development, capacity management, and launch reviews.
Maintain storage infrastructure with monitoring of availability, latency, system health, using predictive analytics and AI-driven automation.
Optimize storage efficiency through compression, duplication, tiering strategies, and intelligent workload placement.
Scale storage systems sustainably with AI/ML-driven automation, policy-based tiering, and dynamic data migration.
Ensure data security and compliance via encryption, access controls, and auditing mechanisms.
Practice sustainable incident response and blameless postmortems; participate in on-call rotation supporting storage and production systems.

Requirements

BS degree or equivalent in Computer Science, Storage Systems, or related field; 5+ years of practical experience.
Experience with high-performance storage solutions: parallel file systems (Lustre, GPFS), distributed storage (Ceph, MinIO), enterprise-scale object storage (S3, NetApp, Pure Storage).
Solid knowledge of block, file, and object storage technologies with performance and methodologies understanding.
Experience with storage networking protocols: NFS, SMB, iSCSI, Fibre Channel, RDMA, NVMe over Fabrics.
Expertise in algorithms, data structures, software design; experience maintaining large-scale Linux-based storage systems.
Programming experience in C/C++, Java, Python, Go, Perl, or Ruby for storage automation and performance tuning.
Hands-on infrastructure configuration management with Ansible, Chef, Puppet, Terraform.
Experience with observability tools like InfluxDB, Prometheus, Elastic stack.

Ways to Stand Out

Deep understanding of distributed storage architectures, replication, erasure coding, capacity planning, and performance tuning.
Experience with Git, CI/CD pipelines, and infrastructure as code.
Strong debugging and problem-solving skills for complex storage issues.
Experience with private/public cloud storage solutions (Kubernetes, OpenStack, hybrid clouds).
Ability to design automated migration, backup, and disaster recovery strategies.
Collaborative, flexible, and adaptable to emerging storage technologies.

Benefits

At NVIDIA, you'll work at the forefront of innovative storage technologies powering AI, HPC, and cloud computing. Eligible for equity and benefits. NVIDIA fosters diversity and is an equal opportunity employer.