Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Ansible @ 4 Chef @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 Java @ 4 CI/CD @ 3 Algorithms @ 4 Data Structures @ 4 Bash @ 4 Communication @ 7 Git @ 3 Networking @ 4 OpenStack @ 4 Debugging @ 7 Puppet @ 4 Compliance @ 4Details
Production engineering is a discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses software and systems engineering practices, storage, data management, and services. Production Engineers at NVIDIA specialize in storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, capacity planning, continuous delivery and deployment, and cloud-enabling technologies such as Kubernetes, containers, and virtualization. The role focuses on ensuring reliable, scalable, high-performance storage solutions and low-latency data access for HPC and AI/ML workloads.
Responsibilities
- Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
- Develop and maintain storage monitoring, logging, and alerting systems for proactive detection and resolution of performance issues.
- Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
- Improve the lifecycle of storage services from inception and design to deployment, operation, and continuous optimization.
- Support storage services before they go live via system design consulting, automation frameworks, capacity management, and launch reviews.
- Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
- Optimize storage efficiency using compression, deduplication, tiering strategies, and intelligent workload placement.
- Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
- Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
- Participate in blameless root cause analysis and be part of an on-call rotation supporting storage and production systems.
Requirements
- BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
- Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
- Solid understanding of block, file, and object storage technologies, including scalability, reliability, and performance characteristics.
- Experience with storage networking protocols: NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, NVMe over Fabrics.
- Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
- Experience with one or more programming languages for storage automation and tuning: C/C++, Java, Python, Go, NodeJS, and Bash.
- Hands-on experience with infrastructure configuration management and automation tools such as Ansible, Chef, Puppet, and Terraform.
- Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
- Excellent written and oral communication skills, strong teamwork, and a commitment to producing quality work and completing tasks reliably.
Ways to stand out
- Deep understanding of large-scale distributed storage architectures, replication strategies, and erasure coding techniques.
- Experience in capacity planning, performance tuning, and troubleshooting high-throughput storage systems.
- Familiarity with Git, code review, pipelines, and CI/CD for infrastructure as code.
- Experience operating private and public cloud storage solutions based on Kubernetes, OpenStack, or hybrid cloud architectures.
- Ability to design and implement automated storage migration, backup, and disaster recovery strategies.
- Strong debugging skills and systematic problem-solving for complex storage issues.
- Proven understanding of network protocols and troubleshooting related to storage performance and availability.
Compensation & Benefits
- Base salary ranges provided by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits page).
Other details
- Applications for this job will be accepted at least until August 17, 2025.
- NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.