Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Ansible @ 4 Chef @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 Java @ 4 CI/CD @ 4 Algorithms @ 4 Data Structures @ 4 Bash @ 4 Communication @ 7 Git @ 4 Networking @ 4 OpenStack @ 4 Debugging @ 7 Puppet @ 4 Compliance @ 4Details
Production engineering is a discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. This role focuses on storage architecture, high-performance distributed storage, data management, systems, networking, coding, database management, capacity planning, continuous delivery and deployment, and cloud-enabling technologies like Kubernetes, containers, and virtualization. The role emphasizes automating storage operations, performance tuning, and optimizing storage for AI/ML and HPC workloads.
Responsibilities
- Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
- Develop and maintain storage monitoring, logging, and alerting systems for proactive detection and resolution of performance issues.
- Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
- Improve the lifecycle of storage services from design to deployment, operation, and continuous optimization, including system design consulting, automation frameworks, capacity management, and launch reviews.
- Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
- Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement.
- Scale storage systems using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
- Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
- Participate in on-call rotation, sustainable incident response, and blameless root cause analysis.
Requirements
- BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
- Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
- Solid understanding of block, file, and object storage technologies and their scalability, reliability, and performance characteristics.
- Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
- Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
- Experience in one or more of: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
- Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
- Experience with observability and tracing tools such as InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
- Strong written and oral communication skills, teamwork, and a commitment to producing high-quality work.
Ways to stand out
- Deep understanding of large-scale distributed storage architectures, replication strategies, and erasure coding techniques.
- Experience in capacity planning, performance tuning, and troubleshooting high-throughput storage systems at scale.
- Experience with Git, code review, CI/CD pipelines, and infrastructure-as-code practices.
- Experience operating private and public cloud storage solutions based on Kubernetes, OpenStack, or hybrid cloud architectures.
- Ability to design and implement automated storage migration, backup, and disaster recovery strategies.
- Strong debugging skills and proven understanding of network protocols and architectures as they relate to storage performance and availability.
Compensation & Benefits
- Base salary ranges (dependent on level, location, and experience):
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
- Eligible for equity and additional benefits (link provided in original posting).
Additional details
- Location: Santa Clara, California, United States.
- Application deadline (as listed): at least until January 6, 2026.
- NVIDIA is an equal opportunity employer and supports a diverse workforce.