Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Docker @ 6 Kubernetes @ 4 Distributed Systems @ 7 Data Science @ 4 Helm @ 6 Networking @ 4 GPU @ 4Details
NVIDIA DGX Cloud provides a serverless generative AI infrastructure enabling NVIDIA's AI supercomputer technologies to be used by anyone. The DGX Cloud engineering team ensures customers receive timely and quality-assured releases. This role focuses on performance and scalability testing, identifying limitations across the Kubernetes and application stack using industry-standard tools and telemetry, and driving improvements across infrastructure and application layers.
Responsibilities
- Analyze and optimize performance across application, middleware, runtime, and infrastructure layers — networking, storage, GPU utilization, and beyond.
- Develop tooling and metrics that provide deep observability into system performance.
- Collaborate closely with infra, platform, runtime, and product teams to identify key performance goals and drive systemic improvements.
- Lead investigations into high-impact performance regressions or scalability issues in production.
- Influence architecture and design decisions to prioritize latency, throughput, and efficiency at scale.
- Drive performance testing strategies and help define SLAs/SLOs around latency and throughput for critical systems.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Data Science, or a related field — or equivalent experience.
- 5+ years in software engineering with a strong track record in performance or scalability of high-scale distributed systems.
- Deeply comfortable with performance profiling tools and tracing systems.
- Ability to identify performance issues, perform root-cause analysis, and propose solutions.
- Experience optimizing performance across one or more layers of the stack (e.g., database, networking, storage, application runtime, GC tuning, Golang internals, GPU utilization).
- Contribution to observability, benchmarking, or performance-focused infrastructure at scale.
- Strong understanding of OS internals, scheduling, memory management, and I/O patterns.
- Demonstrated success navigating ambiguity and aligning stakeholders around performance goals.
- Proficient in container-based infrastructure (Docker, Kubernetes, Helm).
Ways to stand out
- Demonstrated ability to handle sophisticated technical environments while meeting security, reliability, scalability, and availability metrics.
- Strong and confirmed knowledge of modern architectures at scale.
Compensation & Benefits
- Base salary ranges (location- and level-dependent):
- Level 3: 144,000 USD - 230,000 USD
- Level 4: 168,000 USD - 270,250 USD
- Eligible for equity and company benefits (see NVIDIA benefits link).
Other details
- Location: Santa Clara, CA, United States.
- Employment type: Full time.
- Applications accepted at least until September 14, 2025.
- NVIDIA is an equal opportunity employer committed to diversity and inclusion.