Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Docker @ 6 Kubernetes @ 6 Distributed Systems @ 7 Data Science @ 4 Helm @ 6 Networking @ 4 GPU @ 4Details
NVIDIA leads developments in Artificial Intelligence, High-Performance Computing (HPC) and Visualization. DGX Cloud provides a serverless generative AI infrastructure enabling NVIDIA's AI supercomputer technologies to be used by anyone. The DGX Cloud engineering team ensures customers receive timely and quality-assured releases. This role is for a Performance Engineer proficient in performance and scalability testing, identifying limitations across the Kubernetes (K8s) and application stack using industry-standard tools and telemetry. The role involves problem-solving in a distributed team setting and driving performance and scalability improvements across the stack.
Responsibilities
- Analyze and optimize performance across application, middleware, runtime, and infrastructure layers — including networking, storage, GPU utilization, and more.
- Develop tooling and metrics that provide deep observability into system performance.
- Collaborate with infrastructure, platform, runtime, and product teams to identify key performance goals and drive systemic improvements.
- Lead investigations into high-impact performance regressions or scalability issues in production.
- Influence architecture and design decisions to prioritize latency, throughput, and efficiency at scale.
- Drive performance testing strategies and help define SLAs/SLOs around latency and throughput for critical systems.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Data Science, or a related field (or equivalent experience).
- 5+ years in software engineering with a strong track record in performance or scalability of high-scale distributed systems.
- Deep comfort with performance profiling tools and tracing systems.
- Ability to identify performance issues, perform root cause analysis, and propose solutions.
- Experience optimizing performance across one or more layers of the stack (e.g., database, networking, storage, application runtime, GC tuning, Golang internals, GPU utilization).
- Contributions to observability, benchmarking, or performance-focused infrastructure at scale.
- Strong understanding of OS internals, scheduling, memory management, and IO patterns.
- Demonstrated ability to navigate ambiguity and align stakeholders around performance goals.
- Proficiency in container-based infrastructure (Docker, Kubernetes, Helm).
Ways to Stand Out
- Demonstrated ability to handle sophisticated technical environments while meeting or exceeding security, reliability, scalability, and availability metrics.
- Strong and confirmed knowledge of modern architectures at scale.
Benefits
- Competitive base salary (ranges provided below by level), eligibility for equity, and access to NVIDIA benefits.
Compensation Details
- Base salary range for Level 3: 144,000 USD - 230,000 USD.
- Base salary range for Level 4: 168,000 USD - 270,250 USD.
Additional Information
- Applications accepted at least until September 14, 2025.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.