Senior Software Engineer, Profiling Services
at Nvidia
π Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 7 Software Development @ 4 Python @ 7 Machine Learning @ 4 Leadership @ 7 Communication @ 4 Debugging @ 4 API @ 4 Technical Leadership @ 7 Design Patterns @ 4 PyTorch @ 3 CUDA @ 6 GPU @ 4Details
Are you ready to innovate GPU performance analysis for Machine Learning workloads? Join our Developer Tools Always-On Profiling (AON) team as a Senior Software Architect, where you will be pivotal in designing, implementing, and leading our Always-On Profiling service. This role demands deep technical expertise, a proven track record solving ambiguous challenges, and strong technical leadership skills.
Responsibilities
- Architect and build scalable systems: drive the design and implementation of the AON profiling service's core systems, including inter-process communication (IPC), memory management, and low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments.
- Elevate software engineering excellence: promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Ensure code quality and robust testing for a reliable profiling service.
- Lead, mentor, and innovate: guide and mentor engineers, provide impactful code reviews, and shape technical roadmaps. Identify complex technical issues within the AON project, decompose them, and craft innovative solutions.
- Architect and build high-performance platforms: transform user needs into requirements and design documents. Lead end-to-end feature development from planning and prototyping to implementation, testing, and customer evaluation. Work hands-on across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers.
- Collaborate across boundaries: partner with diverse internal and external teams to integrate AON into the broader profiling and ML ecosystem.
Requirements
- BS or MS degree (or equivalent experience) in Computer Engineering, Computer Science, or a related field.
- 8+ years of substantial software development experience in C, C++, and Python.
- 10+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software.
- Strong interpersonal, verbal, and written communication skills; proven ability to build cross-organizational partnerships and lead technical teams.
- Profiling & performance tools expert: extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus.
- GPU & CUDA proficiency: in-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture.
- ML ecosystem & performance analysis: familiarity with ML frameworks such as PyTorch and JAX, and experience analyzing performance for AI training/inference applications.
- Large-scale system development & debugging experience across multi-layered software systems, including user mode and kernel drivers, with the ability to contribute to and extend very large codebases.
- Proficiency designing APIs and interfaces for profiling tools to enable integration with frameworks and custom code.
- Strong ability to simplify ill-defined problems, craft solutions, and lead teams to implement them.
Ways to stand out
- Track record designing and implementing low-overhead profiling systems for multi-process and distributed environments.
- Deep understanding of PyTorch internals and CUDA usage (tensor memory, operations, distributed training).
- Strong ability to analyze profiling data and translate it into concrete, actionable insights for CUDA and ML frameworks like PyTorch.
- Experience translating customer needs into actionable use cases and requirements.
- Strong understanding of system security principles.
Compensation & Additional Information
- Base salary ranges: Level 4: 184,000 USD - 287,500 USD; Level 5: 224,000 USD - 356,500 USD. Your base salary will be determined based on location, experience, and pay of employees in similar positions.
- You will also be eligible for equity and benefits.
- Applications for this job will be accepted at least until December 20, 2025.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.