Senior Software Engineer, Profiling Services

at Nvidia
USD 184,000-356,500 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 7 Software Development @ 4 Python @ 7 Machine Learning @ 4 Leadership @ 7 Communication @ 4 Debugging @ 4 API @ 4 Technical Leadership @ 7 Design Patterns @ 4 PyTorch @ 3 CUDA @ 6 GPU @ 4

Details

Are you ready to innovate GPU performance analysis for Machine Learning workloads? Join our Developer Tools Always-On Profiling (AON) team as a Senior Software Architect, where you will be pivotal in designing, implementing, and leading our Always-On Profiling service. This role demands deep technical expertise, a proven track record solving ambiguous challenges, and strong technical leadership skills.

Responsibilities

  • Architect and build scalable systems: drive the design and implementation of the AON profiling service's core systems, including inter-process communication (IPC), memory management, and low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments.
  • Elevate software engineering excellence: promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Ensure code quality and robust testing for a reliable profiling service.
  • Lead, mentor, and innovate: guide and mentor engineers, provide impactful code reviews, and shape technical roadmaps. Identify complex technical issues within the AON project, decompose them, and craft innovative solutions.
  • Architect and build high-performance platforms: transform user needs into requirements and design documents. Lead end-to-end feature development from planning and prototyping to implementation, testing, and customer evaluation. Work hands-on across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers.
  • Collaborate across boundaries: partner with diverse internal and external teams to integrate AON into the broader profiling and ML ecosystem.

Requirements

  • BS or MS degree (or equivalent experience) in Computer Engineering, Computer Science, or a related field.
  • 8+ years of substantial software development experience in C, C++, and Python.
  • 10+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software.
  • Strong interpersonal, verbal, and written communication skills; proven ability to build cross-organizational partnerships and lead technical teams.
  • Profiling & performance tools expert: extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus.
  • GPU & CUDA proficiency: in-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture.
  • ML ecosystem & performance analysis: familiarity with ML frameworks such as PyTorch and JAX, and experience analyzing performance for AI training/inference applications.
  • Large-scale system development & debugging experience across multi-layered software systems, including user mode and kernel drivers, with the ability to contribute to and extend very large codebases.
  • Proficiency designing APIs and interfaces for profiling tools to enable integration with frameworks and custom code.
  • Strong ability to simplify ill-defined problems, craft solutions, and lead teams to implement them.

Ways to stand out

  • Track record designing and implementing low-overhead profiling systems for multi-process and distributed environments.
  • Deep understanding of PyTorch internals and CUDA usage (tensor memory, operations, distributed training).
  • Strong ability to analyze profiling data and translate it into concrete, actionable insights for CUDA and ML frameworks like PyTorch.
  • Experience translating customer needs into actionable use cases and requirements.
  • Strong understanding of system security principles.

Compensation & Additional Information

  • Base salary ranges: Level 4: 184,000 USD - 287,500 USD; Level 5: 224,000 USD - 356,500 USD. Your base salary will be determined based on location, experience, and pay of employees in similar positions.
  • You will also be eligible for equity and benefits.
  • Applications for this job will be accepted at least until December 20, 2025.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.