Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Jenkins @ 3
Kubernetes @ 4
DevOps @ 4
Python @ 4
CI/CD @ 4
Communication @ 1
Debugging @ 4
Reporting @ 4
QA @ 4
System Architecture @ 7
Compliance @ 4
GPU @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, NVIDIA is focused on AI and next-generation computing. This role is for a Senior Test Architect on the Enterprise Software QA team working on design, construction, optimization, and testing of flagship supercomputers and data center offerings.
Responsibilities
- Define end-to-end test strategy and own the overall test architecture and validation strategy for power features across multiple NVIDIA platforms (pre-silicon simulation and emulation to post-silicon bring-up and production readiness).
- Develop test plans aligned with product deliverables and customer use cases; influence early design decisions to optimize testability and automation readiness.
- Architect scalable test infrastructure: design and implement modular, reusable test frameworks and automation harnesses supporting functional, integration, stress, regression, power, security, and performance testing at scale across hundreds of systems.
- Own data center power quality metrics: define KPIs (code coverage, system uptime, bug escape rate, validation completeness) and establish dashboards/reporting for data-driven decisions.
- Lead root cause analysis and debugging across firmware, software, and hardware layers; develop and document debug methodologies and tools.
- Innovate in lab automation and CI/CD: partner with DevOps and infrastructure teams to enhance test automation pipelines and integrate continuous testing into nightly and pre-merge workflows.
- Enable productization and customer readiness: validate real-world use cases, customer configurations, and production scenarios; contribute to release gates and sign-off criteria.
- Mentor and lead software QA engineers and junior test developers; promote quality, innovation, and continuous learning.
- Use AI-powered tools and copilots to accelerate test development, automate repetitive validation workflows, and streamline debug and root cause analysis.
Requirements
- B.S./M.S./Ph.D. in Electrical Engineering, Computer Engineering, Computer Science, or related field (or equivalent experience).
- 10+ years of experience in data center power enablement related to software/firmware testing, with focus on telemetry and power efficiency across systems.
- Strong knowledge of system architecture, power shelf, baseboard management, hardware and software power features, industry power standards, system interfaces, and embedded controllers.
- Proven experience designing test frameworks and infrastructure in Python, C/C++ or similar languages.
- Expertise with platform standards for security, telemetry and manageability (NIST, DMTF, OCP). Hands-on experience with server platform, network, storage, cluster configuration and debugging.
- Background with platform telemetry, datacenter node lifecycle management/support including CPU/GPU workloads.
- Proficiency in scripting languages such as Python.
- Expertise in administering, operating, and configuring Kubernetes and Envoy.
- Validated experience in CI/CD tools such as GitLab and Jenkins and familiarity with the GitOps model.
- Experience with lab automation, simulation, HW-in-the-loop testing, and CI/CD pipelines.
- Strong debugging, problem-solving, and analytical skills.
- Excellent communication and collaboration skills; experience working in a globally distributed team is a plus.
Ways to Stand Out
- Experience with NVIDIA platforms (e.g., DGX, HGX, Grace Hopper systems).
- Exposure to security validation and compliance (e.g., FIPS, BMC security), or thermal/power validation.
- Prior role as a test architect or technical lead for large-scale datacenter enablement or firmware validation programs.
- Contributions to open-source testing tools or frameworks; strong knowledge of cloud-scale validation, infrastructure automation, or virtualization.
- Prior experience using AI tools to create agents, design test plans, identify test gaps, automation and failure analysis.
Compensation & Benefits
- Base salary ranges: 168,000 USD - 270,250 USD (Level 4) and 200,000 USD - 322,000 USD (Level 5).
- Eligible for equity and benefits (link provided in original posting).
Additional Information
- Applications accepted at least until June 26, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer committed to an inclusive work environment.