Used Tools & Technologies
Machine Learning LLMRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Ansible @ 3
Go @ 3
Grafana @ 2
Kubernetes @ 3
Linux @ 6
Prometheus @ 2
DevOps @ 6
Terraform @ 3
Python @ 3
CI/CD @ 2
Distributed Systems @ 3
MLOps @ 3
TensorFlow @ 3
Leadership @ 3
Networking @ 3
CRM @ 3
ServiceNow @ 3
API @ 3
PyTorch @ 3
Salesforce @ 3
OpenTelemetry @ 2
CUDA @ 3
GPU @ 3
Observability @ 3
AI @ 3
InfiniBand @ 3
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking an NCX Engineer, AI Accelerator to join the AI Accelerator team and collaborate closely with strategic customers to implement and enhance advanced AI workloads. You will deliver hands-on technical assistance for large-scale AI deployments and distributed systems, helping customers get efficient performance from NVIDIA's AI platform across varied environments and partner platforms.
Responsibilities
- Build and deploy custom AI solutions on NCP and Neo Cloud platforms, including distributed training, inference optimization, and MLOps pipelines based on NVIDIA reference architectures.
- Act as the primary technical contact for strategic NCP customers: provide remote and on-site support, troubleshoot complex production issues, and guide partner engineering teams on NVIDIA platform guidelines.
- Deploy and manage AI workloads across DGX Cloud, NCP data centers, and major cloud service provider environments using Kubernetes, containers, and GPU scheduling systems aligned to NCP builds.
- Profile and tune large-scale training and inference workloads (reduce latency, cost, operational risk) and implement observability and SLO/SLA monitoring.
- Implement and extend NVIDIA reference architectures on partner platforms; develop integrations with partner control planes and customer environments to ensure API, data pipeline, and enterprise software connectivity.
- Produce implementation guides, runbooks, and post-mortem documentation that codify standard methodologies for running NVIDIA AI workloads at scale on NCP platforms.
Requirements
- BS, MS, or Ph.D. in Computer Science, Computer/Electrical Engineering, or a related technical field, or equivalent experience.
- 8+ years experience in customer-facing technical roles such as Solutions Engineering, DevOps, Site Reliability, or ML Infrastructure Engineering, ideally supporting large-scale cloud or service-provider environments.
- Strong expertise in Linux systems and distributed computing.
- Experience with Kubernetes, containers, and GPU scheduling on multi-tenant or service-provider platforms.
- Demonstrated AI/ML experience supporting large-scale training and inference workloads (LLMs, generative models, recommendation systems) in production or critical environments.
- Solid programming skills in Python and Go, with hands-on experience using frameworks such as PyTorch or TensorFlow for training and serving.
- Ability to collaborate with customer and partner engineering teams, lead complex technical investigations to root cause, and communicate architectures and recommendations to engineering and leadership audiences.
Ways to stand out
- Experience with the NVIDIA ecosystem: DGX systems, CUDA, NeMo, Triton, NIM, and NVIDIA networking technologies (InfiniBand, RoCE).
- Direct experience collaborating with NVIDIA Cloud Partners, hyperscale CSPs, or managed AI cloud platforms; implementing NVIDIA reference architectures for AI infrastructure.
- Deep familiarity with MLOps and cloud-native practices: containerization, CI/CD pipelines, observability stacks (Prometheus, Grafana, OpenTelemetry), and GitOps workflows.
- Experience with infrastructure-as-code tools (Terraform, Ansible) for repeatable deployment and configuration of GPU-accelerated clusters and NCP building blocks.
- Experience integrating AI platforms with enterprise systems (Salesforce, ServiceNow, other ITSM/CRM) to support end-to-end customer solutions and managed services.
Compensation & Benefits
- Base salary ranges (location-, level-, and experience-dependent):
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until May 9, 2026. NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity.