Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 4 Grafana @ 4 Prometheus @ 4 DevOps @ 4 Python @ 4 Spark @ 4 Java @ 4 Machine Learning @ 4 Leadership @ 4 JavaScript @ 4 Data Analysis @ 4 Technical Leadership @ 4 GPU @ 4Details
NVIDIA’s Hardware Infrastructure organization is seeking a Principal Data Platform Architect. The team serves and collaborates directly with NVIDIA’s AI, hardware, and software engineering and research teams. The role is a technical leadership position to define a vision and roadmap for distributed data platform and observability systems for large-scale AI and HPC clusters and workloads, and to guide implementation and operations worldwide.
Responsibilities
- Collaborate with AI, hardware, and software engineering and research teams to define a vision and roadmap for AI/HPC cluster observability.
- Architect and lead teams to develop, test, and deploy data collectors, pipelines, visualization, and retrieval services.
- Define data collection and retention policies to balance network bandwidth, system load, and storage capacity costs with data analysis requirements.
- Provide operational and strategic data to empower engineers and researchers to improve performance, productivity, and efficiency.
- Continuously improve quality, workloads, and processes through better observability.
Requirements
- Experience designing and building large-scale, distributed observability systems.
- Ability to collaborate with data scientists, researchers, and engineering teams to identify high-value data for collection and analysis.
- Experience turning raw data into actionable reports.
- Experience with observability platforms and tooling such as Apache Spark, Elastic/OpenSearch, Grafana, Prometheus, and other similar open-source tools.
- Technical lead level programming experience in Python, JavaScript, and Java.
- Thorough understanding of databases (relational and non-relational).
- Strong planning, interpersonal, and collaboration skills; adaptability to dynamic requirements.
- MS (preferred) or BS in Computer Science, Electrical Engineering, or related field, or equivalent experience.
- 15+ years of relevant experience.
Ways To Stand Out
- Background in computer science, machine learning, deep learning, open-source software, infrastructure technologies, and GPU technology.
- Prior experience in infrastructure software, production application software development, release and support methodology, and DevOps.
- Experience in the management of datacenters and large-scale distributed computing.
- Experience working with AI researchers and/or EDA developers.
- Track record of driving process improvements and measuring efficiency; passion for sharing knowledge and driving complex projects end-to-end.
Compensation & Benefits
- Base salary range: 272,000 USD - 425,500 USD (will be determined based on location, experience, and comparable pay).
- Eligible for equity and benefits (see NVIDIA benefits).
Additional Details
- Location: Santa Clara, CA, United States.
- Employment type: Full time.
- Applications for this job will be accepted at least until July 29, 2025.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.