Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Grafana @ 4 Jenkins @ 4 Kubernetes @ 4 Prometheus @ 4 DevOps @ 4 Terraform @ 4 Python @ 7 GCP @ 4 GitHub @ 4 GitHub Actions @ 4 CI/CD @ 4 Datadog @ 4 ArgoCD @ 4 Machine Learning @ 7 AWS @ 4 Azure @ 4 Bash @ 7 Helm @ 4 Networking @ 7 SRE @ 4 Compliance @ 4Details
At SentinelOne, we’re redefining cybersecurity by pushing the limits of what’s possible—leveraging AI-powered, data-driven innovation to stay ahead of tomorrow’s threats.
We’re seeking a Staff AI Infrastructure Engineer with deep expertise in building, automating, and managing AI infrastructure at scale. You will be instrumental in designing and maintaining the systems essential for serving and deploying AI models efficiently and securely across diverse cloud environments.
Responsibilities
- Architect, build, and maintain scalable infrastructure to host and serve AI products and models reliably.
- Automate infrastructure deployment and management using Helm, ArgoCD and Terraform.
- Manage and optimize Kubernetes clusters to support high-performance AI workloads.
- Implement and manage CI/CD pipelines utilizing GitHub Actions and Jenkins.
- Ensure infrastructure compliance with security standards including FedRAMP and related guidelines.
- Collaborate closely with AI engineering, product teams, and DevOps to meet infrastructure requirements.
- Monitor infrastructure health and performance, implementing optimizations proactively (monitoring/logging tools such as Prometheus, Grafana, Datadog, Jaeger mentioned as relevant).
- Drive infrastructure best practices and mentor team members to foster technical excellence.
Requirements
- Degree in Computer Science, Information Technology, or related field, or equivalent practical experience.
- 7+ years of experience managing scalable, secure, and resilient infrastructure for AI and machine learning applications.
- Deep proficiency with infrastructure-as-code tools like Helm, Terraform and ArgoCD.
- Extensive hands-on experience with Kubernetes for deploying containerized workloads.
- Demonstrated experience with major cloud platforms (AWS, GCP, Azure), specifically with services related to AI model hosting (example callout: Azure OpenAI).
- Experience implementing and managing CI/CD pipelines (GitHub Actions, Jenkins).
- Familiarity with compliance frameworks, particularly FedRAMP, and security best practices.
- Strong scripting and automation skills using Python, Bash, or similar languages.
- Excellent problem-solving skills, creativity, and self-driven motivation.
Preferred / Exceptional
- Previous experience as a Site Reliability Engineer (SRE), particularly in AI or ML contexts.
- Experience with monitoring and logging tools (Prometheus, Grafana, Datadog, Jaeger).
- Strong networking knowledge and cloud security best practices.
- Professional certifications in Kubernetes or cloud platforms (AWS, Azure, GCP).
Benefits
- Medical, Vision, Dental, 401(k), Commuter, Health and Dependent FSA
- Unlimited PTO
- Industry-leading gender-neutral parental leave
- Paid Company Holidays
- Paid Sick Time
- Employee stock purchase program
- Disability and life insurance
- Employee assistance program
- Gym membership reimbursement
- Cell phone reimbursement
- Numerous company-sponsored events, including regular happy hours and team-building events
Compensation
Base Salary Range: $170,200—$234,600 USD (This U.S. role has a base pay range that will vary based on the location of the candidate. For some locations, a different pay range may apply and will be provided during the recruiting process.)
Other
- SentinelOne is proud to be an Equal Employment Opportunity and Affirmative Action employer.
- SentinelOne participates in the E-Verify Program for all U.S. based roles.