Site Reliability Engineer - High Performance Computing / AI-ML

at X

πŸ“ Austin, United States

$120,000-297,000 per year

MIDDLE SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Software Development @ 3 Ansible @ 6 Chef @ 6 Kubernetes @ 3 Linux @ 3 Python @ 5 Scala @ 5 Java @ 5 Distributed Systems @ 2 Bash @ 5 Communication @ 3 Networking @ 2 Puppet @ 6

Details

Are you prepared to join the X team and help build the ultimate real-time information-sharing app, revolutionizing how people connect? At X, we’re on a mission to become the trusted global digital public square, committed to protecting freedom of speech and building the future unlimited interactivity. Our goal is to empower every user to freely create and share ideas, fostering open public discourse without barriers. Join us in shaping this thrilling journey where your contribution will be invaluable to our success!

Responsibilities

  • Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux + Kubernetes)
  • Collaborating with cross-functional teams, including hardware engineers and software developers, to support and improve our infrastructure
  • Automating the provisioning and deployment of systems to enhance long-term health and scalability
  • Ensuring the robustness of our HPC environments and storage clusters
  • Writing and maintaining scripts and tools for automation and monitoring
  • Addressing system failures and performance issues, identifying root causes, and implementing preventive measures
  • Working closely with end-users to understand changing needs as our environment evolves.

Requirements

  • 2+ years of professional software development experience
  • Extensive experience with Kubernetes and container orchestration
  • Proficiency in one or more object-oriented programming languages (e.g. Python, Java, C++, Scala)
  • Proficiency in scripting languages (Python, Bash, etc.)
  • Strong experience in configuration management (e.g., puppet, ansible, chef, etc.)
  • Familiarity with Ethernet networking at scale and distributed systems
  • Strong troubleshooting skills and experience with HPC environments
  • Experience managing large-scale systems, ideally supporting thousands of machines
  • Working understanding of the storage systems required to support such environments
  • Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms.
  • Ability to think outside the box and come up with innovative solutions to complicated problems.
  • Extremely committed, willing to work in a fast paced environment
  • Excellent communication and interpersonal skills

Benefits

At X, our small but fast-paced team values innovation, creativity, and a strong commitment to our mission. As a Site Reliability Engineer, you'll have the opportunity to make a significant impact on the future of X and our aspiration to build the Everything App.