Technical Incident Manager

at Groq
USD 123,200-199,000 per year
MIDDLE
✅ Remote

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 3 Software Development @ 6 Go @ 5 Terraform @ 5 Python @ 5 SQL @ 3 Communication @ 3 Compliance @ 3

Details

Groq delivers fast, efficient AI inference. Our LPU-based system powers GroqCloud™, giving businesses and developers the speed and scale they need. Headquartered in Silicon Valley, we are on a mission to make high performance AI compute more accessible and affordable. When real-time AI is within reach, anything is possible. Build fast.

Mission:

Enable Groq to achieve world-class reliability that minimizes customer impact, accelerates recovery, and ensures transparent communication with our customers.

Responsibilities & Opportunities:

  • Incident Response & Command: Act as the primary Incident Commander during critical outages, coordinating rapid response, remediation efforts, and clear communication with stakeholders.
  • Facilitate Post-Incident Analysis: Assist teams to ensure that detailed incident retrospectives are completed in a timely fashion and that key learnings are shared with the organization.
  • Collaboration & Communication: Work closely with engineering, operations, and product teams. Ensure that insights from incidents translate into tangible action forward improvements in infrastructure and code quality.
  • Process Improvement: Identify opportunities to improve incident management processes, procedures, and tools. Collaborate with teams to implement said improvements.
  • Automation & Tooling: Develop, automate, and enhance internal workflows related to incident management and disaster recovery.
  • Security and Compliance: Uphold the highest security standards to safeguard customer data, aligning with stringent compliance protocols.

Requirements:

  • Proven track record as an Incident Commander in high-pressure, large-scale production environments.
  • Experience handling Security, IT, Legal, and HR related incidents in addition to software-related incidents.
  • 5+ years of software development experience with a strong emphasis on system reliability and incident management.
  • Proficiency in Python, Go or Terraform, with the ability to develop internal tooling.
  • Experience with BigQuery and writing SQL against incident-related datasets.
  • Ability to maintain composure, think clearly, and lead effectively during high-stress situations.
  • Experience handling customer communication and internal communication during incidents.

Attributes of a Groqster:

  • Humility - Egos are checked at the door.
  • Collaborative & Team Savvy - We make up the smartest person in the room, together.
  • Growth & Giver Mindset - Learn it all versus know it all, we share knowledge generously.
  • Curious & Innovative - Take a creative approach to projects, problems, and design.
  • Passion, Grit, & Boldness - No limit thinking, fueling informed risk-taking.

Compensation:

At Groq, a competitive base salary is part of our comprehensive compensation package, which includes equity and benefits. For this role, the base salary range is $123,250 to $198,950, determined by your skills, qualifications, experience, and internal benchmarks.