Staff Engineer, Datacenter Server Lifecycle

at Anthropic

📍 New York City, United States
📍 San Francisco, United States

USD 320,000-405,000 per year

MIDDLE

✅ Hybrid

✅ Visa Sponsorship

Used Tools & Technologies

IaC

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Security @ 3 Go @ 5 Kubernetes @ 3 Python @ 5 GCP @ 3 Java @ 5 AWS @ 3 Communication @ 3 Networking @ 3 Planning @ 3 Rust @ 5 GPU @ 3 AI @ 3

Details

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

Role overview

As a Staff Engineer on the Datacenter Server Lifecycle team, you will own the end-to-end operational journey of every machine in our facilities — from initial provisioning and deployment, across its working life, through maintenance and refresh, and all the way to decommissioning. This is greenfield work: you will help define processes, tooling, and operational standards that govern how we run and retire hardware at scale. A distinguishing aspect of this role is its deep intersection with security: machines handle sensitive workloads, and ensuring each machine is trusted, attested, and operating with a verified chain of integrity from hardware up is a core part of the job. You will partner closely with Infrastructure Security and Networking teams.

Responsibilities

Lead the build-out of automation to support datacenters containing tens of thousands of servers.
Define and own the end-to-end server lifecycle strategy — provisioning, deployment, operation, maintenance, refresh, and decommissioning — and maintain automation and operational procedures for common lifecycle events (hardware failures, firmware upgrades, fleet rotations).
Partner closely with Infrastructure Security to design and enforce trusted compute standards across the server lifecycle (secure provisioning through end-of-life handling).
Work with the Networking team to ensure end-to-end connectivity across all sites.
Build and maintain tooling to track machine health, configuration, and operational status across the full datacenter fleet.

Minimum qualifications

Hands-on experience with server hardware, including rack deployment, cabling, troubleshooting, and understanding failure modes at scale.
End-to-end understanding of hardware lifecycle management: asset tracking, provisioning workflows, maintenance scheduling, and decommissioning practices.
Proficiency in at least one programming language (examples given: Python, Rust, Go, or Java).
Working knowledge of modern cloud infrastructure, including Kubernetes, Infrastructure as Code, AWS, and GCP.
Ability to communicate clearly and build consensus with a wide range of stakeholders.
Comfort navigating ambiguity and making progress on complex, cross-functional problems.
Willingness to travel occasionally to datacenter sites across North America.

Preferred qualifications

8+ years of experience in datacenter operations, hardware infrastructure management, or a closely related discipline.
Hands-on experience with GPU or AI accelerator hardware (e.g., NVIDIA A100/H100, AMD MI300, Google TPUs, or AWS Trainium) and an understanding of their operational demands.
Familiarity with modern provisioning tooling such as coreboot, LinuxBoot, or u-root.
Experience building or contributing to datacenter automation or fleet management platforms.
Experience building and deploying server operating system distributions across thousands of hosts.
Background in large-scale capacity planning and hardware refresh strategy, ideally at a hyperscaler or large cloud provider.
Experience with trusted compute and hardware security concepts such as secure boot, TPM, hardware attestation, and firmware verification — or a strong desire to develop deep expertise in this area.

Compensation

Annual Salary: $320,000 - $405,000 USD

Logistics

Minimum education: Bachelor’s degree or equivalent combination of education, training, and/or experience.
Location-based hybrid policy: currently expect all staff to be in one of our offices at least 25% of the time (some roles may require more office time).
Visa sponsorship: Anthropic states they do sponsor visas and will make reasonable efforts to obtain a visa for an offer recipient; they retain an immigration lawyer to help.

How we're different

Anthropic works as a cohesive team on a few large-scale research efforts and values communication and collaboration. The role is embedded in large-scale AI infrastructure work and intersects with research and security efforts.

How to apply

Application is handled via Anthropic's careers/job portal. The posting requests a Resume or LinkedIn profile and includes standard application questions (location, visa needs, availability, etc.).