Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

dUVOWFRpekhRYWI2bDY5cmxvbXJCUFJxK3c9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

Total Concrete Services

Project Coordinator - Construction Job at Total Concrete Services

 ...Description About Us: Total Concrete Services is a well-established construction industry leading company, offering a dynamic environment for...  ...Concrete Services, you will work closely with the Project Manager to oversee administrative and collaborative aspects of... 

MLee Healthcare Staffing and Recruiting, Inc

Remote Healthcare Recruiter / Talent Connector Job at MLee Healthcare Staffing and Recruiting, Inc

 ...Embark on Your Journey as a Remote Healthcare Recruiter Powering a Dedicated Team - Connect Passion with Purpose from Anywhere Overview Imagine if your next career step felt less like a mere job and more like a journey you're crafting yourself. Picture a role where... 

BP Energy

Completions Wellsite Leader Job at BP Energy

 ...will be at the forefront of our field operations, overseeing the planning and execution phases of well completions in our global oil and gas projects. This role is pivotal in ensuring that well operations are executed safely, efficiently, and in alignment with BP's commitment... 

Blue Marble

Graphic Designer-Packaging Job at Blue Marble

 ...Description Position Summary We are seeking a highly skilled Graphic Designer to create and manage design deliverables for our...  ...print projects. These design efforts will focus heavily on product packaging, merchandising displays, and general marketing strategies for... 

Allied Universal®

Security Officer - Flex Gate Attendant Job at Allied Universal®

 ...Job Description Allied Universal, North Americas leading security and facility services company, offers rewarding careers that provide you a sense of purpose. While working in a dynamic, welcoming, and collaborative workplace, you will be part of a team that contributes...