Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

dUVOWFRpekhRYWI2bDY5cmxvbXJCUFJxK3c9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

Jeanne C Minnerly A Professional Accountancy Corporation

TAX/STAFF ACCOUNTANT Job at Jeanne C Minnerly A Professional Accountancy Corporation

Job Description Job Description We are looking for a TAX/STAFF ACCOUNTANT to join our team! We are a local Redlands/Inland Empire CPA firm providing high quality accounting and tax services. Responsibilities: General Ledger postings using Quickbooks or a general ledger... 

HCS 247 Travel

Travel Endoscopy Nurse Job at HCS 247 Travel

 ...Job Description HCS 247 Travel is seeking a travel nurse RN Endoscopy for a travel nursing job in Mount Vernon, Washington. Job Description & Requirements ~ Specialty: Endoscopy ~ Discipline: RN ~ Start Date: 01/05/2026~ Duration: 13 weeks ~36 hours... 

Morgan Stanley

Asset Management - Real Estate Investing - Analyst Job at Morgan Stanley

 ...The Asset Management Analyst opportunity is open to candidates interested in San Francisco and/or Los Angeles. Morgan Stanley...  ...institutions, corporations and individuals worldwide. Morgan Stanley Real Estate Investing ("MSREI") is the global private real estate... 

Total Quality Lawncare and Landscaping, LLC

Landscape Laborer Job at Total Quality Lawncare and Landscaping, LLC

 ...Family-owned lawncare company is looking for afull-time Landscape Laborer for immediate full-time work. The ideal candidatewill be responsible for applying fertilizers,maintaining landscape design integrity by removing weeds or dead plants fromthe property and caring... 

Worldwide Flight Services

Air Cargo Ramp Agent Job at Worldwide Flight Services

 ...This job is covered by a collective bargaining agreement of the Transport Workers Union (TWU), a labor union which requires joining the...  ...Awardco Platform including gift cards and more!* Need quality medical care? Multiple options for both full and part-time employees!*...