Sr. Site Reliability Engineer

Tiger Analytics LLC

DC

JOB DETAILS
SKILLS
Artificial Intelligence (AI), Automation, Autoscaling, Budget Management, Budgeting, Cloud Computing, Continuous Deployment/Delivery, Continuous Integration, Ecosystems, GPU (Graphics Processing Unit), GitHub, High Availability, Identify Issues, Incident Management, Incident Response, Modeling Languages, On Call, Performance Engineering, Production Systems, Python Programming/Scripting Language, Reliability Engineering, Reporting Dashboards, Resource Utilization, Root Cause Analysis, Scripting (Scripting Languages), Service Level Agreement (SLA), Software Engineering, System Architecture
LOCATION
DC
POSTED
30+ days ago

Role Overview

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps-bridging the gap between model development and production-grade reliability.

Key Responsibilities

  1. Reliability & Performance Engineering
  • SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
  • Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
  • Scalability: Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.
  1. MLOps & AI Infrastructure
  • Model Serving Reliability: Ensure the high availability of Vertex AI endpoints and custom inference services.
  • GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
  • Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.
  1. Automation & Orchestration (Eliminating "Toil")
  • Infrastructure as Code (IaC): Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
  • CI/CD & GitOps: Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
  • Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.
  1. Monitoring, Alerting & Incident Response
  • Observability: Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
  • Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
  • Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

About the Company

T

Tiger Analytics LLC