Green Bay, WI30+ days ago
p>⢠Own the reliability, performance, and scalability of cloud infrastructure (GCP) ⢠Manage production and staging environments, ensuring stability and availability ⢠Own the monitoring and observability stack: logging, tracing, metrics, and alerting ⢠Drive proactive capacity planning to stay ahead of demand as the client base and platform grow ⢠Own data platform infrastructure (BigQuery, data pipelines) and the compute/platform layer supporting ML and AI workloads ⢠Drive infrastructure architecture decisions in partnership with engineering leadership ⢠Drive SLA/SLO definition, measurement, and reporting ⢠Own the incident management framework: tooling, severity definitions, post-mortem process, and escalation to business continuity activation, leadership notification, and client communication ⢠Own disaster recovery and business continuity planning. ⢠Cloud infrastructure architecture and operations (GCP preferred) ⢠Production environment management, monitoring, and observability tooling ⢠Data platform infrastructure and familiarity with modern data and AI/ML operational patterns ⢠Network security, identity and access management, and vulnerability management ⢠Disaster recovery and business continuity planning ⢠Cloud-native architectures and infrastructure-as-code.