Green Bay, WI30+ days ago
Own the reliability, performance, and scalability of cloud infrastructure (GCP) • Manage production and staging environments, ensuring stability and availability • Own the monitoring and observability stack: logging, tracing, metrics, and alerting • Drive proactive capacity planning to stay ahead of demand as the client base and platform grow • Own data platform infrastructure (BigQuery, data pipelines) and the compute/platform layer supporting ML and AI workloads • Drive infrastructure architecture decisions in partnership with engineering leadership • Drive SLA/SLO definition, measurement, and reporting • Own the incident management framework: tooling, severity definitions, post-mortem process, and escalation to business continuity activation, leadership notification, and client communication • Own disaster recovery and business continuity planning. • Cloud infrastructure architecture and operations (GCP preferred) • Production environment management, monitoring, and observability tooling • Data platform infrastructure and familiarity with modern data and AI/ML operational patterns • Network security, identity and access management, and vulnerability management • Disaster recovery and business continuity planning • Cloud-native architectures and infrastructure-as-code.