p>What you'll do: • Build monitoring to ensure our platform is healthy and its reliability measurable • Build alerting and a set of runbooks to enable faster detection and remediation of platform issues • Debug complex issues that may combine multiple components of the stack and ensure proper fixes are implemented to prevent these issues from happening again • Participate in an on-call rotation and culture of continuous improvement through blameless postmortems • Design and implement components of the platform to enable features that make the work of our customers possible, simpler and more efficient • Build Kubernetes controllers to automate operations.
The Site Reliability team interacts with engineering teams including ingest/data processing, mapping, labeling, triage, machine learning (detection, prediction, tracking), motion planning/control, offline simulation, and release/deployment teams to provide uniform service observability and incident response.