As an SRE at OutSystems here are your key responsibilities and duties: Lead and onboard services and teams to the reliability tenets;
Establish and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs);
Design and implement scalable, reliable, and secure infrastructure, while ensuring cloud-native best practices;
Collaborate with software development teams to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant;
Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents;
Lead incident response efforts, ensuring quick resolution and minimal downtime, and conduct RCA/post-mortems;
Automate every operational task, with a special focus on fast incident detection & recovery;
Programming in Python supported by Gen AI tooling to accelerate development of mission critical automation and tools. (CKA, CKAD, CKS certifications are valued);
Experience with automation and Infrastructure as Code (IaC) tools, such as AWS CloudFormation, Terraform, Puppet, Chef, Spacelift, etc;
Experience with Python, Go, Bash/Shell scripting, or other automation tools/languages;
Familiarity with AWS services like EC2, RDS, ELB, CloudFront, Lambda, etc;
Proficiency in monitoring and troubleshooting complex distributed systems;
Experience with Grafana, ELK stack, Prometheus, or others;
Strong understanding of designing resilient and fault-tolerant systems;
Expertise in debugging complex distributed systems.