Service Reliability Engineer, G&A Solutions Engineering (GSE)

Apple Inc

TX

Apply

JOB DETAILS

SKILLS

Ansible, Apple, Automation, Bash Scripting, Best Practices, Chef (Configuration Management), Computer Science, Configuration Management, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Customer Support/Service, Database Technology, Design For Excellence (DFX), DevOps, Distributed Computing, Documentation, Health Maintenance, ITIL (IT Infrastructure Library), Identify Issues, Incident Management, Incident Response, Java, Linux Administration, MySQL, NoSQL, On Call, Operational Improvement, Operational Strategy, Performance Analysis, PostgreSQL, Problem Solving Skills, Process Improvement, Production Systems, Programming Languages, Puppet (Configuration Management), Python Programming/Scripting Language, Reliability Engineering, Root Cause Analysis, Scripting (Scripting Languages), Splunk, Systems Reliability, Unix System Administration, Windows PowerShell

LOCATION

POSTED

14 days ago

Do you have a passion for ensuring the reliability, scalability, and performance of critical services? Are you a highly motivated and expert engineer with a strong understanding of Site Reliability Engineering (SRE) principles and a desire to automate and improve processes? Join Apples General and Administrative (G&A) Solutions Engineering team as a Service Reliability Engineer and play a vital role in supporting our global, critical production systems.

As a Service Reliability Engineer, youll be at the forefront of maintaining the health, stability, and efficiency of our services, working with a diverse range of technologies and platforms. You will collaborate with Engineers, Data Engineers, DBAs, and network specialists to proactively identify and resolve potential issues, automate repetitive tasks, and drive continuous improvement initiatives. Your expertise will directly impact the reliability of our systems, enabling Apple to deliver innovative products and services to our customers. Proactively monitor service performance, identify potential bottlenecks, and implement solutions to optimize efficiency and resilience

Lead incident response efforts, driving rapid resolution and conducting thorough root cause analysis (RCA)

Develop and implement automation strategies to streamline operational tasks, improve service resilience, and reduce manual intervention

Apply SRE principles to maintain highly reliable and scalable service infrastructure

Collaborate closely with development teams to ensure that new services are designed for operational excellence, incorporating best practices for monitoring, alerting, and scalability

Contribute to the creation and maintenance of comprehensive documentation, including run-books, service level objectives (SLOs)

Participate in on-call rotations, providing 24/7 support for critical services and responding to incidents with a sense of urgency

Identify opportunities for process improvement and drive initiatives to enhance the efficiency and effectiveness of the service reliability team

Champion a culture of continuous learning and knowledge sharing within the team

Define and track key service level indicators (SLIs) and service level objectives (SLOs) to measure and improve service reliability 3+ years of experience in a Site Reliability Engineering, DevOps, or related role, supporting large-scale, enterprise-level services.

Strong proficiency in at least one programming language (e.g., Python, Java, Go) and scripting languages (e.g., Bash, PowerShell)

Experience with cloud platforms (e.g., AWS, Azure, GCP) and cloud-native technologies (e.g., Kubernetes, Docker).

Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, Data dog)

Experience in RCA of technical issues

Bachelors degree in Computer Science or work related experience Proven ability to troubleshoot complex issues in distributed systems

Familiarity with CI/CD pipelines and DevOps practices

Experience with database technologies (e.g., MySQL, PostgreSQL, NoSQL databases)

Knowledge of ITIL frameworks and incident management processes

Understanding of Linux/Unix system administration

Experience with configuration management tools (Ansible, Chef, Puppet)

About the Company

Apple Inc

We bring amazing people together to make amazing things happen.

We’re a diverse collection of thinkers and doers, continually reimagining what’s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with services, including iTunes, the App Store, Apple Music, and Apple Pay. And the same passion for innovation that goes into our products also applies to our practices — strengthening our commitment to leave the world better than we found it.

About Apple

There’s a place here for every kind of brilliant. Everyone here is an innovator, or an innovator-to-be, no matter what your team or your role. So bring your passion, courage, and original thinking and get ready to share it, because every new product, service, or feature we invent is the result of people working together to make each others’ ideas stronger. Innovation at this level depends on people who represent the variety of the human experience and inspire us with their own fresh perspectives. Together, we’ll do amazing work that can make a difference in people’s lives. Including your own. Learn more about working at Apple.

COMPANY SIZE

10,000 employees or more

INDUSTRY

Computer/IT Services

FOUNDED

1976

WEBSITE

https://www.apple.com/jobs

Service Reliability Engineer, G&A Solutions Engineering (GSE)

Apple Inc

TX

About the Company

Apple Inc

We bring amazing people together to make amazing things happen.

About Apple

Similar Job Searches