Site Reliability Engineer, Physical Infrastructure

Apple Inc

Cupertino, CA

JOB DETAILS
SKILLS
Apple, Artificial Intelligence (AI), Automation, Bash Scripting, Capacity and Performance Management, Command Line, Communication Skills, Computer Science, DNS (Domain Name System), Data Analysis, DevOps, Docker, Go Programming Language (Golang), HTTP (HyperText Transport Protocol), High Availability, Identify Issues, Incident Management, Incident Response, Infrastructure Software, Leadership, Linux Administration, Machine Tool, Mentoring, Modeling Languages, Network Operations Center, Operations Planning, Performance Analysis, Performance Management, Problem Solving Skills, Python Programming/Scripting Language, Refactoring, Reliability Engineering, Scripting (Scripting Languages), Software Development, Software Engineering, Swift, Systems Administration/Management, Systems Engineering, Systems Reliability, Systems Scalability, TCP/IP (Transmission Control Protocol/Internet Protocol), Testing, Unix Shell Programming, Unix System Administration
LOCATION
Cupertino, CA
POSTED
30+ days ago

We are looking for a creative and highly motivated Site Reliability Engineer to join our team. Having depth and breadth of knowledge working in physical infrastructure in a large-scale distributed environment is a strength you'll need. You should have experience in unix systems administration, DevOps, and data center infrastructure. If you are passionate about solving complex problems at scale, we want to hear from you!

The Systems and Infrastructure team builds and manages world class services and physical infrastructure for Apple software engineers world wide to build, test, and release Apple's software.

About Our Team: We are a team dedicated to engineering excellence, reusable design, and simplicity. We foster a supportive, growth-focused culture where we mentor each other and work together to build resilient, high-quality systems. Ensure System Reliability: Design, build, and maintain robust, scalable, and observable systems for our core infrastructure services

Automate: Reduce operational toil by developing automation and tooling to prevent and rapidly resolve production issues

Improve Incident Response: Own and refine our incident management processes to ensure high availability

Collaborate with Engineers: Partner with development teams to create functional, high-quality solutions that support the entire workflow

Improve and Modernize Systems: Use a proactive approach to identify and eliminate technical debt to enhance long-term reliability and maintainability

What You'll Bring: We know that great talent comes from a variety of backgrounds, and we encourage you to apply even if you don't meet every single requirement. The most important thing is a deep commitment to building reliable systems and strong collaboration with team members across different organizations.

3+ years of experience as a Site Reliability Engineer, DevOps Engineer, or Systems Admin focused on physical infrastructure in a large-scale distributed environment

Strong software development skills in a language like Swift, Go, or Python, and a high degree of comfort with shell scripting (Bash)

Hands-on experience building and managing systems with container orchestration tools (Kubernetes, Docker)

Deep understanding of networking (TCP/IP, DNS, HTTP) and experience using observability tools (monitoring, logging, tracing) to diagnose complex issues

Excellent problem-solving and communication skills, with a strong sense of ownership and drive

BS/MS in Computer Science, Engineering or related field

Build automation tools that eliminate routine tasks. Every manual process is an opportunity to code a solution

Experience with Unix/Linux systems administration and command-line diagnostic tools

Proven experience leading initiatives to reduce technical debt, refactor systems, or improve performance and latency

Expertise in performance analysis and capacity planning for physical infrastructure.

Demonstrated ability to lead incident response for high-impact outages

Familiarity with using Generative AI (GenAI) or Large Language Models (LLMs) to accelerate operational tasks, such as automating runbooks, generating scripts, or analyzing incident data

About the Company

A

Apple Inc

We bring amazing people together to make amazing things happen.

We’re a diverse collection of thinkers and doers, continually reimagining what’s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with services, including iTunes, the App Store, Apple Music, and Apple Pay. And the same passion for innovation that goes into our products also applies to our practices — strengthening our commitment to leave the world better than we found it.

About Apple

There’s a place here for every kind of brilliant. Everyone here is an innovator, or an innovator-to-be, no matter what your team or your role. So bring your passion, courage, and original thinking and get ready to share it, because every new product, service, or feature we invent is the result of people working together to make each others’ ideas stronger. Innovation at this level depends on people who represent the variety of the human experience and inspire us with their own fresh perspectives. Together, we’ll do amazing work that can make a difference in people’s lives. Including your own. Learn more about working at Apple.

COMPANY SIZE
10,000 employees or more
INDUSTRY
Computer/IT Services
FOUNDED
1976
WEBSITE
https://www.apple.com/jobs