Senior Site Reliability Engineer

The Charles Schwab Corp

Austin, TX

Apply

JOB DETAILS

SKILLS

Amazon Web Services (AWS), Analysis Skills, Application Programming Interface (API), Artificial Intelligence (AI), Automation, Best Practices, Budgeting, Cloud Architecture, Cloud Computing, Communication Skills, Computer Programming, Computer Science, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Cross-Functional, Customer Experience, DNS (Domain Name System), DevOps, Digital Certificates, Distributed Computing, Documentation, Finance, GCP (Good Clinical Practices), High Availability, Identify Issues, Incident Management, Java, Large-Scale Systems, Leadership, Load Balancing, Machine Tool, Mentoring, Microsoft Windows Azure, On Call, Operational Improvement, Operations Management, Performance Analysis, Performance Tuning/Optimization, Problem Solving Skills, Process Improvement, Production Systems, Public Cloud, Python Programming/Scripting Language, Reliability Engineering, Reporting Dashboards, Root Cause Analysis, Scalable System Development, Scripting (Scripting Languages), Security Architecture, Software Development, Software Development Lifecycle (SDLC), Splunk, System Operations, Systems Maintenance, Systems Reliability, Systems Scalability, Team Lead/Manager, Technical Strategy, Telemetry, Unix Shell Programming

LOCATION

Austin, TX

POSTED

11 days ago

Your Opportunity

At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us challenge the status quo and transform the finance industry together.

We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in the specified location(s).

As a Senior Site Reliability Engineer within the CET SAvE organization, you will play a critical leadership role advancing the reliability, scalability, and performance of Schwab's mobile and digital platforms. You will lead efforts to elevate production operations through modern Site Reliability Engineering practices, shaping how engineering teams design, build, and operate resilient systems at scale.

In this role, you will drive measurable improvements in service health and client experience by defining and executing strategies that enhance observability, automation, and system resilience. You will partner cross-functionally with engineering, architecture, infrastructure, and product teams to embed reliability, scalability, and operational excellence into the full software development lifecycle.

Success in this role requires strong problem-solving and decision-making, particularly in complex, high-scale distributed environments. You will influence technical direction, introduce best practices such as service level objectives and error budgets, and guide teams in reducing operational toil through automation and tooling innovation. Your leadership will ensure teams are aligned on reliability goals, respond effectively to production challenges, and continuously improve systems through learning and adaptation.

You will also play a key role in evolving operational maturity by strengthening on-call practices, enabling faster detection and resolution of issues, and fostering a culture of accountability, collaboration, and continuous improvement. This is an opportunity to shape enterprise-wide engineering standards while developing high-performing teams and advancing modern reliability engineering capabilities.

Key Responsibilities:

Production Operations & Incident Management

Respond to system alerts and production incident escalations
Lead or support incident triage, resolution, and root cause analysis
Drive and contribute to post-incident reviews and continuous improvement actions
Participate in an on-call rotation to support high-availability systems

Observability & Monitoring

Ensure comprehensive monitoring coverage and effective alerting strategies across systems
Continuously improve visibility into system performance, reliability, and health
Define and evolve observability best practices, including telemetry, dashboards, and alert thresholds

Automation & Engineering Excellence

Design and build automation solutions to reduce operational toil and improve resiliency
Develop scripts and tooling using Python and shell scripting for system maintenance and performance optimization
Contribute to CI/CD and deployment pipeline improvements
Automate processes such as service recovery, system maintenance, and certificate management

Collaboration & Partnership

Partner with development teams to understand system changes and ensure production readiness
Establish guardrails for monitoring, alerting, and escalation procedures
Embed reliability practices into the software development lifecycle

Reliability Engineering & Continuous Improvement

Proactively identify system weaknesses, risks, and performance gaps
Drive improvements in system reliability, scalability, and resilience
Implement and evolve SRE best practices (SLOs, error budgets, incident reduction strategies)

Innovation & Emerging Capabilities

Explore the use of AI and automation to improve incident detection, triage, and response
Identify opportunities to enhance response times and reduce manual intervention

Technical Influence & Mentorship

Mentor and support junior engineers in SRE best practices and automation techniques
Influence engineering teams to adopt proactive reliability and observability practices
Promote a culture of curiosity, ownership, and continuous improvement

What you have

To ensure that we fulfill our promise of "challenging the status quo," this role has specific qualifications that successful candidates should have:

Required Qualifications:

Bachelor of Science degree in Computer Science or a related field
10+ years of experience in software development and site reliability engineering, including work with cloud-native architectures and distributed systems
8+ years of experience in DevOps and/or site reliability engineering, with a focus on production operations, automation, and system reliability at scale
8+ years of experience with CI/CD pipelines, observability, and monitoring/telemetry platforms
5+ years of experience leading the implementation and scaling of reliability engineering practices such as service level objectives, monitoring strategies, incident reviews, and automation-driven improvements
Demonstrated ability to design, develop, and maintain production-grade systems, automation frameworks, and reliability tooling across the software development lifecycle
Experience supporting high-availability, distributed systems at scale
Deep experience with monitoring, observability, and incident management practices
Strong experience with automation, scripting, and operational tooling

Preferred Qualifications:

Strong programming and automation experience using languages such as Python or Java, including building scalable services and APIs
Experience with application performance monitoring tools (e.g., Splunk preferred)
Experience with Kubernetes and container orchestration platforms
Experience with infrastructure-as-code tools such as Terraform or similar technologies
Familiarity with public cloud platforms (AWS, GCP, or Azure)
Understanding of cloud infrastructure components including compute, storage, networking, load balancing, DNS, and security architectures
Proven ability to lead in fast-paced environments, influence cross-functional teams, and drive alignment on technical strategy
Strong communication skills with the ability to translate complex technical concepts to a variety of audiences

What Sets You Apart:

Proven ability to operate effectively in complex, large-scale systems with evolving documentation
Strong analytical and problem-solving skills with a proactive mindset
Curiosity and initiative to deeply understand systems and identify improvement opportunities
Ability to automate troubleshooting, enhance observability, and reduce manual intervention
Strong communication skills, especially during incidents and cross-team coordination

Applicants must be currently authorized to work in the United States on a full-time basis without employer sponsorship.

In addition to the salary range, this role is eligible for bonus or incentive opportunities.

About the Company

The Charles Schwab Corp

The Charles Schwab Corporation is a leading provider of financial services, with more than 300 offices. Through its operating subsidiaries, the company provides a full range of securities brokerage, banking, money management and financial advisory services to individual investors and independent investment advisors. Named "Highest in Investor Satisfaction with Self-Directed Services" by J.D. Power and Associates in 2009, its broker-dealer subsidiary, Charles Schwab & Co., Inc. (member SIPC) affiliates offer a complete range of investment services and products including an extensive selection of mutual funds; financial planning and investment advice; retirement plan and equity compensation plan services; referrals to independent fee-based investment advisors; and custodial, operational and trading support for independent, fee-based investment advisors through Schwab Advisor Services.

The Charles Schwab Bank (member FDIC) provides banking and mortgage services and products. To meet the needs of our clients, we are actively recruiting people with the desire, drive and creativity to find solutions that help meet our clients' needs; who want the chance to learn, grow with the company and explore their career opportunities; who will strive for excellence in achieving our clients' and our company's goals; who have the highest ethical standards - individuals who take pride in making a difference in people's lives.

COMPANY SIZE

1,000 to 1,499 employees

INDUSTRY

Security and Surveillance

FOUNDED

1971

WEBSITE

http://www.aboutschwab.com/careers

Senior Site Reliability Engineer

The Charles Schwab Corp

Austin, TX

About the Company

The Charles Schwab Corp

Similar Job Searches