p>This role is ideal for someone who enjoys working directly in Azure, improving production systems, troubleshooting issues across infrastructure and application layers, and building practical monitoring and alerting solutions that help teams respond faster and operate more confidently. Wellfit is the dental industry’s fintech solution, breaking down financial barriers so patients, providers, employers, and payors can all access better care.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure & Production Management sector of Consumer & Community Banking, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
This requires an oversight of all routine and strategic infrastructure initiatives, including operating system upgrades, patching, EOL remediation, infrastructure changes, middleware and database activities, cloud technologies and readiness, tooling modernization, and automation at scale. You will lead ongoing improvements in automation, resilience engineering, disaster recovery readiness, and operational maturity, creating repeatable, well-engineered processes that support rapid change with minimal risk.
We build and operate a suite of platforms and applications that prevent, detect, and mitigate regulatory and reputational risk across the firm, have access to the latest technology and to massive amounts of structured and unstructured data, leverage modern frameworks to build responsive and intuitive front end and Big Data applications. We''re committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs.
Dallas, Texas30+ days ago
div>Core Responsibilities:
- Enterprise Architecture: Lead the design, governance, and rollout of Dynatrace observability for distributed microservices, serverless workloads, and multi-region cloud environments. This is a high-impact role designed for a technical leader with nearly a decade of specialization in Dynatrace SaaS, tasked with architecting and automating large-scale monitoring solutions across complex AWS and Azure environments.
NTT DATA recruiters will never ask for payment or banking information and will only use @nttdata.com, @nttdatafed.com and @talent.nttdataservices.com email addresses. If you would like to contact us regarding the accessibility of our website or need assistance completing the application process, please contact us at https://us.nttdata.com/en/contact-us.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Corporate sector, Enterprise technology team, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Arlington, TX30+ days ago
Work closely with data scientists, data architects, data engineers, ETL developers, cybersecurity, network, Linux, other IT counterparts, and business partners to design and setup the environments to manage the ingested and processed datasets from the external sources, internal systems, and the data warehouse to extract features of interest. Solid experience in High Availability and distributed systems, Linux , Data and SAN Storage Networks, NAS and Networking, leveraging tools to instrument and automate proactively and eventually predictive availability solutions.
li>Protocol Expertise: Mastery of DNS-specific protocols including DNSSEC, DoT, and DoH, with a firm grasp of underlying transport layers (UDP/TCP) and dual-stack (IPv4/IPv6) networking. You will combine deep IP networking and DNS expertise with modern security protocols to ensure our platforms remain resilient against evolving threats and perform at the highest level for millions of users.
To find the pay range for this role based on hiring location, https://paylookup.t-mobile.com/paylookup?reqID=REQ356753¶dox=1. The Senior Site Reliability Engineer leverages automation, CI/CD practices, scripting, observability, and incident management expertise to improve reliability, scalability, and operational efficiency across a complex technology environment.
p/>The ideal candidate will bring deep expertise in distributed systems, cloud-native infrastructure, SaaS application support and DevOps/SRE principles, along with strong leadership and collaboration skills to influence cross-functional engineering and Production management teams and drive continuous improvement in service reliability. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve.
p>You'll combine hands-on Azure experience with code-level debugging, observability best practices, and automation to prevent issues before they occur, drive down MTTD/MTTR, and deliver an exceptional experience for patients and providers. - Make an Impact: Your work will directly shape the financial backbone of one of the most innovative healthcare fintech companies in the U.S.
- Work Flexibly: Hybrid model based in Dallas with 3 days/week in-office.
p>With offices in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. Experience with established methodologies (e.g., 5 Whys, Fishbone diagrams, Fault Tree Analysis, or FMEA) to support the responsibility of conducting thorough root cause analyses on recurring issues.
p>We are hiring a Site Reliability Engineer (SRE) with Oracle DBA expertise to join our client's Data Center Engineering team. This role focuses on ensuring uptime, scalability, and resilience of mission‑critical Oracle database infrastructure.
p>Our holistic approach to decisioning is powered by our industry-leading platform and team of experts, who help leaders make better decisions, faster - unlocking business growth and creating powerful customer connections. With clients in 50+ countries and global offices across New York City, Miami, Dallas, Dublin, London, Paris, Singapore, Shanghai, Munich, Poznan, Sydney, Melbourne, Charlottesville and Denver, we're growing fast.
p>The ideal candidate has deep experience operating production systems at scale, an automation-first mindset, and the ability to improve reliability through engineering practices such as SLOs/SLIs, production readiness, incident management, observability, resilience testing, and toil reduction. As part of a new SRE team supporting Autodesk GovCloud, you will have a unique opportunity to help shape how Autodesk deploys, runs, and improves production services in restricted cloud environments.
As a Site Reliability Engineer III at JPMorgan Chase within the Consumer and Community Banking, you will serve as an experienced member of an agile team, focusing on designing and delivering trusted, market-leading technology products that are secure, stable, and scalable. Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others.
Replace the first sentence with \"As a Senior Lead Site Reliability Engineer at JPMorgan Chase within Consumer and Community banking team, you will set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability. Lead and participate in major incident response (including outside business hours when needed), run post-incident reviews, and drive improvements against KPIs like availability, MTTR, and change failure rate.
li>Built, using, and automating monitoring systems such as NewRelic, DataDog, SignalFX, Kibana, Hands-on experience deploying, operating, and monitoring production-grade AI/ML microservices (e.g., RAG pipelines, agentic systems) on cloud platforms like AWS Fargate/ECS. Hands-on experience building and operating distributed systems in a public cloud environment (preferably AWS), using CI/CD to deploy, manage and operate production systems, focusing on tooling and automation using tools such as maven and Jenkins.
p>The Technology & Operational Risk department within the Multifamily (MF) division is seeking a Site Reliability Engineer (SRE) who will blend software engineering with IT operations to ensure the reliability, availability, scalability, in the performance of key systems, services, and environments. Qualifications:
Proven expertise in designing, developing, and maintaining automation frameworks for application operations, including infrastructure provisioning, deployment pipelines, monitoring, and incident response, using tools such as Ansible, Terraform, Jenkins, and related technologies.
The AWS Site Reliability Engineer (SRE) will collaborate closely with cross-functional teams, including development, quality assurance, and operations, to ensure seamless software releases and continuous improvement of our release processes. What you will do:
Infrastructure Automation: Design, implement, and manage infrastructure as code (IaC) solutions using tools like AWS CloudFormation, Terraform or Helm Charts to automate continuous database deployment and scaling processes.
Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members’ efforts to dream, do and grow without questioning that they belong. Write and maintain scripts and automation workflows to reduce manual toil and streamline operational tasks (e.g., provisioning, configuration management, log rotation, disk cleanup, service restarts).
p>In this role, you will contribute to environmental and stress testing efforts, support failure analysis investigations, and help analyze test and field data to identify potential reliability risks. You'll work closely with design, manufacturing, and supplier teams to help implement design-for-reliability best practices and assist with reliability verification activities from concept through production.
NTT DATA recruiters will never ask for payment or banking information and will only use @nttdata.com, @nttdatafed.com and @talent.nttdataservices.com email addresses. If you would like to contact us regarding the accessibility of our website or need assistance completing the application process, please contact us at https://us.nttdata.com/en/contact-us.
p>Founded in 2009, we continue to be recognized for our intentional culture and tremendous growth (Best Place to Work in Fintech; Best & Brightest to Work For Nationally; and Comparably's Best Company Culture, Best Career Growth, Best Engineering Team, and Best Places to Work in Dallas, among others). The Sr Site Reliability Engineer, Release will prototype, write, maintain, and test code in multiple stages of the release process and in multiple environments in order to rapidly deliver automated solutions to our application releases.
strong>Most recently, we were recognized Stevie Employer of the Year 2025, SIA Best Staffing Firm to work for 2025, Inc 5000 Best Workspaces in US (2025 & 2024) and Glassdoor's Best Places to Work (2023 & 2022)!. Primary Skills: Scripting (Expert), Java (Expert), Monitoring (Expert), Cloud Platforms (Intermediate), Database Technologies (Intermediate).
This is a high-impact role designed for a technical leader with nearly a decade of specialization in Dynatrace SaaS, tasked with architecting and automating large-scale monitoring solutions across complex AWS and Azure environments. AI-Driven Insights: Harness Davis AI for causal analysis and root cause identification; develop custom dashboards, alerting profiles, and auto-remediation workflows to minimize MTTR.
Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members’ efforts to dream, do and grow without questioning that they belong. An important part of the Toyota family is Toyota Financial Services (TFS), the finance and insurance brand for Toyota and Lexus in North America.