p>This role is ideal for someone who enjoys working directly in Azure, improving production systems, troubleshooting issues across infrastructure and application layers, and building practical monitoring and alerting solutions that help teams respond faster and operate more confidently. Wellfit is the dental industry’s fintech solution, breaking down financial barriers so patients, providers, employers, and payors can all access better care.
Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members' efforts to dream, do and grow without questioning that they belong. You must have the right to work in the United States and not require Toyota support or sponsorship for immigration-related employment (e.g., H-1B, O-1, E-3, H-1B1, TN, F-1 OPT, F-1 STEM OPT, F-1 CPT, TN, 'job flexibility benefits' (also known as I-140 or Adjustment of Status portability), etc.
p/>As part of our journey from traditional operations toward a mature SRE model, the Senior SRE will partner with product engineering, platform teams, and the Command Center including Service Desk and Major Incident Command (MIC) to deliver measurable improvements in service reliability.
Deep knowledge of:
Azure: AKS, App Services, Functions, VMSS, Storage, Front Door, API Management, Load Balancers, Monitor, Log Analytics, App Insights, Key Vault, Policy, Defender.
p>You'll combine hands-on Azure experience with code-level debugging, observability best practices, and automation to prevent issues before they occur, drive down MTTD/MTTR, and deliver an exceptional experience for patients and providers. - Make an Impact: Your work will directly shape the financial backbone of one of the most innovative healthcare fintech companies in the U.S.
- Work Flexibly: Hybrid model based in Dallas with 3 days/week in-office.
Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members’ efforts to dream, do and grow without questioning that they belong. Write and maintain scripts and automation workflows to reduce manual toil and streamline operational tasks (e.g., provisioning, configuration management, log rotation, disk cleanup, service restarts).
GEICO's Cyber Security Engineering & Analytics, Automation (SEA) team is seeking a Staff Cyber Site Reliability Engineer (SRE) - a hands-on, engineering-minded practitioner who is passionate about building reliable, observable, and scalable systems at the intersection of security and infrastructure. Partner with Developers and Infrastructure Engineers: Work closely with software engineers and infrastructure teams to review system designs for reliability, provide feedback on deployability and operability, and ensure that what gets built can be confidently operated and maintained in production.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Corporate sector, Enterprise technology team, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
p>Our holistic approach to decisioning is powered by our industry-leading platform and team of experts, who help leaders make better decisions, faster - unlocking business growth and creating powerful customer connections. With clients in 50+ countries and global offices across New York City, Miami, Dallas, Dublin, London, Paris, Singapore, Shanghai, Munich, Poznan, Sydney, Melbourne, Charlottesville and Denver, we're growing fast.
p>The ideal candidate has deep experience operating production systems at scale, an automation-first mindset, and the ability to improve reliability through engineering practices such as SLOs/SLIs, production readiness, incident management, observability, resilience testing, and toil reduction. As part of a new SRE team supporting Autodesk GovCloud, you will have a unique opportunity to help shape how Autodesk deploys, runs, and improves production services in restricted cloud environments.
NTT DATA recruiters will never ask for payment or banking information and will only use @nttdata.com, @nttdatafed.com and @talent.nttdataservices.com email addresses. If you would like to contact us regarding the accessibility of our website or need assistance completing the application process, please contact us at https://us.nttdata.com/en/contact-us.
p>Support reliability test methods including thermal cycling, thermal shock, high-temperature exposure, humidity, corrosion, pressure cycling, leak testing, coolant compatibility, mechanical fatigue, and bond/interface reliability. Lead and manage reliability test planning and execution for semiconductor packaging, liquid cold plates, T800 Thermadite, CVD diamond, thermal spreaders, embedded cooling structures, and related thermal assemblies.
Required Qualifications: 5+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering roles Strong production experience in AWS Required: Significant hands-on experience with Terraform in real-world environments Experience operating monitoring and uptime platforms such as Grafana, Pingdom, and Uptrends Strong Linux systems, networking, and troubleshooting skills Experience supporting production systems through incident response and on-call rotations Proficiency with GitHub and modern Git workflows Experience building or maintaining CI/CD pipelines with Azure DevOps Familiarity with ITSM and incident workflows using ServiceNow Strong written communication skills with experience documenting systems and processes in Confluence Ability to work independently in a remote or hybrid environment. Preferred Qualifications: Experience defining and operating against SLOs and error budgets Infrastructure-as-Code best practices beyond Terraform (modules, testing, CI integration) Experience with containers and orchestration (Docker, Kubernetes) Experience supporting large-scale, high-availability production systems Prior experience mentoring engineers or serving as a technical lead.
The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations-while also driving performance testing practices to validate system resilience under load. - Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning.
ul>Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning. Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning.
As a Site Reliability Engineer III at JPMorgan Chase within the Consumer and Community Banking, you will serve as an experienced member of an agile team, focusing on designing and delivering trusted, market-leading technology products that are secure, stable, and scalable. Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others.
Replace the first sentence with \"As a Senior Lead Site Reliability Engineer at JPMorgan Chase within Consumer and Community banking team, you will set clear quality gates across requirements, design, secure coding, testing, releases, and post-production monitoring to ensure reliability, performance, security, and observability. Lead and participate in major incident response (including outside business hours when needed), run post-incident reviews, and drive improvements against KPIs like availability, MTTR, and change failure rate.
li>Built, using, and automating monitoring systems such as NewRelic, DataDog, SignalFX, Kibana, Hands-on experience deploying, operating, and monitoring production-grade AI/ML microservices (e.g., RAG pipelines, agentic systems) on cloud platforms like AWS Fargate/ECS. Hands-on experience building and operating distributed systems in a public cloud environment (preferably AWS), using CI/CD to deploy, manage and operate production systems, focusing on tooling and automation using tools such as maven and Jenkins.
p>The Technology & Operational Risk department within the Multifamily (MF) division is seeking a Site Reliability Engineer (SRE) who will blend software engineering with IT operations to ensure the reliability, availability, scalability, in the performance of key systems, services, and environments. Qualifications:
Proven expertise in designing, developing, and maintaining automation frameworks for application operations, including infrastructure provisioning, deployment pipelines, monitoring, and incident response, using tools such as Ansible, Terraform, Jenkins, and related technologies.
Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members’ efforts to dream, do and grow without questioning that they belong. An important part of the Toyota family is Toyota Financial Services (TFS), the finance and insurance brand for Toyota and Lexus in North America.
p>With offices in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. Experience with established methodologies (e.g., 5 Whys, Fishbone diagrams, Fault Tree Analysis, or FMEA) to support the responsibility of conducting thorough root cause analyses on recurring issues.
p>We are hiring a Site Reliability Engineer (SRE) with Oracle DBA expertise to join our client's Data Center Engineering team. This role focuses on ensuring uptime, scalability, and resilience of mission‑critical Oracle database infrastructure.
Arlington, TX30+ days ago
Work closely with data scientists, data architects, data engineers, ETL developers, cybersecurity, network, Linux, other IT counterparts, and business partners to design and setup the environments to manage the ingested and processed datasets from the external sources, internal systems, and the data warehouse to extract features of interest. Solid experience in High Availability and distributed systems, Linux , Data and SAN Storage Networks, NAS and Networking, leveraging tools to instrument and automate proactively and eventually predictive availability solutions.
li>Protocol Expertise: Mastery of DNS-specific protocols including DNSSEC, DoT, and DoH, with a firm grasp of underlying transport layers (UDP/TCP) and dual-stack (IPv4/IPv6) networking. You will combine deep IP networking and DNS expertise with modern security protocols to ensure our platforms remain resilient against evolving threats and perform at the highest level for millions of users.
This requires an oversight of all routine and strategic infrastructure initiatives, including operating system upgrades, patching, EOL remediation, infrastructure changes, middleware and database activities, cloud technologies and readiness, tooling modernization, and automation at scale. You will lead ongoing improvements in automation, resilience engineering, disaster recovery readiness, and operational maturity, creating repeatable, well-engineered processes that support rapid change with minimal risk.
We build and operate a suite of platforms and applications that prevent, detect, and mitigate regulatory and reputational risk across the firm, have access to the latest technology and to massive amounts of structured and unstructured data, leverage modern frameworks to build responsive and intuitive front end and Big Data applications. We''re committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs.