$169,000–$338,000 Per Year
Algorithms, Amazon Web Services (AWS), Analysis Skills, Application Programming Interface (API), Artificial Intelligence (AI), Assistive Technology, Automation, Best Practices, Business Operations, Capacity and Performance Management, Change Management, Cloud Architecture, Cloud Computing, Code Reviews, Compensation and Benefits, Computer Engineering, Computer Science, Computer Vision, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Deep Learning, Distributed Computing, Distribution Services, Docker, Ecosystems, Engineering, Engineering Change Management, English Language, Failure Analysis, Financial Services, GCP (Good Clinical Practices), High Availability, Home Automation, Hybrid Cloud, Identify Issues, Incident Response, Information Technology & Information Systems, Injections, Instrumentation, Insurance, Large-Scale Systems, Leadership, Machine Learning, Management Strategy, Mentoring, Metrics, Microsoft Windows Azure, Modeling Languages, Natural Language Processing (NLP), Open Source, Performance Analysis, Performance Engineering, Performance Modeling, Performance Tuning/Optimization, Predictive Modeling, Problem Solving Skills, Process Improvement, Product Engineering, Reinforcement Learning, Reliability Engineering, Retail, Secondary School, Service Level Agreement (SLA), Software Engineering, Supply Chain, Systems Administration/Management, Systems Engineering, Systems Reliability, Technical Leadership, Technical Strategy, Thought Leadership, Time Management, User Interface/Experience (UI/UX), Web Content Accessibility Guidelines (WCAG), eCommerce
# Position Summary...
# What youll do...
As a Distinguished AIML Engineer within Walmart Global Techs Reliability Engineering Organization, you will lead the technical development of next-generation agentic AI systems and intelligent automation solutions that ensure mission-critical reliability, scalability, and operational excellence across Walmarts entire technology ecosystem. You will architect and implement cutting-edge machine learning platforms and autonomous agents that transform how we manage change and performance, monitor, predict, and automatically resolve issues across all Walmart systems supporting millions of associates and customers globally.
Walmart Global Techs Reliability Engineering Organization is built with hybrid systems and software engineers who take technical ownership for change engineering, change management, performance engineering, reliability, scalability, automation, and mission-critical issues related to uptime, availability, and rapid continuous improvement across Walmarts e-commerce stores and omni-channel platforms. As a technical expert in this domain, you will drive the evolution of practices into AI-powered self-healing and autonomous systems built on modern technology stacks with intelligent change management and predictive performance optimization. You will also define and implement unified intelligent and operationally robust technical solutions and tools for Walmart Technology organizations across all channels and geographies.
## About the Team
The Reliability Engineering Organization at Walmart Global Tech is responsible for ensuring the reliability, availability, and performance of all systems that power the worlds largest retailer. As a Fortune 1 company, our work impacts hundreds of millions of customers and associates globally-across every transaction, search, and interaction spanning Walmarts digital and physical ecosystem. We are the guardians of system reliability for Walmarts e-commerce platform, supply chain systems, in-store technology, financial services, and all critical business operations.
Our Reliability Engineering organization is at the forefront of applying advanced AIML technologies to reliability challenges, building autonomous systems that can predict, prevent, and resolve issues before they impact customers or business operations. Reliability Engineering is a core engineering discipline within Walmart Global Tech, working closely with all product and engineering teams across the enterprise to ensure every system meets the highest standards of reliability, scalability, and performance. We are deeply invested in building a robust, intelligent, and highly automated technology foundation that supports Walmarts mission to help people live better through innovation and operational excellence.
## What Youll Do
### AIML & Agentic Systems Technical Leadership
- Architect and develop advanced agentic AI systems that autonomously manage complex reliability engineering workflows, predictive failure analysis, and self-optimization across Walmarts technology ecosystem.
- Design and implement multi-agent orchestration platforms that coordinate autonomous agents for change management, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
- Build intelligent observability and monitoring platforms using ML-driven anomaly detection, predictive analytics, and autonomous resolution across Walmarts entire technology landscape.
- Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically remediate system issues before they impact customers, associates, or business operations.
### Reliability Engineering Technical Excellence
- Design, write, and build advanced tools to improve latency, availability, scalability, and change management across Walmart Technology systems, including:
- Engineering reliability using metrics and measurements across all domains.
- Enabling system scaling through technical solutions, automation, and process optimization.
- Building tools and automation to prevent recurrence of failures across mission-critical services.
- Enhancing instrumentation to create a cohesive end-to-end view of system health with particular focus on failure points.
- Architect and implement fault-tolerant systems and services across Walmarts hybrid cloud infrastructure with emphasis on autonomous recovery and intelligent failure prediction.
- Collaborate with engineering teams and leadership to reduce Mean Time to Detect (MTTD) and Mean Time to Restore (MTTR) through intelligent automation and predictive capabilities.
- Partner with service owners across e-commerce, supply chain, stores, fintech, and other domains to define SLA breach detection and change-related anomalies, ensuring systems meet SLAs while maintaining optimal performance and user experience.
- Perform complex troubleshooting and analysis of large-scale distributed systems using deep expertise in coding, algorithms, and distributed systems design.
### Strategic Technical Innovation
- Partner with engineering organizations across E-commerce, Supply Chain, Store Technology, Fintech, and Data Platforms to deliver autonomous reliability solutions using advanced machine learning, natural language processing, and computer vision.
- Drive development of MLOps and AIOps platforms that enable continuous learning, deployment, monitoring, and autonomous optimization of reliability systems.
- Innovate in agentic AI technologies for Reliability Engineering, including:
- Large language models for automated incident response.
- Reinforcement learning agents for capacity optimization.
- Multi-modal AI for infrastructure monitoring.
- Federated learning for cross-domain reliability insights.
- Implement advanced CI/CD pipelines for reliability platforms with automated validation, deployment, rollback, and built-in observability.
- Establish platform engineering excellence by building reusable reliability infrastructure, intelligent monitoring platforms, and developer productivity tools.
- Provide technical mentorship and thought leadership across Walmart Technology through code reviews, design discussions, and knowledge sharing.
## What Youll Bring
### Education & Experience
- Bachelors or Masters degree in engineering, Computer Science, or a related field with 12-43 years of hands-on experience in Reliability Engineering, AIML Engineering, or Platform Engineering.
- Proven record as a senior individual contributor influencing architecture and driving technical excellence across large organizations.
- Deep experience operating mission-critical systems with expertise in MTTD, MTTR, availability, change management, model performance, and autonomous system reliability.
### Must-Have Technical Experience
- Expert-level AIML engineering experience including deep learning frameworks such as TensorFlow and PyTorch and large-scale production ML deployments.
- Advanced experience with agentic AI systems including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
- Comprehensive Reliability Engineering expertise including service management, Incident, Problem, and Change Management, and performance and capacity engineering for AIML systems.
- Expert-level cloud engineering experience (Azure, GCP, AWS) with containerization, Kubernetes, Docker, serverless architectures, and cloud-native AI services.
- Deep observability experience across distributed tracing, metrics, logs, APM, and AI-driven anomaly detection.
- Strong platform engineering background including infrastructure as code, service mesh architectures, API gateways, and self-service developer platforms.
### Preferred Technical Experience
- MLOps and model lifecycle management using platforms such as MLflow, Kubeflow, or Seldon.
- NLP and computer vision expertise for intelligent log analysis, automated incident response, and visual infrastructure monitoring.
- Edge computing and distributed systems experience for retail stores and distribution centers.
- Real-time streaming architectures (Kafka, Pulsar).
- Chaos engineering, fault injection, and performance optimization for large-scale distributed systems.
- Open-source contributions in reliability, observability, or infrastructure automation.
At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision, and dental coverage. Financial benefits include 401k, stock purchase, and company-paid life insurance. Paid time off benefits include PTO, including sick leave, parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption, and surrogacy expense reimbursement, and more.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws where applicable. For information about PTO, see https://one.walmart.com/notices.
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sams Club facilities. Programs range from high school completion to bachelors degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart.
The annual salary range for this position is $169,000.00 - $338,000.00. Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include Stock.
# Minimum Qualifications...
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
- Option 1: Bachelors degree in computer science, computer engineering, computer information systems, software engineering, or related area and 6 years experience in software engineering or related area.
- Option 2: 8 years experience in software engineering or related area.
# Preferred Qualifications...
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
- Masters degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years experience in software engineering or related area.
- We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmarts accessibility standards and guidelines for supporting an inclusive culture.
# Primary Location...
1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America
Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.