What you'll do Ensure high availability of fulfillment systems supporting pickup and delivery Reduce order failures, delays, and customer-impacting issues Improve operational efficiency through automation and proactive monitoring Enable seamless coordination between systems handling orders, inventory, and last-mile delivery Own L2/L3 production support for online fulfillment systems, including order management, pickup, and delivery workflows Act as the technical escalation point for critical incidents, driving resolution with urgency and clear stakeholder communication Perform deep root cause analysis (RCA) and partner with engineering teams to implement permanent fixes Analyze production issues at code, infrastructure, and data levels to identify failure patterns and systemic gaps Build and enhance monitoring, alerting, and observability solutions to detect issues before customer impact Design and develop automation tools and scripts to reduce manual intervention and improve operational efficiency Support high-availability systems, ensuring minimal downtime and rapid recovery (low MTTR) Participate in on-call rotations and handle high-severity incidents in a fast-paced environment Collaborate with product, engineering, and operations teams to improve end-to-end fulfillment reliability and customer experience Support production releases, including code deployments, configuration changes, and validation in production and staging environments Maintain and continuously improve runbooks, knowledge base, and support documentation Drive continuous improvement initiatives to reduce incident volume and improve system resilience Contribute to architecture and design discussions, especially around reliability, scalability, and supportability What you'll bring Strong problem-solving skills with the ability to debug complex distributed systems under pressure Ability to anticipate failure scenarios and proactively implement preventive solutions Excellent communication skills to work effectively with cross-functional and global teams High ownership mindset with the ability to balance support and development responsibilities Strong focus on customer experience, system uptime, and operational excellence 8-10+ years of experience in software engineering with strong exposure to production support / SRE / DevOps environments Proven experience supporting high-scale, customer-facing applications, preferably in eCommerce or fulfillment domains Hands-on expertise in incident management, RCA, and production troubleshooting Strong coding skills with the ability to debug and fix issues at source code level Experience working in agile, fast-paced, and globally distributed teams Technical Skills & Technologies Backend & Application Development Strong programming skills in Java (preferred) or Python Experience with Spring Boot / microservices architecture Cloud & Infrastructure Cloud platforms (Microsoft Azure preferred) Containers (Docker) and orchestration (Kubernetes) Data & Messaging Messaging systems (Kafka) Strong database skills (MSSQL, SQL queries, stored procedures) Experience with NoSQL databases (e.g., Azure Cosmos DB, key-value stores) Monitoring & Observability Hands-on experience with Splunk, Dynatrace, or similar APM tools Expertise in logging, tracing, and alerting frameworks Networking & Systems Solid understanding of TCP/IP, HTTP, DNS, load balancing, distributed systems Automation & Tools Experience building automation scripts/tools to improve operational efficiency Familiarity with CI/CD pipelines and release management processes Nice to Have Experience in Online Fulfillment / Retail / Last-Mile Delivery systems Knowledge of order lifecycle systems, inventory, and dispatch optimization Exposure to SRE principles (SLIs, SLOs, error budgets) Experience with chaos engineering or resilience testing About Walmart Global Tech Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. You will play a key role in ensuring seamless order processing, real-time inventory accuracy, and last-mile delivery execution by proactively identifying risks, resolving production issues, and building automation to prevent recurrence.