We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age national origin, religion, sexual orientation, gender identity, status as a veteran and basis of disability or any other federal, state or local protected class:
Advance Auto Parts-AAP is looking for a Site Reliability Engineer to join their Platform Engineering organization. This organization has a direct impact on Advance Auto Parts ability to introduce business capabilities at a pace that enables Advance Auto Parts Mission of having a “Passion for Customers. Passion for Yes”.
Primary Duties and Responsibilities include the following. Other duties may be assigned.
- The candidate must have a good understanding of web applications, eCommerce sites, browser-based UI technologies and Java based applications
- Looking for a problem solver, quick learner, love working with new technologies, ready to face challenges in one of the large, complex retail environments and latest technology stack in a Azure and/or AWS cloud.
- Look for continuous improvement in the visibility of production health, automation, self-healing and resiliency initiatives, security, scalability and reliability.
- Successful candidate requires strong technical knowledge, deep understanding of various layers of the infrastructure (System, Virtualization, Middleware, Containers and database), supporting application/service in the software build/deploy/operate cycles, high availability, excellent communication skills and a team player in agile environment.
- Coordinate with various teams on production support tickets, root cause analysis and assist in efficient resolution of production processes.
- Candidate will be responsible for monitoring all layers of production environment and take proactive measures to avoid downtimes
- Participate in Architecture/Code reviews.
- Define and drive the team to achieve the SRE priorities
- Contribute to the group's knowledge base by finding new and valuable ways to approach problems and projects.
We’re looking for individuals that possess the following qualifications
- 3+ years of experience in Technical operations out of which 2 years in SRE/DevOps/Service Availability roles
- 2+ years of experience administering middleware Tomcat, WebSphere in large scale production environments
- 2+ Experience in building and operate in large-scale systems and applications architecture
- Experience in one or more languages: BASH/Ruby/Python/PERL and Java
- Good understanding DNS, AD/LDAP, TCP/IP related technologies
- Experience working in virtualization in cloud environment
- Good understanding of Systems performance and monitoring
- Experience with JVM optimization
- Good understanding/experience in one or more monitoring/metrics/logging solutions – New Relic,Solarwinds, Quantum Metrics, Azure Insights, Google Analytics or similar solutions
- Good understanding of high availability, fail over and load balancing - F5 GTM/LTM, HAPROXY, NGINX
- Experience in Incident Management and play key role in driving or participating in high severity calls.
- Experience in the root cause analysis and problem management while partnering with other product teams.
- Ability to analyze, prioritize, multi-task and communicate
- Excellent written and verbal communication skills
- Good understanding of ITIL best practices
- Readiness to automate processes in various languages
- Solid understanding of object oriented design methodologies
- Solid analytical and problem solving skills
- Good communications skills
- Agile Methodology
- Apache Tomcat
- Applications Architecture