Build holistic visibility into SLIs, SLOs, and SLAs, dependency graphs, past performance of software, network, and system to ensure that we can continue to scale without increasing operational burden or toil.
Assess the current state of the environment and drive "SWAT" initiatives in collaboration with the rest of the Organization to ensure transparency, resiliency, stability, reliability etc... Across both Applications & Infrastructure stack. SWAT initiatives for future state can vary from Incident Analysis leveraging ML & AI/ Assisting with Datacenter Stability & Consolidation effort to Application Transformation [Monolithic to Microservices, PaaS etc.]
Enables the adoption and implementation of cloud-based application reliability, resiliency, and observability /deployment best practices for production & non-prod environments including public cloud migration of our mission critical applications from the onprem data-centers.
Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems.
Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the platform.
Step back to observe patterns and develop innovative tools and automation to minimize toil. Use those learnings to drive the best operational practices.
Monitor and report on service level objectives for a given applications services. Work with business and product owners to establish key performance indicators.
Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
Partner with the broader Fiserv organization to build a culture of rigorously learning from incidents.
Share your knowledge by giving brown bags, tech talks, and evangelizing appropriate tech and engineering best practices.
Unblock, support, and effectively communicate across teams to achieve results.
Define roadmap and architecture based on technology and business outcomes.
4+ years of software engineering experience and development best practices code management
Experience with Infrastructure as Code tools (e.g. Terraform, CloudFormation)
Experience with high level programming languages (Python, Go, Java, etc.)
Experience with designing solutions for Canary and/or Blue/Green deployments
Experience designing, debugging and running fault tolerant large-scale distributed systems
Experience working with public cloud platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure, etc.)
Experience with creating and improving documented procedures and/or playbooks.
Knowledge of open-source configuration, orchestration, and CI/CD tools.
Knowledge of Kubernetes, PCF and/or Docker.
Deep understanding of Cloud Architecture and Operations
Strong troubleshooting and debugging skills
Experience with tools & technologies such as Prometheus, Grafana, AppDynamics, Dynatrace, Splunk and Moogsoft is a plus.
Experience handling large numbers of diverse systems with configuration management systems like: Puppet, Chef, Ansible, or Salt.
Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies.
Help us improve CareerBuilder by providing feedback about this job:
Report this job
Report this Job
Once a job has been reported, we will investigate it further. If you require a response, submit your question or concern to ourTrust and Site Security Team
Job ID: BBBH21826
privacy and protection,
when applying to a job online, never give your social security number to a prospective employer, provide credit card or bank account information, or perform any sort of monetary transaction.Learn more.
By applying to a job using CareerBuilder you are agreeing to comply with and be subject to the CareerBuilder
Terms and Conditions
for use of our website. To use our website, you must agree with the
Terms and Conditions
and both meet and comply with their provisions.