Senior Site Reliability Engineer

Ellucian Company LP

VA(remote)

Apply

JOB DETAILS

SKILLS

Amazon Web Services (AWS), Analysis Skills, Artificial Intelligence (AI), Bash Scripting, Best Practices, Business Operations, Cloud Computing, Communication Skills, Community Support, Continuous Deployment/Delivery, Continuous Improvement, Continuous Integration, Cost Control, Customer Relations, Data Sets, DevOps, Distributed Computing, Diversity, Docker, Ecosystems, Fundraising, GCP (Good Clinical Practices), Health Plan, High Availability, Higher Education, Identify Issues, Incident Management, Incident Response, Infrastructure as a Service (IaaS), Metrics, Microsoft Windows Azure, Operational Improvement, Operations Processes, Process Improvement, Production Systems, Python Programming/Scripting Language, Reliability Engineering, Reporting Dashboards, Risk Management, Root Cause Analysis, Scalable System Development, Scripting (Scripting Languages), Software as a Service (SaaS), Systems Analysis, Systems Reliability, Team Player

LOCATION

POSTED

27 days ago

About Ellucian

Ellucian powers innovation for higher education, partnering with approximately 3,000 customers across 50 countries, serving more than 21 million students. Ellucian''s AI-powered platform, trained on the richest dataset available in higher education, drives efficiency, personalized experiences, and strengthened engagement for all students, faculty and staff. Fueled by decades of experience with a singular focus on the unique needs of learning institutions, the Ellucian platform features best-in-class SaaS capabilities and delivers insights needed now and into the future. These solutions and services span the entire student lifecycle, including data-rich tools for student recruitment, enrollment, and retention to workforce analytics, fundraising, and alumni engagement. Ellucian''s innovative solutions, vast ecosystem of partners and user community of more than 45,000 provides best practices leading to greater institutional success and achieving better student outcomes.

About the Opportunity

We are seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, performance, and cost-efficiency of our production systems. This role requires deep expertise in DataDog for observability and will focus on DevOps practices, incident management, root cause analysis, and cost optimization across cloud infrastructure and services.

Where You Will Make an Impact

Own and improve system reliability, availability, and performance for production environments
Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
Perform detailed root cause analysis (RCA) and drive permanent resolutions
Partner with engineering and DevOps teams to build scalable, resilient infrastructure
Automate operational processes to improve efficiency and reduce risk
Analyze and optimize infrastructure and application costs
Define and manage SLIs/SLOs to meet reliability targets
Continuously improve deployment, monitoring, and operational practices

What You Will Bring

5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Mandatory: Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
Experience with cloud platforms (AWS, Azure, or GCP)
Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
Experience with containers and orchestration (Docker, Kubernetes)
Scripting or programming experience (Python, Bash, or similar)
Proven ability to analyze and optimize cloud costs

Preferred Qualifications

Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
Familiarity with cloud security and compliance best practices
Experience supporting high-availability, customer-facing systems
Strong collaboration and communication skills

What Success Looks Like

Improved system reliability and reduced incident frequency
Faster incident detection and resolution (MTTR)
Effective, actionable observability driven by DataDog
Measurable cost savings and optimized infrastructure usage

What makes #Ellucianlife

Comprehensive health coverage: medical, dental, and vision
Flexible time off
Thrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests
401k w/ match & BrightPlan - to help you save for the future
Parental Leave
5 charitable days to support the community that supports us
Telemedicine
Wellness
Headspace Care (mental health)
Wellbeats (virtual fitness classes)
RethinkCare & Wellthy- caregiver support
Diversity and inclusion programs which provide access to internal employee resource groups
Employee referral bonuses to encourage the addition of great new people to the team
We Foster a learning culture with:
Education Assistance Program
Professional development opportunities

#LI-RB1

#LI-Remote

Comprehensive health coverage: medical, dental, and vision
Flexible time off
Thrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests
401k w/ match & BrightPlan - to help you save for the future
Parental Leave
5 charitable days to support the community that supports us
Telemedicine
Wellness
Headspace Care (mental health)
Wellbeats (virtual fitness classes)
RethinkCare & Wellthy- caregiver support
Diversity and inclusion programs which provide access to internal employee resource groups
Employee referral bonuses to encourage the addition of great new people to the team
We Foster a learning culture with:
Education Assistance Program
Professional development opportunities

#LI-RB1

#LI-Remote

Where You Will Make an Impact

Own and improve system reliability, availability, and performance for production environments
Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
Perform detailed root cause analysis (RCA) and drive permanent resolutions
Partner with engineering and DevOps teams to build scalable, resilient infrastructure
Automate operational processes to improve efficiency and reduce risk
Analyze and optimize infrastructure and application costs
Define and manage SLIs/SLOs to meet reliability targets
Continuously improve deployment, monitoring, and operational practices

What You Will Bring

5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Mandatory: Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
Experience with cloud platforms (AWS, Azure, or GCP)
Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
Experience with containers and orchestration (Docker, Kubernetes)
Scripting or programming experience (Python, Bash, or similar)
Proven ability to analyze and optimize cloud costs

Preferred Qualifications

Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
Familiarity with cloud security and compliance best practices
Experience supporting high-availability, customer-facing systems
Strong collaboration and communication skills

What Success Looks Like

Improved system reliability and reduced incident frequency
Faster incident detection and resolution (MTTR)
Effective, actionable observability driven by DataDog
Measurable cost savings and optimized infrastructure usage

About the Company

Ellucian Company LP

We provide technology solutions and services that remove barriers, helping higher education institutions achieve student success. It’s been our total focus for more than 40 years. Our passion speaks for itself: today we serve 2,400 institutions and 18 million students in 40 countries around the globe.Our goal is to ensure European institutions have technology for faster, smarter growth and peak operational efficiency—all in the service of student success.

COMPANY SIZE

2,500 to 4,999 employees

INDUSTRY

Computer/IT Services

FOUNDED

1968

WEBSITE

http://www.ellucian.com/

Senior Site Reliability Engineer

Ellucian Company LP

VA(remote)

About the Company

Ellucian Company LP

Similar Job Searches