Operational Support Engineer

Pride Technologies LLC

Atlanta, GA

JOB DETAILS
SKILLS
Application Programming Interface (API), Artificial Intelligence (AI), Automation, Bridge Building, Business Operations, Change Control, Civil Engineering, Cloud Computing, Communication Skills, Content Delivery Network (CDN), Continuous Improvement, Corrective Action, Customer Escalations, Customer Relations, DevOps, Digital Rights Management (DRM), Distributed Computing, Event Management, Identify Issues, Incident Management, Incident Response, Knowledge Base, Machine Tool, Metrics, Microservices, On Call, Operational Improvement, Operational Support, Operations Planning, Operations Processes, Presentation/Verbal Skills, Procedure Development, Production Support, Production Systems, Reliability Engineering, Risk, Risk Analysis, Risk Management, Root Cause Analysis, Service Level Agreement (SLA), System Validation, Team Player, Technical Support, Telemetry, Time Management, Video Production, Video Streaming, Writing Skills
LOCATION
Atlanta, GA
POSTED
13 days ago

Our client, a American technology corporation Client, is looking to hire a Operational Support Engineer in Atlanta, GA (Hybrid role)

Pay Rate Range: $60/h - $65/h on W2, depending on experience

6 months W2 contract

Role Summary

The team is responsible for the stability, availability, and operational excellence of our 24/7 live video streaming, ads, player, and real-time delivery platforms.

As an Operational Support Engineer (L2), you take end-to-end ownership of customer-impacting production incidents once they are triaged by Level 1 support. You operate directly on production systems, lead live incident resolution, and act as the operational bridge between Support, Engineering, DevOps, and customers, particularly during high-impact live events.

This is a hands-on, customer-facing role focused on incident ownership, production operations, automation, and operational scalability, not just reactive troubleshooting.

Key Responsibilities

Incident & Operational Support

  • Take ownership of escalated customer issues from Level 1 Support and drive them to resolution.
  • Troubleshoot and resolve complex, high-impact production incidents affecting live streams, VOD playback, ad insertion, DRM, and real-time WebRTC services.
  • Operate directly on production environments, including configuration changes, CDN adjustments, and corrective actions, following established operational procedures, including executing mitigations and emergency changes during live incidents when customer impact requires immediate action.
  • Lead or actively contribute to live incident bridges involving customers, internal teams, and partners.
  • Provide clear, timely communication during incidents, including status updates and customer-facing explanations.

Infrastructure as Code (IaC) & Production Operations

  • Work fluently with Infrastructure as Code (IaC) to understand, troubleshoot, and safely modify production environments.

  • Leverage tools and frameworks such as:

  • Terraform

  • Helm

  • Kubernetes manifests

  • GitOps workflows

  • CI/CD and deployment pipelines

  • Use IaC as the primary mechanism for safe, auditable, and repeatable operational changes.

  • Collaborate with Engineering and DevOps to improve deployment reliability and operational safety.

  • Validate and execute infrastructure or configuration changes through codified workflows.

AI-Driven Operations & Automation

  • Leverage AI tools and automation to enhance operational efficiency and incident response.

  • Contribute to and use:

  • AI-assisted incident triage and classification

  • Automated runbook execution

  • AI-based pattern detection across incidents

  • Intelligent alert correlation and noise reduction

  • Use AI to:

  • Generate or improve incident communications

  • Accelerate troubleshooting workflows

  • Identify recurring patterns and systemic issues

  • Drive adoption of automation-first and AI-augmented operational practices.

Pre-Event Planning & Operational Readiness

  • Participate in pre-event readiness planning for critical customer events.

  • Validate system readiness through:

  • Runbook checks

  • Monitoring coverage validation

  • Risk identification and mitigation planning

  • Define and rehearse incident response strategies for high-risk scenarios.

  • Collaborate with customers and internal teams to ensure smooth event execution.

On-Call & 24/7 Operations

  • Participate in a 24/7 on-call rotation, including nights, weekends, and holidays, as part of a global support model.
  • Ensure smooth handovers between shifts and regions.
  • Respond to critical alerts within defined SLAs for stream health, player errors, and delivery infrastructure.

Root Cause Analysis (RCA) & Continuous Improvement

  • Perform or contribute to root cause analysis (RCA) for production incidents.
  • Document findings, corrective actions, and preventive measures.
  • Identify recurring issues and work with Engineering and Product teams to eliminate them permanently.
  • Contribute to and improve runbooks, operational playbooks, and knowledge bases for all OptiView products (Player, ads, live and real-time streaming).

Collaboration & Engineering Feedback Loop

  • Work closely with Engineering teams to escalate defects, validate fixes, and support production deployments.
  • Provide feedback on system observability, tooling gaps, and operational risks.
  • Act as the operational voice during post-incident reviews.

Required Skills & Experience

Technical Skills

  • 5+ years of relevant experience in operational, support, or similar customer-facing roles.
  • Proven ability to own complex problems end-to-end and operate with a high degree of autonomy.
  • Strong experience supporting production video streaming platforms, OTT services, and live systems.
  • Solid troubleshooting skills across distributed systems (APIs, microservices, cloud infrastructure).
  • Familiarity with HLS, DASH, CMAF, WebRTC, DRM, and CDN architectures.
  • Experience working with monitoring, alerting, and logs to diagnose live incidents (Grafana, Kibana/ELK, Prometheus, Loki).
  • Correlate backend streaming metrics, player telemetry, and CDN signals to diagnose live customer issues end-to-end.
  • Comfort performing controlled changes in production environments.
  • Working knowledge of incident management and on-call operations.

Operational Mindset

  • Proven ability to remain calm, structured, and decisive during high-pressure incidents.
  • Strong sense of ownership and accountability for customer outcomes.
  • Excellent written and verbal communication skills, including customer-facing communication during incidents.

Russell Tobin offers eligible employee's comprehensive healthcare coverage (medical, dental, and vision plans), supplemental coverage (accident insurance, critical illness insurance and hospital indemnity), 401(k)-retirement savings, life & disability insurance, an employee assistance program, legal support, auto, home insurance, pet insurance and employee discounts with preferred vendors.

Equal Employment Opportunity

Russell Tobin is an equal opportunity employer. We do not discriminate on the basis of the race, religious creed, color, national origin, ancestry, physical disability, mental disability, reproductive health decision making, medical condition, genetic information, marital status, sex, gender, gender identity, gender expression, age, sexual orientation, veteran or military status, or any other characteristic protected by applicable federal, state, or local law.

Fair Chance Employment

Russell Tobin is a Fair Chance employer. We consider all qualified applicants, including those with criminal histories, in a manner consistent with applicable state and local Fair Chance laws and ordinances, including, the California Fair Chance Act and all applicable local Fair Chance ordinances.

Accommodations

We are committed to providing reasonable accommodations to applicants and employees with disabilities. If you require a reasonable accommodation to participate in the application or interview process, or to perform the essential functions of this role, please contact us.

Only applicable for San Francisco Candidates: Under the San Francisco Lactation in the Workplace Ordinance, we will provide written notice of lactation accommodation rights, and this notice will automatically be given upon hiring, any inquiry of parental leave or lactation accommodation.

#RTA

#LI-GB1

About the Company

P

Pride Technologies LLC