Customer Job Major Incident Engineer Job ID: 26-01197 Pay rate range: 60hr. to 65hr.
Job Description: The Problem Management Engineer is responsible for leading and maturing the Problem Management function to prevent incident recurrence, reduce operational risk, and improve service resiliency. This role owns the quality and effectiveness of root cause analysis, ensures permanent fixes are validated, and drives continuous improvement across IT services in alignment with ITIL best practices.
This position requires a strong technical background to credibly engage engineering teams, challenge root cause conclusions, and ensure solutions are durable, evidence-based, and measurable.
Problem Management Leadership ---------------------------
Lead and oversee the end-to-end Problem Management lifecycle, including:
• Detection • Logging • Classification • Investigation • Resolution • Validation • Closure
Ensure problems are closed only when defined closure criteria are met, including validated resolution, preventive controls, and monitoring improvements. Prevent premature or superficial closure of problems by enforcing quality and evidence standards.
Root Cause Analysis & Technical Oversight -----------------------------------------
Lead and validate structured Root Cause Analysis (RCA) using methodologies such as:
• 5 Whys • Fishbone • Fault tree analysis
Challenge assumptions and ensure true root causes are identified for major incidents and recurring issues. Review and validate the technical feasibility and effectiveness of permanent fixes.
Cross Functional Collaboration ---------------------------
Partner closely with:
• Incident Management • Change Management • Resiliency/Reliability Engineering • Service Owners
Coordinate permanent fixes through formal change processes. Work with vendors and external partners to track dependencies and ensure accountability.
Governance Metrics & Reporting ---------------------------
Establish and enforce Problem Management governance and quality standards. Track and report on key metrics, including:
• Overdue problems • SLA compliance • Recurrence trends • Systemic risks
Provide clear, actionable updates and insights to senior leadership and executive forums.
Knowledge & Continuous Improvement -----------------------------------
Maintain and improve the Known Error Database (KEDB) and Problem-related Knowledge Articles. Identify opportunities for proactive problem management, automation, and improved monitoring and alerting. Continuously refine Problem Management processes, tools, and standards to increase effectiveness and efficiency.
Required Qualifications ----------------------
• Strong understanding of ITIL Problem Management processes and best practices • Proven experience leading or performing Root Cause Analysis in complex technical environments • Technical background in infrastructure, applications, cloud, or enterprise platforms sufficient to engage and challenge engineering teams • Hands-on experience with ServiceNow or comparable enterprise ITSM platforms • Strong communication and stakeholder management skills, including executive-level communication • Ability to analyze trends, identify systemic risk, and drive proactive improvements
Preferred Qualifications ----------------------
• ITIL Foundation certification or higher • Experience in large-scale enterprise environments • Experience supporting Major Incident or executive outage review forums • Familiarity with automation, observability, and proactive problem management techniques • Experience working with vendors and external service providers