Job Title: Monitoring and Observability EngineerDuration: 12+ Months (Possible extension)Location: Pittsburgh, PA 15258Onsite Role (4 days a week)Alternate Location: Lake Mary, FL 32746 or New York, NY 10286Responsibilities:- Seeking a skilled Cloud Monitoring and Observability Engineer (Azure) engineer to design, implement, and optimize end-to-end monitoring and observability solutions for a mission-critical application deployed in the Azure environment.
- The ideal candidate has hands-on experience with enterprise monitoring tools—such as AppDynamics, Thousand Eyes, NetScout, and SolarWinds (or equivalent alternatives)—and a strong background in building scalable, secure, and compliant observability stacks for cloud deployments.
- Will collaborate closely with application engineering, cloud platform, network, and security teams to ensure comprehensive coverage across application, infrastructure, and network layers
- Design and implement end-to-end monitoring, alerting, and observability for an Azure-hosted application across application, infrastructure, network, and user experience layers.
- Configure, integrate, and maintain enterprise monitoring platforms to deliver actionable telemetry, performance baselines, and SLA/SLO tracking.
- Build dashboards, health checks, synthetic tests, and alerting workflows; optimize alert fidelity to minimize noise and improve signal-to-noise ratio.
- Establish and document telemetry standards (metrics, logs, traces), data collection strategies, and service-level indicators (SLIs) aligned to reliability objectives (SLOs).
- Integrate Azure-native services (Azure Monitor, Log Analytics, Application Insights) with enterprise tools to provide unified visibility and correlation.
- Implement network performance monitoring, path visibility, and internet/extranet testing using NPM tools (e.g., ThousandEyes, NetScout); leverage infrastructure monitoring platforms (e.g., SolarWinds) for device and service health.
- Instrument applications with APM tools (e.g., AppDynamics, Dynatrace, New Relic) for business transaction monitoring, dependency mapping, and root-cause analysis; tune anomaly detection and policy thresholds.
- Collaborate with DevOps/SRE teams to embed monitoring into CI/CD and infrastructure-as-code patterns; ensure new services adhere to observability standards.
- Define runbooks and escalation paths; support incident response and post-incident reviews with data-driven insights and remediation recommendations.
- Ensure monitoring solutions meet applicable security and compliance requirements; support audit requests with clear documentation and evidence.
- Conduct capacity and performance trend analysis; recommend optimization, right-sizing, and resilience improvements.
- Provide knowledge transfer, documentation, and training on monitoring tools, best practices, and operational workflows.
Education/Experience:- 5+ years implementing enterprise monitoring/observability for cloud or hybrid environments, including mission-critical applications.
- Demonstrable expertise with at least one tool in each category (or equivalent), including production deployments, advanced configuration, and operational use:
- Application Performance Monitoring (APM): AppDynamics, Dynatrace, or New Relic.
- Experience instrumenting services for business transaction tracing, code-level diagnostics, service maps, and anomaly detection.
- Ability to design APM dashboards and create alert policies with appropriate thresholds and baselines.
- Network Performance Monitoring (NPM) / Digital Experience Monitoring (DEM): Thousand Eyes, NetScout, or Kentik.
- Experience with synthetic tests, path visualization, packet-level analysis, and internet/WAN performance monitoring.
- Ability to configure endpoint agents, BGP/DNS tests, and multi-hop path monitoring for user experience correlation.
- Infrastructure Monitoring and Event Management: SolarWinds, Microsoft SCOM, Datadog, or Prometheus/Grafan.
- Experience monitoring servers, containers, network devices, and cloud services; creating availability and capacity dashboards.
- Proficiency with alert routing, de-duplication, and event correlation.
- Strong Azure monitoring experience: Azure Monitor, Log Analytics (KQL), Application Insights, and integration with third-party tools.
- Solid understanding of distributed tracing, metrics, and log aggregation; familiarity with Open Telemetry concepts and data pipelines.
- Scripting/automation skills (PowerShell, Python, or Bash) to automate monitoring configuration, agent deployment, test creation, and reporting.
- Networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP), CDN concepts, and WAN performance monitoring; ability to correlate app and network telemetry.
- Experience supporting incident response and performance troubleshooting across applications, infrastructure, and network layers.
- Excellent documentation and communication skills; collaborative mindset with engineering, operations, and security stakeholders.
Preferred:- Background in regulated environments (financial services, government, healthcare) with compliance-aware monitoring design.
- Experience with log aggregation and SIEM/SOAR platforms (e.g., Splunk, Elastic) and integration with APM/NPM tools.
- Integration experience with ITSM platforms (e.g., ServiceNow) for incident, change, and problem management workflows.
- Familiarity with infrastructure-as-code (ARM/Bicep/Terraform) and embedding observability into IaC patterns; experience with CI/CD integration.
- Exposure to SRE practices (SLIs/SLOs, error budgets, reliability reviews) and capacity/performance planning.
- Ability to code in one or more of the following languages for instrumentation, custom telemetry, SDK integration, and tooling automation:
- Java: Implementing Open Telemetry SDKs/agents, custom instrumentation, and APM tagging; building synthetic test harnesses.
- .NET (C#): Instrumenting ASP.NET services, configuring APM auto-instrumentation, writing custom exporters and health probes.
- Python: Building automation scripts, collectors/exporters, synthetic tests, and integrating with monitoring APIs and SDKs.