Monitoring and Observability Engineer

Veterans Sourcing Group

Pittsburgh, PA

JOB DETAILS
LOCATION
Pittsburgh, PA
POSTED
30+ days ago
Job Title: Monitoring and Observability Engineer
Duration: 12+ Months (Possible extension)
Location: Pittsburgh, PA 15258
Onsite Role (4 days a week)
Alternate Location: Lake Mary, FL 32746 or New York, NY 10286

Responsibilities:
  • Seeking a skilled Cloud Monitoring and Observability Engineer (Azure) engineer to design, implement, and optimize end-to-end monitoring and observability solutions for a mission-critical application deployed in the Azure environment.
  • The ideal candidate has hands-on experience with enterprise monitoring tools—such as AppDynamics, Thousand Eyes, NetScout, and SolarWinds (or equivalent alternatives)—and a strong background in building scalable, secure, and compliant observability stacks for cloud deployments.
  • Will collaborate closely with application engineering, cloud platform, network, and security teams to ensure comprehensive coverage across application, infrastructure, and network layers
  • Design and implement end-to-end monitoring, alerting, and observability for an Azure-hosted application across application, infrastructure, network, and user experience layers.
  • Configure, integrate, and maintain enterprise monitoring platforms to deliver actionable telemetry, performance baselines, and SLA/SLO tracking.
  • Build dashboards, health checks, synthetic tests, and alerting workflows; optimize alert fidelity to minimize noise and improve signal-to-noise ratio.
  • Establish and document telemetry standards (metrics, logs, traces), data collection strategies, and service-level indicators (SLIs) aligned to reliability objectives (SLOs).
  • Integrate Azure-native services (Azure Monitor, Log Analytics, Application Insights) with enterprise tools to provide unified visibility and correlation.
  • Implement network performance monitoring, path visibility, and internet/extranet testing using NPM tools (e.g., ThousandEyes, NetScout); leverage infrastructure monitoring platforms (e.g., SolarWinds) for device and service health.
  • Instrument applications with APM tools (e.g., AppDynamics, Dynatrace, New Relic) for business transaction monitoring, dependency mapping, and root-cause analysis; tune anomaly detection and policy thresholds.
  • Collaborate with DevOps/SRE teams to embed monitoring into CI/CD and infrastructure-as-code patterns; ensure new services adhere to observability standards.
  • Define runbooks and escalation paths; support incident response and post-incident reviews with data-driven insights and remediation recommendations.
  • Ensure monitoring solutions meet applicable security and compliance requirements; support audit requests with clear documentation and evidence.
  • Conduct capacity and performance trend analysis; recommend optimization, right-sizing, and resilience improvements.
  • Provide knowledge transfer, documentation, and training on monitoring tools, best practices, and operational workflows.
Education/Experience:
  • 5+ years implementing enterprise monitoring/observability for cloud or hybrid environments, including mission-critical applications.
  • Demonstrable expertise with at least one tool in each category (or equivalent), including production deployments, advanced configuration, and operational use:
  • Application Performance Monitoring (APM): AppDynamics, Dynatrace, or New Relic.
  • Experience instrumenting services for business transaction tracing, code-level diagnostics, service maps, and anomaly detection.
  • Ability to design APM dashboards and create alert policies with appropriate thresholds and baselines.
  • Network Performance Monitoring (NPM) / Digital Experience Monitoring (DEM): Thousand Eyes, NetScout, or Kentik.
  • Experience with synthetic tests, path visualization, packet-level analysis, and internet/WAN performance monitoring.
  • Ability to configure endpoint agents, BGP/DNS tests, and multi-hop path monitoring for user experience correlation.
  • Infrastructure Monitoring and Event Management: SolarWinds, Microsoft SCOM, Datadog, or Prometheus/Grafan.
    • Experience monitoring servers, containers, network devices, and cloud services; creating availability and capacity dashboards.
    • Proficiency with alert routing, de-duplication, and event correlation.
    • Strong Azure monitoring experience: Azure Monitor, Log Analytics (KQL), Application Insights, and integration with third-party tools.
  • Solid understanding of distributed tracing, metrics, and log aggregation; familiarity with Open Telemetry concepts and data pipelines.
  • Scripting/automation skills (PowerShell, Python, or Bash) to automate monitoring configuration, agent deployment, test creation, and reporting.
  • Networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP), CDN concepts, and WAN performance monitoring; ability to correlate app and network telemetry.
  • Experience supporting incident response and performance troubleshooting across applications, infrastructure, and network layers.
  • Excellent documentation and communication skills; collaborative mindset with engineering, operations, and security stakeholders.
Preferred:
  • Background in regulated environments (financial services, government, healthcare) with compliance-aware monitoring design.
  • Experience with log aggregation and SIEM/SOAR platforms (e.g., Splunk, Elastic) and integration with APM/NPM tools.
  • Integration experience with ITSM platforms (e.g., ServiceNow) for incident, change, and problem management workflows.
  • Familiarity with infrastructure-as-code (ARM/Bicep/Terraform) and embedding observability into IaC patterns; experience with CI/CD integration.
  • Exposure to SRE practices (SLIs/SLOs, error budgets, reliability reviews) and capacity/performance planning.
  • Ability to code in one or more of the following languages for instrumentation, custom telemetry, SDK integration, and tooling automation:
    • Java: Implementing Open Telemetry SDKs/agents, custom instrumentation, and APM tagging; building synthetic test harnesses.
    • .NET (C#): Instrumenting ASP.NET services, configuring APM auto-instrumentation, writing custom exporters and health probes.
    • Python: Building automation scripts, collectors/exporters, synthetic tests, and integrating with monitoring APIs and SDKs.

About the Company

V

Veterans Sourcing Group