Monitoring and Observability Engineer

Veterans Sourcing Group

Pittsburgh, PA

JOB DETAILS
SKILLS
ARM (Advanced RISC Machine), Analysis Skills, Application Hosting, Application Integration, Application Programming Interface (API), Automation, BGP, Bash Scripting, Best Practices, Budgeting, Capacity and Performance Management, Cloud Computing, Communication Skills, Content Delivery Network (CDN), Continuous Deployment/Delivery, Continuous Integration, DNS (Domain Name System), Data Collection, Data Management, DevOps, Documentation, Documentation Standards, Event Management, Financial Services, Government, HTTP (HyperText Transport Protocol), Healthcare, Home Automation, Hybrid Cloud, IT Service Management (ITSM), Incident Management, Incident Response, Instrumentation, Java, Knowledge Transfer, Machine Tool, Metrics, Microsoft .NET, Microsoft ASP.NET (Active Server Page), Microsoft C# (C Sharp), Microsoft System Center Operations Manager (SCOM), Microsoft Windows Azure, Network Monitoring, Network Performance/Analysis, Operations Security (OPSEC), Performance Analysis, Policy Development, Python Programming/Scripting Language, Regulatory Compliance, Reporting Dashboards, Reporting Skills, Right-Sizing, Root Cause Analysis, SSL-TLS (Secure Socket Layer - Transport Layer Security), Scalable System Development, Scripting (Scripting Languages), Security Information and Event Management (SIEM), Service Level Agreement (SLA), ServiceNow, Signal-to-noise Ratio (SNR), Software Administration, Software Engineering, Splunk, TCP/IP (Transmission Control Protocol/Internet Protocol), Team Player, Telemetry, Test Harness, Test Plan/Schedule, Transaction Processing/Management, Trend Analysis, User Interface/Experience (UI/UX), Wide Area Network (WAN), Windows PowerShell
LOCATION
Pittsburgh, PA
POSTED
30+ days ago
Job Title: Monitoring and Observability Engineer
Duration: 12+ Months (Possible extension)
Location: Pittsburgh, PA 15258
Onsite Role (4 days a week)
Alternate Location: Lake Mary, FL 32746 or New York, NY 10286

Responsibilities:
  • Seeking a skilled Cloud Monitoring and Observability Engineer (Azure) engineer to design, implement, and optimize end-to-end monitoring and observability solutions for a mission-critical application deployed in the Azure environment.
  • The ideal candidate has hands-on experience with enterprise monitoring tools—such as AppDynamics, Thousand Eyes, NetScout, and SolarWinds (or equivalent alternatives)—and a strong background in building scalable, secure, and compliant observability stacks for cloud deployments.
  • Will collaborate closely with application engineering, cloud platform, network, and security teams to ensure comprehensive coverage across application, infrastructure, and network layers
  • Design and implement end-to-end monitoring, alerting, and observability for an Azure-hosted application across application, infrastructure, network, and user experience layers.
  • Configure, integrate, and maintain enterprise monitoring platforms to deliver actionable telemetry, performance baselines, and SLA/SLO tracking.
  • Build dashboards, health checks, synthetic tests, and alerting workflows; optimize alert fidelity to minimize noise and improve signal-to-noise ratio.
  • Establish and document telemetry standards (metrics, logs, traces), data collection strategies, and service-level indicators (SLIs) aligned to reliability objectives (SLOs).
  • Integrate Azure-native services (Azure Monitor, Log Analytics, Application Insights) with enterprise tools to provide unified visibility and correlation.
  • Implement network performance monitoring, path visibility, and internet/extranet testing using NPM tools (e.g., ThousandEyes, NetScout); leverage infrastructure monitoring platforms (e.g., SolarWinds) for device and service health.
  • Instrument applications with APM tools (e.g., AppDynamics, Dynatrace, New Relic) for business transaction monitoring, dependency mapping, and root-cause analysis; tune anomaly detection and policy thresholds.
  • Collaborate with DevOps/SRE teams to embed monitoring into CI/CD and infrastructure-as-code patterns; ensure new services adhere to observability standards.
  • Define runbooks and escalation paths; support incident response and post-incident reviews with data-driven insights and remediation recommendations.
  • Ensure monitoring solutions meet applicable security and compliance requirements; support audit requests with clear documentation and evidence.
  • Conduct capacity and performance trend analysis; recommend optimization, right-sizing, and resilience improvements.
  • Provide knowledge transfer, documentation, and training on monitoring tools, best practices, and operational workflows.
Education/Experience:
  • 5+ years implementing enterprise monitoring/observability for cloud or hybrid environments, including mission-critical applications.
  • Demonstrable expertise with at least one tool in each category (or equivalent), including production deployments, advanced configuration, and operational use:
  • Application Performance Monitoring (APM): AppDynamics, Dynatrace, or New Relic.
  • Experience instrumenting services for business transaction tracing, code-level diagnostics, service maps, and anomaly detection.
  • Ability to design APM dashboards and create alert policies with appropriate thresholds and baselines.
  • Network Performance Monitoring (NPM) / Digital Experience Monitoring (DEM): Thousand Eyes, NetScout, or Kentik.
  • Experience with synthetic tests, path visualization, packet-level analysis, and internet/WAN performance monitoring.
  • Ability to configure endpoint agents, BGP/DNS tests, and multi-hop path monitoring for user experience correlation.
  • Infrastructure Monitoring and Event Management: SolarWinds, Microsoft SCOM, Datadog, or Prometheus/Grafan.
    • Experience monitoring servers, containers, network devices, and cloud services; creating availability and capacity dashboards.
    • Proficiency with alert routing, de-duplication, and event correlation.
    • Strong Azure monitoring experience: Azure Monitor, Log Analytics (KQL), Application Insights, and integration with third-party tools.
  • Solid understanding of distributed tracing, metrics, and log aggregation; familiarity with Open Telemetry concepts and data pipelines.
  • Scripting/automation skills (PowerShell, Python, or Bash) to automate monitoring configuration, agent deployment, test creation, and reporting.
  • Networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP), CDN concepts, and WAN performance monitoring; ability to correlate app and network telemetry.
  • Experience supporting incident response and performance troubleshooting across applications, infrastructure, and network layers.
  • Excellent documentation and communication skills; collaborative mindset with engineering, operations, and security stakeholders.
Preferred:
  • Background in regulated environments (financial services, government, healthcare) with compliance-aware monitoring design.
  • Experience with log aggregation and SIEM/SOAR platforms (e.g., Splunk, Elastic) and integration with APM/NPM tools.
  • Integration experience with ITSM platforms (e.g., ServiceNow) for incident, change, and problem management workflows.
  • Familiarity with infrastructure-as-code (ARM/Bicep/Terraform) and embedding observability into IaC patterns; experience with CI/CD integration.
  • Exposure to SRE practices (SLIs/SLOs, error budgets, reliability reviews) and capacity/performance planning.
  • Ability to code in one or more of the following languages for instrumentation, custom telemetry, SDK integration, and tooling automation:
    • Java: Implementing Open Telemetry SDKs/agents, custom instrumentation, and APM tagging; building synthetic test harnesses.
    • .NET (C#): Instrumenting ASP.NET services, configuring APM auto-instrumentation, writing custom exporters and health probes.
    • Python: Building automation scripts, collectors/exporters, synthetic tests, and integrating with monitoring APIs and SDKs.

About the Company

V

Veterans Sourcing Group