Data Center Site Manager

Nebius Group NV

New Jersey, NJ

JOB DETAILS
SALARY
$90,000–$140,000 Per Year
SKILLS
Atlassian JIRA, Auditing, Automation, Budgeting, Capacity Management, Capacity and Performance Management, Capital Expenditure (CAPEX), Change Management, Cloud Computing, Computer Networks, Construction Design, Continuous Improvement, Corrective Action, Cost Control, Cross-Functional, Electricity, Engineering Management, Fiber Optics, Fire Suppression/Control, Forecasting, HVAC, High Availability, Incident Management, Industrial Engineering, Information/Data Security (InfoSec), Knowledge Base, Lead Management, Leadership, Lean Manufacturing, Lean Six Sigma, Life Safety Systems, Linux Operating System, Machine Tool, Mechanical Engineering, Mechanical, Electrical and Plumbing (MEP), Mentoring, Metrics, Network Configuration Management, Network Operations Center, Network Security, Network System Hardware, On Call, Operational Expenditure (OPEX), Performance Metrics, Performance Tuning/Optimization, Physical Security, Predictive Modeling, Presentation/Verbal Skills, Quality Assurance, Quality Control, Reliability Centered Maintenance, Reliability Engineering, Reporting Dashboards, Retrofit, SQL (Structured Query Language), Safety Compliance, Safety/Work Safety, Scripting (Scripting Languages), Service Level Agreement (SLA), ServiceNow, Standard Operating Procedures (SOP), Switchgear, Team Lead/Manager, Telemetry, Testing, Time Management, Topology, Vehicle Fleets, Vendor/Supplier Management, Willing to Travel, Writing Skills
LOCATION
New Jersey, NJ
POSTED
30+ days ago

The Role

The Data Center Site Manager owns end-to-end reliability, safety, capacity, and performance for one of our flagship U.S. sites. You'll lead a high-performing, multi-disciplinary operations team and partner tightly with Design, Build, Network, Security, Capacity Planning, and the DC orgs to deliver world-class availability and cost efficiency.

Your responsibilities will include:

Own the site 24/7: deliver continuous availability across power, cooling, structured cabling, network, security, and DCIM-meeting or beating global SLAs.

Build and lead the team: hire, mentor, and develop managers/technicians; run staffing models, shift coverage, and on-call rotations that scale.

Be the incident commander: lead major events end-to-end-triage, communications, executive briefings, RCA, and durable corrective actions.

Drive reliability engineering: implement RCM, predictive maintenance, QA/QC, 5S, and Lean/continuous improvement to cut MTTR and raise MTBF.

Deliver capacity on time: plan and execute expansions/retrofits; commission MEP systems with Design/Construction; achieve flawless change control (MOP/SOP/EOP).

Scale tooling & automation: mature DCIM/BMS/EPMS, monitoring/alerting, work management (Jira/ServiceNow), knowledge base (Confluence), and light scripting/SQL for telemetry and workflow automation.

Run a metrics-first operation: publish dashboards and KPIs (availability, PUE, MTBF/MTTR, work compliance, safety) and use them to drive decisions.

Partner across functions: work with Cloud/Compute, Network, Security, and Capacity Planning to optimize performance, cost, and resiliency across the fleet.

Manage vendors & colos: own contracts, SLAs, and execution for rack deliveries, PDUs, fiber/copper, and lifecycle PMs; validate colo topology and compliance.

Raise the safety bar: enforce a zero-injury EHS culture; conduct drills/audits for life safety, physical security, and data protection.

Forecast and budget: build data-backed plans for power, spares, headcount, and projects; track OpEx/CapEx with rigor.

We expect you to have:

Associate's degree or trade certification in Electrical/Mechanical/Industrial Engineering (or equivalent experience).

10+ years in electrical/mechanical/HVAC/controls within industrial/commercial settings, 5+ years specifically in data center or mission-critical facilities.

Team leadership experience in 24/7 sites (managing leads/techs, vendors, and on-call operations).

Deep, hands-on knowledge of UPS/generators/switchgear, chillers/CRAC/CRAH, fire detection/suppression, BMS/EPMS/DCIM, and structured cabling (copper & fiber).

Proven strength in incident management, RCA/Corrective Actions, change management, and vendor/contract oversight.

Data-driven mindset with the ability to forecast resources and make analytics-backed decisions (Excel; SQL/scripting a plus).

Excellent written/verbal communication with comfort presenting to executives and guiding field teams during live events.

Ability to travel up to ~30% and support after-hours escalations when needed.

It would be an added bonus if you have:

Bachelor's degree in Electrical/Mechanical/Industrial Engineering, Engineering Management, or Reliability Engineering.

Hyperscale/colo experience with reliability-centered maintenance, predictive analytics, and Lean/Six Sigma practices.

Familiarity with Linux fundamentals, network equipment installation/troubleshooting, and fiber optics testing.

Experience with Jira, Confluence, ServiceNow (or similar); strong SOP/MOP/EOP authorship.

Certifications such as CDCP, DCM, PMP, OSHA-30, ITIL, or Uptime-aligned credentials.

Key Employee Benefits

Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.

401(k) plan: up to 4% company match with immediate vesting.

Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.

Remote work reimbursement: up to $85/month for mobile and internet.

Disability & life insurance: company-paid short-term, long-term and life insurance coverage.

Compensation

We offer competitive salaries, ranging from $90k- $140k base + quarterly performance bonuses.

Join Nebius Today!

About the Company

N

Nebius Group NV