Senior HPC Cluster Engineer

NVIDIA Corp

Austin, TX

Apply

JOB DETAILS

SKILLS

Ansible, Artificial Intelligence (AI), Automation, Autonomous Driving Systems, Bash Scripting, Benchmarking, Broadband, CUDA (Compute Unified Device Architecture), CentOS, Communication Skills, Computer Networks, Computer Science, Computer Systems, Configuration Management, Continuous Improvement, Corrective Action, Cross-Functional, Debugging Tools, Distributed Computing, Docker, Ecosystems, Electrical Engineering, Electronic Design Automation, Energy Efficiency, GPU (Graphics Processing Unit), Gaming, Large-Scale Systems, Linux Distributions, Linux Operating System, MPI, Machine Tool, Metrics, Performance Analysis, Performance Tuning/Optimization, Problem Solving Skills, Python Programming/Scripting Language, Red Hat Linux Operating System, Return on Capital Employed (ROCE), Root Cause Analysis, Scalable System Development, Scientific Research, Systems Analysis, Team Player, Technical Leadership, Technical Strategy, Ubuntu

LOCATION

Austin, TX

POSTED

30+ days ago

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what's never been done before takes vision, innovation, and the world's best talent. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute Clusters for EDA (Electronic Design Automation) and high-performance computing workloads used across multiple teams and projects. Join our engineering team and collaborate with researchers and infrastructure teams to ensure our GPU clusters are highly performant, scalable and reliable.

Responsibilities:

Develop and enhance our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
Continuously improve infrastructure provisioning, management, observability and day to day operation through automation.
Provide technical leadership and strategic guidance for managing large-scale HPC systems, including the deployment of compute, networking, and storage.
Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs
Support our researchers to run their EDA workloads including performance analysis and optimizations.
Conduct root cause analysis and suggest corrective action. Proactively find and fix issues before they occur.
Build innovative tooling to accelerate researchers velocity, debugging and software performance at scale.

Requirements:

Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience.
Minimum of 5 years of proven experience crafting and operating large scale compute infrastructure, including cluster configuration managements tools such as BCM or Ansible.
Experience with AI/HPC job schedulers and orchestrators, such as Slurm, LSF, PBS or K8s.
Applied experience with AI/HPC workflows that use MPI and NCCL.
Proficient in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions.
A solid understanding of container technologies such Enroot and Docker.
Proficiency in Python and Bash
Experience analyzing and tuning performance for a variety of EDA workloads.
Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals.
Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields.

Ways to Stand Out from the Crowd:

Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking.
Experience supporting EDA workloads and tools.
Familiarity with High-Speed Networking pertaining to HPC including InfiniBand, RDMA and RoCE.
Understanding of fast, distributed storage systems such as Lustre and GPFS for AI/HPC workload.
Familiarity with metrics collection and visualization at scale with Prometheus, OpenSearch and Grafana.

About NVIDIA:

NVIDIA is building the most groundbreaking and powerful compute platforms for the world to use. It's because of our work that scientists, researchers and engineers can advance their ideas. At its core, our visual computing technology not only enables an amazing computing experience, but it is also energy efficient! We pioneered a supercharged form of computing loved by the most demanding computer users in the world - scientists, designers, artists, and gamers. It's not just technology though! It is our people, some of the brightest in the world, and our diverse company culture make NVIDIA one of the most fun, innovative and dynamic places to work in the world! At the center of NVIDIAs culture are our core values like innovation, excellence and determination and team, that guide us to be the best we can be.

Compensation and Benefits:

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits.

Application and Hiring Process:

Applications for this job will be accepted at least until March 15, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

About the Company

NVIDIA Corp

Visualize your future . . . We Do
NVIDIA is the world leader in graphics processing technologies, creating innovative, industry-changing products for computing, consumer electronics, and mobile devices. NVIDIA products are transforming visually-rich applications such as video games, film production, broadcasting, industrial design, space exploration, and medical imaging. We invest in our people and our technologies, support and fund industry research around the world, and consistently deliver high-quality products. NVIDIA's culture promotes and inspires a team of world-class employees to be at the top of their game. We've created an environment where talents are recognized and collaboration is valued. Our employees are shaping the world of tomorrow. . . today. We invite you to explore the opportunities available at NVIDIA to see what your future may hold.

COMPANY SIZE

10,000 employees or more

INDUSTRY

Computer Software

FOUNDED

1993

WEBSITE

http://www.nvidia.com

Senior HPC Cluster Engineer

NVIDIA Corp

Austin, TX

About the Company

NVIDIA Corp

Similar Job Searches