Lead Systems Engineer (HPC)

Princeton University

Princeton, NJ

JOB DETAILS
SKILLS
Artificial Intelligence (AI), Automation, Best Practices, Calendar Management, Computer Services, Computer Software, Computer Systems, Configuration Management, Data Clustering, Data Management, Data Storage, Educational Administration, File Systems, Hardware Administration, Identify Issues, Information Technology & Information Systems, Investment Strategy, Leadership, Linux Operating System, Mentoring, Network Support, Problem Solving Skills, Programming Tools, Research Administration, Research Skills, Scripting (Scripting Languages), Standards Development, Systems Administration/Management, Systems Analysis, Systems Engineering, Team Player, Technical Leadership, Technical Research, Technical Strategy, Technical Support, Testing, Time Management, Trend Analysis, User Documentation, Vendor/Supplier Evaluation
LOCATION
Princeton, NJ
POSTED
3 days ago
Overview The Lead Systems Engineer for High Performance Computing (HPC) and Artificial Intelligence (AI) works as part of the Advanced Systems team within Research Computing that supports the hardware and system-level software on the University's centralized high-performance computing and other computing for research systems. The Lead Systems Engineer is responsible for engaging with faculty, researchers, vendors, and other information technology (IT) staff to specify, design, install, and administer computing for research systems while also providing insight into trends and technologies supporting the advancement of AI research. The Lead Systems Engineer is also expected to be in tune to trends in computational research and will be asked to evaluate, pilot, and implement systems that advance Princeton’s HPC and AI technologies enhancing Research Computing services. The Lead Systems Engineer serves as an expert for HPC and AI hardware and software and helps researchers troubleshoot system level problems with software, data, and job submission. This position requires one to work closely with colleagues at all levels of technical understanding in the Office of Information Technology (OIT) and University academic departments to provide timely and creative support for research computing. The Lead Systems Engineer is required to work well on teams and independently, and will be asked to lead initiatives within Advanced Systems, requiring only general supervision. On-call rotation is a mandatory facet of this role, requiring infrequent off-hour and weekend duty. Responsibilities Operations: Design, maintain, troubleshoot, and refine advanced HPC/AI cluster infrastructure including high‑performance interconnects, cluster schedulers, and configuration management across research systems. Partner with colleagues in Advanced Data and Storage Management to align designs for scratch filesystems and data management with cluster designs. Develop data-transfer pathways and networks to support AI‑driven computing workloads. Establish and maintain best practices for cluster management and usage to support AI-driven workloads. Develop documentation for users and technical staff that can be used by the larger community. Develop, enhance, and expand monitoring infrastructure and related protocols for research computing systems. Plan and implement scheduled maintenance of operations, including during off hours. Perform other tasks as assigned. Technical Leadership: Define and drive the institutional technical strategy for advanced AI and data‑intensive HPC. Bring creativity, foresight, and mature professional judgment in anticipating and solving novel and complex problems, in determining project objectives and requirements, and in developing standards and governance for all research computing platforms. Leveraging expertise in AI technologies, identify, evaluate, and pilot researcher-facing systems that enable the acceleration of research using AI. Lead the implementation and expand adoption of modern, automation-driven infrastructure and cluster management practices. Promote institution‑wide collaboration as the community expert advising and working with faculty, researchers and vendors on emerging trends and challenges in AI‑enabled research computing. Cultivate a collaborative, knowledge‑sharing environment by providing technical mentorship to systems specialists and analysts by sharing designs and operational expertise across data systems and HPC/AI infrastructure. Contribute to the strategic vision for HPC/AI systems; Advise senior leadership and stakeholders on strategic investments, risks, and opportunities related to research infrastructure. Troubleshooting and Problem Resolution: Monitor HPC clusters, networks, and storage systems for abnormalities, and resolve issues. Analyze and solve problems in Linux and HPC/AI computing environments with software, data, and job submissions. Use scripting and programming tools to troubleshoot issues. PI9b93d14f1c34-25448-407289515c143e31-5e48-4549-b638-05792d185386

About the Company

P

Princeton University