Inference Optimization Intern – Performance Modeling

Institute of Foundation Models

Sunnyvale, California

JOB DETAILS
SKILLS
Analysis Skills, Architectural Services, Artificial Intelligence (AI), C++ Programming Language, CUDA (Compute Unified Device Architecture), Communication Skills, Computer Engineering, Computer Programming, Computer Science, Electrical Engineering, GPU (Graphics Processing Unit), Kernel Programming, Large-Scale Systems, Machine Learning, Memory Hardware, Metrics, Performance Analysis, Performance Engineering, Performance Management, Performance Modeling, Performance Tuning/Optimization, Presentation/Verbal Skills, Python Programming/Scripting Language, Scalable System Development, Software Design, Systems Engineering, Writing Skills
LOCATION
Sunnyvale, California
POSTED
3 days ago
About the Institute of Foundation Models
 
The Institute of Foundation Models is dedicated to advancing the science and engineering of large-scale AI systems. Our researchers and engineers develop cutting-edge foundation models while pushing the limits of high-performance computing and efficient AI inference. By combining deep expertise in machine learning, systems engineering, and hardware optimization, we build scalable AI solutions that drive scientific discovery and real-world impact.
As part of the team, interns work alongside world-class researchers and performance engineers to optimize the execution of large-scale foundation models on next-generation NVIDIA GPU architectures. This internship provides hands-on experience in low-level GPU performance analysis, kernel optimization, and hardware-aware inference acceleration.

Key Responsibilities

This intensive internship offers a unique opportunity to contribute to the development of a simulator and profiling framework for foundation model inference on NVidia GPUs.
Responsibilities include:
  • Develop analytical performance models for GPU kernels and inference workloads.
  • Build and validate a simulator to estimate theoretical hardware performance limits.
  • Compare measured kernel performance against architectural peak throughput.
  • Identify performance bottlenecks in compute, memory, communication, and scheduling.
  • Analyze GPU execution using NVIDIA Nsight Systems and Nsight Compute.
  • Investigate PTX and SASS code generation to understand low-level execution behavior.
  • Collaborate with researchers and engineers to optimize inference kernels for transformer-based models.
  • Evaluate utilization of Tensor Cores, memory bandwidth, caches, and instruction pipelines.
  • Design profiling methodologies for Hopper and Blackwell architectures.
  • Document findings and provide actionable recommendations for performance improvements.

Academic Qualifications

Currently pursuing a degree in Computer Science, Computer Engineering, Electrical Engineering, Artificial Intelligence, High-Performance Computing, or a related quantitative discipline.

Preferred Qualifications

  • Experience with CUDA programming and GPU kernel development.
  • Understanding of NVIDIA GPU architecture and memory hierarchy.
  • Familiarity with performance profiling tools such as Nsight Systems and Nsight Compute.
  • Knowledge of PTX, SASS, and low-level GPU execution.
  • Experience optimizing CUDA kernels for throughput and latency.
  • Understanding of roofline analysis, performance modeling, and hardware utilization metrics.
  • Experience with deep learning frameworks such as PyTorch or TensorFlow.
  • Strong programming skills in C++, CUDA, and Python.

Desired Skills

  • Performance engineering mindset.
  • Strong analytical and debugging abilities.
  • Interest in AI systems, inference optimization, and hardware-software co-design.
  • Ability to work independently on research and engineering challenges.
  • Excellent written and verbal communication skills.

About the Company

I

Institute of Foundation Models