Senior Research Scientist/Engineer - AI Infrastructure

Beijing ByteDance Technology Co Ltd

San Jose, CA

JOB DETAILS
SKILLS
ASIC (Application Specific Integrated Circuit), Artificial Intelligence (AI), Automation System Development, Autoscaling, Benchmarking, Best Practices, C++ Programming Language, CUDA (Compute Unified Device Architecture), Cloud Computing, Coaching, Communication Skills, Computer Programming, Computer Science, Computer Skills, Conferences, Data Management, Data Processing, Data Science, Database Extract Transform and Load (ETL), Energy Efficiency, FPGA, GPU (Graphics Processing Unit), Go Programming Language (Golang), High Availability, JAX (Java API for XML), Kernel Programming, Machine Learning, Mentoring, Network Operations Center, Open Source, Patents, Performance Tuning/Optimization, Problem Solving Skills, Publications, Python Programming/Scripting Language, Reliability Engineering, Research Skills, Resource Management, Scientific Research, Service-Oriented Architecture (fka Distributed Object Architecture), Software Engineering, Systems Engineering, Systems Reliability, Telemetry, Training Data Sets, Virtual Machine (VM)
LOCATION
San Jose, CA
POSTED
30+ days ago

We are seeking a highly skilled and motivated AI Infrastructure Engineer to join our dynamic team. In this role, you will be responsible for designing, building, deploying, and maintaining the robust and scalable infrastructure that powers our cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives. You will work closely with our AI/ML researchers, data scientists, and software engineers to create an efficient, high-performance environment for training, inference, and data processing. Your expertise will be critical in enabling the next generation of AI-driven products and services.

Responsibilities

• Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads. • Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security.

• Profile and optimize every layer of the ML stack-ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks. • Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving.

• Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud). • Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement.

• Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets. • Integrate experiment management and workflow orchestration tools (Airflow, Kubeflow, Metaflow) to streamline research-to-production.

Collaboration & Mentorship

• Partner with ML researchers to translate prototype requirements into production-grade systems. • Mentor and coach engineers on best practices in performance tuning, systems design, and reliability engineering.

Required Qualifications

  • Masters degree (PhDs degree is preferred) in Computer Science, Engineering, or a related technical field.
  • 5+ years in infrastructure or systems engineering focused roles, with at least 2 years focused on ML/AI infrastructure.
  • Strong programming skills in Python, C++, Go, or Rust for systems development and automation.
  • Ability to design end-to-end systems that balance performance, reliability, security, and cost.
  • Excellent communicator able to bridge research and production teams.
  • Strong problem-solving aptitude and a drive to push the state of the art in ML infrastructure.

Preferred Qualifications

  • Hands-on experience with ML training frameworks (PyTorch, TensorFlow, JAX) at scale.
  • Knowledge of hardware-level optimization: CUDA, ROCm, kernel bypass, FPGA/ASIC integration.
  • Experience with Heterogeneous Computing for AI, Bigdata, HPC.
  • Open-source contributions or patents in the ML systems space.
  • Publications in ML or System Conferences such as MLSys, ICML, ICLR, KDD, NeurIPS

About the Company

B

Beijing ByteDance Technology Co Ltd