Member of Technical Staff (MTS) - Multimodal Foundation Models
Deeproute.ai
Fremont, CA
Apply
JOB DETAILS
SKILLS
Analysis Skills, Artificial Intelligence (AI), Autonomous Driving Systems, Computer Science, Computer Vision, Distributed Computing, Engineering, Failure Analysis, Large-Scale Systems, Machine Learning, Memory Hardware, Modeling Languages, Production Systems, Robotics, Scalable System Development, Systems Scalability, Technical Research
LOCATION
Fremont, CA
POSTED
29 days ago
Focus
Multimodal Foundation Models · Representation Learning · Method Innovation
We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.
Ideal candidates should have:
Strong experimental rigor
Solid systems and modeling intuition
Hands-on engineering ability
Interest in scalable multimodal AI systems for real-world autonomy
We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large-scale autonomous driving systems.
Responsibilities
1. Large-Scale Foundation Model Pretraining
Develop scalable pretraining pipelines for large-scale multimodal driving data
Design and optimize training strategies for:
Vision-language-action models
Video foundation models
Long-context temporal modeling
Multimodal representation alignment
Improve:
Training stability
Data efficiency
Scaling efficiency
Representation robustness
Work on distributed training systems and large-scale model optimization using frameworks such as:
PyTorch Distributed
DeepSpeed
Megatron-LM
2. Representation Learning & Method Innovation
Design and improve self-supervised and multimodal learning methods for real-world autonomous driving systems
Conduct architecture-level research on:
Vision Transformers (ViT)
Video / temporal architectures
Multimodal fusion and alignment
Embedding and retrieval systems
Long-context and memory-efficient architectures
Explore and improve:
Pretraining objectives
Loss functions
Training paradigms
Generalization and robustness
Analyze model behavior through:
Rigorous ablation studies
Failure case analysis
Representation probing and evaluation
3. Efficient Foundation Models & Scalable Deployment
Improve the efficiency, scalability, and deployability of large multimodal foundation models for real-world autonomous driving systems
Work on areas such as:
Model quantization
Knowledge distillation
Efficient attention mechanisms
Sparse architectures and Mixture-of-Experts (MoE)
Long-context and memory-efficient modeling
Inference acceleration and serving optimization
Training and inference system efficiency
Optimize model throughput, latency, memory usage, and deployment performance for large-scale production environments
Requirements
MS or PhD in:
Computer Vision
Machine Learning
Robotics
Computer Science
Related fields
Strong understanding of:
Foundation models
Self-supervised learning
Representation learning
Multimodal learning
Large-scale pretraining
Hands-on experience with methods such as:
CLIP
DINO / DINOv2
MAE
Contrastive learning
Masked modeling
MoE or scalable transformer architectures
Experience with one or more of the following is highly valued:
Video foundation models
Long-context modeling
Retrieval systems
Efficient inference
Distributed training
Model compression and deployment optimization
Strong publication record in top-tier venues is preferred: