Responsibilities Design and develop model architectures for perception, reasoning, and action across multimodal inputs (e.g., vision, language, proprioception) Build models that learn structured representations of the world, including objects, dynamics, and interactions Advance multimodal learning approaches, including fusion, alignment, and cross-modal reasoning Improve model capabilities in areas such as generalization, robustness, and long-horizon reasoning Work across the model lifecycle, from initial research and prototyping to training and deployment Collaborate closely with pretraining, video, generative, RL, and robot learning teams to integrate modeling advances into the full autonomy stack Design experiments and evaluation frameworks to understand model behavior and guide iteration Contribute to the development of new modeling paradigms for embodied AI systems Requirements Experience designing and training deep learning models for vision, language, or multimodal systems Strong understanding of modern model architectures (e.g., transformers and related approaches) Experience improving model performance through architectural innovation and experimentation Proficiency in Python and deep learning frameworks such as PyTorch Strong experimental rigor and ability to iterate on model design and performance Solid software engineering skills and ability to build reliable, maintainable systems Ability to operate independently and drive ambiguous, high-impact technical problems Bonus Qualifications Experience with multimodal models (vision-language or vision-language-action systems) Background in representation learning, world models, or structured prediction Experience working on frontier models at companies such as OpenAI, Google DeepMind, Anthropic, Meta, or xAI Familiarity with embodied AI, robotics, or real-world ML systems Experience with large-scale training or distributed systems * Publication record in machine learning, computer vision, NLP, or multimodal AI The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. This role focuses on developing new modeling approaches across vision, language, and action-spanning representation learning, multimodal fusion, and model capabilities that directly impact robot intelligence.