Machine Learning Safety: Evaluation Research Engineer

Apple Inc

Cupertino, CA

JOB DETAILS
SKILLS
Analysis Skills, Artificial Intelligence (AI), Automation, Benchmarking, Calibration, Communication Skills, Concrete, Content Structure, Corrective Action, Cross-Functional, Data Analysis, Data Management, Data Quality, Data Sets, Develop Methodologies, Documentation, Experiment Design, Information Science, Internationalization, Linguistics, Localization, Machine Learning, Machine Tool, Market Analysis, Metrics, Multicultural, Multilingual, Organizational Skills, Policy Development, Policy Evaluation, Presentation/Verbal Skills, Product Engineering, Product Safety, Product Support, Product Testing, Project Tracking, Quality Monitoring, Reliability Analysis, Reporting Dashboards, Risk, Risk Analysis, Safety Compliance, Safety Systems, Safety Training, Social Sciences, Standards Development, Stress Testing, Structured Analysis, Survey Design, Target Marketing, Taxonomies, Team Player, Technology Analysis, Training Data Sets, Writing Skills
LOCATION
Cupertino, CA
POSTED
30+ days ago

This role supports the design and development of safety evaluation methodologies for generative and agentic AI features that enable users across the globe to interact with our media products and services.

You will play an impactful role: shaping responsible AI and safety policies, evaluating fidelity to product safety requirements, creating risk assessments and taxonomies, curating exemplar safety evaluation datasets, and ensuring that evaluation frameworks are culturally and linguistically grounded.

An ideal candidate possesses a strong understanding of issues in responsible AI and A and society, technology evaluation design principles and practices, and brings experience designing evaluations to support policies and/or product requirements, classification systems, and annotation and/or study participant guidelines.

Taxonomy Development: Design, refine, and maintain safety-relevant taxonomies that capture risk categories, content types, and policy distinctions, achieved through collaborations with subject matter experts who bring knowledge across languages and cultural contexts. You will work collaboratively to ensure taxonomies are comprehensive, internally consistent, and actionable for downstream evaluation work.

Policy-to-Data Translation: Develop and validate exemplar sets that illustrate taxonomy categories, edge cases, and boundary conditions. Collaborate with language and cultural experts to ensure exemplars are culturally appropriate and representative across target markets. Partner with policy, product, and engineering teams to translate responsible AI policies and guidelines into concrete data requirements, annotation schemas, and evaluation criteria that can be operationalized across markets. Develop and maintain synthetic data generation pipelines to augment evaluation coverage, stress-test safety boundaries, and support evaluation in low-resource languages. Ensure synthetic data is diverse, representative, and validated against human-generated benchmarks.

Automated Judge Development: Shape the development, training and fine-tuning, and validation of automated judge models that can reliably score AI system outputs for safety and policy compliance. Develop calibration and agreement metrics to ensure judges meet human-parity benchmarks. Design and implement validation frameworks to assess the accuracy, reliability, and consistency of automated evaluation systems. Develop methods to detect drift, bias, and failure modes in automated judges across markets.

Scalable Analysis & Reporting Automation: Create automated pipelines for analysis and reporting that reduce manual effort, increase reproducibility, and enable rapid cross-market safety assessments. Build tooling that integrates with existing dashboards and reporting workflows.

Documentation & Communication: Produce clear, detailed documentation artifacts. Present findings and recommendations to cross-functional stakeholders including engineering, product, compliance, and policy teams.

Canonical Guideline Development: Author and maintain canonical evaluation guidelines that standardize task definitions, rating criteria, and edge-case handling. These assets will be adapted to scale across languages and markets, with the support of multi-lingual and operations experts. You will ensure guidelines are clear, complete, and adaptable.

Evaluation Design & Execution: Pilot and run evaluations with validated task setups, manage evaluation instruments and surface issues before full-scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly. esign and run pilot evaluations to validate task setups, identify guideline ambiguities, calibrate annotator understanding, and surface issues before full-scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly.

Monitoring & Data Quality: Develop and implement monitoring frameworks to track evaluation progress, annotator performance, inter-rater agreement, and data quality in real time. Flag anomalies and implement corrective actions to maintain data integrity across markets4+ years of experience in an applied research setting related to evaluation design, AI ethics, Responsible AI, AI safety, computational social science, content analysis, or a closely related field.

Strong understanding of taxonomy design, classification systems, and annotation methodology.

Experience developing evaluation guidelines and exemplar sets for human annotation or labeling tasks.

Demonstrated ability to collaborate with subject matter experts (e.g., linguists, cultural consultants, multi-lingual annotators) to inform research design.

Able to work independently to drive outcomes among cross-functional teams, with minimal direction.

Organized, highly attentive to detail, and manages time well.

Excellent written and oral communication skills.

Experience working in industry.

Advanced degree (MS/PhD) in Linguistics, Information Science, Computational Social Science, or a related socio-technical field.

Experience designing evaluation frameworks for multilingual or cross-cultural contexts.

Familiarity with responsible AI, AI safety, or content moderation policy frameworks.

Experience with experimental design methodologies, inter-rater reliability data analysis and annotation quality assessment methods.

Prior experience working with localization, internationalization, or language service teams.

Experience with survey design, AI policy development, and/or structured content analysis methodologies.

About the Company

A

Apple Inc

We bring amazing people together to make amazing things happen.

We’re a diverse collection of thinkers and doers, continually reimagining what’s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with services, including iTunes, the App Store, Apple Music, and Apple Pay. And the same passion for innovation that goes into our products also applies to our practices — strengthening our commitment to leave the world better than we found it.

About Apple

There’s a place here for every kind of brilliant. Everyone here is an innovator, or an innovator-to-be, no matter what your team or your role. So bring your passion, courage, and original thinking and get ready to share it, because every new product, service, or feature we invent is the result of people working together to make each others’ ideas stronger. Innovation at this level depends on people who represent the variety of the human experience and inspire us with their own fresh perspectives. Together, we’ll do amazing work that can make a difference in people’s lives. Including your own. Learn more about working at Apple.

COMPANY SIZE
10,000 employees or more
INDUSTRY
Computer/IT Services
FOUNDED
1976
WEBSITE
https://www.apple.com/jobs