ML Evaluation & Insights Engineer - ASE
Apple·Singapore
Role details
Summary
Imagine what you could do here. At Apple, great new ideas become extraordinary products, services, and customer experiences quickly. Bring passion and dedication to your work, and there’s no telling what you could accomplish.
Apple Services Engineering (ASE) powers many AI features across App Store, Music, Video, and more. We build deeply personal products with the goal of representing users around the globe authentically. We continuously aim to avoid perpetuating systemic biases and to maintain safe and trustworthy experiences across our AI tools and models.
Role Overview
Our team, part of Apple Services Engineering, is seeking an ML Research Engineer to lead the design and ongoing development of automated benchmarking methodologies. You will investigate the behavior of media-related agents, craft rigorous evaluation frameworks, and establish scientific standards for assessing quality features. You will support the development of scalable evaluation techniques that enable engineers to assess candidate models and product features for optimal performance. Your work will facilitate the generation of benchmark datasets and evaluation methodologies for model and application outputs at scale, translating insights into actionable engineering and product improvements. This role blends deep technical expertise with strong analytical judgment to develop tools for assessing and improving the behavior of advanced AI/ML models. You will collaborate cross-functionally with Engineering, Project Management, Product, Safety, and Editorial teams to develop technologies that ensure AI experiences are reliable, safe, and aligned with human expectations. The successful candidate will work proactively both independently and with others on a wide range of projects, alongside ML and data scientists, software developers, project managers, and other teams at Apple to translate requirements into scalable, reliable, and efficient evaluation frameworks.
Responsibilities
- Lead the design and ongoing development of automated benchmarking methodologies for AI/ML systems.
- Investigate the behavior of media-related agents and craft rigorous evaluation frameworks and techniques.
- Establish scientific standards for assessing quality features and support scalable evaluation techniques.
- Generate benchmark datasets and evaluation methodologies for model and application outputs at scale.
- Collaborate cross-functionally with Engineering, Product, Safety, Editorial, and Project Management teams to ensure reliability, safety, and alignment with human expectations.
- Translate requirements into scalable, reliable, and efficient evaluation frameworks with ML and data science teams.
- Communicate insights and results to stakeholders, and contribute to tooling that improves engineering and product decisions.
Minimum Qualifications
- Advanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience.
- Minimum 1+ years of work experience, either as a postdoc or in industry.
- Strong research background in empirical evaluation, experimental design, or benchmarking.
- Proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.).
- Deep familiarity with software engineering workflows and developer tools.
- Experience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems.
- Strong analytical and communication skills, including the ability to write clear reports.
Technical Skills
- Proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.).
- Experience working with large datasets, annotation tools, and model evaluation pipelines.
- Familiarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns.
- Ability to design taxonomies, categorization schemes, and structured labeling frameworks.
- Analytical strength: ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights.
- Communication: ability to synthesize qualitative and quantitative insights into actionable guidance; ability to communicate complex architectures and systems to stakeholders.
Education & Background
- Education in Data Science, Linguistics, Cognitive Science, HCI, Psychology, Social Science, or a related field.
- Fluent in English and either Korean, Chinese, Japanese, French, Spanish, Portuguese, Hindi, or Tamil.
Preferred Qualifications
- Publications in AI/ML evaluation or related fields.
- Experience with automated testing frameworks.
- Experience constructing human-in-the-loop or multi-turn evaluation setups.
- Intermediate or advanced proficiency in Swift.
- Familiarity with RAG systems, reinforcement learning, agentic architectures, and model fine-tuning.
- Expertise in designing annotation guidelines and validation instruments and techniques.
- Background in human factors, social science, and/or safety assessment methodologies.
More open roles at Apple
- A
Early Career - GPU Physical Design Engineer
London
Lateral HiresElectrical Engineering - A
Data Infrastructure Engineer
London
Students And GraduatesData Engineering - A
HelpLine Technical Support Analyst - IS&T
Singapore
Lateral HiresIT Service Desk / Support - A
Retail Support Specialist - Retail Customer Care
Singapore
Lateral HiresOperations
