aitrainer.work - AI Training Jobs Platform
Data Science mercor

Human Baseliner for Open-Ended ML Research Tasks

Mercor Remote

Education

Any

Type

Pay Rate

$82.5/task

Listed

Today

✅ Applying through this link gives you a verified candidate referral.

Referrals from verified candidates give your profile a visibility boost and help support our platform at no cost to you.

This position is hosted on an external talent platform. Please only apply for this position if it fits your skills and interests.

Apply Now

Applying to Mercor?

We support strong candidates applying here. Set up your talent profile so we know who you are.

Set up your profile →

About this Role

We are hiring experienced machine learning engineers and researchers to serve as human baseliners for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated.

Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)

Overview

human baseliners

What You’ll Do

Commitment

20 hours per week if selected

Requirements

all

3+ years of machine learning experience

top-100 university

FAANG or a comparable company

PyTorch, JAX, or TensorFlow

Required Domain Expertise

at least one

Pretraining

Reinforcement learning

Post-training

Dataset curation

Model architecture

Logistics (work trial requirements)

We are hiring experienced machine learning engineers and researchers to serve as human baseliners for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated. Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial) Work independently in a sandboxed Linux environment with internet access Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT Record your full working session via screen recording Complete a short pre-task and post-task questionnaire Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment Minimum 20 hours per week if selected More availability is strongly preferred Candidates must meet all of the following: 3+ years of machine learning experience Time spent in a PhD program counts toward this requirement Undergraduate and master’s experience does not count Attended a top-100 university or worked at FAANG or a comparable company Experience with at least one major ML framework such as PyTorch, JAX, or TensorFlow Deep, hands-on expertise in at least one of the focus areas below: Pretraining under tight data and compute budgets PPO, reward shaping, custom gym / gymnasium environments, and throughput tuning Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance Architecture design under strict parameter-count or size constraints Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives Contrastive training for embedding or retrieval models Generative vision or video modeling Multilingual or low-resource language experience Image or video data pipelines at scale Experience balancing competing model objectives such as safety and capability Prior work as an ML evaluator, red-teamer, or baseliner Candidates must have strong practical experience in at least one of the following: Pretraining: training transformer language models from scratch Reinforcement learning: training agents in custom or existing environments Post-training: fine-tuning and aligning LLMs Dataset curation: building and cleaning large text corpora for LLM training Model architecture: designing and modifying neural network architectures One baseline attempt per contractor per task Each task may only be attempted once by a given contractor All work is confidential and covered by NDA Compute and environment are provided; no personal GPU is required We consider all qualified applicants without regard to legally protected characteristics and provide reasonable accommodations upon request.

  • Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)
  • Work independently in a sandboxed Linux environment with internet access
  • Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT
  • Record your full working session via screen recording
  • Complete a short pre-task and post-task questionnaire
  • Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment
  • Minimum 20 hours per week if selected
  • More availability is strongly preferred
  • 3+ years of machine learning experience

Time spent in a PhD program counts toward this requirement

Undergraduate and master’s experience does not count

  • Time spent in a PhD program counts toward this requirement
  • Undergraduate and master’s experience does not count
  • Attended a top-100 university or worked at FAANG or a comparable company
  • Experience with at least one major ML framework such as PyTorch, JAX, or TensorFlow
  • Deep, hands-on expertise in at least one of the focus areas below:

Pretraining under tight data and compute budgets

PPO, reward shaping, custom gym / gymnasium environments, and throughput tuning

Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation

Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance

Architecture design under strict parameter-count or size constraints

Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives

Contrastive training for embedding or retrieval models

Generative vision or video modeling

Multilingual or low-resource language experience

Image or video data pipelines at scale

Experience balancing competing model objectives such as safety and capability

Prior work as an ML evaluator, red-teamer, or baseliner

  • Pretraining under tight data and compute budgets
  • PPO, reward shaping, custom gym / gymnasium environments, and throughput tuning
  • Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation
  • Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance
  • Architecture design under strict parameter-count or size constraints
  • Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives
  • Contrastive training for embedding or retrieval models
  • Generative vision or video modeling
  • Multilingual or low-resource language experience
  • Image or video data pipelines at scale
  • Experience balancing competing model objectives such as safety and capability
  • Prior work as an ML evaluator, red-teamer, or baseliner
  • Time spent in a PhD program counts toward this requirement
  • Undergraduate and master’s experience does not count
  • Pretraining under tight data and compute budgets
  • PPO, reward shaping, custom gym / gymnasium environments, and throughput tuning
  • Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation
  • Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance
  • Architecture design under strict parameter-count or size constraints
  • Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives
  • Contrastive training for embedding or retrieval models
  • Generative vision or video modeling
  • Multilingual or low-resource language experience
  • Image or video data pipelines at scale
  • Experience balancing competing model objectives such as safety and capability
  • Prior work as an ML evaluator, red-teamer, or baseliner
  • Pretraining: training transformer language models from scratch
  • Reinforcement learning: training agents in custom or existing environments
  • Post-training: fine-tuning and aligning LLMs
  • Dataset curation: building and cleaning large text corpora for LLM training
  • Model architecture: designing and modifying neural network architectures
  • One baseline attempt per contractor per task
  • Each task may only be attempted once by a given contractor
  • All work is confidential and covered by NDA
  • Compute and environment are provided; no personal GPU is required

Requirements

  • Must be eligible to work in Remote
  • Fluent proficiency in English (Written & Verbal)
  • Reliable high-speed internet connection
  • Bachelor's degree or equivalent professional experience
  • Demonstrated expertise in Data Science

Compensation Analysis

Work from anywhere, at any time. This fully remote position ($82.5/hr) breaks down geographic barriers, allowing you to earn US-competitive rates regardless of your local market. It is a perfect stepping stone for building a career in the data labeling and AI training ecosystem.

Skills & Categories

Explore other opportunities in related specializations:

Related Jobs

Mercor

Browse All Jobs from Mercor

Discover more opportunities on Mercor that match your skills and interests.

View All Mercor Jobs →

Community Reviews

Loading reviews…
💬

Share your experience with Mercor

Help other candidates make better decisions by leaving a review.

Sign in to leave a review

Frequently Asked Questions

Is this for freelancers or full-time employees?

Both. Mercor tries to match you with clients who want long-term contractors. Unlike other platforms where you log in and grab small tasks, Mercor matches you with one company for a steady role (e.g., 'Python Tutor for 3 months').

I'm not comfortable on camera. Can I still apply?

No. The application requires a video interview with an AI avatar. The AI asks you questions about your resume, and the video is shared with potential clients to prove your communication skills.

Does it cost money to join?

No. You should never pay to join these platforms. Mercor makes money by charging the client a fee on top of your hourly rate.

What does the work actually look like?

It is practical, hands-on data work. You might be recording short videos, categorizing images, rating text responses, or analyzing data. The tasks are designed to be short and distinct—typically 5-60 minutes per task.

How flexible is the schedule?

Extremely. This is true "log in and work" flexibility. You can usually work for 20 minutes or 4 hours depending on your availability. There are rarely minimum hour requirements, making it ideal for side income.

Is there an interview?

Usually, no. Hiring for these roles is almost entirely based on passing an automated assessment or "qualification" task. If you pass the test, you get access to the work.

How soon will I start?

Important: Mercor is a talent marketplace, not a task queue. Applying puts you in a pool of candidates. You will only start working when a specific client (like a major AI lab) selects your profile. This matching process can take weeks.