Senior Software Engineer — AI Evaluation & Benchmarks
Alignerr • Remote • Posted 22 days ago
Education
Any
Type
Pay Rate
$90/task
Posted
22d ago
✅ Applying through this link gives you a verified candidate referral.
Referrals from verified candidates give your profile a visibility boost and help support our platform at no cost to you.
This position is hosted on an external talent platform. Please only apply for this position if it fits your skills and interests.
About this Role
What You'll Do
- Design and implement coding benchmarks used to evaluate frontier AI models across real-world programming tasks
- Build and maintain scalable data pipelines for AI evaluation workflows
- Analyze AI-generated code for correctness, reliability, and edge-case failures
- Create structured evaluation scenarios that rigorously test reasoning, debugging, and code quality
- Work with large code repositories and multi-language environments
- Collaborate on systems that improve how AI models understand and generate software
- Provide detailed technical feedback on model performance and failure patterns
- Contribute to the design of evaluation frameworks that set industry standards
About the Role
What if the code you write could determine how smart the next generation of AI truly is? We're looking for experienced Software Engineers to design and build the coding benchmarks and data pipelines used to evaluate frontier AI models — the systems that decide whether an AI can actually reason, debug, and write production-quality software. This is high-impact, technically demanding work at the intersection of software engineering and AI research. You'll work with large codebases, multiple programming languages, and scalable infrastructure to create evaluation systems that push the boundaries of what AI can do. This is a fully remote contract role. If you thrive in fast-paced engineering environments and want your work to directly shape the trajectory of AI — this is the role.
- Organization: Alignerr
- Type: Hourly Contract
- Location: Remote
- Contract Length: 3 Months
- Commitment: Full-time availability preferred
Who You Are
- 4+ years of professional software engineering experience — this is non-negotiable
- Experience working at a high-growth tech company or top-tier software organization
- Expert proficiency in Python — you write clean, performant, well-tested Python code
- Hands-on experience with code repositories and working in large, complex codebases
- Proven experience designing and implementing LLM coding benchmarks and data pipelines
- Track record of working in high-performance engineering environments with large-scale products or platforms
- Strong command of version control systems (Git) and modern development workflows
- Bilingual or native English speaker with strong written communication skills
- Self-directed, technically rigorous, and comfortable operating with autonomy
What Makes a Perfect Match
Candidates with these additional qualifications have the highest chance of success:
- Senior or Lead-level engineering profiles with a history of technical ownership
- Bachelor's or Master's degree in Computer Science, Machine Learning, or a related field — or equivalent professional experience
- Proficiency in one or more additional languages: JavaScript, Go, C++, or other relevant languages
- Experience with CI/CD pipelines and writing robust unit tests (pytest, Mocha, JUnit)
- Background in security engineering or significant open-source contributions
- Familiarity with AI/ML evaluation methodologies or model benchmarking
Why Join Us
- Work on cutting-edge AI evaluation projects alongside world-class research teams
- Fully remote — work from anywhere with a reliable internet connection
- Your benchmarks directly influence how the most advanced AI systems in the world are measured and improved
- Freelance autonomy with meaningful, high-stakes engineering work
- Collaborate with a global community of elite engineers and researchers
- Potential for contract extension and ongoing engagement as new evaluation challenges emerge
Requirements
- Fluent proficiency in English (Written & Verbal)
- Reliable high-speed internet connection
- Bachelor's degree or equivalent professional experience
- Demonstrated expertise in Software Engineering
Eligible Languages
Fluent proficiency in English
Compensation Analysis
What if the code you write could determine how smart the next generation of AI truly is? We're looking for experienced Software Engineers to design and build the coding benchmarks and data pipelines used to evaluate frontier AI models — the systems that decide whether an AI can actually reason, debug, and write production-quality software. This is
Skills & Categories
Explore other opportunities in related specializations:
Related Jobs
Document Annotation Specialist (PDF)
sme_careers • Software Engineering
$10
iOS Engineer (Swift and Objective-C)
sme_careers • Software Engineering
$110
Android Engineer (Kotlin and Java)
sme_careers • Software Engineering
$110
Full Stack Engineer (Node.js and React)
sme_careers • Software Engineering
$100
Browse All Jobs from Alignerr
Discover more opportunities on Alignerr that match your skills and interests.
View All Alignerr Jobs →Community Reviews
Share your experience with Alignerr
Help other candidates make better decisions by leaving a review.
Sign in to leave a reviewLeave your review
Frequently Asked Questions
What is the assessment actually like?
Notoriously strict. Alignerr uses TestGorilla for role-specific timed tests — a blank coding environment for engineers, rigorous grammar and fact-checking for writers. There is almost no hand-holding. The critical catch: this is essentially a one-shot process. Fail or abandon the assessment, and you are typically locked out of that role permanently with no option to retake.
How quickly can I start earning after I pass?
Not immediately. Even after passing the assessment and completing identity verification (via Persona) and billing setup (via Deel), you may sit in a waiting pool for weeks or months. You only start earning when a project matching your specific skills launches and you are officially assigned. Do not plan around Alignerr income until you are actively on a project.
Is there a community?
Yes — and it is one of Alignerr's genuine strengths. Once assigned to a project, you are added to Slack channels where you can ask questions, get rubric clarifications from admins, and talk to other AI trainers. This is rare in AI training and makes a real difference when guidelines are ambiguous or change mid-project.
What does the work actually look like?
It is practical, hands-on data work. You might be recording short videos, categorizing images, rating text responses, or analyzing data. The tasks are designed to be short and distinct—typically 5-60 minutes per task.
How flexible is the schedule?
Extremely. This is true "log in and work" flexibility. You can usually work for 20 minutes or 4 hours depending on your availability. There are rarely minimum hour requirements, making it ideal for side income.
Is there an interview?
Usually, no. Hiring for these roles is almost entirely based on passing an automated assessment or "qualification" task. If you pass the test, you get access to the work.
What is the barrier to entry?
Alignerr is known for difficult technical assessments. You must pass a timed test in your specific domain (e.g., Python, Physics, or Language) before you are eligible for any paid projects.