Can AI Handle Real Human Tasks?

The first benchmark testing LLMs on everyday tasks like solving Wordle or booking flights

100+ Real Tasks
15+ LLM Models
1000+ Test Runs

Real-World Tasks

From solving Wordle to booking flights, we test AI on tasks humans do daily

Verifiable Results

Clear success criteria and automated verification for each task

Community Driven

Submit your own tasks and help expand the benchmark

Latest Insights

Best Overall Claude 3.5 Sonnet 85% Success Rate
Most Cost-Effective DeepSeek v3 $0.08 per task
Hardest Task Flight Booking 65% Success Rate