Can AI Handle Real Human Tasks?
The first benchmark testing LLMs on everyday tasks like solving Wordle or booking flights
100+
Real Tasks
15+
LLM Models
1000+
Test Runs
Real-World Tasks
From solving Wordle to booking flights, we test AI on tasks humans do daily
Verifiable Results
Clear success criteria and automated verification for each task
Community Driven
Submit your own tasks and help expand the benchmark
Latest Insights
Best Overall
Claude 3.5 Sonnet
85% Success Rate
Most Cost-Effective
DeepSeek v3
$0.08 per task
Hardest Task
Flight Booking
65% Success Rate