Results Tasks Submit Task

Can AI Handle Real Human Tasks?

The first benchmark testing LLMs on everyday tasks like solving Wordle or booking flights

100+ Real Tasks

15+ LLM Models

1000+ Test Runs

View Results Submit Task

From solving Wordle to booking flights, we test AI on tasks humans do daily

Clear success criteria and automated verification for each task

Submit your own tasks and help expand the benchmark

Latest Insights

Best Overall Claude 3.5 Sonnet 85% Success Rate

Most Cost-Effective DeepSeek v3 $0.08 per task

Hardest Task Flight Booking 65% Success Rate