Top 10 LLMs by current Elo / Bradley–Terry scores from LMArena human-preference battles. Click a model name to see its ToolHub family page; ↗ links to the official source.
| # | Model | Elo Score |
|---|---|---|
| 1 | Anthropic | 1502 |
| 2 |
| 1500 |
| 3 | Anthropic | 1498 |
| 4 | Anthropic | 1492 |
| 5 | Meta | 1489 |
| 6 | 1488 |
| 7 | 1486 |
| 8 | 1481 |
| 9 | 1480 |
| 10 | 1480 |
LMArena (formerly LMSYS) Chatbot Arena is the de-facto gold standard for human-preference LLM evaluation. Real users vote on blind side-by-side answers; the platform applies the Bradley–Terry model and Elo-style ratings to produce the rankings you see in the snapshot above.
The Text leaderboard captures general-chat quality. Companion leaderboards (WebDev, Vision, Coding) track domain-specific strength; if you need a model for a specific job, check the relevant sub-arena on the official site rather than defaulting to the overall top.
A few points worth knowing: Elo gaps under ~10 points are not always meaningful, "thinking" variants generally score higher but cost more latency and tokens, and newly-added models can swing rapidly before vote counts stabilize.
Best when you want the source leaderboard directly and need the most current rankings without any intermediary summary.
Useful if you want a more product-facing view of model availability, pricing, and ecosystem adoption alongside rankings.
Helpful when you want benchmark-heavy comparisons rather than crowd preference and chat-style pairwise voting.
Check the "Coding" or "Hard Prompts" category leaderboards specifically if you are looking for a model to handle complex logic or software development.
Participate in "Side-by-side" battles to contribute to the ELO rankings while testing your specific edge-case prompts against two anonymous models.
Monitor the "Style Control" and "Long Context" updates to see which models excel at following strict formatting or handling massive documents.