AI Benchmarks
Compare the world's leading LLMs across key performance indicators including reasoning, coding, math, and language understanding.
Overall Performance Rankings
GPT-4 Turbo
OpenAI
94.2%
Average Score
Claude 3 Opus
Anthropic
92.7%
Average Score
Gemini 1.5 Pro
90.1%
Average Score
Claude 3.5 Sonnet
Anthropic
88.5%
Average Score
Llama 3.1 405B
Meta
85.3%
Average Score
Performance by Category
Coding (HumanEval)
Code generation accuracy
Math (MATH)
Mathematical reasoning
Language (MMLU)
General knowledge & reasoning
Speed
Tokens per second
Context Window Sizes
GPT-4 Turbo
128K
tokens
Claude 3 Opus
200K
tokens
Gemini 1.5 Pro
2M
tokens
Claude 3.5 Sonnet
200K
tokens
Benchmark Methodology
Our benchmarks aggregate scores from industry-standard evaluation datasets including HumanEval (code generation), MATH (mathematical reasoning), MMLU (multitask language understanding), and internal speed tests. All models tested with default parameters at temperature 0.0 for consistency.
Last Updated
December 2024
Test Environment
Standard API calls, 0.0 temperature