Performance Metrics

AI Benchmarks

Compare the world's leading LLMs across key performance indicators including reasoning, coding, math, and language understanding.

Overall Performance Rankings

GPT-4 Turbo

OpenAI

94.2%

Average Score

Claude 3 Opus

Anthropic

92.7%

Average Score

Gemini 1.5 Pro

Google

90.1%

Average Score

Claude 3.5 Sonnet

Anthropic

88.5%

Average Score

Llama 3.1 405B

Performance by Category

Coding (HumanEval)

Code generation accuracy

GPT-4 Turbo91.2%

Claude 3 Opus88.7%

Gemini Pro84.3%

Math (MATH)

Mathematical reasoning

GPT-4 Turbo89.5%

Claude 3 Opus87.1%

Gemini Pro82.8%

Language (MMLU)

General knowledge & reasoning

GPT-4 Turbo86.4%

Claude 3 Opus86.8%

Gemini Pro83.7%

Speed

Tokens per second

GPT-4 Turbo102 tok/s

Claude 3 Sonnet150 tok/s

Gemini Pro88 tok/s

Context Window Sizes

GPT-4 Turbo

128K

tokens

Claude 3 Opus

200K

tokens

Gemini 1.5 Pro

tokens

Claude 3.5 Sonnet

200K

tokens

Benchmark Methodology

Our benchmarks aggregate scores from industry-standard evaluation datasets including HumanEval (code generation), MATH (mathematical reasoning), MMLU (multitask language understanding), and internal speed tests. All models tested with default parameters at temperature 0.0 for consistency.

Last Updated

December 2024

Test Environment

Standard API calls, 0.0 temperature