Compass Academic Leaderboard (Full Version)

The CompassAcademic currently focuses on the comprehensive reasoning abilities of LLMs.

  • The datasets selected so far include General Knowledge Reasoning (MMLU-Pro/GPQA-Diamond), Logical Reasoning (BBH), Mathematical Reasoning (MATH-500, AIME), Code Completion (LiveCodeBench, HumanEval), and Instruction Following (IFEval).
  • Currently, the evaluation primarily targets chat models, with updates featuring the latest community models at irregular intervals.
  • Prompts and reproduction scripts can be found in OpenCompass: A Toolkit for Evaluation of LLMs🏆.
Model Size
Model Type
Index
Model Name
Release Time
Parameters
OpenSource
IFEval
BBH
GPQA_diamond
Math-500
AIME2024
MMLU-Pro
LiveCodeBench
HumanEval
Drop
Hellaswag
MUSR
KorBench
CMMLU
MMLU
BigCodeBench
10
Llama4-Maverick-17B-128E-Instruct
2024/10/22
456B
OpenSource
83.55
80.92
58.08
76.80
10.00
75.09
36.25
89.02
89.37
88.46
68.55
55.12
83.86
88.92
22.97