Day 99

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Research questions. The paper asks whether current quantitative AI benchmarks are trustworthy for evaluating model performance, capability, safety, and risk. It focuses on both technical flaws and broader sociotechnical problems in benchmark design and use.

Methodology. This is an interdisciplinary meta-review of about 110 studies published over the last 10 years. The authors synthesize problems across dataset creation, documentation, contamination, construct validity, benchmark gaming, and evaluation incentives.

Findings. The review finds that benchmarks often suffer from bias, poor documentation, contamination, weak signal-to-noise separation, construct validity problems, and incentives that reward leaderboard gains over real-world relevance. It also argues that one-time, text-focused evaluations fail to capture increasingly multimodal and interactive AI systems.

Why it matters. The article is important because benchmarks increasingly shape AI development and regulation, yet may create misplaced confidence in model safety or capability. For sycophancy and alignment research, it is a reminder that evaluation design can distort what researchers think models are actually doing.

← All Projects