Peak Performance or Just Noise?
Statistical Comparison of Blind Challenge Entries by Jon Swain Benchmarks have become ubiquitous in machine learning, from the MNIST (Modified National Institute of Standards and Technology) dataset for computer vision to the SWE-bench leaderboard for LLM coding tasks. Benchmarks are standardized datasets or tasks used to evaluate models, comparing metrics