1–5 Sept 2025
ETH Zurich
Europe/Zurich timezone

MLCommons Science Benchmarks

Not scheduled
1h
HIT G floor (gallery)

HIT G floor (gallery)

Speaker

Ben Hawks (Fermi National Accelerator Lab)

Description

Benchmarks are a cornerstone of modern
machine learning practice, providing standardized eval-
uations that enable reproducibility, comparison, and
scientific progress. Yet, as AI systems — particularly
deep learning models — become increasingly dynamic,
traditional static benchmarking approaches are losing
their relevance. Models rapidly evolve in architecture,
scale, and capability; datasets shift; and deployment
contexts continuously change, creating a moving target
for evaluation. Without adaptive benchmarking frame-
works, both scientific assessment and real-world de-
ployment risk becoming misaligned with actual system
behavior.
Drawing on our experience from MLCommons, educa-
tional initiatives, and government programs such as the
DOE’s Million Parameter Consortium, we identify key
barriers that hinder the broader adoption and utility of
benchmarking in AI. These include substantial resource
demands, limited access to specialized hardware, lack
of expertise in benchmark design, and uncertainty
among practitioners about how to relate benchmark
results to their own application domains. Moreover,
current benchmarks often emphasize peak performance
on leadership-class hardware, offering limited guidance
for more diverse, real-world deployment scenarios.
We argue that benchmarking itself must become dy-
namic in order to incorporate evolving models, updated
data, and heterogeneous computational platforms while
maintaining transparency, reproducibility, and inter-
pretability. Democratizing this process requires not only
technical innovation, but also systematic educational
efforts spanning undergraduate to professional levels to
develop sustained expertise in benchmark design and
use. Finally, benchmarks should be framed and com-
municated to support application-relevant comparisons,
enabling both developers and users to make informed,
context-sensitive decisions. Advancing dynamic and
inclusive benchmarking practices will be essential to
ensure that evaluation keeps pace with the evolving AI
landscape and supports responsible, reproducible, and
accessible AI deployment.

Authors

Ben Hawks (Fermi National Accelerator Lab) Gregor von Laszewski (University of Virginia) Marco Colombo Nhan Tran (Fermi National Accelerator Lab. (US))

Presentation materials