Efficient Benchmarking via Bias-Bounded Subset Selection

Evaluating AI systems, particularly large models, is an essential yet computationally expensive task. The use of extensive benchmarks often leads to substantial computational/human costs that may even exceed those of pretraining. The efficiency of AI model evaluation focuses on estimating the model&...

Ausführliche Beschreibung

Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence. - 1979. - PP(2025) vom: 12. Aug.
1. Verfasser: Zhuang, Yan (VerfasserIn)
Weitere Verfasser: Yu, Junhao, Liu, Qi, Sun, Yuxuan, Li, Jiatong, Huang, Zhenya, Chen, Enhong
Format: Online-Aufsatz
Sprache:English
Veröffentlicht: 2025
Zugriff auf das übergeordnete Werk:IEEE transactions on pattern analysis and machine intelligence
Schlagworte:Journal Article
Beschreibung
Zusammenfassung:Evaluating AI systems, particularly large models, is an essential yet computationally expensive task. The use of extensive benchmarks often leads to substantial computational/human costs that may even exceed those of pretraining. The efficiency of AI model evaluation focuses on estimating the model's score on the full benchmark based on its responses to a smaller subset. Various empirical selection methods have been proposed to identify valuable subsets within these benchmarks. In this paper, we formally define and approximate the subset selection problem inherent in efficient evaluation. We prove that this problem actually optimizes a submodular function and that a unified subset can be identified using a simple greedy algorithm. Importantly, this approach is the first to provide theoretical guarantees of bias control and generalizability in score estimation. Using language models as a case study, experimental results across 11 different benchmarks validate its superiority in estimating model scores and maintaining ranking consistency. It can achieve accurate score estimation using no more than 30% of the full benchmark, thus facilitating efficient and sparse benchmark design
Beschreibung:Date Revised 12.08.2025
published: Print-Electronic
Citation Status Publisher
ISSN:1939-3539
DOI:10.1109/TPAMI.2025.3598031