Events

National Center for Supercomputing Applications master calendar

View Full Calendar

NCSA staff who would like to submit an item for the calendar can email newsdesk@ncsa.illinois.edu.

On the Brittleness of Evaluation in NLP

Event Type

Lecture

Sponsor

Tal August

Location

Siebel Center 2405

Date

Sep 10, 2025 3:00 pm 4:00 pm

Speaker

Gabriel Stanovsky

Contact

Tal August

E-Mail

taugust@illinois.edu

Views

Originating Calendar

Siebel School Speakers Calendar

Abstract:

Large language models are commonly evaluated against several popular benchmarks, including HELM, MMLU or BIG-bench, all of which rely on a single prompt template per task. I will begin by presenting our recent large-scale statistical analysis of over more than 250M samples, showing that minimal prompt paraphrases lead to drastic changes in both absolute performance and relative ranking of different LLMs. These results call into question many of the recent empirical observations about the strengths and weaknesses of LLMs. Following, I will discuss desiderata for a more meaningful evaluation in NLP, leading to our formulation of diverse metrics tailored for different use cases, and conclude with a proposal for a probabilistic benchmarking approach for modern LLMs.

link for robots only

Events

National Center for
Supercomputing Applications University of Illinois at Urbana-Champaign

Site menu

Events

Events

National Center for Supercomputing Applications master calendar

On the Brittleness of Evaluation in NLP