PhD defence by Ruixiang Cui
Title
Evaluating Language Models: From Multilingual Compositional Semantic Parsing to Human-level Complex Reasoning
Abstract
The pursuit of creating machines that can think and communicate like humans has led to the birth of artificial intelligence. Natural language processing is at the heart of this effort, a subfield concerning developing models that can process and generate human language. Recent advances in large language models (LLMs), trained on massive text corpora with neural architectures, have led to remarkable conversational capabilities. However, as these models grow in size and complexity, evaluating their ability to generalize across a wide range of languages, tasks, and linguistic phenomena becomes increasingly essential. While classic language model benchmarks have helped compare and develop models, many of them have already reached a point of saturation, with near-perfect scores achieved. This thesis explores various aspects of language model evaluation, reflecting the evolving landscape of model devel opment and evaluation in the era of LLMs. It investigates the strengths and weaknesses of current language models, particularly their ability to generalize compositionally, a fundamental aspect of human language that allows for creating novel expressions by combining known building blocks.
The main contributions of this thesis are as follows: (1) We develop a method to migrate a question-answering dataset from one knowledge base to another and extend it to diverse languages and domains. The resulting benchmark, Multilingual Compositional Wikidata Questions (MCWQ), reveals language models lack the ability of cross-lingual compositional generalization. (2) We identified generalized quantifiers, i.e., words like "all", "some", and "most", as a significant challenge for understanding natural language and developed a benchmark to test this specific reasoning ability. (3) We investigate how lan guage models reason with the word "respectively" in various learning settings and demonstrate the challenge they face in generalizing to the long tails of linguistic constructions (4) We introduce AGIEval, a bilingual benchmark comprising high-standard official exams, which uncovers the limitations of LLMs in human-level reasoning tasks. By pushing the boundaries of how we evaluate LLMs, this thesis provides valuable insights into their strengths and weaknesses. Ultimately, we argue that true language understanding requires more than good performance on existing tests. It requires the ability to generalize and adapt to new challenges, just like humans do.
Supervisors
Principal Supervisor Anders Østerskov Søgaard
Co-Supervisor Daniel Hershcovich
Assessment Committee
Associate Professor Desmond Elliott, Dpt. of Computer Science
Tenured Assistant Professor Ellie Pavlick, Brown University
Research Scientist Adina Williams, FAIR (Meta)
IT responsible: Daniel Hershcovich
For an electronic copy of the thesis, please visit the PhD Programme page.