Talk by Tiago Pimentel

Join this talk via Zoom


Title

Theory vs. Practice: How much does Tokenisation Impact Language Models?

Abstract

Tokenisers are the foundation over which most modern language models are built. They transform human-readable raw text data – represented as sequences of characters – into the sequences of tokens that our models process. Despite their importance, much remains unknown about tokenisation – both from a theoretical and empirical perspective. In this talk, I will present some recent results about tokenisation, highlighting its importance for both language modelling and psycholinguistics research. I will first show that the problem of obtaining optimal tokenisers is NP-complete, justifying the wide-spread use of heuristic algorithms for their selection. I will then show how distributions over characters or over words can be properly recovered from language models, which instead provide distributions over tokens. In doing so, it will become clear that, in theory, if language models were perfectly optimised to match a data-generating distribution, tokenisation choices would not change these distributions. Finally, I will show that, in practice, tokenisation choices do change these distributions – in fact, tokenisers have a big impact on language models’ outputs.

Speaker

Tiago is a Postdoc at ETH Zurich, where he works with Thomas Hofmann. He is mainly interested in understanding how humans and language models process text, focusing on how to formalise the methods used to study these topics. Towards this goal, he uses tools from information theory, causality, statistics, and natural language processing.