Transfer Learning for Computational Content Analysis

Publikation: Bog/antologi/afhandling/rapport › Ph.d.-afhandling › Forskning

Mareike Hartmann

Content analysis is a research technique that is concerned with the discovery of trends, patterns and differences in artifacts of human communication. It requires the reading and coding of data according to annotation guidelines, which is a labor-intensive process. In the times of mass communication, huge amounts of content are produced everyday. Analysing this content with respect to the social phenomena they capture is of interest to researchers in many fields. However, manual coding is impractical for such large amounts of data and automating the coding step could speed up the process significantly.

Supervised machine learning is a promising approach in this direction, as such models can be applied to learn from human annotations and generalize to unseen data, making the coding of large amounts of content more feasible. However, labeled data sets are expensive to generate. On the one hand, this leads to small training dataset sizes. On the other hand, it makes it valuable if a model can generalize across datasets from different domains and languages. Transfer learning is a machine learning method that enables such knowledge transfer between data from different distributions, leveraging as much data as possible and keeping the additional annotation efforts low.

This thesis investigates the use of transfer learning for automated content coding. In the first part of the work, we directly apply transfer learning to content coding tasks. We investigate how the methods can improve the task and show that transfer learning can overcome the problem of little training data by leveraging additional resources. The second part of the work focuses on methods that enable knowledge transfer between languages. Such methods rely on word representations that capture meanings across languages. Unsupervised methods for learning such representations are attractive but unstable and we investigate the causes of these instabilities

Originalsprog	Engelsk

Forlag	Department of Computer Science, Faculty of Science, University of Copenhagen
Status	Udgivet - 2019

Datalogisk Institut