MSc Defence by Magnus Stavngaard and August Sørensen
Authorship Verification - Deep Learning Based Methods for Authorship Verification
In this thesis we investigate authorship verification of texts produced by secondary school students. Given a set of texts written by one author, authorship verification (or ghostwriter detection) is the process of determining whether a text of unknown authorship is written by said author. We work with the Danish company MaCom that provides a dataset containing assignments from secondary school students. We focus on deep neural networks to perform the authorship verification. We implement two baseline methods representing classic machine learning solutions to the authorship verification problem. After that we present three networks to solve the same problem:
- A convolutional neural network working on the character level of the texts,
- A recurrent neural network working on the sentence level of the texts, and
- A convolutional neural network working on both the character and the word level of the texts.
Classic machine learning methods for authorship verification use manually chosen feature configurations, but the networks we implement extract features from raw text data. The networks beat both baseline methods on accuracy and accusation error. On a dataset with 50% ghostwritten assignments we achieve an accuracy of 86.5%.
Our methods are meant to be used by teachers of secondary schools in a supporting manner to detect ghostwritten assignments. They are able to give teachers feedback on why the networks make a decision and they are able to detect specific areas of assignments that might be ghostwritten.
Jacob Nordfalk (DTU)