MSc Defences

Computer Science

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title

Banking the unbanked: Future-proofing the least developed countries as they go from cash to online-payment 

Abstract

Banking is a necessity for everyone, it is a key factor to reduce poverty and is a focal point for many organizations around the world. Unfortunately, 1.7 billion people remain unbanked. We take an example-driven approach to explore the reasons for why this is the case, where this is the case, and how we
can bank the people of these countries. We introduce a banking model based on M-Pesa that circumvents some of the complications of the M-Pesa model.
In these regions, cash is king. As the digital divide lessens we implement two systems based around this model. One for the current generation, based on the technology already available, and one for future generations, based on technology that will become available. We find that converting from a static
agent model to a dynamic one, multiple benefits can appear: The distance to banks is reduced, fees might be reduced, new job opportunities are made, and lack of identification might no longer be a limiting factor.

Time and place

17 June at 14:00

Online

Supervisor(s)  

Fritz Henglein, Søren Terp Hørlück Jessen

External examiner(s)  

Mads Rosendahl

 

 

Title

Using Graph Neural Networks To Learn Node Embeddings For Spatial Transcriptomics Neighborhood Graphs

Abstract

Recently, spatial transcriptomics methods have emerged and become more accessible. However, the number of computational methods that make use of the spatial information is limited. Existing
machine learning methods either do not incorporate spatial aspects or work on regular structures. My aim with this thesis is to present a machine learning approach that makes use of the true strength of
the spatial transcriptomics technology: spatiality. By turning spatial data into neighborhood graphs, we abstract the spatial information and make it possible to work with Graph Neural Networks. With these, we learn how to aggregate spot information with neighboring spot information and use these aggregations for machine learning predictions. To facilitate this process, I provide a user-friendly pipeline that assists with the graph construction, model creation and -tuning, and the extraction of the node embeddings, the aggregated spot information. I compare the results with a benchmark model that does not factor spatial information to compare the method to neighborhood-agnostic approaches.
I found that our approach outperforms other machine learning methods that don’t factor spatial information by 7% in prediction accuracy in a supervised machine learning task classifying multiple
annotated brain regions within a mouse brain atlas with an overall score of 79.01%. Furthermore, I present how the node embeddings serve downstream data analysis tasks like clustering and anomaly
detection. Applying my method to another use case, detecting Alzheimer’s diseased brain tissue spots shows that our approach works across different datasets and use-cases.

Time and place

21 June at 15:00

Online

Supervisor(s)  

Anders Krogh, Tune Pers, Petar Todorov

External examiner(s)  

Jes Frellsen

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Bioinformatics

 

 

 

Title

Mining the literature to detect connections between lifestyle and diseases

Abstract

Background and Methodology: Text mining is a flexible technique that can be applied
to various tasks in the biomedical field. The association between diseases and genes is
well established in the literature and as such it has been extensively mined and stored
in dedicated databases. However, another factor related to the onset and development
of diseases – lifestyle – is still hidden in the vast sea of texts, and there is no dedicated
database with this information integrated.
In this thesis, I fine-tuned the BioBERT model of natural language processing to
identify lifestyle factors, thereby extending a prototype lifestyle factors ontology. After
completing the expansion, I used the JensenLab dictionary-based tagger to extract
Disease-Lifestyle associations from PubMed. Tagger, an efficient dictionary-based text
mining software, is used both to identify lifestyle factors and diseases in text, and to
find the association between them by considering their co-occurrences within and
between sentences.
Results: After fine-tuning the pre-trained BioBERT model, the model’s prediction
accuracy for the named entity recognition task was 94.61%. This model was used to
predict whether Wikipedia titles with over 1000 matches in PubMed are also lifestyle
factors. After assigning proper thresholds for inclusion and extensive manual
annotation, 447 new terms from Wikipedia titles were added to the prototype ontology
of lifestyle factors. Finally, 501,952 pairs of Disease-Lifestyle associations were
obtained, by running tagger, out of which 50,997 were of high or very high confidence.
Conclusion: This project enriched the lifestyle factors ontology and detected
associations between diseases and lifestyle factors. The manual inspection of results
suggests to a certain extent that when the confidence level is high, the Disease-Lifestyle
associations found through text mining are credible, but further testing is needed to
avoid false positives.

Time and place

22 June at 09:00

Panum, Room 6.2.09

Supervisor(s)  

Lars Juhl Jensen, Aikaterini Despoina Nastou, Anders Krogh

External examiner(s)  

Jes Frellsen

 

 

Title

Using machine learning as a weapon to fight scientific fraud by detecting paper-mill publications 

Abstract

With the rapid development of society and economy, the increasingly serious problem
of scientific fraud has attracted public attention. The shadowy companies that fabricate
papers in bulk, the so-called paper mills, are gradually being noticed. In this thesis,
different machine learning-based methods were implemented to detect paper mill
publications. Some known paper mills were collected, and the biggest one called the
Tadpole paper mill is the one mainly used. Through the application of named entity
recognition from text-mining, all papers mentioning non-coding RNA in the Tadpole
paper mill were used as the input data to train supervised machine learning methods,
namely support vector machine, logistic regression, multinomial naive bayes, stochastic
gradient descent, passive aggressive classifier, random forest and XGBoost. Text was
vectorized using the TF-IDF approach and after hyperparameter optimization, the
trained classifiers were applied to other paper mills and papers from 2021 for prediction.
Almost all classifiers achieved good performance with approximately F1-scores of 90%,
proving that they can learn from the specific fraud style rather than the theme. From
prediction results, the classifier shows the ability to only identify fake papers belonging
to the paper mill it was trained on, and also does not have journal bias even if the paper
mill publications concentrate on some specific journals. In addition, the paper mills
seem to have fraud templates or patterns. According to their preference for combining
non-coding RNA and disease as main contents, the function of relationship extraction
was used to obtain papers mentioning such pairs for association analysis. After scoring
for confidence, the results show fake papers mainly focus on under-studied pairs. Such
fake studies linking ncRNA to disease represent a significant threat to science, because
it will pollute under-investigated fields and thereby mislead further research. In
conclusion, the paper mills may have and will definitely continue to seriously damage
the research ecosystem, while it is probable that the machine learning classifiers
working with detecting image duplication could better detect fraud and protect
scientific integrity.

Time and place

22 June at 10:30

Panum, Room 6.2.09

Supervisor(s)  

Anders Krogh, Lars Juhl Jensen, Aikaterini Despoina Nastou

External examiner(s)  

Jes Frellsen

 

 

 

 

 

 

 

Physics

 

 

Statistics

 

 

Sundhed og informatik