Automatic Boolean Query Generation for Biomedical Information Retrieval

Master's Defense by Brian Søborg Mathiasen

Title: "Automatic Boolean Query Generation for Biomedical Information Retrieval"

Time: Tuesday, May 15 2012, 14.00-15.15

Place: Njalsgade 126-128, Building 24, 5th floor, room 62

Supervisors: Ole Nørgaard Frandsen (Department of Public Health), Jakob Grue Simonsen (Department of Computer Science), University of Copenhagen.

Censor: Troels Andreasen, University of Roskilde.


I investigate the topic of automatically generating Booelan queries for biomedical informatin retrieval. A prototype composed of modules for document classification, terminology extraction, and methods for Boolean query generation has been implemented. Document classification experiments show that we can decrease CPU time by utilizing feature vectors composed of terms extracted using Natural Language Processing (NLP) versus using trigrams of characters. The decrease in CPU time by using extracted terms as feature vectors is statistically significant for p < 0.05, with no significant loss of precision or recall.

Term-Frequency Inverse Document Frequency (TFIDF) and Decision Tree models are implemented as methods for automatic Boolean query generation. To achieve this goal, Natural Language Processing is employed for identifying descriptive terms, and ad hoc approaches are used to combine the resulting descriptors into functioning queries. Three experiments are conducted in different environments for assessing various performance measures on the generated Boolean queries, and the results for each query generation method are reported and evaluated. No statistical significance between the proposed methods could be established.