Part of Speech n-Grams for Information Retrieval

Research output: Book/ReportPh.D. thesisResearch

The increasing availability of information on the World Wide Web (Web), and the need to access relevant specs of this information provide an important impetus for the development of automatic intelligent Information Retrieval (IR) technology. IR systems convert human authored language into representations that can be processed by computers, with the aim to provide humans with access to knowledge. Specifically, IR applications locate and quantify informative content in data, and make statistical decisions on the topical similarity, or relevance, between different items of data. The wide popularity of IR applications in the last decades has driven intensive research and development into theoretical models of information and relevance, and their implementation into usable applications, such as commercial search engines. The majority of IR systems today typically rely on statistical manipulations of individual lexical frequencies (i.e., single word counts) to estimate the relevance of a document to a user request, on the assumption that such lexical statistics can be sufficiently representative of informative content. Such estimations implicitly assume that words occur independently of each other, and as such ignore the compositional semantics of language. This assumption however is not entirely true, and can cause several problems, such as ambiguity in understanding textual information, misinterpreting or falsifying the original informative intent, and limiting the semantic scope of text. These problems can hinder the accurate estimation of relevance between texts, and hence harm the performance of an IR application. This thesis investigates the use of non-lexical statistics by IR models, with the goal to enhance the estimation of relevance between a document and a user request. These non-lexical statistics consist of part of speech information. The parts of speech are the grammatical classes of words (e.g., noun, verb). Part of speech statistics are modelled in the form of part of speech (POS) n-grams, which are contiguous sequences of parts of speech, extracted from text. The distribution of POS n-grams in language is statistically analysed. It is shown that there exists a relationship between the frequency and informative content of POS n-grams. Based on this, different applications of POS n-grams to IR technology are described and evaluated with state of the art systems. Experimental results show that POS n-grams can assist the retrieval process.
Original languageEnglish
Number of pages217
Publication statusPublished - 2007
Externally publishedYes

ID: 38257407