Mixed monolingual homepage finding in 34 languages: the role of language script and search domain
Research output: Contribution to journal › Journal article › Research › peer-review
Standard
Mixed monolingual homepage finding in 34 languages : the role of language script and search domain. / Blanco, Roi; Lioma, Christina.
In: Information Retrieval, Vol. 12, No. 3, 2009, p. 324-351.Research output: Contribution to journal › Journal article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - JOUR
T1 - Mixed monolingual homepage finding in 34 languages
T2 - the role of language script and search domain
AU - Blanco, Roi
AU - Lioma, Christina
PY - 2009
Y1 - 2009
N2 - The information that is available or sought on the World Wide Web (Web) is increasingly multilingual. Information Retrieval systems, such as the freely available search engines on the Web, need to provide fair and equal access to this information, regardless of the language in which a query is written or where the query is posted from. In this work, we ask two questions: How do existing state of the art search engines deal with languages written in different alphabets (scripts)? Do local language-based search domains actually facilitate access to information? We conduct a thorough study on the effect of multilingual queries for homepage finding, where the aim of the retrieval system is to return only one document, namely the homepage described in the query. We evaluate the effect of multilingual queries in retrieval performance with regard to (i) the alphabet in which the queries are written (e.g., Latin, Russian, Arabic), and (ii) the language domain where the queries are posted (e.g., google.com, google.fr). We query four major freely available search engines with 764 queries in 34 different languages, and look for the correct homepage in the top retrieved results. In order to have fair multilingual experimental settings, we use an ontology that is comparable across languages and also representative of realistic Web searches: football premier leagues in different countries; the official team name represents our query, and the official team homepage represents the document to be retrieved. A series of thorough experiments involving over 10,000 runs, with queries both in their correct and in Latin characters, and also using both global-domain and local-domain searches, reveal that queries issued in the correct script of a language are more likely to be found and ranked in the top 3, while queries in non-Latin script languages which are however issued in Latin script are less likely to be found; also, queries issued to the correct local domain of a search engine, e.g., French queries to yahoo.fr, are likely to have better retrieval performance than queries issued to the global domain of a search engine. To our knowledge, this is the first Web retrieval study that uses such a wide range of languages.
AB - The information that is available or sought on the World Wide Web (Web) is increasingly multilingual. Information Retrieval systems, such as the freely available search engines on the Web, need to provide fair and equal access to this information, regardless of the language in which a query is written or where the query is posted from. In this work, we ask two questions: How do existing state of the art search engines deal with languages written in different alphabets (scripts)? Do local language-based search domains actually facilitate access to information? We conduct a thorough study on the effect of multilingual queries for homepage finding, where the aim of the retrieval system is to return only one document, namely the homepage described in the query. We evaluate the effect of multilingual queries in retrieval performance with regard to (i) the alphabet in which the queries are written (e.g., Latin, Russian, Arabic), and (ii) the language domain where the queries are posted (e.g., google.com, google.fr). We query four major freely available search engines with 764 queries in 34 different languages, and look for the correct homepage in the top retrieved results. In order to have fair multilingual experimental settings, we use an ontology that is comparable across languages and also representative of realistic Web searches: football premier leagues in different countries; the official team name represents our query, and the official team homepage represents the document to be retrieved. A series of thorough experiments involving over 10,000 runs, with queries both in their correct and in Latin characters, and also using both global-domain and local-domain searches, reveal that queries issued in the correct script of a language are more likely to be found and ranked in the top 3, while queries in non-Latin script languages which are however issued in Latin script are less likely to be found; also, queries issued to the correct local domain of a search engine, e.g., French queries to yahoo.fr, are likely to have better retrieval performance than queries issued to the global domain of a search engine. To our knowledge, this is the first Web retrieval study that uses such a wide range of languages.
U2 - 10.1007/s10791-008-9082-8
DO - 10.1007/s10791-008-9082-8
M3 - Journal article
AN - SCOPUS:64749100663
VL - 12
SP - 324
EP - 351
JO - Information Retrieval
JF - Information Retrieval
SN - 1386-4564
IS - 3
ER -
ID: 49502428