Data Mining and Information Extraction for CiteSeerX and Friends

Talk by Prof. C. Lee Giles, Pennsylvania State University USA. It's at 14.30 - 15.30.


Cyberinfrastructure or e-science has become crucial in many areas of science where data access often defines scientific progress. Open source (OS) systems have greatly facilitated design and implementation and supporting cyberinfrastructure permitting the design of specialized integrated search engines and digital libraries which offer many opportunities for domain relevant information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We describe the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss issues in building domain specific enterprise search and cyberinfrastructure for the sciences and academia. Because of the large amount of information crawled and/or search there are many scale problems in information extraction and data mining such as author and entity disambiguation, data extraction and ranking, etc. We highlight application domains with examples from computer science, CiteSeerX, and chemistry, ChemXSeer and related problem areas. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance. We draw lessons for other e-science and cyberinfrastructure systems in terms of design, implementation and research and discuss future directions, systems and research.


C. Lee Giles is the David Reese Professor at the College of Information Sciences and Technology at the Pennsylvania State University, USA. He is also graduate college Professor of Computer Science and Engineering, courtesy Professor of Supply Chain and Information Systems, and Director of the Intelligent Systems Research Laboratory. He has twice received the IBM Distinguished Faculty Award. He directs the Next Generation CiteSeer, CiteSeerx project and codirects the ChemxSeer project at Penn State. His current research interests are in intelligent information processing systems. He was one of the creators of the novel metasearch engines, Inquirus and Inquirus2. He was also one of the creators of the popular computer and information science search engine, CiteSeer, an autonomous citation indexing search engine and digital library, now hosted at the College of Information Sciences and Technology at Penn State University. He also created a niche search engine eBizSearch, a search engine for e-business documents, and, SMEALSearch, a search engine and digital library for academic business documents. He is very interested in cyberinfrastructure for science and the academy and is currently a codeveloper in the research and development of a portal and search tool for environmental chemistry, ChemxSeer. He prototyped a novel search engine for archaeology, ArchSeer, and also developed a new search engine for robots.txt, BotSeer, that indexed over 2 million robots.txt files. Currently, he is working on collaboration networks, CollabSeer, and citation recommendation, RefSeer. He has more than 300 publications with over 18,000 citations (one of the top 100 h-indexes in Computer Science). Several of his papers have won or been nominated for best paper awards and have been reprinted in edited collections. He has served or is currently serving on the editorial boards of IEEE Intelligent Systems, IEEE Transactions on Knowledge and Data Engineering, Machine Learning Journal, Computational Intelligence and Applications, IEEE Transactions on Neural Networks, Journal of Computational Intelligence in Finance, Journal of Parallel and Distributed Computing, Neural Networks, Neural Computation, and Academic Press. He is a Fellow of the ACM, a Fellow of the IEEE and a Fellow of the International Neural Network Society, and a member of AAAI and AAAS. More here:

Scientific Host: Christina Lioma, DIKU

The talk is open to all interested parties - admission is free. There will be cake in the kitchen at 24.05.49 after the talk, where you can meet the speaker.