Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31097
Mining News Sites to Create Special Domain News Collections

Authors: David B. Bracewell, Fuji Ren, Shingo Kuroiwa


We present a method to create special domain collections from news sites. The method only requires a single sample article as a seed. No prior corpus statistics are needed and the method is applicable to multiple languages. We examine various similarity measures and the creation of document collections for English and Japanese. The main contributions are as follows. First, the algorithm can build special domain collections from as little as one sample document. Second, unlike other algorithms it does not require a second “general" corpus to compute statistics. Third, in our testing the algorithm outperformed others in creating collections made up of highly relevant articles.

Keywords: Information Retrieval, news, Special DomainCollections

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1234


[1] Dragomir Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal, "Probabilistic question answering on the web", in WWW -02: Proceedings of the 11th international conference on World Wide Web, New York, NY, USA, 2002, pp. 408-419, ACM Press.
[2] Dmitri Roussinov and Jose Robles, "Learning patterns to answer open domain questions on the web", in SIGIR -04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2004, pp. 500-501, ACM Press.
[3] P. Resnik and N. A. Smith, "The web as a parallel corpus", Computational Linguistics, vol. 29, pp. 349-380, 2003.
[4] Mirella Lapata and Frank Keller, "Web-based models for natural language processing", ACM Trans. Speech Lang. Process., vol. 2, no. 1, pp. 1-31, 2005.
[5] William H. Fletcher, "Facilitating the compilation and dissemination of ad-hoc web corpora", in Papers from the Fifth International Conference on Teaching and Language Corpora, 2004.
[6] M. Baroni and S. Bernardini, "Bootcat: Bootstrapping corpora and terms from the web", in Proceedings of LREC 2004, 2004.
[7] Sara Castagnoli, Using the Web as a Source of LSP Corpora in the Terminology Classroom, chapter 6, pp. 159-172, GEDIT, 2006.
[8] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, "Focused crawling: a new approach to topic-specific Web resource discovery", Computer Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 11- 16, pp. 1623-1640, 1999.
[9] G. Salton and C. Buckley, "Improving retrieval performance by relevance feedback", Journal of the American Society for Information Science, vol. 41, pp. 288-297, 1990.
[10] Cdrick Fairon, "Corporator: A tool for creating rss-based specialized corpora", in Proceedings of the 2nd International Workshop on Web as Corpus, 2006.
[11] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Multilingual single document keyword extraction for information retrieval", in Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, China, November 2005.
[12] M.F. Porter, "An algorithm for suffix stripping", Program, vol. 14, pp. 130-137, 1980.
[13] E. Brill, "A simple rule-based part-of-speech tagger", in Proceedings of 3rd Applied Natural Language Processing, 1992, pp. 152-155.
[14] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano, Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara, "Morphological analysis system chasen version 2.2.9 manual.", Tech. Rep., Nara Institute of Science and Technology, 2002.
[15] R. C. J. van Rijsbergen, Information Retrieval: Second Edition, Butterworth-Heinemann, 1979.
[16] Gerald Salton, Automatic Text Processing, Addison-Wesley Publishing Company, 1998.
[17] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Category classification and topic discovery of news articles", in Proceedings of Information-MFCSIT 2006, 2006, pp. 345-348.