Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31106
Web Search Engine Based Naming Procedure for Independent Topic

Authors: Takahiro Nishigaki, Takashi Onoda

Abstract:

In recent years, the number of document data has been increasing since the spread of the Internet. Many methods have been studied for extracting topics from large document data. We proposed Independent Topic Analysis (ITA) to extract topics independent of each other from large document data such as newspaper data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis. The topic represented by ITA is represented by a set of words. However, the set of words is quite different from the topics the user imagines. For example, the top five words with high independence of a topic are as follows. Topic1 = {"scor", "game", "lead", "quarter", "rebound"}. This Topic 1 is considered to represent the topic of "SPORTS". This topic name "SPORTS" has to be attached by the user. ITA cannot name topics. Therefore, in this research, we propose a method to obtain topics easy for people to understand by using the web search engine, topics given by the set of words given by independent topic analysis. In particular, we search a set of topical words, and the title of the homepage of the search result is taken as the topic name. And we also use the proposed method for some data and verify its effectiveness.

Keywords: web search engine, topic extraction, independent topic analysis, topic naming

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 144

References:


[1] Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation, The Journal of Machine Learning Research, Vol. 3, pp. 993–1022.
[2] Blei, D. M. 2012. Probabilistic topic models, Commun. ACM, Vol. 55, No. 4, pp. 77–84.
[3] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis, Journal of the American Society of Information Science, Vol. 41, No. 6, pp. 391–407.
[4] Hofmann, T. 1999. Probabilistic latent semantic analysis, Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI’99), pp. 289–29, Morgan Kaufmann Publishers Inc..
[5] Hyv  arinen A. 1999. Fast and robust fixed-point algorithms for independent component analysis, IEEE Trans. on Neural Networks, Vol. 10, No. 3.
[6] Hyv  arinen, A., Karhunen, J. and Oja, E. 2001. Independent component analysis, John Wiley & Sons.
[7] Lichman, M. 2013. UCI machine learning repository, http://archive.ics.uci.edu/ml , Accessed on 11/11/2016.
[8] Salton, G., Fox, E. A., Wu, H. 1983. Extended boolean information retrieval, Commun. ACM, Vol. 26, No. 11, pp. 1022–1036.
[9] Shinohara, Y. 1999. Independent Topic Analysis : Extraction of Characteristic Topics by maximization of Independence, Technical report of IEICE.
[10] Shinohara, Y. 2000. Development of Browsing Assistance System for finding Primary Topics and Tracking their Changes in a Document Database, CRIEPI Research Report.
[11] Sirovich, I., and Kirby, M., 1987. Low-Dimensional procedure for the caracterization of human faces, Journal of Optical Society of America A, Vol.4, No.3, pp.519–524.
[12] Tanaka, M, Shinohara, Y. 2003. Topic-Based Dynamic Document Management System for discovering Important and New Topics, CRIEPI Research Report.
[13] Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering algorithms for document datasets, Conference of Information and Knowledge Management (CIKM), pp. 515–524, ACM.
[14] Zhong, S., and Ghosh, J. 2003. A comparative study of generative models for document clustering, Data Mining Workshop on Clustering High Dimensional Data and Its Applications.
[15] google-search 1.0.2, https://pypi.org/project/google-search/, 2018/11/15