Online Topic Model for Broadcasting Contents Using Semantic Correlation Information
Authors: Chang-Uk Kwak, Sun-Joong Kim, Seong-Bae Park, Sang-Jo Lee
Abstract:
This paper proposes a method of learning topics for broadcasting contents. There are two kinds of texts related to broadcasting contents. One is a broadcasting script, which is a series of texts including directions and dialogues. The other is blogposts, which possesses relatively abstracted contents, stories, and diverse information of broadcasting contents. Although two texts range over similar broadcasting contents, words in blogposts and broadcasting script are different. When unseen words appear, it needs a method to reflect to existing topic. In this paper, we introduce a semantic vocabulary expansion method to reflect unseen words. We expand topics of the broadcasting script by incorporating the words in blogposts. Each word in blogposts is added to the most semantically correlated topics. We use word2vec to get the semantic correlation between words in blogposts and topics of scripts. The vocabularies of topics are updated and then posterior inference is performed to rearrange the topics. In experiments, we verified that the proposed method can discover more salient topics for broadcasting contents.
Keywords: Broadcasting script analysis, topic expansion, semantic correlation analysis, word2vec.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1110948
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1762References:
[1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 2013.
[2] K. Zhai, and J. Boyd-Graber. "Online Latent Dirichlet Allocation with Infinite Vocabulary." In Proceedings of The 30th International Conference on Machine Learning, pp. 561-569, 2013.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation." the Journal of machine Learning research, vol. 3, pp. 993-1022, 2003.
[4] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, "Hierarchical dirichlet processes." The American statistical association, 2006.
[5] M. Hoffman, F. R. Bach, and D. M. Blei, "Online learning for latent dirichlet allocation," Advances in neural information processing systems, pp. 856-864, 2010.
[6] C. Wang, J. W. Paisley, and D. M. Blei, "Online variational inference for the hierarchical Dirichlet process," In Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 752-760, 2011.
[7] H. Misra, F. Hopfgartner, A. Goyal, P.Punitha, and J. M. Mose, "TV news story segmentation based on semantic coherence and content similarity." Advances in Multimedia Modeling, pp. 347-357. 2010.
[8] C. Engels, K. Deschacht, J. H. Becker, T. Tuytleaars, M-F. Moens, and L. V. Gool, "Automatic annotation of unique locations from video and text," BMVC, pp 1-11, 2010.
[9] D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, "An analysis of the coherence of descriptors in topic modeling". Expert Systems with Applications, Vol. 42(13), pp. 5645-5657. 2015.
[10] G. Bouma. "Normalized (pointwise) mutual information in collocation extraction." In Proceedings of GSCL, pp. 31-40, 2009.