Improving Topic Quality of Scripts by Using Scene Similarity Based Word Co-Occurrence

Yunseok Noh; Chang-Uk Kwak; Sun-Joong Kim; Seong-Bae Park

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Improving Topic Quality of Scripts by Using Scene Similarity Based Word Co-Occurrence

Authors: Yunseok Noh, Chang-Uk Kwak, Sun-Joong Kim, Seong-Bae Park

Abstract:

Scripts are one of the basic text resources to understand broadcasting contents. Topic modeling is the method to get the summary of the broadcasting contents from its scripts. Generally, scripts represent contents descriptively with directions and speeches, and provide scene segments that can be seen as semantic units. Therefore, a script can be topic modeled by treating a scene segment as a document. Because scene segments consist of speeches mainly, however, relatively small co-occurrences among words in the scene segments are observed. This causes inevitably the bad quality of topics by statistical learning method. To tackle this problem, we propose a method to improve topic quality with additional word co-occurrence information obtained using scene similarities. The main idea of improving topic quality is that the information that two or more texts are topically related can be useful to learn high quality of topics. In addition, more accurate topical representations lead to get information more accurate whether two texts are related or not. In this paper, we regard two scene segments are related if their topical similarity is high enough. We also consider that words are co-occurred if they are in topically related scene segments together. By iteratively inferring topics and determining semantically neighborhood scene segments, we draw a topic space represents broadcasting contents well. In the experiments, we showed the proposed method generates a higher quality of topics from Korean drama scripts than the baselines.

Keywords: Broadcasting contents, generalized P´olya urn model, scripts, text similarity, topic model.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1110840

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1819

References:

[1] D. M. Blei, “Probabilistic topic models,” Communications of the ACM, vol. 55, no. 4, pp. 77–84, 2012.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
[3] K. R. Canini, L. Shi, and T. L. Griffiths, “Online inference of topics with latent dirichlet allocation,” in International conference on artificial intelligence and statistics, 2009, pp. 65–72.
[4] Z. Chen and B. Liu, “Topic modeling using topics from many domains, lifelong learning and big data,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 703–711.
[5] J. Du, J. Jiang, D. Song, and L. Liao, “Topic modeling with document relative similarities,” in Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 2015, pp. 3469–3475.
[6] A. Gruber, M. Rosen-Zvi, and Y. Weiss, “Latent topic models for hypertext,” in Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08). AUAI Press, 2008, pp. 230–239.
[7] Y.-J. Han, S.-Y. Park, and S.-B. Park, “A single-directional influence topic model using call and proximity logs simultaneously,” Soft Computing, pp. 1–17, 2015.
[8] H. Mahmoud, P´olya urn models. CRC press, 2008.
[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
[10] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum, “Optimizing semantic coherence in topic models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011, pp. 262–272.
[11] H. Misra, F. Hopfgartner, A. Goyal, P. Punitha, and J. M. Jose, “Tv news story segmentation based on semantic coherence and content similarity,” in Advances in Multimedia Modeling. Springer, 2010, pp. 347–357.
[12] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evaluation of topic coherence,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 100–108.
[13] M. Purver, T. L. Griffiths, K. P. K¨ording, and J. B. Tenenbaum, “Unsupervised topic modelling for multi-party spoken discourse,” in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006, pp. 17–24.
[14] P. Xie, D. Yang, and E. Xing, “Incorporating word correlation knowledge into topic modeling,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2015, pp. 725–734.