Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33156
Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis

Authors: Sangita Pokhrel, Nalinda Somasiri, Rebecca Jeyavadhanam, Swathi Ganesan

Abstract:

Tourism is a booming industry with huge future potential for global wealth and employment. There are countless data generated over social media sites every day, creating numerous opportunities to bring more insights to decision-makers. The integration of big data technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centres or hotels, etc., and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences was first classified through VADER and RoBERTa model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization and Term Frequency – Inverse Document Frequency (TFIDF) Vectorization and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide if the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm, after pre-processing and cleaning the dataset, we received an accuracy of 96.12% for the positive and negative sentiment analysis.

Keywords: Counter vectorization, Convolutional Neural Network, Crawler, data technology, Long Short-Term Memory, LSTM, Web Scraping, sentiment analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 198

References:


[1] S. R. Department, "Total contribution of travel and tourism to gross domestic product (GDP) worldwide from 2006 to 2021," Travel, Tourism & Hospitality, no. 2022, 2022.
[2] M. H. A. Gandomi, " Beyond the hype: Big data concepts, methods, and analytics," International Journal of Information Management, no. 2022, pp. 137-144, 2015.
[3] G. F. S. T.March, " Design and natural science research on information technology," Decision Support Systems, no. 2022, pp. 251-256, 1995.
[4] D. R. R. D. Chingakham Nirma Devi, "Literature Review on Sentiment Analysis in Tourism," Test Engineering and Management, vol. 83, pp. 2466-2474, 2020.
[5] Renganathan, "Text mining in biomedical domain with emphasis on document clustering," Healthcare Informatics Research, vol. 3, no. 23, pp. 141-146, 2017.
[6] Q. C. C. S. E. S. P. Jiang, "Sentiment analysis of online destination image," Current Issues in Tourism, vol. 4, no. 26, pp. 1-22, 2021.
[7] A. M. a. I. M. Abubakar, "Impact of online WOM on destination rust and intention to travel: a medical tourism perspective," vol. 5, pp. 192-201, 2016.
[8] Shiyang Liao, Junbo Wang, Ruiyun Yu, Koichi Sato, "CNN for situations understanding based on sentiment analysis of twitter data," ResearchGate, vol. 4, pp. 376-381, 2017.
[9] C. S. M. B, "An Approach of Sentiment Analysis for Movie Reviews," International Conference on Communication, Computing and Internet of Thing, 2022.
[10] X. L. F. D. X. L. M. W. Xian Fan, "Apply Word Vectors for Sentiment Analysis of APP Reviews," The 2016 3rd International Conference on Systems and Informatics (ICSAI 2016), 2016.
[11] A. U. Vinaitheerthan Renganathan, "Dubai Restaurants: A Sentiment Analysis," vol. 14, no. 2, 2021.
[12] E. S. P. W. Afina Ramadhani, "LSTM-based Deep Learning Architecture of Tourist Review in Tripadvisor," Sixth International Conference on Informatics and Computing (ICIC), 2021.
[13] Ali Aggaa, Ahmed Abbou, Moussa Labbadib, Yassine El HoumaImane, HammouOu Alia, "CNN-LSTM: An efficient hybrid deep learning architecture for predicting short-term photovoltaic power production," Electric Power Systems Research, vol. 208, 2022.
[14] T. Huang, "Research on Sentiment Classification of Tourist," IEEE 3rd Eurasia Conference on IOT Communication and Engineering (ECICE), 2021.
[15] Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J. Santamaría, Mohammed A. Fadhel, Muthana Al-Amidie & Laith Farhan, "Review of deep learning: concepts, CNN architectures, challenges, applications, future directions," Journal of Big Data, vol. 53, 2021.
[16] M. M. Ily Amalina Ahmad Sabri, "A deep web data extraction model for web mining: a review," Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, pp. 519-528, 2021.
[17] Saram Han and Christopher K. Anderson, "Web Scraping for Hospitality Research: Overview Opportunities, and Implications," Cornell Hospitality Quarterly, 2021.
[18] A. Rao, "Convolutional Neural Network Tutorial (CNN) – Developing an Image Classifier in Python Using TensorFlow," Edureka, 15 09 2022. (Online). Available: https://www.edureka.co/blog/convolutional-neural-network/. (Accessed 11 2022).
[19] Z. Cai, J. Liu, L. Xu, C. Yin, J. Wang, "A Vision Recognition Based Method for Web Data Extraction," Computer Science, 2017.
[20] R. Mitchell, "Web Scraping with Python," O'Reilly Media, 2015.
[21] V. Draxl, "Web Scraping Data Extraction from websites," no. 2022, 2018.
[22] A. OT, "Web Scraping vs. API: What's the Best Way to Extract Data?," 2021. (Online). Available: https://www.makeuseof.com/web-scraping-vs-api/. (Accessed 03 09 2022).
[23] C. P. Colombage, "Comparing Deep Learning Architecture for Sentiment Assessment for Online Consumer Reviews," York St. John University – London Campus, Department of Computer Science, London, 2021.
[24] A. Sharma, "A guide to web scraping in Python using Beautiful Soup," 2021. (Online). Available: https://opensource.com/article/21/9/web-scraping-python-beautiful-soup. (Accessed 09 2022).
[25] A. R. V. R. C. A. R. D. A. K. M. a. S. K. Shalini K, "Sentiment Analysis of Indian Languages using Convolutional Neural Networks," International Conference on Computer Communication and Informatics (ICCCI -2018), no. 2022, 2018.
[26] Renganathan, "Text mining in biomedical domain with emphasis on document clustering," Healthcare Informatics Research, vol. 3, no. 2022, pp. 141-146, 2017.
[27] Chang, Chia-Hui and Shao-Chen Lui. “IEPAD: information extraction based on pattern discovery.” The Web Conference (2001).