Search results for: text mining analysis
9181 Elimination of Redundant Links in Web Pages– Mathematical Approach
Authors: G. Poonkuzhali, K.Thiagarajan, K.Sarukesi
Abstract:
With the enormous growth on the web, users get easily lost in the rich hyper structure. Thus developing user friendly and automated tools for providing relevant information without any redundant links to the users to cater to their needs is the primary task for the website owners. Most of the existing web mining algorithms have concentrated on finding frequent patterns while neglecting the less frequent one that are likely to contain the outlying data such as noise, irrelevant and redundant data. This paper proposes new algorithm for mining the web content by detecting the redundant links from the web documents using set theoretical(classical mathematics) such as subset, union, intersection etc,. Then the redundant links is removed from the original web content to get the required information by the user..Keywords: Web documents, Web content mining, redundantlink, outliers, set theory.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20139180 Providing Medical Information in Braille: Research and Development of Automatic Braille Translation Program for Japanese “eBraille“
Authors: Aki Sugano, Mika Ohta, Mineko Ikegami, Kenji Miura, Sayo Tsukamoto, Akihiro Ichinose, Toshiko Ohshima, Eiichi Maeda, Masako Matsuura, Yutaka Takao
Abstract:
Along with the advances in medicine, providing medical information to individual patient is becoming more important. In Japan such information via Braille is hardly provided to blind and partially sighted people. Thus we are researching and developing a Web-based automatic translation program “eBraille" to translate Japanese text into Japanese Braille. First we analyzed the Japanese transcription rules to implement them on our program. We then added medical words to the dictionary of the program to improve its translation accuracy for medical text. Finally we examined the efficacy of statistical learning models (SLMs) for further increase of word segmentation accuracy in braille translation. As a result, eBraille had the highest translation accuracy in the comparison with other translation programs, improved the accuracy for medical text and is utilized to make hospital brochures in braille for outpatients and inpatients.
Keywords: Automatic Braille translation, Medical text, Partially sighted people.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16009179 Object Identification with Color, Texture, and Object-Correlation in CBIR System
Authors: Awais Adnan, Muhammad Nawaz, Sajid Anwar, Tamleek Ali, Muhammad Ali
Abstract:
Needs of an efficient information retrieval in recent years in increased more then ever because of the frequent use of digital information in our life. We see a lot of work in the area of textual information but in multimedia information, we cannot find much progress. In text based information, new technology of data mining and data marts are now in working that were started from the basic concept of database some where in 1960. In image search and especially in image identification, computerized system at very initial stages. Even in the area of image search we cannot see much progress as in the case of text based search techniques. One main reason for this is the wide spread roots of image search where many area like artificial intelligence, statistics, image processing, pattern recognition play their role. Even human psychology and perception and cultural diversity also have their share for the design of a good and efficient image recognition and retrieval system. A new object based search technique is presented in this paper where object in the image are identified on the basis of their geometrical shapes and other features like color and texture where object-co-relation augments this search process. To be more focused on objects identification, simple images are selected for the work to reduce the role of segmentation in overall process however same technique can also be applied for other images.Keywords: Object correlation, Geometrical shape, Color, texture, features, contents.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20279178 Edit Distance Algorithm to Increase Storage Efficiency of Javanese Corpora
Authors: Aji P. Wibawa, Andrew Nafalski, Neil Murray, Wayan F. Mahmudy
Abstract:
Since the one-to-one word translator does not have the facility to translate pragmatic aspects of Javanese, the parallel text alignment model described uses a phrase pair combination. The algorithm aligns the parallel text automatically from the beginning to the end of each sentence. Even though the results of the phrase pair combination outperform the previous algorithm, it is still inefficient. Recording all possible combinations consume more space in the database and time consuming. The original algorithm is modified by applying the edit distance coefficient to improve the data-storage efficiency. As a result, the data-storage consumption is 90% reduced as well as its learning period (42s).Keywords: edit distance coefficient, Javanese, parallel text alignment, phrase pair combination
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17289177 Redesigning Business Processes: A Method Based on Simulation and Process Mining Techniques
Authors: Zahra Mohammadnazari, Fateme Rostambeygi, Fatemeh Dehrouyeh, Hwang Ki-Soon, Amir Aghsami
Abstract:
Corporations have always prioritized efforts to examine and improve processes. Various metrics, such as the cost and time required to implement the process and can be specified in this regard. Process improvement can be defined as an improvement of these indicators. This is accomplished by looking at prospective adjustments to the current executive process model or the resources allotted to it. Research has been conducted in this paper to the improve the procurement process and aims to explore assessment prospects in the project using a combination of process mining and simulation (benefiting from Play-In and Play-Out methodologies). To run the simulation, we will need to complete the control flow diagram, institution settings, resource settings, and activity settings. The process of mining event logs yields the process control flow. However, both the entry of institutions and the distribution of resources must be modeled. The rate of admission of institutions and the distribution of time for the implementation of activities will be determined in the next step.
Keywords: Business reengineering, Petri net, process-based simulation, process mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4829176 Actionable Rules: Issues and New Directions
Authors: Harleen Kaur
Abstract:
Knowledge Discovery in Databases (KDD) is the process of extracting previously unknown, hidden and interesting patterns from a huge amount of data stored in databases. Data mining is a stage of the KDD process that aims at selecting and applying a particular data mining algorithm to extract an interesting and useful knowledge. It is highly expected that data mining methods will find interesting patterns according to some measures, from databases. It is of vital importance to define good measures of interestingness that would allow the system to discover only the useful patterns. Measures of interestingness are divided into objective and subjective measures. Objective measures are those that depend only on the structure of a pattern and which can be quantified by using statistical methods. While, subjective measures depend only on the subjectivity and understandability of the user who examine the patterns. These subjective measures are further divided into actionable, unexpected and novel. The key issues that faces data mining community is how to make actions on the basis of discovered knowledge. For a pattern to be actionable, the user subjectivity is captured by providing his/her background knowledge about domain. Here, we consider the actionability of the discovered knowledge as a measure of interestingness and raise important issues which need to be addressed to discover actionable knowledge.
Keywords: Data Mining Community, Knowledge Discovery inDatabases (KDD), Interestingness, Subjective Measures, Actionability.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 19419175 Discovering Complex Regularities: from Tree to Semi-Lattice Classifications
Authors: A. Faro, D. Giordano, F. Maiorana
Abstract:
Data mining uses a variety of techniques each of which is useful for some particular task. It is important to have a deep understanding of each technique and be able to perform sophisticated analysis. In this article we describe a tool built to simulate a variation of the Kohonen network to perform unsupervised clustering and support the entire data mining process up to results visualization. A graphical representation helps the user to find out a strategy to optimize classification by adding, moving or delete a neuron in order to change the number of classes. The tool is able to automatically suggest a strategy to optimize the number of classes optimization, but also support both tree classifications and semi-lattice organizations of the classes to give to the users the possibility of passing from one class to the ones with which it has some aspects in common. Examples of using tree and semi-lattice classifications are given to illustrate advantages and problems. The tool is applied to classify macroeconomic data that report the most developed countries- import and export. It is possible to classify the countries based on their economic behaviour and use the tool to characterize the commercial behaviour of a country in a selected class from the analysis of positive and negative features that contribute to classes formation. Possible interrelationships between the classes and their meaning are also discussed.Keywords: Unsupervised classification, Kohonen networks, macroeconomics, Visual data mining, Cluster interpretation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15419174 An Efficient Approach to Mining Frequent Itemsets on Data Streams
Authors: Sara Ansari, Mohammad Hadi Sadreddini
Abstract:
The increasing importance of data stream arising in a wide range of advanced applications has led to the extensive study of mining frequent patterns. Mining data streams poses many new challenges amongst which are the one-scan nature, the unbounded memory requirement and the high arrival rate of data streams. In this paper, we propose a new approach for mining itemsets on data stream. Our approach SFIDS has been developed based on FIDS algorithm. The main attempts were to keep some advantages of the previous approach and resolve some of its drawbacks, and consequently to improve run time and memory consumption. Our approach has the following advantages: using a data structure similar to lattice for keeping frequent itemsets, separating regions from each other with deleting common nodes that results in a decrease in search space, memory consumption and run time; and Finally, considering CPU constraint, with increasing arrival rate of data that result in overloading system, SFIDS automatically detect this situation and discard some of unprocessing data. We guarantee that error of results is bounded to user pre-specified threshold, based on a probability technique. Final results show that SFIDS algorithm could attain about 50% run time improvement than FIDS approach.Keywords: Data stream, frequent itemset, stream mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14189173 Evaluation of the Urban Regeneration Project: Land Use Transformation and SNS Big Data Analysis
Authors: Ju-Young Kim, Tae-Heon Moon, Jung-Hun Cho
Abstract:
Urban regeneration projects have been actively promoted in Korea. In particular, Jeonju Hanok Village is evaluated as one of representative cases in terms of utilizing local cultural heritage sits in the urban regeneration project. However, recently, there has been a growing concern in this area, due to the ‘gentrification’, caused by the excessive commercialization and surging tourists. This trend was changing land and building use and resulted in the loss of identity of the region. In this regard, this study analyzed the land use transformation between 2010 and 2016 to identify the commercialization trend in Jeonju Hanok Village. In addition, it conducted SNS big data analysis on Jeonju Hanok Village from February 14th, 2016 to March 31st, 2016 to identify visitors’ awareness of the village. The study results demonstrate that rapid commercialization was underway, unlikely the initial intention, so that planners and officials in city government should reconsider the project direction and rebuild deliberate management strategies. This study is meaningful in that it analyzed the land use transformation and SNS big data to identify the current situation in urban regeneration area. Furthermore, it is expected that the study results will contribute to the vitalization of regeneration area.
Keywords: Land use, SNS, text mining, urban regeneration.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 12159172 A Comprehensive Review on Different Mixed Data Clustering Ensemble Methods
Authors: S. Sarumathi, N. Shanthi, S. Vidhya, M. Sharmila
Abstract:
An extensive amount of work has been done in data clustering research under the unsupervised learning technique in Data Mining during the past two decades. Moreover, several approaches and methods have been emerged focusing on clustering diverse data types, features of cluster models and similarity rates of clusters. However, none of the single clustering algorithm exemplifies its best nature in extracting efficient clusters. Consequently, in order to rectify this issue, a new challenging technique called Cluster Ensemble method was bloomed. This new approach tends to be the alternative method for the cluster analysis problem. The main objective of the Cluster Ensemble is to aggregate the diverse clustering solutions in such a way to attain accuracy and also to improve the eminence the individual clustering algorithms. Due to the massive and rapid development of new methods in the globe of data mining, it is highly mandatory to scrutinize a vital analysis of existing techniques and the future novelty. This paper shows the comparative analysis of different cluster ensemble methods along with their methodologies and salient features. Henceforth this unambiguous analysis will be very useful for the society of clustering experts and also helps in deciding the most appropriate one to resolve the problem in hand.
Keywords: Clustering, Cluster Ensemble Methods, Coassociation matrix, Consensus Function, Median Partition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21049171 Lead and Cadmium Spatial Pattern and Risk Assessment around Coal Mine in Hyrcanian Forest, North Iran
Authors: Mahsa Tavakoli, Seyed Mohammad Hojjati, Yahya Kooch
Abstract:
In this study, the effect of coal mining activities on lead and cadmium concentrations and distribution in soil was investigated in Hyrcanian forest, North Iran. 16 plots (20×20 m2) were established by systematic-randomly (60×60 m2) in an area of 4 ha (200×200 m2-mine entrance placed at center). An area adjacent to the mine was not affected by the mining activity; considered as the controlled area. In order to investigate soil lead and cadmium concentration, one sample was taken from the 0-10 cm in each plot. To study the spatial pattern of soil properties and lead and cadmium concentrations in the mining area, an area of 80×80m2 (the mine as the center) was considered and 80 soil samples were systematic-randomly taken (10 m intervals). Geostatistical analysis was performed via Kriging method and GS+ software (version 5.1). In order to estimate the impact of coal mining activities on soil quality, pollution index was measured. Lead and cadmium concentrations were significantly higher in mine area (Pb: 10.97±0.30, Cd: 184.47±6.26 mg.kg-1) in comparison to control area (Pb: 9.42±0.17, Cd: 131.71±15.77 mg.kg-1). The mean values of the PI index indicate that Pb (1.16) and Cd (1.77) presented slightly polluted. Results of the NIPI index showed that Pb (1.44) and Cd (2.52) presented slight pollution and moderate pollution respectively. Results of variography and kriging method showed that it is possible to prepare interpolation maps of lead and cadmium around the mining areas in Hyrcanian forest. According to results of pollution and risk assessments, forest soil was contaminated by heavy metals (lead and cadmium); therefore, using reclamation and remediation techniques in these areas is necessary.
Keywords: Traditional coal mining, heavy metals, pollution indicators, geostatistics, caspian forest.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 10519170 The Negative Effect of Traditional Loops Style on the Performance of Algorithms
Authors: Mahmoud Moh'd Mhashi
Abstract:
A new algorithm called Character-Comparison to Character-Access (CCCA) is developed to test the effect of both: 1) converting character-comparison and number-comparison into character-access and 2) the starting point of checking on the performance of the checking operation in string searching. An experiment is performed using both English text and DNA text with different sizes. The results are compared with five algorithms, namely, Naive, BM, Inf_Suf_Pref, Raita, and Cycle. With the CCCA algorithm, the results suggest that the evaluation criteria of the average number of total comparisons are improved up to 35%. Furthermore, the results suggest that the clock time required by the other algorithms is improved in range from 22.13% to 42.33% by the new CCCA algorithm.
Keywords: Pattern matching, string searching, charactercomparison, character-access, text type, and checking
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 12689169 A Study of the Variability of Very Low Resolution Characters and the Feasibility of Their Discrimination Using Geometrical Features
Authors: Farshideh Einsele, Rolf Ingold
Abstract:
Current OCR technology does not allow to accurately recognizing small text images, such as those found in web images. Our goal is to investigate new approaches to recognize very low resolution text images containing antialiased character shapes. This paper presents a preliminary study on the variability of such characters and the feasibility to discriminate them by using geometrical features. In a first stage we analyze the distribution of these features. In a second stage we present a study on the discriminative power for recognizing isolated characters, using various rendering methods and font properties. Finally we present interesting results of our evaluation tests leading to our conclusion and future focus.Keywords: World Wide Web, document analysis, pattern recognition, Optical Character Recognition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 13709168 Customer Need Type Classification Model using Data Mining Techniques for Recommender Systems
Authors: Kyoung-jae Kim
Abstract:
Recommender systems are usually regarded as an important marketing tool in the e-commerce. They use important information about users to facilitate accurate recommendation. The information includes user context such as location, time and interest for personalization of mobile users. We can easily collect information about location and time because mobile devices communicate with the base station of the service provider. However, information about user interest can-t be easily collected because user interest can not be captured automatically without user-s approval process. User interest usually represented as a need. In this study, we classify needs into two types according to prior research. This study investigates the usefulness of data mining techniques for classifying user need type for recommendation systems. We employ several data mining techniques including artificial neural networks, decision trees, case-based reasoning, and multivariate discriminant analysis. Experimental results show that CHAID algorithm outperforms other models for classifying user need type. This study performs McNemar test to examine the statistical significance of the differences of classification results. The results of McNemar test also show that CHAID performs better than the other models with statistical significance.Keywords: Customer need type, Data mining techniques, Recommender system, Personalization, Mobile user.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21459167 Information Gain Ratio Based Clustering for Investigation of Environmental Parameters Effects on Human Mental Performance
Authors: H. Mehdi, Kh. S. Karimov, A. A. Kavokin
Abstract:
Methods of clustering which were developed in the data mining theory can be successfully applied to the investigation of different kinds of dependencies between the conditions of environment and human activities. It is known, that environmental parameters such as temperature, relative humidity, atmospheric pressure and illumination have significant effects on the human mental performance. To investigate these parameters effect, data mining technique of clustering using entropy and Information Gain Ratio (IGR) K(Y/X) = (H(X)–H(Y/X))/H(Y) is used, where H(Y)=-ΣPi ln(Pi). This technique allows adjusting the boundaries of clusters. It is shown that the information gain ratio (IGR) grows monotonically and simultaneously with degree of connectivity between two variables. This approach has some preferences if compared, for example, with correlation analysis due to relatively smaller sensitivity to shape of functional dependencies. Variant of an algorithm to implement the proposed method with some analysis of above problem of environmental effects is also presented. It was shown that proposed method converges with finite number of steps.Keywords: Clustering, Correlation analysis, EnvironmentalParameters, Information Gain Ratio, Mental Performance.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18239166 Development of Innovative Islamic Web Applications
Authors: Farrukh Shahzad
Abstract:
The rich Islamic resources related to religious text, Islamic sciences, and history are widely available in print and in electronic format online. However, most of these works are only available in Arabic language. In this research, an attempt is made to utilize these resources to create interactive web applications in Arabic, English and other languages. The system utilizes the Pattern Recognition, Knowledge Management, Data Mining, Information Retrieval and Management, Indexing, storage and data-analysis techniques to parse, store, convert and manage the information from authentic Arabic resources. These interactive web Apps provide smart multi-lingual search, tree based search, on-demand information matching and linking. In this paper, we provide details of application architecture, design, implementation and technologies employed. We also presented the summary of web applications already developed. We have also included some screen shots from the corresponding web sites. These web applications provide an Innovative On-line Learning Systems (eLearning and computer based education).Keywords: Islamic resources, Muslim scholars, hadith, narrators, history, fiqh.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 13029165 A Novel Approach to Improve Users Search Goal in Web Usage Mining
Authors: R. Lokeshkumar, P. Sengottuvelan
Abstract:
Web mining is to discover and extract useful Information. Different users may have different search goals when they search by giving queries and submitting it to a search engine. The inference and analysis of user search goals can be very useful for providing an experience result for a user search query. In this project, we propose a novel approach to infer user search goals by analyzing search web logs. First, we propose a novel approach to infer user search goals by analyzing search engine query logs, the feedback sessions are constructed from user click-through logs and it efficiently reflect the information needed for users. Second we propose a preprocessing technique to clean the unnecessary data’s from web log file (feedback session). Third we propose a technique to generate pseudo-documents to representation of feedback sessions for clustering. Finally we implement k-medoids clustering algorithm to discover different user search goals and to provide a more optimal result for a search query based on feedback sessions for the user.Keywords: Data Preprocessing, Session Identification, Web log mining, Web Personalization.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20229164 Techniques with Statistics for Web Page Watermarking
Authors: Mohamed Lahcen BenSaad, Sun XingMing
Abstract:
Information hiding, especially watermarking is a promising technique for the protection of intellectual property rights. This technology is mainly advanced for multimedia but the same has not been done for text. Web pages, like other documents, need a protection against piracy. In this paper, some techniques are proposed to show how to hide information in web pages using some features of the markup language used to describe these pages. Most of the techniques proposed here use the white space to hide information or some varieties of the language in representing elements. Experiments on a very small page and analysis of five thousands web pages show that these techniques have a wide bandwidth available for information hiding, and they might form a solid base to develop a robust algorithm for web page watermarking.Keywords: Digital Watermarking, Information Hiding, Markup Language, Text watermarking, Software Watermarking.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17939163 Spatio-Temporal Data Mining with Association Rules for Lake Van
Authors: T. Aydin, M. F. Alaeddinoglu
Abstract:
People, throughout the history, have made estimates and inferences about the future by using their past experiences. Developing information technologies and the improvements in the database management systems make it possible to extract useful information from knowledge in hand for the strategic decisions. Therefore, different methods have been developed. Data mining by association rules learning is one of such methods. Apriori algorithm, one of the well-known association rules learning algorithms, is not commonly used in spatio-temporal data sets. However, it is possible to embed time and space features into the data sets and make Apriori algorithm a suitable data mining technique for learning spatiotemporal association rules. Lake Van, the largest lake of Turkey, is a closed basin. This feature causes the volume of the lake to increase or decrease as a result of change in water amount it holds. In this study, evaporation, humidity, lake altitude, amount of rainfall and temperature parameters recorded in Lake Van region throughout the years are used by the Apriori algorithm and a spatio-temporal data mining application is developed to identify overflows and newlyformed soil regions (underflows) occurring in the coastal parts of Lake Van. Identifying possible reasons of overflows and underflows may be used to alert the experts to take precautions and make the necessary investments.Keywords: Apriori algorithm, association rules, data mining, spatio-temporal data.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14059162 Spatial Data Mining by Decision Trees
Authors: S. Oujdi, H. Belbachir
Abstract:
Existing methods of data mining cannot be applied on spatial data because they require spatial specificity consideration, as spatial relationships. This paper focuses on the classification with decision trees, which are one of the data mining techniques. We propose an extension of the C4.5 algorithm for spatial data, based on two different approaches Join materialization and Querying on the fly the different tables. Similar works have been done on these two main approaches, the first - Join materialization - favors the processing time in spite of memory space, whereas the second - Querying on the fly different tables- promotes memory space despite of the processing time. The modified C4.5 algorithm requires three entries tables: a target table, a neighbor table, and a spatial index join that contains the possible spatial relationship among the objects in the target table and those in the neighbor table. Thus, the proposed algorithms are applied to a spatial data pattern in the accidentology domain. A comparative study of our approach with other works of classification by spatial decision trees will be detailed.
Keywords: C4.5 Algorithm, Decision trees, S-CART, Spatial data mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 29869161 A Hybrid Approach for Thread Recommendation in MOOC Forums
Authors: Ahmad. A. Kardan, Amir Narimani, Foozhan Ataiefard
Abstract:
Recommender Systems have been developed to provide contents and services compatible to users based on their behaviors and interests. Due to information overload in online discussion forums and users diverse interests, recommending relative topics and threads is considered to be helpful for improving the ease of forum usage. In order to lead learners to find relevant information in educational forums, recommendations are even more needed. We present a hybrid thread recommender system for MOOC forums by applying social network analysis and association rule mining techniques. Initial results indicate that the proposed recommender system performs comparatively well with regard to limited available data from users' previous posts in the forum.Keywords: Association rule mining, hybrid recommender system, massive open online courses, MOOCs, social network analysis.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 12639160 A BERT-Based Model for Financial Social Media Sentiment Analysis
Authors: Josiel Delgadillo, Johnson Kinyua, Charles Mutigwe
Abstract:
The purpose of sentiment analysis is to determine the sentiment strength (e.g., positive, negative, neutral) from a textual source for good decision-making. Natural Language Processing (NLP) in domains such as financial markets requires knowledge of domain ontology, and pre-trained language models, such as BERT, have made significant breakthroughs in various NLP tasks by training on large-scale un-labeled generic corpora such as Wikipedia. However, sentiment analysis is a strong domain-dependent task. The rapid growth of social media has given users a platform to share their experiences and views about products, services, and processes, including financial markets. StockTwits and Twitter are social networks that allow the public to express their sentiments in real time. Hence, leveraging the success of unsupervised pre-training and a large amount of financial text available on social media platforms could potentially benefit a wide range of financial applications. This work is focused on sentiment analysis using social media text on platforms such as StockTwits and Twitter. To meet this need, SkyBERT, a domain-specific language model pre-trained and fine-tuned on financial corpora, has been developed. The results show that SkyBERT outperforms current state-of-the-art models in financial sentiment analysis. Extensive experimental results demonstrate the effectiveness and robustness of SkyBERT.
Keywords: BERT, financial markets, Twitter, sentiment analysis.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 7169159 Conceptualization of the Attractive Work Environment and Organizational Activity for Humans in Future Deep Mines
Authors: M. A. Sanda, B. Johansson, J. Johansson
Abstract:
The purpose of this paper is to conceptualize a futureoriented human work environment and organizational activity in deep mines that entails a vision of good and safe workplace. Futureoriented technological challenges and mental images required for modern work organization design were appraised. It is argued that an intelligent-deep-mine covering the entire value chain, including environmental issues and with work organization that supports good working and social conditions towards increased human productivity could be designed. With such intelligent system and work organization in place, the mining industry could be seen as a place where cooperation, skills development and gender equality are key components. By this perspective, both the youth and women might view mining activity as an attractive job and the work environment as a safe, and this could go a long way in breaking the unequal gender balance that exists in most mines today. Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16539158 Mining Educational Data to Support Students’ Major Selection
Authors: Kunyanuth Kularbphettong, Cholticha Tongsiri
Abstract:
This paper aims to create the model for student in choosing an emphasized track of student majoring in computer science at Suan Sunandha Rajabhat University. The objective of this research is to develop the suggested system using data mining technique to analyze knowledge and conduct decision rules. Such relationships can be used to demonstrate the reasonableness of student choosing a track as well as to support his/her decision and the system is verified by experts in the field. The sampling is from student of computer science based on the system and the questionnaire to see the satisfaction. The system result is found to be satisfactory by both experts and student as well.
Keywords: Data mining technique, the decision support system, knowledge and decision rules.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 32849157 Generating Frequent Patterns through Intersection between Transactions
Authors: M. Jamali, F. Taghiyareh
Abstract:
The problem of frequent itemset mining is considered in this paper. One new technique proposed to generate frequent patterns in large databases without time-consuming candidate generation. This technique is based on focusing on transaction instead of concentrating on itemset. This algorithm based on take intersection between one transaction and others transaction and the maximum shared items between transactions computed instead of creating itemset and computing their frequency. With applying real life transactions and some consumption is taken from real life data, the significant efficiency acquire from databases in generation association rules mining.Keywords: Association rules, data mining, frequent patterns, shared itemset.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14029156 Text Summarization for Oil and Gas Drilling Topic
Authors: Y. Y. Chen, O. M. Foong, S. P. Yong, Kurniawan Iwan
Abstract:
Information sharing and gathering are important in the rapid advancement era of technology. The existence of WWW has caused rapid growth of information explosion. Readers are overloaded with too many lengthy text documents in which they are more interested in shorter versions. Oil and gas industry could not escape from this predicament. In this paper, we develop an Automated Text Summarization System known as AutoTextSumm to extract the salient points of oil and gas drilling articles by incorporating statistical approach, keywords identification, synonym words and sentence-s position. In this study, we have conducted interviews with Petroleum Engineering experts and English Language experts to identify the list of most commonly used keywords in the oil and gas drilling domain. The system performance of AutoTextSumm is evaluated using the formulae of precision, recall and F-score. Based on the experimental results, AutoTextSumm has produced satisfactory performance with F-score of 0.81.
Keywords: Keyword's probability, synonym sets.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17309155 The Main Principles of Text-to-Speech Synthesis System
Authors: K.R. Aida–Zade, C. Ardil, A.M. Sharifova
Abstract:
In this paper, the main principles of text-to-speech synthesis system are presented. Associated problems which arise when developing speech synthesis system are described. Used approaches and their application in the speech synthesis systems for Azerbaijani language are shown.
Keywords: synthesis of Azerbaijani language, morphemes, phonemes, sounds, sentence, speech synthesizer, intonation, accent, pronunciation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 56529154 Improving Academic Performance Prediction using Voting Technique in Data Mining
Authors: Ikmal Hisyam Mohamad Paris, Lilly Suriani Affendey, Norwati Mustapha
Abstract:
In this paper we compare the accuracy of data mining methods to classifying students in order to predicting student-s class grade. These predictions are more useful for identifying weak students and assisting management to take remedial measures at early stages to produce excellent graduate that will graduate at least with second class upper. Firstly we examine single classifiers accuracy on our data set and choose the best one and then ensembles it with a weak classifier to produce simple voting method. We present results show that combining different classifiers outperformed other single classifiers for predicting student performance.Keywords: Classification, Data Mining, Prediction, Combination of Multiple Classifiers.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 27539153 WebGD: A CORBA-based Document Classification and Retrieval System on the Web
Authors: Fuyang Peng, Bo Deng, Chao Qi, Mou Zhan
Abstract:
This paper presents the design and implementation of the WebGD, a CORBA-based document classification and retrieval system on Internet. The WebGD makes use of such techniques as Web, CORBA, Java, NLP, fuzzy technique, knowledge-based processing and database technology. Unified classification and retrieval model, classifying and retrieving with one reasoning engine and flexible working mode configuration are some of its main features. The architecture of WebGD, the unified classification and retrieval model, the components of the WebGD server and the fuzzy inference engine are discussed in this paper in detail.Keywords: Text Mining, document classification, knowledgeprocessing, fuzzy logic, Web, CORBA
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18479152 Classification of Political Affiliations by Reduced Number of Features
Authors: Vesile Evrim, Aliyu Awwal
Abstract:
By the evolvement in technology, the way of expressing opinions switched direction to the digital world. The domain of politics, as one of the hottest topics of opinion mining research, merged together with the behavior analysis for affiliation determination in texts, which constitutes the subject of this paper. This study aims to classify the text in news/blogs either as Republican or Democrat with the minimum number of features. As an initial set, 68 features which 64 were constituted by Linguistic Inquiry and Word Count (LIWC) features were tested against 14 benchmark classification algorithms. In the later experiments, the dimensions of the feature vector reduced based on the 7 feature selection algorithms. The results show that the “Decision Tree”, “Rule Induction” and “M5 Rule” classifiers when used with “SVM” and “IGR” feature selection algorithms performed the best up to 82.5% accuracy on a given dataset. Further tests on a single feature and the linguistic based feature sets showed the similar results. The feature “Function”, as an aggregate feature of the linguistic category, was found as the most differentiating feature among the 68 features with the accuracy of 81% in classifying articles either as Republican or Democrat.Keywords: Politics, machine learning, feature selection, LIWC.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2365