Search results for: naive Bayes classifier
487 An Integrated Lightweight Naïve Bayes Based Webpage Classification Service for Smartphone Browsers
Authors: Mayank Gupta, Siba Prasad Samal, Vasu Kakkirala
Abstract:
The internet world and its priorities have changed considerably in the last decade. Browsing on smart phones has increased manifold and is set to explode much more. Users spent considerable time browsing different websites, that gives a great deal of insight into user’s preferences. Instead of plain information classifying different aspects of browsing like Bookmarks, History, and Download Manager into useful categories would improve and enhance the user’s experience. Most of the classification solutions are server side that involves maintaining server and other heavy resources. It has security constraints and maybe misses on contextual data during classification. On device, classification solves many such problems, but the challenge is to achieve accuracy on classification with resource constraints. This on device classification can be much more useful in personalization, reducing dependency on cloud connectivity and better privacy/security. This approach provides more relevant results as compared to current standalone solutions because it uses content rendered by browser which is customized by the content provider based on user’s profile. This paper proposes a Naive Bayes based lightweight classification engine targeted for a resource constraint devices. Our solution integrates with Web Browser that in turn triggers classification algorithm. Whenever a user browses a webpage, this solution extracts DOM Tree data from the browser’s rendering engine. This DOM data is a dynamic, contextual and secure data that can’t be replicated. This proposal extracts different features of the webpage that runs on an algorithm to classify into multiple categories. Naive Bayes based engine is chosen in this solution for its inherent advantages in using limited resources compared to other classification algorithms like Support Vector Machine, Neural Networks, etc. Naive Bayes classification requires small memory footprint and less computation suitable for smartphone environment. This solution has a feature to partition the model into multiple chunks that in turn will facilitate less usage of memory instead of loading a complete model. Classification of the webpages done through integrated engine is faster, more relevant and energy efficient than other standalone on device solution. This classification engine has been tested on Samsung Z3 Tizen hardware. The Engine is integrated into Tizen Browser that uses Chromium Rendering Engine. For this solution, extensive dataset is sourced from dmoztools.net and cleaned. This cleaned dataset has 227.5K webpages which are divided into 8 generic categories ('education', 'games', 'health', 'entertainment', 'news', 'shopping', 'sports', 'travel'). Our browser integrated solution has resulted in 15% less memory usage (due to partition method) and 24% less power consumption in comparison with standalone solution. This solution considered 70% of the dataset for training the data model and the rest 30% dataset for testing. An average accuracy of ~96.3% is achieved across the above mentioned 8 categories. This engine can be further extended for suggesting Dynamic tags and using the classification for differential uses cases to enhance browsing experience.Keywords: chromium, lightweight engine, mobile computing, Naive Bayes, Tizen, web browser, webpage classification
Procedia PDF Downloads 163486 Classification of Potential Biomarkers in Breast Cancer Using Artificial Intelligence Algorithms and Anthropometric Datasets
Authors: Aref Aasi, Sahar Ebrahimi Bajgani, Erfan Aasi
Abstract:
Breast cancer (BC) continues to be the most frequent cancer in females and causes the highest number of cancer-related deaths in women worldwide. Inspired by recent advances in studying the relationship between different patient attributes and features and the disease, in this paper, we have tried to investigate the different classification methods for better diagnosis of BC in the early stages. In this regard, datasets from the University Hospital Centre of Coimbra were chosen, and different machine learning (ML)-based and neural network (NN) classifiers have been studied. For this purpose, we have selected favorable features among the nine provided attributes from the clinical dataset by using a random forest algorithm. This dataset consists of both healthy controls and BC patients, and it was noted that glucose, BMI, resistin, and age have the most importance, respectively. Moreover, we have analyzed these features with various ML-based classifier methods, including Decision Tree (DT), K-Nearest Neighbors (KNN), eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM) along with NN-based Multi-Layer Perceptron (MLP) classifier. The results revealed that among different techniques, the SVM and MLP classifiers have the most accuracy, with amounts of 96% and 92%, respectively. These results divulged that the adopted procedure could be used effectively for the classification of cancer cells, and also it encourages further experimental investigations with more collected data for other types of cancers.Keywords: breast cancer, diagnosis, machine learning, biomarker classification, neural network
Procedia PDF Downloads 133485 Charting Sentiments with Naive Bayes and Logistic Regression
Authors: Jummalla Aashrith, N. L. Shiva Sai, K. Bhavya Sri
Abstract:
The swift progress of web technology has not only amassed a vast reservoir of internet data but also triggered a substantial surge in data generation. The internet has metamorphosed into one of the dynamic hubs for online education, idea dissemination, as well as opinion-sharing. Notably, the widely utilized social networking platform Twitter is experiencing considerable expansion, providing users with the ability to share viewpoints, participate in discussions spanning diverse communities, and broadcast messages on a global scale. The upswing in online engagement has sparked a significant curiosity in subjective analysis, particularly when it comes to Twitter data. This research is committed to delving into sentiment analysis, focusing specifically on the realm of Twitter. It aims to offer valuable insights into deciphering information within tweets, where opinions manifest in a highly unstructured and diverse manner, spanning a spectrum from positivity to negativity, occasionally punctuated by neutrality expressions. Within this document, we offer a comprehensive exploration and comparative assessment of modern approaches to opinion mining. Employing a range of machine learning algorithms such as Naive Bayes and Logistic Regression, our investigation plunges into the domain of Twitter data streams. We delve into overarching challenges and applications inherent in the realm of subjectivity analysis over Twitter.Keywords: machine learning, sentiment analysis, visualisation, python
Procedia PDF Downloads 54484 Incorporating Information Gain in Regular Expressions Based Classifiers
Authors: Rosa L. Figueroa, Christopher A. Flores, Qing Zeng-Treitler
Abstract:
A regular expression consists of sequence characters which allow describing a text path. Usually, in clinical research, regular expressions are manually created by programmers together with domain experts. Lately, there have been several efforts to investigate how to generate them automatically. This article presents a text classification algorithm based on regexes. The algorithm named REX was designed, and then, implemented as a simplified method to create regexes to classify Spanish text automatically. In order to classify ambiguous cases, such as, when multiple labels are assigned to a testing example, REX includes an information gain method Two sets of data were used to evaluate the algorithm’s effectiveness in clinical text classification tasks. The results indicate that the regular expression based classifier proposed in this work performs statically better regarding accuracy and F-measure than Support Vector Machine and Naïve Bayes for both datasets.Keywords: information gain, regular expressions, smith-waterman algorithm, text classification
Procedia PDF Downloads 319483 Detecting Cyberbullying, Spam and Bot Behavior and Fake News in Social Media Accounts Using Machine Learning
Authors: M. D. D. Chathurangi, M. G. K. Nayanathara, K. M. H. M. M. Gunapala, G. M. R. G. Dayananda, Kavinga Yapa Abeywardena, Deemantha Siriwardana
Abstract:
Due to the growing popularity of social media platforms at present, there are various concerns, mostly cyberbullying, spam, bot accounts, and the spread of incorrect information. To develop a risk score calculation system as a thorough method for deciphering and exposing unethical social media profiles, this research explores the most suitable algorithms to our best knowledge in detecting the mentioned concerns. Various multiple models, such as Naïve Bayes, CNN, KNN, Stochastic Gradient Descent, Gradient Boosting Classifier, etc., were examined, and the best results were taken into the development of the risk score system. For cyberbullying, the Logistic Regression algorithm achieved an accuracy of 84.9%, while the spam-detecting MLP model gained 98.02% accuracy. The bot accounts identifying the Random Forest algorithm obtained 91.06% accuracy, and 84% accuracy was acquired for fake news detection using SVM.Keywords: cyberbullying, spam behavior, bot accounts, fake news, machine learning
Procedia PDF Downloads 35482 Classification of Red, Green and Blue Values from Face Images Using k-NN Classifier to Predict the Skin or Non-Skin
Authors: Kemal Polat
Abstract:
In this study, it has been estimated whether there is skin by using RBG values obtained from the camera and k-nearest neighbor (k-NN) classifier. The dataset used in this study has an unbalanced distribution and a linearly non-separable structure. This problem can also be called a big data problem. The Skin dataset was taken from UCI machine learning repository. As the classifier, we have used the k-NN method to handle this big data problem. For k value of k-NN classifier, we have used as 1. To train and test the k-NN classifier, 50-50% training-testing partition has been used. As the performance metrics, TP rate, FP Rate, Precision, recall, f-measure and AUC values have been used to evaluate the performance of k-NN classifier. These obtained results are as follows: 0.999, 0.001, 0.999, 0.999, 0.999, and 1,00. As can be seen from the obtained results, this proposed method could be used to predict whether the image is skin or not.Keywords: k-NN classifier, skin or non-skin classification, RGB values, classification
Procedia PDF Downloads 246481 Exploring the Role of Data Mining in Crime Classification: A Systematic Literature Review
Authors: Faisal Muhibuddin, Ani Dijah Rahajoe
Abstract:
This in-depth exploration, through a systematic literature review, scrutinizes the nuanced role of data mining in the classification of criminal activities. The research focuses on investigating various methodological aspects and recent developments in leveraging data mining techniques to enhance the effectiveness and precision of crime categorization. Commencing with an exposition of the foundational concepts of crime classification and its evolutionary dynamics, this study details the paradigm shift from conventional methods towards approaches supported by data mining, addressing the challenges and complexities inherent in the modern crime landscape. Specifically, the research delves into various data mining techniques, including K-means clustering, Naïve Bayes, K-nearest neighbour, and clustering methods. A comprehensive review of the strengths and limitations of each technique provides insights into their respective contributions to improving crime classification models. The integration of diverse data sources takes centre stage in this research. A detailed analysis explores how the amalgamation of structured data (such as criminal records) and unstructured data (such as social media) can offer a holistic understanding of crime, enriching classification models with more profound insights. Furthermore, the study explores the temporal implications in crime classification, emphasizing the significance of considering temporal factors to comprehend long-term trends and seasonality. The availability of real-time data is also elucidated as a crucial element in enhancing responsiveness and accuracy in crime classification.Keywords: data mining, classification algorithm, naïve bayes, k-means clustering, k-nearest neigbhor, crime, data analysis, sistematic literature review
Procedia PDF Downloads 61480 An Estimating Parameter of the Mean in Normal Distribution by Maximum Likelihood, Bayes, and Markov Chain Monte Carlo Methods
Authors: Autcha Araveeporn
Abstract:
This paper is to compare the parameter estimation of the mean in normal distribution by Maximum Likelihood (ML), Bayes, and Markov Chain Monte Carlo (MCMC) methods. The ML estimator is estimated by the average of data, the Bayes method is considered from the prior distribution to estimate Bayes estimator, and MCMC estimator is approximated by Gibbs sampling from posterior distribution. These methods are also to estimate a parameter then the hypothesis testing is used to check a robustness of the estimators. Data are simulated from normal distribution with the true parameter of mean 2, and variance 4, 9, and 16 when the sample sizes is set as 10, 20, 30, and 50. From the results, it can be seen that the estimation of MLE, and MCMC are perceivably different from the true parameter when the sample size is 10 and 20 with variance 16. Furthermore, the Bayes estimator is estimated from the prior distribution when mean is 1, and variance is 12 which showed the significant difference in mean with variance 9 at the sample size 10 and 20.Keywords: Bayes method, Markov chain Monte Carlo method, maximum likelihood method, normal distribution
Procedia PDF Downloads 355479 Use of Fractal Geometry in Machine Learning
Authors: Fuad M. Alkoot
Abstract:
The main component of a machine learning system is the classifier. Classifiers are mathematical models that can perform classification tasks for a specific application area. Additionally, many classifiers are combined using any of the available methods to reduce the classifier error rate. The benefits gained from the combination of multiple classifier designs has motivated the development of diverse approaches to multiple classifiers. We aim to investigate using fractal geometry to develop an improved classifier combiner. Initially we experiment with measuring the fractal dimension of data and use the results in the development of a combiner strategy.Keywords: fractal geometry, machine learning, classifier, fractal dimension
Procedia PDF Downloads 213478 Evaluation of Gesture-Based Password: User Behavioral Features Using Machine Learning Algorithms
Authors: Lakshmidevi Sreeramareddy, Komalpreet Kaur, Nane Pothier
Abstract:
Graphical-based passwords have existed for decades. Their major advantage is that they are easier to remember than an alphanumeric password. However, their disadvantage (especially recognition-based passwords) is the smaller password space, making them more vulnerable to brute force attacks. Graphical passwords are also highly susceptible to the shoulder-surfing effect. The gesture-based password method that we developed is a grid-free, template-free method. In this study, we evaluated the gesture-based passwords for usability and vulnerability. The results of the study are significant. We developed a gesture-based password application for data collection. Two modes of data collection were used: Creation mode and Replication mode. In creation mode (Session 1), users were asked to create six different passwords and reenter each password five times. In replication mode, users saw a password image created by some other user for a fixed duration of time. Three different duration timers, such as 5 seconds (Session 2), 10 seconds (Session 3), and 15 seconds (Session 4), were used to mimic the shoulder-surfing attack. After the timer expired, the password image was removed, and users were asked to replicate the password. There were 74, 57, 50, and 44 users participated in Session 1, Session 2, Session 3, and Session 4 respectfully. In this study, the machine learning algorithms have been applied to determine whether the person is a genuine user or an imposter based on the password entered. Five different machine learning algorithms were deployed to compare the performance in user authentication: namely, Decision Trees, Linear Discriminant Analysis, Naive Bayes Classifier, Support Vector Machines (SVMs) with Gaussian Radial Basis Kernel function, and K-Nearest Neighbor. Gesture-based password features vary from one entry to the next. It is difficult to distinguish between a creator and an intruder for authentication. For each password entered by the user, four features were extracted: password score, password length, password speed, and password size. All four features were normalized before being fed to a classifier. Three different classifiers were trained using data from all four sessions. Classifiers A, B, and C were trained and tested using data from the password creation session and the password replication with a timer of 5 seconds, 10 seconds, and 15 seconds, respectively. The classification accuracies for Classifier A using five ML algorithms are 72.5%, 71.3%, 71.9%, 74.4%, and 72.9%, respectively. The classification accuracies for Classifier B using five ML algorithms are 69.7%, 67.9%, 70.2%, 73.8%, and 71.2%, respectively. The classification accuracies for Classifier C using five ML algorithms are 68.1%, 64.9%, 68.4%, 71.5%, and 69.8%, respectively. SVMs with Gaussian Radial Basis Kernel outperform other ML algorithms for gesture-based password authentication. Results confirm that the shorter the duration of the shoulder-surfing attack, the higher the authentication accuracy. In conclusion, behavioral features extracted from the gesture-based passwords lead to less vulnerable user authentication.Keywords: authentication, gesture-based passwords, machine learning algorithms, shoulder-surfing attacks, usability
Procedia PDF Downloads 102477 Early Stage Suicide Ideation Detection Using Supervised Machine Learning and Neural Network Classifier
Authors: Devendra Kr Tayal, Vrinda Gupta, Aastha Bansal, Khushi Singh, Sristi Sharma, Hunny Gaur
Abstract:
In today's world, suicide is a serious problem. In order to save lives, early suicide attempt detection and prevention should be addressed. A good number of at-risk people utilize social media platforms to talk about their issues or find knowledge on related chores. Twitter and Reddit are two of the most common platforms that are used for expressing oneself. Extensive research has already been done in this field. Through supervised classification techniques like Nave Bayes, Bernoulli Nave Bayes, and Multiple Layer Perceptron on a Reddit dataset, we demonstrate the early recognition of suicidal ideation. We also performed comparative analysis on these approaches and used accuracy, recall score, F1 score, and precision score for analysis.Keywords: machine learning, suicide ideation detection, supervised classification, natural language processing
Procedia PDF Downloads 89476 A Comparative Analysis of Classification Models with Wrapper-Based Feature Selection for Predicting Student Academic Performance
Authors: Abdullah Al Farwan, Ya Zhang
Abstract:
In today’s educational arena, it is critical to understand educational data and be able to evaluate important aspects, particularly data on student achievement. Educational Data Mining (EDM) is a research area that focusing on uncovering patterns and information in data from educational institutions. Teachers, if they are able to predict their students' class performance, can use this information to improve their teaching abilities. It has evolved into valuable knowledge that can be used for a wide range of objectives; for example, a strategic plan can be used to generate high-quality education. Based on previous data, this paper recommends employing data mining techniques to forecast students' final grades. In this study, five data mining methods, Decision Tree, JRip, Naive Bayes, Multi-layer Perceptron, and Random Forest with wrapper feature selection, were used on two datasets relating to Portuguese language and mathematics classes lessons. The results showed the effectiveness of using data mining learning methodologies in predicting student academic success. The classification accuracy achieved with selected algorithms lies in the range of 80-94%. Among all the selected classification algorithms, the lowest accuracy is achieved by the Multi-layer Perceptron algorithm, which is close to 70.45%, and the highest accuracy is achieved by the Random Forest algorithm, which is close to 94.10%. This proposed work can assist educational administrators to identify poor performing students at an early stage and perhaps implement motivational interventions to improve their academic success and prevent educational dropout.Keywords: classification algorithms, decision tree, feature selection, multi-layer perceptron, Naïve Bayes, random forest, students’ academic performance
Procedia PDF Downloads 165475 Detection and Classification of Myocardial Infarction Using New Extracted Features from Standard 12-Lead ECG Signals
Authors: Naser Safdarian, Nader Jafarnia Dabanloo
Abstract:
In this paper we used four features i.e. Q-wave integral, QRS complex integral, T-wave integral and total integral as extracted feature from normal and patient ECG signals to detection and localization of myocardial infarction (MI) in left ventricle of heart. In our research we focused on detection and localization of MI in standard ECG. We use the Q-wave integral and T-wave integral because this feature is important impression in detection of MI. We used some pattern recognition method such as Artificial Neural Network (ANN) to detect and localize the MI. Because these methods have good accuracy for classification of normal and abnormal signals. We used one type of Radial Basis Function (RBF) that called Probabilistic Neural Network (PNN) because of its nonlinearity property, and used other classifier such as k-Nearest Neighbors (KNN), Multilayer Perceptron (MLP) and Naive Bayes Classification. We used PhysioNet database as our training and test data. We reached over 80% for accuracy in test data for localization and over 95% for detection of MI. Main advantages of our method are simplicity and its good accuracy. Also we can improve accuracy of classification by adding more features in this method. A simple method based on using only four features which extracted from standard ECG is presented which has good accuracy in MI localization.Keywords: ECG signal processing, myocardial infarction, features extraction, pattern recognition
Procedia PDF Downloads 453474 Advancements in Predicting Diabetes Biomarkers: A Machine Learning Epigenetic Approach
Authors: James Ladzekpo
Abstract:
Background: The urgent need to identify new pharmacological targets for diabetes treatment and prevention has been amplified by the disease's extensive impact on individuals and healthcare systems. A deeper insight into the biological underpinnings of diabetes is crucial for the creation of therapeutic strategies aimed at these biological processes. Current predictive models based on genetic variations fall short of accurately forecasting diabetes. Objectives: Our study aims to pinpoint key epigenetic factors that predispose individuals to diabetes. These factors will inform the development of an advanced predictive model that estimates diabetes risk from genetic profiles, utilizing state-of-the-art statistical and data mining methods. Methodology: We have implemented a recursive feature elimination with cross-validation using the support vector machine (SVM) approach for refined feature selection. Building on this, we developed six machine learning models, including logistic regression, k-Nearest Neighbors (k-NN), Naive Bayes, Random Forest, Gradient Boosting, and Multilayer Perceptron Neural Network, to evaluate their performance. Findings: The Gradient Boosting Classifier excelled, achieving a median recall of 92.17% and outstanding metrics such as area under the receiver operating characteristics curve (AUC) with a median of 68%, alongside median accuracy and precision scores of 76%. Through our machine learning analysis, we identified 31 genes significantly associated with diabetes traits, highlighting their potential as biomarkers and targets for diabetes management strategies. Conclusion: Particularly noteworthy were the Gradient Boosting Classifier and Multilayer Perceptron Neural Network, which demonstrated potential in diabetes outcome prediction. We recommend future investigations to incorporate larger cohorts and a wider array of predictive variables to enhance the models' predictive capabilities.Keywords: diabetes, machine learning, prediction, biomarkers
Procedia PDF Downloads 53473 Speaker Recognition Using LIRA Neural Networks
Authors: Nestor A. Garcia Fragoso, Tetyana Baydyk, Ernst Kussul
Abstract:
This article contains information from our investigation in the field of voice recognition. For this purpose, we created a voice database that contains different phrases in two languages, English and Spanish, for men and women. As a classifier, the LIRA (Limited Receptive Area) grayscale neural classifier was selected. The LIRA grayscale neural classifier was developed for image recognition tasks and demonstrated good results. Therefore, we decided to develop a recognition system using this classifier for voice recognition. From a specific set of speakers, we can recognize the speaker’s voice. For this purpose, the system uses spectrograms of the voice signals as input to the system, extracts the characteristics and identifies the speaker. The results are described and analyzed in this article. The classifier can be used for speaker identification in security system or smart buildings for different types of intelligent devices.Keywords: extreme learning, LIRA neural classifier, speaker identification, voice recognition
Procedia PDF Downloads 176472 A Decision Support System to Detect the Lumbar Disc Disease on the Basis of Clinical MRI
Authors: Yavuz Unal, Kemal Polat, H. Erdinc Kocer
Abstract:
In this study, a decision support system comprising three stages has been proposed to detect the disc abnormalities of the lumbar region. In the first stage named the feature extraction, T2-weighted sagittal and axial Magnetic Resonance Images (MRI) were taken from 55 people and then 27 appearance and shape features were acquired from both sagittal and transverse images. In the second stage named the feature weighting process, k-means clustering based feature weighting (KMCBFW) proposed by Gunes et al. Finally, in the third stage named the classification process, the classifier algorithms including multi-layer perceptron (MLP- neural network), support vector machine (SVM), Naïve Bayes, and decision tree have been used to classify whether the subject has lumbar disc or not. In order to test the performance of the proposed method, the classification accuracy (%), sensitivity, specificity, precision, recall, f-measure, kappa value, and computation times have been used. The best hybrid model is the combination of k-means clustering based feature weighting and decision tree in the detecting of lumbar disc disease based on both sagittal and axial MR images.Keywords: lumbar disc abnormality, lumbar MRI, lumbar spine, hybrid models, hybrid features, k-means clustering based feature weighting
Procedia PDF Downloads 517471 A Machine Learning Model for Predicting Students’ Academic Performance in Higher Institutions
Authors: Emmanuel Osaze Oshoiribhor, Adetokunbo MacGregor John-Otumu
Abstract:
There has been a need in recent years to predict student academic achievement prior to graduation. This is to assist them in improving their grades, especially for those who have struggled in the past. The purpose of this research is to use supervised learning techniques to create a model that predicts student academic progress. Many scholars have developed models that predict student academic achievement based on characteristics including smoking, demography, culture, social media, parent educational background, parent finances, and family background, to mention a few. This element, as well as the model used, could have misclassified the kids in terms of their academic achievement. As a prerequisite to predicting if the student will perform well in the future on related courses, this model is built using a logistic regression classifier with basic features such as the previous semester's course score, attendance to class, class participation, and the total number of course materials or resources the student is able to cover per semester. With a 96.7 percent accuracy, the model outperformed other classifiers such as Naive bayes, Support vector machine (SVM), Decision Tree, Random forest, and Adaboost. This model is offered as a desktop application with user-friendly interfaces for forecasting student academic progress for both teachers and students. As a result, both students and professors are encouraged to use this technique to predict outcomes better.Keywords: artificial intelligence, ML, logistic regression, performance, prediction
Procedia PDF Downloads 108470 Landslide Susceptibility Mapping Using Soft Computing in Amhara Saint
Authors: Semachew M. Kassa, Africa M Geremew, Tezera F. Azmatch, Nandyala Darga Kumar
Abstract:
Frequency ratio (FR) and analytical hierarchy process (AHP) methods are developed based on past landslide failure points to identify the landslide susceptibility mapping because landslides can seriously harm both the environment and society. However, it is still difficult to select the most efficient method and correctly identify the main driving factors for particular regions. In this study, we used fourteen landslide conditioning factors (LCFs) and five soft computing algorithms, including Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Artificial Neural Network (ANN), and Naïve Bayes (NB), to predict the landslide susceptibility at 12.5 m spatial scale. The performance of the RF (F1-score: 0.88, AUC: 0.94), ANN (F1-score: 0.85, AUC: 0.92), and SVM (F1-score: 0.82, AUC: 0.86) methods was significantly better than the LR (F1-score: 0.75, AUC: 0.76) and NB (F1-score: 0.73, AUC: 0.75) method, according to the classification results based on inventory landslide points. The findings also showed that around 35% of the study region was made up of places with high and very high landslide risk (susceptibility greater than 0.5). The very high-risk locations were primarily found in the western and southeastern regions, and all five models showed good agreement and similar geographic distribution patterns in landslide susceptibility. The towns with the highest landslide risk include Amhara Saint Town's western part, the Northern part, and St. Gebreal Church villages, with mean susceptibility values greater than 0.5. However, rainfall, distance to road, and slope were typically among the top leading factors for most villages. The primary contributing factors to landslide vulnerability were slightly varied for the five models. Decision-makers and policy planners can use the information from our study to make informed decisions and establish policies. It also suggests that various places should take different safeguards to reduce or prevent serious damage from landslide events.Keywords: artificial neural network, logistic regression, landslide susceptibility, naïve Bayes, random forest, support vector machine
Procedia PDF Downloads 79469 Logistic Regression Based Model for Predicting Students’ Academic Performance in Higher Institutions
Authors: Emmanuel Osaze Oshoiribhor, Adetokunbo MacGregor John-Otumu
Abstract:
In recent years, there has been a desire to forecast student academic achievement prior to graduation. This is to help them improve their grades, particularly for individuals with poor performance. The goal of this study is to employ supervised learning techniques to construct a predictive model for student academic achievement. Many academics have already constructed models that predict student academic achievement based on factors such as smoking, demography, culture, social media, parent educational background, parent finances, and family background, to name a few. This feature and the model employed may not have correctly classified the students in terms of their academic performance. This model is built using a logistic regression classifier with basic features such as the previous semester's course score, attendance to class, class participation, and the total number of course materials or resources the student is able to cover per semester as a prerequisite to predict if the student will perform well in future on related courses. The model outperformed other classifiers such as Naive bayes, Support vector machine (SVM), Decision Tree, Random forest, and Adaboost, returning a 96.7% accuracy. This model is available as a desktop application, allowing both instructors and students to benefit from user-friendly interfaces for predicting student academic achievement. As a result, it is recommended that both students and professors use this tool to better forecast outcomes.Keywords: artificial intelligence, ML, logistic regression, performance, prediction
Procedia PDF Downloads 96468 Artificial Intelligence Assisted Sentiment Analysis of Hotel Reviews Using Topic Modeling
Authors: Sushma Ghogale
Abstract:
With a surge in user-generated content or feedback or reviews on the internet, it has become possible and important to know consumers' opinions about products and services. This data is important for both potential customers and businesses providing the services. Data from social media is attracting significant attention and has become the most prominent channel of expressing an unregulated opinion. Prospective customers look for reviews from experienced customers before deciding to buy a product or service. Several websites provide a platform for users to post their feedback for the provider and potential customers. However, the biggest challenge in analyzing such data is in extracting latent features and providing term-level analysis of the data. This paper proposes an approach to use topic modeling to classify the reviews into topics and conduct sentiment analysis to mine the opinions. This approach can analyse and classify latent topics mentioned by reviewers on business sites or review sites, or social media using topic modeling to identify the importance of each topic. It is followed by sentiment analysis to assess the satisfaction level of each topic. This approach provides a classification of hotel reviews using multiple machine learning techniques and comparing different classifiers to mine the opinions of user reviews through sentiment analysis. This experiment concludes that Multinomial Naïve Bayes classifier produces higher accuracy than other classifiers.Keywords: latent Dirichlet allocation, topic modeling, text classification, sentiment analysis
Procedia PDF Downloads 96467 Measuring Multi-Class Linear Classifier for Image Classification
Authors: Fatma Susilawati Mohamad, Azizah Abdul Manaf, Fadhillah Ahmad, Zarina Mohamad, Wan Suryani Wan Awang
Abstract:
A simple and robust multi-class linear classifier is proposed and implemented. For a pair of classes of the linear boundary, a collection of segments of hyper planes created as perpendicular bisectors of line segments linking centroids of the classes or part of classes. Nearest Neighbor and Linear Discriminant Analysis are compared in the experiments to see the performances of each classifier in discriminating ripeness of oil palm. This paper proposes a multi-class linear classifier using Linear Discriminant Analysis (LDA) for image identification. Result proves that LDA is well capable in separating multi-class features for ripeness identification.Keywords: multi-class, linear classifier, nearest neighbor, linear discriminant analysis
Procedia PDF Downloads 536466 An Automatic Bayesian Classification System for File Format Selection
Authors: Roman Graf, Sergiu Gordea, Heather M. Ryan
Abstract:
This paper presents an approach for the classification of an unstructured format description for identification of file formats. The main contribution of this work is the employment of data mining techniques to support file format selection with just the unstructured text description that comprises the most important format features for a particular organisation. Subsequently, the file format indentification method employs file format classifier and associated configurations to support digital preservation experts with an estimation of required file format. Our goal is to make use of a format specification knowledge base aggregated from a different Web sources in order to select file format for a particular institution. Using the naive Bayes method, the decision support system recommends to an expert, the file format for his institution. The proposed methods facilitate the selection of file format and the quality of a digital preservation process. The presented approach is meant to facilitate decision making for the preservation of digital content in libraries and archives using domain expert knowledge and specifications of file formats. To facilitate decision-making, the aggregated information about the file formats is presented as a file format vocabulary that comprises most common terms that are characteristic for all researched formats. The goal is to suggest a particular file format based on this vocabulary for analysis by an expert. The sample file format calculation and the calculation results including probabilities are presented in the evaluation section.Keywords: data mining, digital libraries, digital preservation, file format
Procedia PDF Downloads 497465 Optimization of Hate Speech and Abusive Language Detection on Indonesian-language Twitter using Genetic Algorithms
Authors: Rikson Gultom
Abstract:
Hate Speech and Abusive language on social media is difficult to detect, usually, it is detected after it becomes viral in cyberspace, of course, it is too late for prevention. An early detection system that has a fairly good accuracy is needed so that it can reduce conflicts that occur in society caused by postings on social media that attack individuals, groups, and governments in Indonesia. The purpose of this study is to find an early detection model on Twitter social media using machine learning that has high accuracy from several machine learning methods studied. In this study, the support vector machine (SVM), Naïve Bayes (NB), and Random Forest Decision Tree (RFDT) methods were compared with the Support Vector machine with genetic algorithm (SVM-GA), Nave Bayes with genetic algorithm (NB-GA), and Random Forest Decision Tree with Genetic Algorithm (RFDT-GA). The study produced a comparison table for the accuracy of the hate speech and abusive language detection model, and presented it in the form of a graph of the accuracy of the six algorithms developed based on the Indonesian-language Twitter dataset, and concluded the best model with the highest accuracy.Keywords: abusive language, hate speech, machine learning, optimization, social media
Procedia PDF Downloads 126464 Evaluation of Machine Learning Algorithms and Ensemble Methods for Prediction of Students’ Graduation
Authors: Soha A. Bahanshal, Vaibhav Verdhan, Bayong Kim
Abstract:
Graduation rates at six-year colleges are becoming a more essential indicator for incoming fresh students and for university rankings. Predicting student graduation is extremely beneficial to schools and has a huge potential for targeted intervention. It is important for educational institutions since it enables the development of strategic plans that will assist or improve students' performance in achieving their degrees on time (GOT). A first step and a helping hand in extracting useful information from these data and gaining insights into the prediction of students' progress and performance is offered by machine learning techniques. Data analysis and visualization techniques are applied to understand and interpret the data. The data used for the analysis contains students who have graduated in 6 years in the academic year 2017-2018 for science majors. This analysis can be used to predict the graduation of students in the next academic year. Different Predictive modelings such as logistic regression, decision trees, support vector machines, Random Forest, Naïve Bayes, and KNeighborsClassifier are applied to predict whether a student will graduate. These classifiers were evaluated with k folds of 5. The performance of these classifiers was compared based on accuracy measurement. The results indicated that Ensemble Classifier achieves better accuracy, about 91.12%. This GOT prediction model would hopefully be useful to university administration and academics in developing measures for assisting and boosting students' academic performance and ensuring they graduate on time.Keywords: prediction, decision trees, machine learning, support vector machine, ensemble model, student graduation, GOT graduate on time
Procedia PDF Downloads 70463 A Comparative Analysis of Global Minimum Variance and Naïve Portfolios: Performance across Stock Market Indices and Selected Economic Regimes Using Various Risk-Return Metrics
Authors: Lynmar M. Didal, Ramises G. Manzano Jr., Jacque Bon-Isaac C. Aboy
Abstract:
This study analyzes the performance of global minimum variance and naive portfolios across different economic periods, using monthly stock returns from the Philippine Stock Exchange Index (PSEI), S&P 500, and Dow Jones Industrial Average (DOW). The performance is evaluated through the Sharpe ratio, Sortino ratio, Jensen’s Alpha, Treynor ratio, and Information ratio. Additionally, the study investigates the impact of short selling on portfolio performance. Six-time periods are defined for analysis, encompassing events such as the global financial crisis and the COVID-19 pandemic. Findings indicate that the Naive portfolio generally outperforms the GMV portfolio in the S&P 500, signifying higher returns with increased volatility. Conversely, in the PSEI and DOW, the GMV portfolio shows more efficient risk-adjusted returns. Short selling significantly impacts the GMV portfolio during mid-GFC and mid-COVID periods. The study offers insights for investors, suggesting the Naive portfolio for higher risk tolerance and the GMV portfolio as a conservative alternative.Keywords: portfolio performance, global minimum variance, naïve portfolio, risk-adjusted metrics, short-selling
Procedia PDF Downloads 93462 Bayes Estimation of Parameters of Binomial Type Rayleigh Class Software Reliability Growth Model using Non-informative Priors
Authors: Rajesh Singh, Kailash Kale
Abstract:
In this paper, the Binomial process type occurrence of software failures is considered and failure intensity has been characterized by one parameter Rayleigh class Software Reliability Growth Model (SRGM). The proposed SRGM is mathematical function of parameters namely; total number of failures i.e. η-0 and scale parameter i.e. η-1. It is assumed that very little or no information is available about both these parameters and then considering non-informative priors for both these parameters, the Bayes estimators for the parameters η-0 and η-1 have been obtained under square error loss function. The proposed Bayes estimators are compared with their corresponding maximum likelihood estimators on the basis of risk efficiencies obtained by Monte Carlo simulation technique. It is concluded that both the proposed Bayes estimators of total number of failures and scale parameter perform well for proper choice of execution time.Keywords: binomial process, non-informative prior, maximum likelihood estimator (MLE), rayleigh class, software reliability growth model (SRGM)
Procedia PDF Downloads 387461 A Bayesian Classification System for Facilitating an Institutional Risk Profile Definition
Authors: Roman Graf, Sergiu Gordea, Heather M. Ryan
Abstract:
This paper presents an approach for easy creation and classification of institutional risk profiles supporting endangerment analysis of file formats. The main contribution of this work is the employment of data mining techniques to support set up of the most important risk factors. Subsequently, risk profiles employ risk factors classifier and associated configurations to support digital preservation experts with a semi-automatic estimation of endangerment group for file format risk profiles. Our goal is to make use of an expert knowledge base, accuired through a digital preservation survey in order to detect preservation risks for a particular institution. Another contribution is support for visualisation of risk factors for a requried dimension for analysis. Using the naive Bayes method, the decision support system recommends to an expert the matching risk profile group for the previously selected institutional risk profile. The proposed methods improve the visibility of risk factor values and the quality of a digital preservation process. The presented approach is designed to facilitate decision making for the preservation of digital content in libraries and archives using domain expert knowledge and values of file format risk profiles. To facilitate decision-making, the aggregated information about the risk factors is presented as a multidimensional vector. The goal is to visualise particular dimensions of this vector for analysis by an expert and to define its profile group. The sample risk profile calculation and the visualisation of some risk factor dimensions is presented in the evaluation section.Keywords: linked open data, information integration, digital libraries, data mining
Procedia PDF Downloads 422460 Tongue Image Retrieval Based Using Machine Learning
Authors: Ahmad FAROOQ, Xinfeng Zhang, Fahad Sabah, Raheem Sarwar
Abstract:
In Traditional Chinese Medicine, tongue diagnosis is a vital inspection tool (TCM). In this study, we explore the potential of machine learning in tongue diagnosis. It begins with the cataloguing of the various classifications and characteristics of the human tongue. We infer 24 kinds of tongues from the material and coating of the tongue, and we identify 21 attributes of the tongue. The next step is to apply machine learning methods to the tongue dataset. We use the Weka machine learning platform to conduct the experiment for performance analysis. The 457 instances of the tongue dataset are used to test the performance of five different machine learning methods, including SVM, Random Forests, Decision Trees, and Naive Bayes. Based on accuracy and Area under the ROC Curve, the Support Vector Machine algorithm was shown to be the most effective for tongue diagnosis (AUC).Keywords: medical imaging, image retrieval, machine learning, tongue
Procedia PDF Downloads 79459 Prediction of MicroRNA-Target Gene by Machine Learning Algorithms in Lung Cancer Study
Authors: Nilubon Kurubanjerdjit, Nattakarn Iam-On, Ka-Lok Ng
Abstract:
MicroRNAs are small non-coding RNA found in many different species. They play crucial roles in cancer such as biological processes of apoptosis and proliferation. The identification of microRNA-target genes can be an essential first step towards to reveal the role of microRNA in various cancer types. In this paper, we predict miRNA-target genes for lung cancer by integrating prediction scores from miRanda and PITA algorithms used as a feature vector of miRNA-target interaction. Then, machine-learning algorithms were implemented for making a final prediction. The approach developed in this study should be of value for future studies into understanding the role of miRNAs in molecular mechanisms enabling lung cancer formation.Keywords: microRNA, miRNAs, lung cancer, machine learning, Naïve Bayes, SVM
Procedia PDF Downloads 397458 Data Mining Model for Predicting the Status of HIV Patients during Drug Regimen Change
Authors: Ermias A. Tegegn, Million Meshesha
Abstract:
Human Immunodeficiency Virus and Acquired Immunodeficiency Syndrome (HIV/AIDS) is a major cause of death for most African countries. Ethiopia is one of the seriously affected countries in sub Saharan Africa. Previously in Ethiopia, having HIV/AIDS was almost equivalent to a death sentence. With the introduction of Antiretroviral Therapy (ART), HIV/AIDS has become chronic, but manageable disease. The study focused on a data mining technique to predict future living status of HIV/AIDS patients at the time of drug regimen change when the patients become toxic to the currently taking ART drug combination. The data is taken from University of Gondar Hospital ART program database. Hybrid methodology is followed to explore the application of data mining on ART program dataset. Data cleaning, handling missing values and data transformation were used for preprocessing the data. WEKA 3.7.9 data mining tools, classification algorithms, and expertise are utilized as means to address the research problem. By using four different classification algorithms, (i.e., J48 Classifier, PART rule induction, Naïve Bayes and Neural network) and by adjusting their parameters thirty-two models were built on the pre-processed University of Gondar ART program dataset. The performances of the models were evaluated using the standard metrics of accuracy, precision, recall, and F-measure. The most effective model to predict the status of HIV patients with drug regimen substitution is pruned J48 decision tree with a classification accuracy of 98.01%. This study extracts interesting attributes such as Ever taking Cotrim, Ever taking TbRx, CD4 count, Age, Weight, and Gender so as to predict the status of drug regimen substitution. The outcome of this study can be used as an assistant tool for the clinician to help them make more appropriate drug regimen substitution. Future research directions are forwarded to come up with an applicable system in the area of the study.Keywords: HIV drug regimen, data mining, hybrid methodology, predictive model
Procedia PDF Downloads 141