Search results for: supervised classification
1811 The Classification Performance in Parametric and Nonparametric Discriminant Analysis for a Class- Unbalanced Data of Diabetes Risk Groups
Authors: Lily Ingsrisawang, Tasanee Nacharoen
Abstract:
Introduction: The problems of unbalanced data sets generally appear in real world applications. Due to unequal class distribution, many research papers found that the performance of existing classifier tends to be biased towards the majority class. The k -nearest neighbors’ nonparametric discriminant analysis is one method that was proposed for classifying unbalanced classes with good performance. Hence, the methods of discriminant analysis are of interest to us in investigating misclassification error rates for class-imbalanced data of three diabetes risk groups. Objective: The purpose of this study was to compare the classification performance between parametric discriminant analysis and nonparametric discriminant analysis in a three-class classification application of class-imbalanced data of diabetes risk groups. Methods: Data from a healthy project for 599 staffs in a government hospital in Bangkok were obtained for the classification problem. The staffs were diagnosed into one of three diabetes risk groups: non-risk (90%), risk (5%), and diabetic (5%). The original data along with the variables; diabetes risk group, age, gender, cholesterol, and BMI was analyzed and bootstrapped up to 50 and 100 samples, 599 observations per sample, for additional estimation of misclassification error rate. Each data set was explored for the departure of multivariate normality and the equality of covariance matrices of the three risk groups. Both the original data and the bootstrap samples show non-normality and unequal covariance matrices. The parametric linear discriminant function, quadratic discriminant function, and the nonparametric k-nearest neighbors’ discriminant function were performed over 50 and 100 bootstrap samples and applied to the original data. In finding the optimal classification rule, the choices of prior probabilities were set up for both equal proportions (0.33: 0.33: 0.33) and unequal proportions with three choices of (0.90:0.05:0.05), (0.80: 0.10: 0.10) or (0.70, 0.15, 0.15). Results: The results from 50 and 100 bootstrap samples indicated that the k-nearest neighbors approach when k = 3 or k = 4 and the prior probabilities of {non-risk:risk:diabetic} as {0.90:0.05:0.05} or {0.80:0.10:0.10} gave the smallest error rate of misclassification. Conclusion: The k-nearest neighbors approach would be suggested for classifying a three-class-imbalanced data of diabetes risk groups.Keywords: error rate, bootstrap, diabetes risk groups, k-nearest neighbors
Procedia PDF Downloads 4351810 Predictive Spectral Lithological Mapping, Geomorphology and Geospatial Correlation of Structural Lineaments in Bornu Basin, Northeast Nigeria
Authors: Aminu Abdullahi Isyaku
Abstract:
Semi-arid Bornu basin in northeast Nigeria is characterised with flat topography, thick cover sediments and lack of continuous bedrock outcrops discernible for field geology. This paper presents the methodology for the characterisation of neotectonic surface structures and surface lithology in the north-eastern Bornu basin in northeast Nigeria as an alternative approach to field geological mapping using free multispectral Landsat 7 ETM+, SRTM DEM and ASAR Earth Observation datasets. Spectral lithological mapping herein developed utilised spectral discrimination of the surface features identified on Landsat 7 ETM+ images to infer on the lithology using four steps including; computations of band combination images; band ratio images; supervised image classification and inferences of the lithological compositions. Two complementary approaches to lineament mapping are carried out in this study involving manual digitization and automatic lineament extraction to validate the structural lineaments extracted from the Landsat 7 ETM+ image mosaic covering the study. A comparison between the mapped surface lineaments and lineament zones show good geospatial correlation and identified the predominant NE-SW and NW-SE structural trends in the basin. Topographic profiles across different parts of the Bama Beach Ridge palaeoshorelines in the basin appear to show different elevations across the feature. It is determined that most of the drainage systems in the northeastern Bornu basin are structurally controlled with drainage lines terminating against the paleo-lake border and emptying into the Lake Chad mainly arising from the extensive topographic high-stand Bama Beach Ridge palaeoshoreline.Keywords: Bornu Basin, lineaments, spectral lithology, tectonics
Procedia PDF Downloads 1391809 Sentiment Analysis of Fake Health News Using Naive Bayes Classification Models
Authors: Danielle Shackley, Yetunde Folajimi
Abstract:
As more people turn to the internet seeking health-related information, there is more risk of finding false, inaccurate, or dangerous information. Sentiment analysis is a natural language processing technique that assigns polarity scores to text, ranging from positive, neutral, and negative. In this research, we evaluate the weight of a sentiment analysis feature added to fake health news classification models. The dataset consists of existing reliably labeled health article headlines that were supplemented with health information collected about COVID-19 from social media sources. We started with data preprocessing and tested out various vectorization methods such as Count and TFIDF vectorization. We implemented 3 Naive Bayes classifier models, including Bernoulli, Multinomial, and Complement. To test the weight of the sentiment analysis feature on the dataset, we created benchmark Naive Bayes classification models without sentiment analysis, and those same models were reproduced, and the feature was added. We evaluated using the precision and accuracy scores. The Bernoulli initial model performed with 90% precision and 75.2% accuracy, while the model supplemented with sentiment labels performed with 90.4% precision and stayed constant at 75.2% accuracy. Our results show that the addition of sentiment analysis did not improve model precision by a wide margin; while there was no evidence of improvement in accuracy, we had a 1.9% improvement margin of the precision score with the Complement model. Future expansion of this work could include replicating the experiment process and substituting the Naive Bayes for a deep learning neural network model.Keywords: sentiment analysis, Naive Bayes model, natural language processing, topic analysis, fake health news classification model
Procedia PDF Downloads 971808 Monitoring Deforestation Using Remote Sensing And GIS
Authors: Tejaswi Agarwal, Amritansh Agarwal
Abstract:
Forest ecosystem plays very important role in the global carbon cycle. It stores about 80% of all above ground and 40% of all below ground terrestrial organic carbon. There is much interest in the extent of tropical forests and their rates of deforestation for two reasons: greenhouse gas contributions and the impact of profoundly negative biodiversity. Deforestation has many ecological, social and economic consequences, one of which is the loss of biological diversity. The rapid deployment of remote sensing (RS) satellites and development of RS analysis techniques in the past three decades have provided a reliable, effective, and practical way to characterize terrestrial ecosystem properties. Global estimates of tropical deforestation vary widely and range from 50,000 to 170,000km2 /yr Recent FAO tropical deforestation estimates for 1990–1995 cite 116,756km2 / yr globally. Remote Sensing can prove to be a very useful tool in monitoring of forests and associated deforestation to a sufficient level of accuracy without the need of physically surveying the forest areas as many of them are physically inaccessible. The methodology for the assessment of forest cover using digital image processing (ERDAS) has been followed. The satellite data for the study was procured from Indian institute of remote Sensing (IIRS), Dehradoon in the digital format. While procuring the satellite data, care was taken to ensure that the data was cloud free and did not belong to dry and leafless season. The Normalized Difference Vegetation Index (NDVI) has been used as a numerical indicator of the reduction in ground biomass. NDVI = (near I.R - Red)/ (near I.R + Red). After calculating the NDVI variations and associated mean, we have analysed the change in ground biomass. Through this paper, we have tried to indicate the rate of deforestation over a given period of time by comparing the forest cover at different time intervals. With the help of remote sensing and GIS techniques, it is clearly shown that the total forest cover is continuously degrading and transforming into various land use/land cover category.Keywords: remote sensing, deforestation, supervised classification, NDVI, change detection
Procedia PDF Downloads 12041807 NDVI as a Measure of Change in Forest Biomass
Authors: Amritansh Agarwal, Tejaswi Agarwal
Abstract:
Forest ecosystem plays very important role in the global carbon cycle. It stores about 80% of all above ground and 40% of all below ground terrestrial organic carbon. There is much interest in the extent of tropical forests and their rates of deforestation for two reasons: greenhouse gas contributions and the impact of profoundly negative biodiversity. Deforestation has many ecological, social and economic consequences, one of which is the loss of biological diversity. The rapid deployment of remote sensing (RS) satellites and development of RS analysis techniques in the past three decades have provided a reliable, effective, and practical way to characterize terrestrial ecosystem properties. Global estimates of tropical deforestation vary widely and range from 50,000 to 170,000 km2 /yr Recent FAO tropical deforestation estimates for 1990–1995 cite 116,756km2 / yr globally. Remote Sensing can prove to be a very useful tool in monitoring of forests and associated deforestation to a sufficient level of accuracy without the need of physically surveying the forest areas as many of them are physically inaccessible. The methodology for the assessment of forest cover using digital image processing (ERDAS) has been followed. The satellite data for the study was procured from USGS website in the digital format. While procuring the satellite data, care was taken to ensure that the data was cloud and aerosol free by making using of FLAASH atmospheric correction technique. The Normalized Difference Vegetation Index (NDVI) has been used as a numerical indicator of the reduction in ground biomass. NDVI = (near I.R - Red)/ (near I.R + Red). After calculating the NDVI variations and associated mean we have analysed the change in ground biomass. Through this paper we have tried to indicate the rate of deforestation over a given period of time by comparing the forest cover at different time intervals. With the help of remote sensing and GIS techniques it is clearly shows that the total forest cover is continuously degrading and transforming into various land use/land cover category.Keywords: remote sensing, deforestation, supervised classification, NDVI change detection
Procedia PDF Downloads 4021806 Multiscale Simulation of Absolute Permeability in Carbonate Samples Using 3D X-Ray Micro Computed Tomography Images Textures
Authors: M. S. Jouini, A. Al-Sumaiti, M. Tembely, K. Rahimov
Abstract:
Characterizing rock properties of carbonate reservoirs is highly challenging because of rock heterogeneities revealed at several length scales. In the last two decades, the Digital Rock Physics (DRP) approach was implemented successfully in sandstone rocks reservoirs in order to understand rock properties behaviour at the pore scale. This approach uses 3D X-ray Microtomography images to characterize pore network and also simulate rock properties from these images. Even though, DRP is able to predict realistic rock properties results in sandstone reservoirs it is still suffering from a lack of clear workflow in carbonate rocks. The main challenge is the integration of properties simulated at different scales in order to obtain the effective rock property of core plugs. In this paper, we propose several approaches to characterize absolute permeability in some carbonate core plugs samples using multi-scale numerical simulation workflow. In this study, we propose a procedure to simulate porosity and absolute permeability of a carbonate rock sample using textures of Micro-Computed Tomography images. First, we discretize X-Ray Micro-CT image into a regular grid. Then, we use a textural parametric model to classify each cell of the grid using supervised classification. The main parameters are first and second order statistics such as mean, variance, range and autocorrelations computed from sub-bands obtained after wavelet decomposition. Furthermore, we fill permeability property in each cell using two strategies based on numerical simulation values obtained locally on subsets. Finally, we simulate numerically the effective permeability using Darcy’s law simulator. Results obtained for studied carbonate sample shows good agreement with the experimental property.Keywords: multiscale modeling, permeability, texture, micro-tomography images
Procedia PDF Downloads 1831805 Predicting Loss of Containment in Surface Pipeline using Computational Fluid Dynamics and Supervised Machine Learning Model to Improve Process Safety in Oil and Gas Operations
Authors: Muhammmad Riandhy Anindika Yudhy, Harry Patria, Ramadhani Santoso
Abstract:
Loss of containment is the primary hazard that process safety management is concerned within the oil and gas industry. Escalation to more serious consequences all begins with the loss of containment, starting with oil and gas release from leakage or spillage from primary containment resulting in pool fire, jet fire and even explosion when reacted with various ignition sources in the operations. Therefore, the heart of process safety management is avoiding loss of containment and mitigating its impact through the implementation of safeguards. The most effective safeguard for the case is an early detection system to alert Operations to take action prior to a potential case of loss of containment. The detection system value increases when applied to a long surface pipeline that is naturally difficult to monitor at all times and is exposed to multiple causes of loss of containment, from natural corrosion to illegal tapping. Based on prior researches and studies, detecting loss of containment accurately in the surface pipeline is difficult. The trade-off between cost-effectiveness and high accuracy has been the main issue when selecting the traditional detection method. The current best-performing method, Real-Time Transient Model (RTTM), requires analysis of closely positioned pressure, flow and temperature (PVT) points in the pipeline to be accurate. Having multiple adjacent PVT sensors along the pipeline is expensive, hence generally not a viable alternative from an economic standpoint.A conceptual approach to combine mathematical modeling using computational fluid dynamics and a supervised machine learning model has shown promising results to predict leakage in the pipeline. Mathematical modeling is used to generate simulation data where this data is used to train the leak detection and localization models. Mathematical models and simulation software have also been shown to provide comparable results with experimental data with very high levels of accuracy. While the supervised machine learning model requires a large training dataset for the development of accurate models, mathematical modeling has been shown to be able to generate the required datasets to justify the application of data analytics for the development of model-based leak detection systems for petroleum pipelines. This paper presents a review of key leak detection strategies for oil and gas pipelines, with a specific focus on crude oil applications, and presents the opportunities for the use of data analytics tools and mathematical modeling for the development of robust real-time leak detection and localization system for surface pipelines. A case study is also presented.Keywords: pipeline, leakage, detection, AI
Procedia PDF Downloads 1911804 Using the Smith-Waterman Algorithm to Extract Features in the Classification of Obesity Status
Authors: Rosa Figueroa, Christopher Flores
Abstract:
Text categorization is the problem of assigning a new document to a set of predetermined categories, on the basis of a training set of free-text data that contains documents whose category membership is known. To train a classification model, it is necessary to extract characteristics in the form of tokens that facilitate the learning and classification process. In text categorization, the feature extraction process involves the use of word sequences also known as N-grams. In general, it is expected that documents belonging to the same category share similar features. The Smith-Waterman (SW) algorithm is a dynamic programming algorithm that performs a local sequence alignment in order to determine similar regions between two strings or protein sequences. This work explores the use of SW algorithm as an alternative to feature extraction in text categorization. The dataset used for this purpose, contains 2,610 annotated documents with the classes Obese/Non-Obese. This dataset was represented in a matrix form using the Bag of Word approach. The score selected to represent the occurrence of the tokens in each document was the term frequency-inverse document frequency (TF-IDF). In order to extract features for classification, four experiments were conducted: the first experiment used SW to extract features, the second one used unigrams (single word), the third one used bigrams (two word sequence) and the last experiment used a combination of unigrams and bigrams to extract features for classification. To test the effectiveness of the extracted feature set for the four experiments, a Support Vector Machine (SVM) classifier was tuned using 20% of the dataset. The remaining 80% of the dataset together with 5-Fold Cross Validation were used to evaluate and compare the performance of the four experiments of feature extraction. Results from the tuning process suggest that SW performs better than the N-gram based feature extraction. These results were confirmed by using the remaining 80% of the dataset, where SW performed the best (accuracy = 97.10%, weighted average F-measure = 97.07%). The second best was obtained by the combination of unigrams-bigrams (accuracy = 96.04, weighted average F-measure = 95.97) closely followed by the bigrams (accuracy = 94.56%, weighted average F-measure = 94.46%) and finally unigrams (accuracy = 92.96%, weighted average F-measure = 92.90%).Keywords: comorbidities, machine learning, obesity, Smith-Waterman algorithm
Procedia PDF Downloads 2971803 Multi-Layer Perceptron and Radial Basis Function Neural Network Models for Classification of Diabetic Retinopathy Disease Using Video-Oculography Signals
Authors: Ceren Kaya, Okan Erkaymaz, Orhan Ayar, Mahmut Özer
Abstract:
Diabetes Mellitus (Diabetes) is a disease based on insulin hormone disorders and causes high blood glucose. Clinical findings determine that diabetes can be diagnosed by electrophysiological signals obtained from the vital organs. 'Diabetic Retinopathy' is one of the most common eye diseases resulting on diabetes and it is the leading cause of vision loss due to structural alteration of the retinal layer vessels. In this study, features of horizontal and vertical Video-Oculography (VOG) signals have been used to classify non-proliferative and proliferative diabetic retinopathy disease. Twenty-five features are acquired by using discrete wavelet transform with VOG signals which are taken from 21 subjects. Two models, based on multi-layer perceptron and radial basis function, are recommended in the diagnosis of Diabetic Retinopathy. The proposed models also can detect level of the disease. We show comparative classification performance of the proposed models. Our results show that proposed the RBF model (100%) results in better classification performance than the MLP model (94%).Keywords: diabetic retinopathy, discrete wavelet transform, multi-layer perceptron, radial basis function, video-oculography (VOG)
Procedia PDF Downloads 2591802 Spectrogram Pre-Processing to Improve Isotopic Identification to Discriminate Gamma and Neutrons Sources
Authors: Mustafa Alhamdi
Abstract:
Industrial application to classify gamma rays and neutron events is investigated in this study using deep machine learning. The identification using a convolutional neural network and recursive neural network showed a significant improvement in predication accuracy in a variety of applications. The ability to identify the isotope type and activity from spectral information depends on feature extraction methods, followed by classification. The features extracted from the spectrum profiles try to find patterns and relationships to present the actual spectrum energy in low dimensional space. Increasing the level of separation between classes in feature space improves the possibility to enhance classification accuracy. The nonlinear nature to extract features by neural network contains a variety of transformation and mathematical optimization, while principal component analysis depends on linear transformations to extract features and subsequently improve the classification accuracy. In this paper, the isotope spectrum information has been preprocessed by finding the frequencies components relative to time and using them as a training dataset. Fourier transform implementation to extract frequencies component has been optimized by a suitable windowing function. Training and validation samples of different isotope profiles interacted with CdTe crystal have been simulated using Geant4. The readout electronic noise has been simulated by optimizing the mean and variance of normal distribution. Ensemble learning by combing voting of many models managed to improve the classification accuracy of neural networks. The ability to discriminate gamma and neutron events in a single predication approach using deep machine learning has shown high accuracy using deep learning. The paper findings show the ability to improve the classification accuracy by applying the spectrogram preprocessing stage to the gamma and neutron spectrums of different isotopes. Tuning deep machine learning models by hyperparameter optimization of neural network models enhanced the separation in the latent space and provided the ability to extend the number of detected isotopes in the training database. Ensemble learning contributed significantly to improve the final prediction.Keywords: machine learning, nuclear physics, Monte Carlo simulation, noise estimation, feature extraction, classification
Procedia PDF Downloads 1501801 6D Posture Estimation of Road Vehicles from Color Images
Authors: Yoshimoto Kurihara, Tad Gonsalves
Abstract:
Currently, in the field of object posture estimation, there is research on estimating the position and angle of an object by storing a 3D model of the object to be estimated in advance in a computer and matching it with the model. However, in this research, we have succeeded in creating a module that is much simpler, smaller in scale, and faster in operation. Our 6D pose estimation model consists of two different networks – a classification network and a regression network. From a single RGB image, the trained model estimates the class of the object in the image, the coordinates of the object, and its rotation angle in 3D space. In addition, we compared the estimation accuracy of each camera position, i.e., the angle from which the object was captured. The highest accuracy was recorded when the camera position was 75°, the accuracy of the classification was about 87.3%, and that of regression was about 98.9%.Keywords: 6D posture estimation, image recognition, deep learning, AlexNet
Procedia PDF Downloads 1551800 The Relationship between Representational Conflicts, Generalization, and Encoding Requirements in an Instance Memory Network
Authors: Mathew Wakefield, Matthew Mitchell, Lisa Wise, Christopher McCarthy
Abstract:
The properties of memory representations in artificial neural networks have cognitive implications. Distributed representations that encode instances as a pattern of activity across layers of nodes afford memory compression and enforce the selection of a single point in instance space. These encoding schemes also appear to distort the representational space, as well as trading off the ability to validate that input information is within the bounds of past experience. In contrast, a localist representation which encodes some meaningful information into individual nodes in a network layer affords less memory compression while retaining the integrity of the representational space. This allows the validity of an input to be determined. The validity (or familiarity) of input along with the capacity of localist representation for multiple instance selections affords a memory sampling approach that dynamically balances the bias-variance trade-off. When the input is familiar, bias may be high by referring only to the most similar instances in memory. When the input is less familiar, variance can be increased by referring to more instances that capture a broader range of features. Using this approach in a localist instance memory network, an experiment demonstrates a relationship between representational conflict, generalization performance, and memorization demand. Relatively small sampling ranges produce the best performance on a classic machine learning dataset of visual objects. Combining memory validity with conflict detection produces a reliable confidence judgement that can separate responses with high and low error rates. Confidence can also be used to signal the need for supervisory input. Using this judgement, the need for supervised learning as well as memory encoding can be substantially reduced with only a trivial detriment to classification performance.Keywords: artificial neural networks, representation, memory, conflict monitoring, confidence
Procedia PDF Downloads 1281799 Hyper Parameter Optimization of Deep Convolutional Neural Networks for Pavement Distress Classification
Authors: Oumaima Khlifati, Khadija Baba
Abstract:
Pavement distress is the main factor responsible for the deterioration of road structure durability, damage vehicles, and driver comfort. Transportation agencies spend a high proportion of their funds on pavement monitoring and maintenance. The auscultation of pavement distress was based on the manual survey, which was extremely time consuming, labor intensive, and required domain expertise. Therefore, the automatic distress detection is needed to reduce the cost of manual inspection and avoid more serious damage by implementing the appropriate remediation actions at the right time. Inspired by recent deep learning applications, this paper proposes an algorithm for automatic road distress detection and classification using on the Deep Convolutional Neural Network (DCNN). In this study, the types of pavement distress are classified as transverse or longitudinal cracking, alligator, pothole, and intact pavement. The dataset used in this work is composed of public asphalt pavement images. In order to learn the structure of the different type of distress, the DCNN models are trained and tested as a multi-label classification task. In addition, to get the highest accuracy for our model, we adjust the structural optimization hyper parameters such as the number of convolutions and max pooling, filers, size of filters, loss functions, activation functions, and optimizer and fine-tuning hyper parameters that conclude batch size and learning rate. The optimization of the model is executed by checking all feasible combinations and selecting the best performing one. The model, after being optimized, performance metrics is calculated, which describe the training and validation accuracies, precision, recall, and F1 score.Keywords: distress pavement, hyperparameters, automatic classification, deep learning
Procedia PDF Downloads 931798 The Asymmetric Proximal Support Vector Machine Based on Multitask Learning for Classification
Authors: Qing Wu, Fei-Yan Li, Heng-Chang Zhang
Abstract:
Multitask learning support vector machines (SVMs) have recently attracted increasing research attention. Given several related tasks, the single-task learning methods trains each task separately and ignore the inner cross-relationship among tasks. However, multitask learning can capture the correlation information among tasks and achieve better performance by training all tasks simultaneously. In addition, the asymmetric squared loss function can better improve the generalization ability of the models on the most asymmetric distributed data. In this paper, we first make two assumptions on the relatedness among tasks and propose two multitask learning proximal support vector machine algorithms, named MTL-a-PSVM and EMTL-a-PSVM, respectively. MTL-a-PSVM seeks a trade-off between the maximum expectile distance for each task model and the closeness of each task model to the general model. As an extension of the MTL-a-PSVM, EMTL-a-PSVM can select appropriate kernel functions for shared information and private information. Besides, two corresponding special cases named MTL-PSVM and EMTLPSVM are proposed by analyzing the asymmetric squared loss function, which can be easily implemented by solving linear systems. Experimental analysis of three classification datasets demonstrates the effectiveness and superiority of our proposed multitask learning algorithms.Keywords: multitask learning, asymmetric squared loss, EMTL-a-PSVM, classification
Procedia PDF Downloads 1341797 Classification of Generative Adversarial Network Generated Multivariate Time Series Data Featuring Transformer-Based Deep Learning Architecture
Authors: Thrivikraman Aswathi, S. Advaith
Abstract:
As there can be cases where the use of real data is somehow limited, such as when it is hard to get access to a large volume of real data, we need to go for synthetic data generation. This produces high-quality synthetic data while maintaining the statistical properties of a specific dataset. In the present work, a generative adversarial network (GAN) is trained to produce multivariate time series (MTS) data since the MTS is now being gathered more often in various real-world systems. Furthermore, the GAN-generated MTS data is fed into a transformer-based deep learning architecture that carries out the data categorization into predefined classes. Further, the model is evaluated across various distinct domains by generating corresponding MTS data.Keywords: GAN, transformer, classification, multivariate time series
Procedia PDF Downloads 1301796 Blame Classification through N-Grams in E-Commerce Customer Reviews
Authors: Subhadeep Mandal, Sujoy Bhattacharya, Pabitra Mitra, Diya Guha Roy, Seema Bhattacharya
Abstract:
E-commerce firms allow customers to evaluate and review the things they buy as a positive or bad experience. The e-commerce transaction processes are made up of a variety of diverse organizations and activities that operate independently but are connected together to complete the transaction (from placing an order to the goods reaching the client). After a negative shopping experience, clients frequently disregard the critical assessment of these businesses and submit their feedback on an all-over basis, which benefits certain enterprises but is tedious for others. In this article, we solely dealt with negative reviews and attempted to distinguish between negative reviews where the e-commerce firm is explicitly blamed by customers for a bad purchasing experience and other negative reviews.Keywords: e-commerce, online shopping, customer reviews, customer behaviour, text analytics, n-grams classification
Procedia PDF Downloads 2571795 Rapid Soil Classification Using Computer Vision with Electrical Resistivity and Soil Strength
Authors: Eugene Y. J. Aw, J. W. Koh, S. H. Chew, K. E. Chua, P. L. Goh, Grace H. B. Foo, M. L. Leong
Abstract:
This paper presents the evaluation of various soil testing methods such as the four-probe soil electrical resistivity method and cone penetration test (CPT) that can complement a newly developed novel rapid soil classification scheme using computer vision, to improve the accuracy and productivity of on-site classification of excavated soil. In Singapore, excavated soils from the local construction industry are transported to Staging Grounds (SGs) to be reused as fill material for land reclamation. Excavated soils are mainly categorized into two groups (“Good Earth” and “Soft Clay”) based on particle size distribution (PSD) and water content (w) from soil investigation reports and on-site visual survey, such that proper treatment and usage can be exercised. However, this process is time-consuming and labor-intensive. Thus, a rapid classification method is needed at the SGs. Four-probe soil electrical resistivity and CPT were evaluated for their feasibility as suitable additions to the computer vision system to further develop this innovative non-destructive and instantaneous classification method. The computer vision technique comprises soil image acquisition using an industrial-grade camera; image processing and analysis via calculation of Grey Level Co-occurrence Matrix (GLCM) textural parameters; and decision-making using an Artificial Neural Network (ANN). It was found from the previous study that the ANN model coupled with ρ can classify soils into “Good Earth” and “Soft Clay” in less than a minute, with an accuracy of 85% based on selected representative soil images. To further improve the technique, the following three items were targeted to be added onto the computer vision scheme: the apparent electrical resistivity of soil (ρ) measured using a set of four probes arranged in Wenner’s array, the soil strength measured using a modified mini cone penetrometer, and w measured using a set of time-domain reflectometry (TDR) probes. Laboratory proof-of-concept was conducted through a series of seven tests with three types of soils – “Good Earth”, “Soft Clay,” and a mix of the two. Validation was performed against the PSD and w of each soil type obtained from conventional laboratory tests. The results show that ρ, w and CPT measurements can be collectively analyzed to classify soils into “Good Earth” or “Soft Clay” and are feasible as complementing methods to the computer vision system.Keywords: computer vision technique, cone penetration test, electrical resistivity, rapid and non-destructive, soil classification
Procedia PDF Downloads 2391794 Benchmarking Bert-Based Low-Resource Language: Case Uzbek NLP Models
Authors: Jamshid Qodirov, Sirojiddin Komolov, Ravilov Mirahmad, Olimjon Mirzayev
Abstract:
Nowadays, natural language processing tools play a crucial role in our daily lives, including various techniques with text processing. There are very advanced models in modern languages, such as English, Russian etc. But, in some languages, such as Uzbek, the NLP models have been developed recently. Thus, there are only a few NLP models in Uzbek language. Moreover, there is no such work that could show which Uzbek NLP model behaves in different situations and when to use them. This work tries to close this gap and compares the Uzbek NLP models existing as of the time this article was written. The authors try to compare the NLP models in two different scenarios: sentiment analysis and sentence similarity, which are the implementations of the two most common problems in the industry: classification and similarity. Another outcome from this work is two datasets for classification and sentence similarity in Uzbek language that we generated ourselves and can be useful in both industry and academia as well.Keywords: NLP, benchmak, bert, vectorization
Procedia PDF Downloads 541793 Comparative Analysis of Change in Vegetation in Four Districts of Punjab through Satellite Imagery, Land Use Statistics and Machine Learning
Authors: Mirza Waseem Abbas, Syed Danish Raza
Abstract:
For many countries agriculture is still the major force driving the economy and a critically important socioeconomic sector, despite exceptional industrial development across the globe. In countries like Pakistan, this sector is considered the backbone of the economy, and most of the economic decision making revolves around agricultural outputs and data. Timely and accurate facts and figures about this vital sector hold immense significance and have serious implications for the long-term development of the economy. Therefore, any significant improvements in the statistics and other forms of data regarding agriculture sector are considered important by all policymakers. This is especially true for decision making for the betterment of crops and the agriculture sector in general. Provincial and federal agricultural departments collect data for all cash and non-cash crops and the sector, in general, every year. Traditional data collection for such a large sector i.e. agriculture, being time-consuming, prone to human error and labor-intensive, is slowly but gradually being replaced by remote sensing techniques. For this study, remotely sensed data were used for change detection (machine learning, supervised & unsupervised classification) to assess the increase or decrease in area under agriculture over the last fifteen years due to urbanization. Detailed Landsat Images for the selected agricultural districts were acquired for the year 2000 and compared to images of the same area acquired for the year 2016. Observed differences validated through detailed analysis of the areas show that there was a considerable decrease in vegetation during the last fifteen years in four major agricultural districts of the Punjab province due to urbanization (housing societies).Keywords: change detection, area estimation, machine learning, urbanization, remote sensing
Procedia PDF Downloads 2491792 Transformer-Driven Multi-Category Classification for an Automated Academic Strand Recommendation Framework
Authors: Ma Cecilia Siva
Abstract:
This study introduces a Bidirectional Encoder Representations from Transformers (BERT)-based machine learning model aimed at improving educational counseling by automating the process of recommending academic strands for students. The framework is designed to streamline and enhance the strand selection process by analyzing students' profiles and suggesting suitable academic paths based on their interests, strengths, and goals. Data was gathered from a sample of 200 grade 10 students, which included personal essays and survey responses relevant to strand alignment. After thorough preprocessing, the text data was tokenized, label-encoded, and input into a fine-tuned BERT model set up for multi-label classification. The model was optimized for balanced accuracy and computational efficiency, featuring a multi-category classification layer with sigmoid activation for independent strand predictions. Performance metrics showed an F1 score of 88%, indicating a well-balanced model with precision at 80% and recall at 100%, demonstrating its effectiveness in providing reliable recommendations while reducing irrelevant strand suggestions. To facilitate practical use, the final deployment phase created a recommendation framework that processes new student data through the trained model and generates personalized academic strand suggestions. This automated recommendation system presents a scalable solution for academic guidance, potentially enhancing student satisfaction and alignment with educational objectives. The study's findings indicate that expanding the data set, integrating additional features, and refining the model iteratively could improve the framework's accuracy and broaden its applicability in various educational contexts.Keywords: tokenized, sigmoid activation, transformer, multi category classification
Procedia PDF Downloads 91791 Improve Student Performance Prediction Using Majority Vote Ensemble Model for Higher Education
Authors: Wade Ghribi, Abdelmoty M. Ahmed, Ahmed Said Badawy, Belgacem Bouallegue
Abstract:
In higher education institutions, the most pressing priority is to improve student performance and retention. Large volumes of student data are used in Educational Data Mining techniques to find new hidden information from students' learning behavior, particularly to uncover the early symptom of at-risk pupils. On the other hand, data with noise, outliers, and irrelevant information may provide incorrect conclusions. By identifying features of students' data that have the potential to improve performance prediction results, comparing and identifying the most appropriate ensemble learning technique after preprocessing the data, and optimizing the hyperparameters, this paper aims to develop a reliable students' performance prediction model for Higher Education Institutions. Data was gathered from two different systems: a student information system and an e-learning system for undergraduate students in the College of Computer Science of a Saudi Arabian State University. The cases of 4413 students were used in this article. The process includes data collection, data integration, data preprocessing (such as cleaning, normalization, and transformation), feature selection, pattern extraction, and, finally, model optimization and assessment. Random Forest, Bagging, Stacking, Majority Vote, and two types of Boosting techniques, AdaBoost and XGBoost, are ensemble learning approaches, whereas Decision Tree, Support Vector Machine, and Artificial Neural Network are supervised learning techniques. Hyperparameters for ensemble learning systems will be fine-tuned to provide enhanced performance and optimal output. The findings imply that combining features of students' behavior from e-learning and students' information systems using Majority Vote produced better outcomes than the other ensemble techniques.Keywords: educational data mining, student performance prediction, e-learning, classification, ensemble learning, higher education
Procedia PDF Downloads 1081790 Effect of Coaching Related Incompetency to Stand Trial on Symptom Validity Test: Robustness, Sensitivity, and Specificity
Authors: Natthawut Arin
Abstract:
In forensic contexts, competency to stand trial assessments are the most common referrals. The defendants may attempt to endorse psychopathology symptoms and feign incompetent. Coaching, which can be teaching them test-taking strategies to avoid detection of psychopathological symptoms feigning. Recently, the Symptom Validity Testings (SVTs) were created to detect feigning. Moreover, the works of the literature showed that the effects of coaching on SVTs may be more robust to the effects of coaching. Thai Symptom Validity Test (SVT-Th) was designed as SVTs which demonstrated adequate psychometric properties and ability to classify between feigners and honest responders. Thus, the current study to examine the utility as the robustness of SVT-Th in the detection of feigned psychopathology. Participants consisted of 120 were recruited from undergraduate courses in psychology, randomly assigned to one of three groups. The SVT-Th was administered to those three scenario-experimental groups: (a) Uncoached group were asked to respond honestly (n=40), (b) Symptom-coached without warning group were asked to feign psychiatric symptoms to gain incompetency to stand trial (n=40), while (c) Test-coached with warning group were asked to feign psychiatric symptoms to avoid test detection but being incompetency to stand trial (n=40). Group differences were analyzed using one-way ANOVAs. The result revealed an uncoached group (M = 4.23, SD.= 5.20) had significantly lower SVT-Th mean scores than those both coached groups (M =185.00, SD.= 72.88 and M = 132.10, SD.= 54.06, respectively). Classification rates were calculated to determine the classification accuracy. Result indicated that SVT-Th had overall classification accuracy rates of 96.67% with acceptable of 95% sensitivity and 100% specificity rates. Overall, the results of the present study indicate that the SVT-Th yielded high adequate indices of accuracy and these findings suggest that the SVT-Th is robustness against coaching.Keywords: incompetency to stand trial, coaching, robustness, classification accuracy
Procedia PDF Downloads 1381789 Determining Optimal Number of Trees in Random Forests
Authors: Songul Cinaroglu
Abstract:
Background: Random Forest is an efficient, multi-class machine learning method using for classification, regression and other tasks. This method is operating by constructing each tree using different bootstrap sample of the data. Determining the number of trees in random forests is an open question in the literature for studies about improving classification performance of random forests. Aim: The aim of this study is to analyze whether there is an optimal number of trees in Random Forests and how performance of Random Forests differ according to increase in number of trees using sample health data sets in R programme. Method: In this study we analyzed the performance of Random Forests as the number of trees grows and doubling the number of trees at every iteration using “random forest” package in R programme. For determining minimum and optimal number of trees we performed Mc Nemar test and Area Under ROC Curve respectively. Results: At the end of the analysis it was found that as the number of trees grows, it does not always means that the performance of the forest is better than forests which have fever trees. In other words larger number of trees only increases computational costs but not increases performance results. Conclusion: Despite general practice in using random forests is to generate large number of trees for having high performance results, this study shows that increasing number of trees doesn’t always improves performance. Future studies can compare different kinds of data sets and different performance measures to test whether Random Forest performance results change as number of trees increase or not.Keywords: classification methods, decision trees, number of trees, random forest
Procedia PDF Downloads 3951788 Spectral Mixture Model Applied to Cannabis Parcel Determination
Authors: Levent Basayigit, Sinan Demir, Yusuf Ucar, Burhan Kara
Abstract:
Many research projects require accurate delineation of the different land cover type of the agricultural area. Especially it is critically important for the definition of specific plants like cannabis. However, the complexity of vegetation stands structure, abundant vegetation species, and the smooth transition between different seconder section stages make vegetation classification difficult when using traditional approaches such as the maximum likelihood classifier. Most of the time, classification distinguishes only between trees/annual or grain. It has been difficult to accurately determine the cannabis mixed with other plants. In this paper, a mixed distribution models approach is applied to classify pure and mix cannabis parcels using Worldview-2 imagery in the Lakes region of Turkey. Five different land use types (i.e. sunflower, maize, bare soil, and cannabis) were identified in the image. A constrained Gaussian mixture discriminant analysis (GMDA) was used to unmix the image. In the study, 255 reflectance ratios derived from spectral signatures of seven bands (Blue-Green-Yellow-Red-Rededge-NIR1-NIR2) were randomly arranged as 80% for training and 20% for test data. Gaussian mixed distribution model approach is proved to be an effective and convenient way to combine very high spatial resolution imagery for distinguishing cannabis vegetation. Based on the overall accuracies of the classification, the Gaussian mixed distribution model was found to be very successful to achieve image classification tasks. This approach is sensitive to capture the illegal cannabis planting areas in the large plain. This approach can also be used for monitoring and determination with spectral reflections in illegal cannabis planting areas.Keywords: Gaussian mixture discriminant analysis, spectral mixture model, Worldview-2, land parcels
Procedia PDF Downloads 1971787 The Spatial Classification of China near Sea for Marine Biodiversity Conservation Based on Bio-Geographical Factors
Abstract:
Global biodiversity continues to decline as a result of global climate change and various human activities, such as habitat destruction, pollution, introduction of alien species and overfishing. Although there are connections between global marine organisms more or less, it is better to have clear geographical boundaries in order to facilitate the assessment and management of different biogeographical zones. And so area based management tools (ABMT) are considered as the most effective means for the conservation and sustainable use of marine biodiversity. On a large scale, the geographical gap (or barrier) is the main factor to influence the connectivity, diffusion, ecological and evolutionary process of marine organisms, which results in different distribution patterns. On a small scale, these factors include geographical location, geology, and geomorphology, water depth, current, temperature, salinity, etc. Therefore, the analysis on geographic and environmental factors is of great significance in the study of biodiversity characteristics. This paper summarizes the marine spatial classification and ABMTs used in coastal area, open oceans and deep sea. And analysis principles and methods of marine spatial classification based on biogeographic related factors, and take China Near Sea (CNS) area as case study, and select key biogeographic related factors, carry out marine spatial classification at biological region scale, ecological regionals scale and biogeographical scale. The research shows that CNS is divided into 5 biological regions by climate and geographical differences, the Yellow Sea, the Bohai Sea, the East China Sea, the Taiwan Straits, and the South China Sea. And the bioregions are then divided into 12 ecological regions according to the typical ecological and administrative factors, and finally the eco-regions are divided into 98 biogeographical units according to the benthic substrate types, depth, coastal types, water temperature, and salinity, given the integrity of biological and ecological process, the area of the biogeographical units is not less than 1,000 km². This research is of great use to the coastal management and biodiversity conservation for local and central government, and provide important scientific support for future spatial planning and management of coastal waters and sustainable use of marine biodiversity.Keywords: spatial classification, marine biodiversity, bio-geographical, conservation
Procedia PDF Downloads 1521786 Classifying Blog Texts Based on the Psycholinguistic Features of the Texts
Authors: Hyung Jun Ahn
Abstract:
With the growing importance of social media, it is imperative to analyze it to understand the users. Users share useful information and their experience through social media, where much of what is shared is in the form of texts. This study focused on blogs and aimed to test whether the psycho-linguistic characteristics of blog texts vary with the subject or the type of experience of the texts. For this goal, blog texts about four different types of experience, Go, skiing, reading, and musical were collected through the search API of the Tistory blog service. The analysis of the texts showed that various psycholinguistic characteristics of the texts are different across the four categories of the texts. Moreover, the machine learning experiment using the characteristics for automatic text classification showed significant performance. Specifically, the ensemble method, based on functional tree and bagging appeared to be most effective in classification.Keywords: blog, social media, text analysis, psycholinguistics
Procedia PDF Downloads 2791785 Evolving Convolutional Filter Using Genetic Algorithm for Image Classification
Authors: Rujia Chen, Ajit Narayanan
Abstract:
Convolutional neural networks (CNN), as typically applied in deep learning, use layer-wise backpropagation (BP) to construct filters and kernels for feature extraction. Such filters are 2D or 3D groups of weights for constructing feature maps at subsequent layers of the CNN and are shared across the entire input. BP as a gradient descent algorithm has well-known problems of getting stuck at local optima. The use of genetic algorithms (GAs) for evolving weights between layers of standard artificial neural networks (ANNs) is a well-established area of neuroevolution. In particular, the use of crossover techniques when optimizing weights can help to overcome problems of local optima. However, the application of GAs for evolving the weights of filters and kernels in CNNs is not yet an established area of neuroevolution. In this paper, a GA-based filter development algorithm is proposed. The results of the proof-of-concept experiments described in this paper show the proposed GA algorithm can find filter weights through evolutionary techniques rather than BP learning. For some simple classification tasks like geometric shape recognition, the proposed algorithm can achieve 100% accuracy. The results for MNIST classification, while not as good as possible through standard filter learning through BP, show that filter and kernel evolution warrants further investigation as a new subarea of neuroevolution for deep architectures.Keywords: neuroevolution, convolutional neural network, genetic algorithm, filters, kernels
Procedia PDF Downloads 1861784 Spermiogram Values of Fertile Men in Malatya Region
Authors: Aliseydi Bozkurt, Ugur Yılmaz
Abstract:
Objective: It was aimed to evaluate the current status of semen parameters in fertile males with one or more children and whose wife having a pregnancy for the last 1-12 months in Malatya region. Methods: Sperm samples were obtained from 131 voluntary fertile men. In each analysis, sperm volume (ml), number of sperm (sperm/ml), sperm motility and sperm viscosity were examined with Makler device. Classification was made according to World Health Organization (WHO) criteria. Results: Mean ejaculate volume ranged from 1.5 ml to 5.5 ml, sperm count ranged from 27 to 180 million/ml and motility ranged from 35 to 90%. Sperm motility was found to be on average; 69.9% in A, 7.6% in B, 8.7% in C, 13.3% in D category. Conclusion: The mean spermiogram values of fertile males in Malatya region were found to be similar to those in fertile males determined by the WHO. This study has a regional classification value in terms of spermiogram values.Keywords: fertile men, infertility, spermiogram, sperm motility
Procedia PDF Downloads 3521783 Classification Using Worldview-2 Imagery of Giant Panda Habitat in Wolong, Sichuan Province, China
Authors: Yunwei Tang, Linhai Jing, Hui Li, Qingjie Liu, Xiuxia Li, Qi Yan, Haifeng Ding
Abstract:
The giant panda (Ailuropoda melanoleuca) is an endangered species, mainly live in central China, where bamboos act as the main food source of wild giant pandas. Knowledge of spatial distribution of bamboos therefore becomes important for identifying the habitat of giant pandas. There have been ongoing studies for mapping bamboos and other tree species using remote sensing. WorldView-2 (WV-2) is the first high resolution commercial satellite with eight Multi-Spectral (MS) bands. Recent studies demonstrated that WV-2 imagery has a high potential in classification of tree species. The advanced classification techniques are important for utilising high spatial resolution imagery. It is generally agreed that object-based image analysis is a more desirable method than pixel-based analysis in processing high spatial resolution remotely sensed data. Classifiers that use spatial information combined with spectral information are known as contextual classifiers. It is suggested that contextual classifiers can achieve greater accuracy than non-contextual classifiers. Thus, spatial correlation can be incorporated into classifiers to improve classification results. The study area is located at Wuyipeng area in Wolong, Sichuan Province. The complex environment makes it difficult for information extraction since bamboos are sparsely distributed, mixed with brushes, and covered by other trees. Extensive fieldworks in Wuyingpeng were carried out twice. The first one was on 11th June, 2014, aiming at sampling feature locations for geometric correction and collecting training samples for classification. The second fieldwork was on 11th September, 2014, for the purposes of testing the classification results. In this study, spectral separability analysis was first performed to select appropriate MS bands for classification. Also, the reflectance analysis provided information for expanding sample points under the circumstance of knowing only a few. Then, a spatially weighted object-based k-nearest neighbour (k-NN) classifier was applied to the selected MS bands to identify seven land cover types (bamboo, conifer, broadleaf, mixed forest, brush, bare land, and shadow), accounting for spatial correlation within classes using geostatistical modelling. The spatially weighted k-NN method was compared with three alternatives: the traditional k-NN classifier, the Support Vector Machine (SVM) method and the Classification and Regression Tree (CART). Through field validation, it was proved that the classification result obtained using the spatially weighted k-NN method has the highest overall classification accuracy (77.61%) and Kappa coefficient (0.729); the producer’s accuracy and user’s accuracy achieve 81.25% and 95.12% for the bamboo class, respectively, also higher than the other methods. Photos of tree crowns were taken at sample locations using a fisheye camera, so the canopy density could be estimated. It is found that it is difficult to identify bamboo in the areas with a large canopy density (over 0.70); it is possible to extract bamboos in the areas with a median canopy density (from 0.2 to 0.7) and in a sparse forest (canopy density is less than 0.2). In summary, this study explores the ability of WV-2 imagery for bamboo extraction in a mountainous region in Sichuan. The study successfully identified the bamboo distribution, providing supporting knowledge for assessing the habitats of giant pandas.Keywords: bamboo mapping, classification, geostatistics, k-NN, worldview-2
Procedia PDF Downloads 3131782 Automatic Motion Trajectory Analysis for Dual Human Interaction Using Video Sequences
Authors: Yuan-Hsiang Chang, Pin-Chi Lin, Li-Der Jeng
Abstract:
Advance in techniques of image and video processing has enabled the development of intelligent video surveillance systems. This study was aimed to automatically detect moving human objects and to analyze events of dual human interaction in a surveillance scene. Our system was developed in four major steps: image preprocessing, human object detection, human object tracking, and motion trajectory analysis. The adaptive background subtraction and image processing techniques were used to detect and track moving human objects. To solve the occlusion problem during the interaction, the Kalman filter was used to retain a complete trajectory for each human object. Finally, the motion trajectory analysis was developed to distinguish between the interaction and non-interaction events based on derivatives of trajectories related to the speed of the moving objects. Using a database of 60 video sequences, our system could achieve the classification accuracy of 80% in interaction events and 95% in non-interaction events, respectively. In summary, we have explored the idea to investigate a system for the automatic classification of events for interaction and non-interaction events using surveillance cameras. Ultimately, this system could be incorporated in an intelligent surveillance system for the detection and/or classification of abnormal or criminal events (e.g., theft, snatch, fighting, etc.).Keywords: motion detection, motion tracking, trajectory analysis, video surveillance
Procedia PDF Downloads 548