Search results for: K-nearest neighbor classifier
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 379

Search results for: K-nearest neighbor classifier

19 Localization of Geospatial Events and Hoax Prediction in the UFO Database

Authors: Harish Krishnamurthy, Anna Lafontant, Ren Yi

Abstract:

Unidentified Flying Objects (UFOs) have been an interesting topic for most enthusiasts and hence people all over the United States report such findings online at the National UFO Report Center (NUFORC). Some of these reports are a hoax and among those that seem legitimate, our task is not to establish that these events confirm that they indeed are events related to flying objects from aliens in outer space. Rather, we intend to identify if the report was a hoax as was identified by the UFO database team with their existing curation criterion. However, the database provides a wealth of information that can be exploited to provide various analyses and insights such as social reporting, identifying real-time spatial events and much more. We perform analysis to localize these time-series geospatial events and correlate with known real-time events. This paper does not confirm any legitimacy of alien activity, but rather attempts to gather information from likely legitimate reports of UFOs by studying the online reports. These events happen in geospatial clusters and also are time-based. We look at cluster density and data visualization to search the space of various cluster realizations to decide best probable clusters that provide us information about the proximity of such activity. A random forest classifier is also presented that is used to identify true events and hoax events, using the best possible features available such as region, week, time-period and duration. Lastly, we show the performance of the scheme on various days and correlate with real-time events where one of the UFO reports strongly correlates to a missile test conducted in the United States.

Keywords: Time-series clustering, feature extraction, hoax prediction, geospatial events.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 808
18 Automatic Staging and Subtype Determination for Non-Small Cell Lung Carcinoma Using PET Image Texture Analysis

Authors: Seyhan Karaçavuş, Bülent Yılmaz, Ömer Kayaaltı, Semra İçer, Arzu Taşdemir, Oğuzhan Ayyıldız, Kübra Eset, Eser Kaya

Abstract:

In this study, our goal was to perform tumor staging and subtype determination automatically using different texture analysis approaches for a very common cancer type, i.e., non-small cell lung carcinoma (NSCLC). Especially, we introduced a texture analysis approach, called Law’s texture filter, to be used in this context for the first time. The 18F-FDG PET images of 42 patients with NSCLC were evaluated. The number of patients for each tumor stage, i.e., I-II, III or IV, was 14. The patients had ~45% adenocarcinoma (ADC) and ~55% squamous cell carcinoma (SqCCs). MATLAB technical computing language was employed in the extraction of 51 features by using first order statistics (FOS), gray-level co-occurrence matrix (GLCM), gray-level run-length matrix (GLRLM), and Laws’ texture filters. The feature selection method employed was the sequential forward selection (SFS). Selected textural features were used in the automatic classification by k-nearest neighbors (k-NN) and support vector machines (SVM). In the automatic classification of tumor stage, the accuracy was approximately 59.5% with k-NN classifier (k=3) and 69% with SVM (with one versus one paradigm), using 5 features. In the automatic classification of tumor subtype, the accuracy was around 92.7% with SVM one vs. one. Texture analysis of FDG-PET images might be used, in addition to metabolic parameters as an objective tool to assess tumor histopathological characteristics and in automatic classification of tumor stage and subtype.

Keywords: Cancer stage, cancer cell type, non-small cell lung carcinoma, PET, texture analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 929
17 Markov Random Field-Based Segmentation Algorithm for Detection of Land Cover Changes Using Uninhabited Aerial Vehicle Synthetic Aperture Radar Polarimetric Images

Authors: Mehrnoosh Omati, Mahmod Reza Sahebi

Abstract:

The information on land use/land cover changing plays an essential role for environmental assessment, planning and management in regional development. Remotely sensed imagery is widely used for providing information in many change detection applications. Polarimetric Synthetic aperture radar (PolSAR) image, with the discrimination capability between different scattering mechanisms, is a powerful tool for environmental monitoring applications. This paper proposes a new boundary-based segmentation algorithm as a fundamental step for land cover change detection. In this method, first, two PolSAR images are segmented using integration of marker-controlled watershed algorithm and coupled Markov random field (MRF). Then, object-based classification is performed to determine changed/no changed image objects. Compared with pixel-based support vector machine (SVM) classifier, this novel segmentation algorithm significantly reduces the speckle effect in PolSAR images and improves the accuracy of binary classification in object-based level. The experimental results on Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR) polarimetric images show a 3% and 6% improvement in overall accuracy and kappa coefficient, respectively. Also, the proposed method can correctly distinguish homogeneous image parcels.

Keywords: Coupled Markov random field, environment, object-based analysis, Polarimetric SAR images.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 821
16 Identification of Spam Keywords Using Hierarchical Category in C2C E-commerce

Authors: Shao Bo Cheng, Yong-Jin Han, Se Young Park, Seong-Bae Park

Abstract:

Consumer-to-Consumer (C2C) E-commerce has been growing at a very high speed in recent years. Since identical or nearly-same kinds of products compete one another by relying on keyword search in C2C E-commerce, some sellers describe their products with spam keywords that are popular but are not related to their products. Though such products get more chances to be retrieved and selected by consumers than those without spam keywords, the spam keywords mislead the consumers and waste their time. This problem has been reported in many commercial services like ebay and taobao, but there have been little research to solve this problem. As a solution to this problem, this paper proposes a method to classify whether keywords of a product are spam or not. The proposed method assumes that a keyword for a given product is more reliable if the keyword is observed commonly in specifications of products which are the same or the same kind as the given product. This is because that a hierarchical category of a product in general determined precisely by a seller of the product and so is the specification of the product. Since higher layers of the hierarchical category represent more general kinds of products, a reliable degree is differently determined according to the layers. Hence, reliable degrees from different layers of a hierarchical category become features for keywords and they are used together with features only from specifications for classification of the keywords. Support Vector Machines are adopted as a basic classifier using the features, since it is powerful, and widely used in many classification tasks. In the experiments, the proposed method is evaluated with a golden standard dataset from Yi-han-wang, a Chinese C2C E-commerce, and is compared with a baseline method that does not consider the hierarchical category. The experimental results show that the proposed method outperforms the baseline in F1-measure, which proves that spam keywords are effectively identified by a hierarchical category in C2C E-commerce.

Keywords: Spam Keyword, E-commerce, keyword features, spam filtering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2461
15 Automatic Removal of Ocular Artifacts using JADE Algorithm and Neural Network

Authors: V Krishnaveni, S Jayaraman, A Gunasekaran, K Ramadoss

Abstract:

The ElectroEncephaloGram (EEG) is useful for clinical diagnosis and biomedical research. EEG signals often contain strong ElectroOculoGram (EOG) artifacts produced by eye movements and eye blinks especially in EEG recorded from frontal channels. These artifacts obscure the underlying brain activity, making its visual or automated inspection difficult. The goal of ocular artifact removal is to remove ocular artifacts from the recorded EEG, leaving the underlying background signals due to brain activity. In recent times, Independent Component Analysis (ICA) algorithms have demonstrated superior potential in obtaining the least dependent source components. In this paper, the independent components are obtained by using the JADE algorithm (best separating algorithm) and are classified into either artifact component or neural component. Neural Network is used for the classification of the obtained independent components. Neural Network requires input features that exactly represent the true character of the input signals so that the neural network could classify the signals based on those key characters that differentiate between various signals. In this work, Auto Regressive (AR) coefficients are used as the input features for classification. Two neural network approaches are used to learn classification rules from EEG data. First, a Polynomial Neural Network (PNN) trained by GMDH (Group Method of Data Handling) algorithm is used and secondly, feed-forward neural network classifier trained by a standard back-propagation algorithm is used for classification and the results show that JADE-FNN performs better than JADEPNN.

Keywords: Auto Regressive (AR) Coefficients, Feed Forward Neural Network (FNN), Joint Approximation Diagonalisation of Eigen matrices (JADE) Algorithm, Polynomial Neural Network (PNN).

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1834
14 Monte Carlo and Biophysics Analysis in a Criminal Trial

Authors: Luca Indovina, Carmela Coppola, Carlo Altucci, Riccardo Barberi, Rocco Romano

Abstract:

In this paper a real court case, held in Italy at the Court of Nola, in which a correct physical description, conducted with both a Monte Carlo and biophysical analysis, would have been sufficient to arrive at conclusions confirmed by documentary evidence, is considered. This will be an example of how forensic physics can be useful in confirming documentary evidence in order to reach hardly questionable conclusions. This was a libel trial in which the defendant, Mr. DS (Defendant for Slander), had falsely accused one of his neighbors, Mr. OP (Offended Person), of having caused him some damages. The damages would have been caused by an external plaster piece that would have detached from the neighbor’s property and would have hit Mr DS while he was in his garden, much more than a meter far away from the facade of the building from which the plaster piece would have detached. In the trial, Mr. DS claimed to have suffered a scratch on his forehead, but he never showed the plaster that had hit him, nor was able to tell from where the plaster would have arrived. Furthermore, Mr. DS presented a medical certificate with a diagnosis of contusion of the cerebral cortex. On the contrary, the images of Mr. OP’s security cameras do not show any movement in the garden of Mr. DS in a long interval of time (about 2 hours) around the time of the alleged accident, nor do they show any people entering or coming out from the house of Mr. DS in the same interval of time. Biophysical analysis shows that both the diagnosis of the medical certificate and the wound declared by the defendant, already in conflict with each other, are not compatible with the fall of external plaster pieces too small to be found. The wind was at a level 1 of the Beaufort scale, that is, unable to raise even dust (level 4 of the Beaufort scale). Therefore, the motion of the plaster pieces can be described as a projectile motion, whereas collisions with the building cornice can be treated using Newtons law of coefficients of restitution. Numerous numerical Monte Carlo simulations show that the pieces of plaster would not have been able to reach even the garden of Mr. DS, let alone a distance over 1.30 meters. Results agree with the documentary evidence (images of Mr. OP’s security cameras) that Mr. DS could not have been hit by plaster pieces coming from Mr. OP’s property.

Keywords: Biophysical analysis, Monte Carlo simulations, Newton’s law of restitution, projectile motion.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 533
13 Iris Recognition Based On the Low Order Norms of Gradient Components

Authors: Iman A. Saad, Loay E. George

Abstract:

Iris pattern is an important biological feature of human body; it becomes very hot topic in both research and practical applications. In this paper, an algorithm is proposed for iris recognition and a simple, efficient and fast method is introduced to extract a set of discriminatory features using first order gradient operator applied on grayscale images. The gradient based features are robust, up to certain extents, against the variations may occur in contrast or brightness of iris image samples; the variations are mostly occur due lightening differences and camera changes. At first, the iris region is located, after that it is remapped to a rectangular area of size 360x60 pixels. Also, a new method is proposed for detecting eyelash and eyelid points; it depends on making image statistical analysis, to mark the eyelash and eyelid as a noise points. In order to cover the features localization (variation), the rectangular iris image is partitioned into N overlapped sub-images (blocks); then from each block a set of different average directional gradient densities values is calculated to be used as texture features vector. The applied gradient operators are taken along the horizontal, vertical and diagonal directions. The low order norms of gradient components were used to establish the feature vector. Euclidean distance based classifier was used as a matching metric for determining the degree of similarity between the features vector extracted from the tested iris image and template features vectors stored in the database. Experimental tests were performed using 2639 iris images from CASIA V4-Interival database, the attained recognition accuracy has reached up to 99.92%.

Keywords: Iris recognition, contrast stretching, gradient features, texture features, Euclidean metric.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1917
12 Sentiment Analysis of Fake Health News Using Naive Bayes Classification Models

Authors: Danielle Shackley, Yetunde Folajimi

Abstract:

As more people turn to the internet seeking health related information, there is more risk of finding false, inaccurate, or dangerous information. Sentiment analysis is a natural language processing technique that assigns polarity scores of text, ranging from positive, neutral and negative. In this research, we evaluate the weight of a sentiment analysis feature added to fake health news classification models. The dataset consists of existing reliably labeled health article headlines that were supplemented with health information collected about COVID-19 from social media sources. We started with data preprocessing, tested out various vectorization methods such as Count and TFIDF vectorization. We implemented 3 Naive Bayes classifier models, including Bernoulli, Multinomial and Complement. To test the weight of the sentiment analysis feature on the dataset, we created benchmark Naive Bayes classification models without sentiment analysis, and those same models were reproduced and the feature was added. We evaluated using the precision and accuracy scores. The Bernoulli initial model performed with 90% precision and 75.2% accuracy, while the model supplemented with sentiment labels performed with 90.4% precision and stayed constant at 75.2% accuracy. Our results show that the addition of sentiment analysis did not improve model precision by a wide margin; while there was no evidence of improvement in accuracy, we had a 1.9% improvement margin of the precision score with the Complement model. Future expansion of this work could include replicating the experiment process, and substituting the Naive Bayes for a deep learning neural network model.

Keywords: Sentiment analysis, Naive Bayes model, natural language processing, topic analysis, fake health news classification model.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 380
11 C-LNRD: A Cross-Layered Neighbor Route Discovery for Effective Packet Communication in Wireless Sensor Network

Authors: K. Kalaikumar, E. Baburaj

Abstract:

One of the problems to be addressed in wireless sensor networks is the issues related to cross layer communication. Cross layer architecture shares the information across the layer, ensuring Quality of Services (QoS). With this shared information, MAC protocol adapts effective functionality maintenance such as route selection on changeable sensor network environment. However, time slot assignment and neighbour route selection time duration for cross layer have not been carried out. The time varying physical layer communication over cross layer causes high traffic load in the sensor network. Though, the traffic load was reduced using cross layer optimization procedure, the computational cost is high. To improve communication efficacy in the sensor network, a self-determined time slot based Cross-Layered Neighbour Route Discovery (C-LNRD) method is presented in this paper. In the presented work, the initial process is to discover the route in the sensor network using Dynamic Source Routing based Medium Access Control (MAC) sub layers. This process considers MAC layer operation with dynamic route neighbour table discovery. Then, the discovered route path for packet communication employs Broad Route Distributed Time Slot Assignment method on Cross-Layered Sensor Network system. Broad Route means time slotting on varying length of the route paths. During packet communication in this sensor network, transmission of packets is adjusted over the different time with varying ranges for controlling the traffic rate. Finally, Rayleigh fading model is developed in C-LNRD to identify the performance of the sensor network communication structure. The main task of Rayleigh Fading is to measure the power level of each communication under MAC sub layer. The minimized power level helps to easily reduce the computational cost of packet communication in the sensor network. Experiments are conducted on factors such as power factor, on packet communication, neighbour route discovery time, and information (i.e., packet) propagation speed.

Keywords: Medium access control, neighbour route discovery, wireless sensor network, Rayleigh fading, distributed time slot assignment

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 729
10 Lexical Based Method for Opinion Detection on Tripadvisor Collection

Authors: Faiza Belbachir, Thibault Schienhinski

Abstract:

The massive development of online social networks allows users to post and share their opinions on various topics. With this huge volume of opinion, it is interesting to extract and interpret these information for different domains, e.g., product and service benchmarking, politic, system of recommendation. This is why opinion detection is one of the most important research tasks. It consists on differentiating between opinion data and factual data. The difficulty of this task is to determine an approach which returns opinionated document. Generally, there are two approaches used for opinion detection i.e. Lexical based approaches and Machine Learning based approaches. In Lexical based approaches, a dictionary of sentimental words is used, words are associated with weights. The opinion score of document is derived by the occurrence of words from this dictionary. In Machine learning approaches, usually a classifier is trained using a set of annotated document containing sentiment, and features such as n-grams of words, part-of-speech tags, and logical forms. Majority of these works are based on documents text to determine opinion score but dont take into account if these texts are really correct. Thus, it is interesting to exploit other information to improve opinion detection. In our work, we will develop a new way to consider the opinion score. We introduce the notion of trust score. We determine opinionated documents but also if these opinions are really trustable information in relation with topics. For that we use lexical SentiWordNet to calculate opinion and trust scores, we compute different features about users like (numbers of their comments, numbers of their useful comments, Average useful review). After that, we combine opinion score and trust score to obtain a final score. We applied our method to detect trust opinions in TRIPADVISOR collection. Our experimental results report that the combination between opinion score and trust score improves opinion detection.

Keywords: Tripadvisor, Opinion detection, SentiWordNet, trust score.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 693
9 A Bayesian Classification System for Facilitating an Institutional Risk Profile Definition

Authors: Roman Graf, Sergiu Gordea, Heather M. Ryan

Abstract:

This paper presents an approach for easy creation and classification of institutional risk profiles supporting endangerment analysis of file formats. The main contribution of this work is the employment of data mining techniques to support set up of the most important risk factors. Subsequently, risk profiles employ risk factors classifier and associated configurations to support digital preservation experts with a semi-automatic estimation of endangerment group for file format risk profiles. Our goal is to make use of an expert knowledge base, accuired through a digital preservation survey in order to detect preservation risks for a particular institution. Another contribution is support for visualisation of risk factors for a requried dimension for analysis. Using the naive Bayes method, the decision support system recommends to an expert the matching risk profile group for the previously selected institutional risk profile. The proposed methods improve the visibility of risk factor values and the quality of a digital preservation process. The presented approach is designed to facilitate decision making for the preservation of digital content in libraries and archives using domain expert knowledge and values of file format risk profiles. To facilitate decision-making, the aggregated information about the risk factors is presented as a multidimensional vector. The goal is to visualise particular dimensions of this vector for analysis by an expert and to define its profile group. The sample risk profile calculation and the visualisation of some risk factor dimensions is presented in the evaluation section.

Keywords: linked open data, information integration, digital libraries, data mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 675
8 Combination of Different Classifiers for Cardiac Arrhythmia Recognition

Authors: M. R. Homaeinezhad, E. Tavakkoli, M. Habibi, S. A. Atyabi, A. Ghaffari

Abstract:

This paper describes a new supervised fusion (hybrid) electrocardiogram (ECG) classification solution consisting of a new QRS complex geometrical feature extraction as well as a new version of the learning vector quantization (LVQ) classification algorithm aimed for overcoming the stability-plasticity dilemma. Toward this objective, after detection and delineation of the major events of ECG signal via an appropriate algorithm, each QRS region and also its corresponding discrete wavelet transform (DWT) are supposed as virtual images and each of them is divided into eight polar sectors. Then, the curve length of each excerpted segment is calculated and is used as the element of the feature space. To increase the robustness of the proposed classification algorithm versus noise, artifacts and arrhythmic outliers, a fusion structure consisting of five different classifiers namely as Support Vector Machine (SVM), Modified Learning Vector Quantization (MLVQ) and three Multi Layer Perceptron-Back Propagation (MLP–BP) neural networks with different topologies were designed and implemented. The new proposed algorithm was applied to all 48 MIT–BIH Arrhythmia Database records (within–record analysis) and the discrimination power of the classifier in isolation of different beat types of each record was assessed and as the result, the average accuracy value Acc=98.51% was obtained. Also, the proposed method was applied to 6 number of arrhythmias (Normal, LBBB, RBBB, PVC, APB, PB) belonging to 20 different records of the aforementioned database (between– record analysis) and the average value of Acc=95.6% was achieved. To evaluate performance quality of the new proposed hybrid learning machine, the obtained results were compared with similar peer– reviewed studies in this area.

Keywords: Feature Extraction, Curve Length Method, SupportVector Machine, Learning Vector Quantization, Multi Layer Perceptron, Fusion (Hybrid) Classification, Arrhythmia Classification, Supervised Learning Machine.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2183
7 Spectral Mixture Model Applied to Cannabis Parcel Determination

Authors: Levent Basayigit, Sinan Demir, Yusuf Ucar, Burhan Kara

Abstract:

Many research projects require accurate delineation of the different land cover type of the agricultural area. Especially it is critically important for the definition of specific plants like cannabis. However, the complexity of vegetation stands structure, abundant vegetation species, and the smooth transition between different seconder section stages make vegetation classification difficult when using traditional approaches such as the maximum likelihood classifier. Most of the time, classification distinguishes only between trees/annual or grain. It has been difficult to accurately determine the cannabis mixed with other plants. In this paper, a mixed distribution models approach is applied to classify pure and mix cannabis parcels using Worldview-2 imagery in the Lakes region of Turkey. Five different land use types (i.e. sunflower, maize, bare soil, and cannabis) were identified in the image. A constrained Gaussian mixture discriminant analysis (GMDA) was used to unmix the image. In the study, 255 reflectance ratios derived from spectral signatures of seven bands (Blue-Green-Yellow-Red-Rededge-NIR1-NIR2) were randomly arranged as 80% for training and 20% for test data. Gaussian mixed distribution model approach is proved to be an effective and convenient way to combine very high spatial resolution imagery for distinguishing cannabis vegetation. Based on the overall accuracies of the classification, the Gaussian mixed distribution model was found to be very successful to achieve image classification tasks. This approach is sensitive to capture the illegal cannabis planting areas in the large plain. This approach can also be used for monitoring and determination with spectral reflections in illegal cannabis planting areas.

Keywords: Gaussian mixture discriminant analysis, spectral mixture model, World View-2, land parcels.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 747
6 Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis

Authors: Sangita Pokhrel, Nalinda Somasiri, Rebecca Jeyavadhanam, Swathi Ganesan

Abstract:

Tourism is a booming industry with huge future potential for global wealth and employment. There are countless data generated over social media sites every day, creating numerous opportunities to bring more insights to decision-makers. The integration of big data technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centres or hotels, etc., and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences was first classified through VADER and RoBERTa model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization and Term Frequency – Inverse Document Frequency (TFIDF) Vectorization and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide if the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm, after pre-processing and cleaning the dataset, we received an accuracy of 96.12% for the positive and negative sentiment analysis.

Keywords: Counter vectorization, Convolutional Neural Network, Crawler, data technology, Long Short-Term Memory, LSTM, Web Scraping, sentiment analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 111
5 A Proposed Optimized and Efficient Intrusion Detection System for Wireless Sensor Network

Authors: Abdulaziz Alsadhan, Naveed Khan

Abstract:

In recent years intrusions on computer network are the major security threat. Hence, it is important to impede such intrusions. The hindrance of such intrusions entirely relies on its detection, which is primary concern of any security tool like Intrusion detection system (IDS). Therefore, it is imperative to accurately detect network attack. Numerous intrusion detection techniques are available but the main issue is their performance. The performance of IDS can be improved by increasing the accurate detection rate and reducing false positive. The existing intrusion detection techniques have the limitation of usage of raw dataset for classification. The classifier may get jumble due to redundancy, which results incorrect classification. To minimize this problem, Principle component analysis (PCA), Linear Discriminant Analysis (LDA) and Local Binary Pattern (LBP) can be applied to transform raw features into principle features space and select the features based on their sensitivity. Eigen values can be used to determine the sensitivity. To further classify, the selected features greedy search, back elimination, and Particle Swarm Optimization (PSO) can be used to obtain a subset of features with optimal sensitivity and highest discriminatory power. This optimal feature subset is used to perform classification. For classification purpose, Support Vector Machine (SVM) and Multilayer Perceptron (MLP) are used due to its proven ability in classification. The Knowledge Discovery and Data mining (KDD’99) cup dataset was considered as a benchmark for evaluating security detection mechanisms. The proposed approach can provide an optimal intrusion detection mechanism that outperforms the existing approaches and has the capability to minimize the number of features and maximize the detection rates.

Keywords: Particle Swarm Optimization (PSO), Principle component analysis (PCA), Linear Discriminant Analysis (LDA), Local Binary Pattern (LBP), Support Vector Machine (SVM), Multilayer Perceptron (MLP).

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2707
4 An Intelligent Combined Method Based on Power Spectral Density, Decision Trees and Fuzzy Logic for Hydraulic Pumps Fault Diagnosis

Authors: Kaveh Mollazade, Hojat Ahmadi, Mahmoud Omid, Reza Alimardani

Abstract:

Recently, the issue of machine condition monitoring and fault diagnosis as a part of maintenance system became global due to the potential advantages to be gained from reduced maintenance costs, improved productivity and increased machine availability. The aim of this work is to investigate the effectiveness of a new fault diagnosis method based on power spectral density (PSD) of vibration signals in combination with decision trees and fuzzy inference system (FIS). To this end, a series of studies was conducted on an external gear hydraulic pump. After a test under normal condition, a number of different machine defect conditions were introduced for three working levels of pump speed (1000, 1500, and 2000 rpm), corresponding to (i) Journal-bearing with inner face wear (BIFW), (ii) Gear with tooth face wear (GTFW), and (iii) Journal-bearing with inner face wear plus Gear with tooth face wear (B&GW). The features of PSD values of vibration signal were extracted using descriptive statistical parameters. J48 algorithm is used as a feature selection procedure to select pertinent features from data set. The output of J48 algorithm was employed to produce the crisp if-then rule and membership function sets. The structure of FIS classifier was then defined based on the crisp sets. In order to evaluate the proposed PSD-J48-FIS model, the data sets obtained from vibration signals of the pump were used. Results showed that the total classification accuracy for 1000, 1500, and 2000 rpm conditions were 96.42%, 100%, and 96.42% respectively. The results indicate that the combined PSD-J48-FIS model has the potential for fault diagnosis of hydraulic pumps.

Keywords: Power Spectral Density, Machine ConditionMonitoring, Hydraulic Pump, Fuzzy Logic.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2636
3 Model-Driven and Data-Driven Approaches for Crop Yield Prediction: Analysis and Comparison

Authors: Xiangtuo Chen, Paul-Henry Cournéde

Abstract:

Crop yield prediction is a paramount issue in agriculture. The main idea of this paper is to find out efficient way to predict the yield of corn based meteorological records. The prediction models used in this paper can be classified into model-driven approaches and data-driven approaches, according to the different modeling methodologies. The model-driven approaches are based on crop mechanistic modeling. They describe crop growth in interaction with their environment as dynamical systems. But the calibration process of the dynamic system comes up with much difficulty, because it turns out to be a multidimensional non-convex optimization problem. An original contribution of this paper is to propose a statistical methodology, Multi-Scenarios Parameters Estimation (MSPE), for the parametrization of potentially complex mechanistic models from a new type of datasets (climatic data, final yield in many situations). It is tested with CORNFLO, a crop model for maize growth. On the other hand, the data-driven approach for yield prediction is free of the complex biophysical process. But it has some strict requirements about the dataset. A second contribution of the paper is the comparison of these model-driven methods with classical data-driven methods. For this purpose, we consider two classes of regression methods, methods derived from linear regression (Ridge and Lasso Regression, Principal Components Regression or Partial Least Squares Regression) and machine learning methods (Random Forest, k-Nearest Neighbor, Artificial Neural Network and SVM regression). The dataset consists of 720 records of corn yield at county scale provided by the United States Department of Agriculture (USDA) and the associated climatic data. A 5-folds cross-validation process and two accuracy metrics: root mean square error of prediction(RMSEP), mean absolute error of prediction(MAEP) were used to evaluate the crop prediction capacity. The results show that among the data-driven approaches, Random Forest is the most robust and generally achieves the best prediction error (MAEP 4.27%). It also outperforms our model-driven approach (MAEP 6.11%). However, the method to calibrate the mechanistic model from dataset easy to access offers several side-perspectives. The mechanistic model can potentially help to underline the stresses suffered by the crop or to identify the biological parameters of interest for breeding purposes. For this reason, an interesting perspective is to combine these two types of approaches.

Keywords: Crop yield prediction, crop model, sensitivity analysis, paramater estimation, particle swarm optimization, random forest.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1131
2 Modeling Engagement with Multimodal Multisensor Data: The Continuous Performance Test as an Objective Tool to Track Flow

Authors: Mohammad H. Taheri, David J. Brown, Nasser Sherkat

Abstract:

Engagement is one of the most important factors in determining successful outcomes and deep learning in students. Existing approaches to detect student engagement involve periodic human observations that are subject to inter-rater reliability. Our solution uses real-time multimodal multisensor data labeled by objective performance outcomes to infer the engagement of students. The study involves four students with a combined diagnosis of cerebral palsy and a learning disability who took part in a 3-month trial over 59 sessions. Multimodal multisensor data were collected while they participated in a continuous performance test. Eye gaze, electroencephalogram, body pose, and interaction data were used to create a model of student engagement through objective labeling from the continuous performance test outcomes. In order to achieve this, a type of continuous performance test is introduced, the Seek-X type. Nine features were extracted including high-level handpicked compound features. Using leave-one-out cross-validation, a series of different machine learning approaches were evaluated. Overall, the random forest classification approach achieved the best classification results. Using random forest, 93.3% classification for engagement and 42.9% accuracy for disengagement were achieved. We compared these results to outcomes from different models: AdaBoost, decision tree, k-Nearest Neighbor, naïve Bayes, neural network, and support vector machine. We showed that using a multisensor approach achieved higher accuracy than using features from any reduced set of sensors. We found that using high-level handpicked features can improve the classification accuracy in every sensor mode. Our approach is robust to both sensor fallout and occlusions. The single most important sensor feature to the classification of engagement and distraction was shown to be eye gaze. It has been shown that we can accurately predict the level of engagement of students with learning disabilities in a real-time approach that is not subject to inter-rater reliability, human observation or reliant on a single mode of sensor input. This will help teachers design interventions for a heterogeneous group of students, where teachers cannot possibly attend to each of their individual needs. Our approach can be used to identify those with the greatest learning challenges so that all students are supported to reach their full potential.

Keywords: Affective computing in education, affect detection, continuous performance test, engagement, flow, HCI, interaction, learning disabilities, machine learning, multimodal, multisensor, physiological sensors, Signal Detection Theory, student engagement.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1206
1 Towards End-To-End Disease Prediction from Raw Metagenomic Data

Authors: Maxence Queyrel, Edi Prifti, Alexandre Templier, Jean-Daniel Zucker

Abstract:

Analysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and stored as fastq files. Conventional processing pipelines consist in multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimensionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life data-sets as well a simulated one, we demonstrated that this original approach reaches high performance, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Keywords: Metagenomics, phenotype prediction, deep learning, embeddings, multiple instance learning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 834