Search results for: MNIST dataset
188 Analysis of Diverse Cluster Ensemble Techniques
Authors: S. Sarumathi, N. Shanthi, P. Ranjetha
Abstract:
Data mining is the procedure of determining interesting patterns from the huge amount of data. With the intention of accessing the data faster the most supporting processes needed is clustering. Clustering is the process of identifying similarity between data according to the individuality present in the data and grouping associated data objects into clusters. Cluster ensemble is the technique to combine various runs of different clustering algorithms to obtain a general partition of the original dataset, aiming for consolidation of outcomes from a collection of individual clustering outcomes. The performances of clustering ensembles are mainly affecting by two principal factors such as diversity and quality. This paper presents the overview about the different cluster ensemble algorithm along with their methods used in cluster ensemble to improve the diversity and quality in the several cluster ensemble related papers and shows the comparative analysis of different cluster ensemble also summarize various cluster ensemble methods. Henceforth this clear analysis will be very useful for the world of clustering experts and also helps in deciding the most appropriate one to determine the problem in hand.Keywords: Cluster Ensemble, Consensus Function, CSPA, Diversity, HGPA, MCLA.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1841187 The Use of Classifiers in Image Analysis of Oil Wells Profiling Process and the Automatic Identification of Events
Authors: Jaqueline M. R. Vieira
Abstract:
Different strategies and tools are available at the oil and gas industry for detecting and analyzing tension and possible fractures in borehole walls. Most of these techniques are based on manual observation of the captured borehole images. While this strategy may be possible and convenient with small images and few data, it may become difficult and suitable to errors when big databases of images must be treated. While the patterns may differ among the image area, depending on many characteristics (drilling strategy, rock components, rock strength, etc.). In this work we propose the inclusion of data-mining classification strategies in order to create a knowledge database of the segmented curves. These classifiers allow that, after some time using and manually pointing parts of borehole images that correspond to tension regions and breakout areas, the system will indicate and suggest automatically new candidate regions, with higher accuracy. We suggest the use of different classifiers methods, in order to achieve different knowledge dataset configurations.
Keywords: Brazil, classifiers, data-mining, Image Segmentation, oil well visualization, classifiers.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2544186 Behavioral Signature Generation using Shadow Honeypot
Authors: Maros Barabas, Michal Drozd, Petr Hanacek
Abstract:
A novel behavioral detection framework is proposed to detect zero day buffer overflow vulnerabilities (based on network behavioral signatures) using zero-day exploits, instead of the signature-based or anomaly-based detection solutions currently available for IDPS techniques. At first we present the detection model that uses shadow honeypot. Our system is used for the online processing of network attacks and generating a behavior detection profile. The detection profile represents the dataset of 112 types of metrics describing the exact behavior of malware in the network. In this paper we present the examples of generating behavioral signatures for two attacks – a buffer overflow exploit on FTP server and well known Conficker worm. We demonstrated the visualization of important aspects by showing the differences between valid behavior and the attacks. Based on these metrics we can detect attacks with a very high probability of success, the process of detection is however very expensive.Keywords: behavioral signatures, metrics, network, security design
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2053185 A New Technique for Solar Activity Forecasting Using Recurrent Elman Networks
Authors: Salvatore Marra, Francesco C. Morabito
Abstract:
In this paper we present an efficient approach for the prediction of two sunspot-related time series, namely the Yearly Sunspot Number and the IR5 Index, that are commonly used for monitoring solar activity. The method is based on exploiting partially recurrent Elman networks and it can be divided into three main steps: the first one consists in a “de-rectification" of the time series under study in order to obtain a new time series whose appearance, similar to a sum of sinusoids, can be modelled by our neural networks much better than the original dataset. After that, we normalize the derectified data so that they have zero mean and unity standard deviation and, finally, train an Elman network with only one input, a recurrent hidden layer and one output using a back-propagation algorithm with variable learning rate and momentum. The achieved results have shown the efficiency of this approach that, although very simple, can perform better than most of the existing solar activity forecasting methods.
Keywords: Elman neural networks, sunspot, solar activity, time series prediction.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1854184 Mining Frequent Patterns with Functional Programming
Authors: Nittaya Kerdprasop, Kittisak Kerdprasop
Abstract:
Frequent patterns are patterns such as sets of features or items that appear in data frequently. Finding such frequent patterns has become an important data mining task because it reveals associations, correlations, and many other interesting relationships hidden in a dataset. Most of the proposed frequent pattern mining algorithms have been implemented with imperative programming languages such as C, Cµ, Java. The imperative paradigm is significantly inefficient when itemset is large and the frequent pattern is long. We suggest a high-level declarative style of programming using a functional language. Our supposition is that the problem of frequent pattern discovery can be efficiently and concisely implemented via a functional paradigm since pattern matching is a fundamental feature supported by most functional languages. Our frequent pattern mining implementation using the Haskell language confirms our hypothesis about conciseness of the program. The performance studies on speed and memory usage support our intuition on efficiency of functional language.Keywords: Association, frequent pattern mining, functionalprogramming, pattern matching.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2135183 Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion
Authors: Adrià Arbués-Sangüesa, Coloma Ballester, Gloria Haro
Abstract:
Tracking sports players is a widely challenging scenario, specially in single-feed videos recorded in tight courts, where cluttering and occlusions cannot be avoided. This paper presents an analysis of several geometric and semantic visual features to detect and track basketball players. An ablation study is carried out and then used to remark that a robust tracker can be built with Deep Learning features, without the need of extracting contextual ones, such as proximity or color similarity, nor applying camera stabilization techniques. The presented tracker consists of: (1) a detection step, which uses a pretrained deep learning model to estimate the players pose, followed by (2) a tracking step, which leverages pose and semantic information from the output of a convolutional layer in a VGG network. Its performance is analyzed in terms of MOTA over a basketball dataset with more than 10k instances.Keywords: Basketball, deep learning, feature extraction, single-camera, tracking.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 698182 Software Maintenance Severity Prediction with Soft Computing Approach
Authors: E. Ardil, Erdem Uçar, Parvinder S. Sandhu
Abstract:
As the majority of faults are found in a few of its modules so there is a need to investigate the modules that are affected severely as compared to other modules and proper maintenance need to be done on time especially for the critical applications. In this paper, we have explored the different predictor models to NASA-s public domain defect dataset coded in Perl programming language. Different machine learning algorithms belonging to the different learner categories of the WEKA project including Mamdani Based Fuzzy Inference System and Neuro-fuzzy based system have been evaluated for the modeling of maintenance severity or impact of fault severity. The results are recorded in terms of Accuracy, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results show that Neuro-fuzzy based model provides relatively better prediction accuracy as compared to other models and hence, can be used for the maintenance severity prediction of the software.Keywords: Software Metrics, Fuzzy, Neuro-Fuzzy, SoftwareFaults, Accuracy, MAE, RMSE.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1581181 Differences in Innovative Orientation of the Entrepreneurially Active Adults: The Case of Croatia
Authors: Nataša Šarlija, Sanja Pfeifer
Abstract:
This study analyzes the innovative orientation of the Croatian entrepreneurs. Innovative orientation is represented by the perceived extent to which an entrepreneur’s product or service or technology is new, and no other businesses offer the same product. The sample is extracted from the GEM Croatia Adult Population Survey dataset for the years 2003-2013. We apply descriptive statistics, t-test, Chi-square test and logistic regression. Findings indicate that innovative orientations vary with personal, firm, meso and macro level variables, and between different stages in entrepreneurship process. Significant predictors are occupation of the entrepreneurs, size of the firm and export aspiration for both early stage and established entrepreneurs. In addition, fear of failure, expecting to start a new business and seeing an entrepreneurial career as a desirable choice are predictors of innovative orientation among early stage entrepreneurs.
Keywords: Multilevel determinants of the innovative orientation, Croatian early stage entrepreneurs, established businesses, GEM evidence.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1947180 Concepts Extraction from Discharge Notes using Association Rule Mining
Authors: Basak Oguz Yolcular
Abstract:
A large amount of valuable information is available in plain text clinical reports. New techniques and technologies are applied to extract information from these reports. In this study, we developed a domain based software system to transform 600 Otorhinolaryngology discharge notes to a structured form for extracting clinical data from the discharge notes. In order to decrease the system process time discharge notes were transformed into a data table after preprocessing. Several word lists were constituted to identify common section in the discharge notes, including patient history, age, problems, and diagnosis etc. N-gram method was used for discovering terms co-Occurrences within each section. Using this method a dataset of concept candidates has been generated for the validation step, and then Predictive Apriori algorithm for Association Rule Mining (ARM) was applied to validate candidate concepts.Keywords: association rule mining, otorhinolaryngology, predictive apriori, text mining
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1614179 Inverse Problem Methodology for the Measurement of the Electromagnetic Parameters Using MLP Neural Network
Authors: T. Hacib, M. R. Mekideche, N. Ferkha
Abstract:
This paper presents an approach which is based on the use of supervised feed forward neural network, namely multilayer perceptron (MLP) neural network and finite element method (FEM) to solve the inverse problem of parameters identification. The approach is used to identify unknown parameters of ferromagnetic materials. The methodology used in this study consists in the simulation of a large number of parameters in a material under test, using the finite element method (FEM). Both variations in relative magnetic permeability and electrical conductivity of the material under test are considered. Then, the obtained results are used to generate a set of vectors for the training of MLP neural network. Finally, the obtained neural network is used to evaluate a group of new materials, simulated by the FEM, but not belonging to the original dataset. Noisy data, added to the probe measurements is used to enhance the robustness of the method. The reached results demonstrate the efficiency of the proposed approach, and encourage future works on this subject.Keywords: Inverse problem, MLP neural network, parametersidentification, FEM.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1764178 One-Class Support Vector Machines for Protein-Protein Interactions Prediction
Authors: Hany Alashwal, Safaai Deris, Razib M. Othman
Abstract:
Predicting protein-protein interactions represent a key step in understanding proteins functions. This is due to the fact that proteins usually work in context of other proteins and rarely function alone. Machine learning techniques have been applied to predict protein-protein interactions. However, most of these techniques address this problem as a binary classification problem. Although it is easy to get a dataset of interacting proteins as positive examples, there are no experimentally confirmed non-interacting proteins to be considered as negative examples. Therefore, in this paper we solve this problem as a one-class classification problem using one-class support vector machines (SVM). Using only positive examples (interacting protein pairs) in training phase, the one-class SVM achieves accuracy of about 80%. These results imply that protein-protein interaction can be predicted using one-class classifier with comparable accuracy to the binary classifiers that use artificially constructed negative examples.Keywords: Bioinformatics, Protein-protein interactions, One-Class Support Vector Machines
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1989177 The Response Relation between Climate Change and NDVI over the Qinghai-Tibet plateau
Authors: Shen Weishou, Ji Di, Zhang Hui, Yan Shouguang, Li Haidong, Lin Naifeng
Abstract:
Based on a long-term vegetation index dataset of NDVI and meteorological data from 68 meteorological stations in the Qinghai-Tibet plateau and their relations with major climate factors were analyzed. The results show the following: 1) The linear trends of temperature in the Qinghai-Tibet plateau indicate that the temperature in the plateau generally increased, but it rose faster in the last 20 years. 2) The most significant NDVI increase occurred in the eastern and southern plateau. However, the western and northern plateau demonstrate a decreasing trend. 3) There is a significant positive linear correlation between NDVI and temperature and a negative correlation between NDVI and mean wind speed. However, no significant statistical relationship was found between NDVI and relative humidity, precipitation or sunshine duration.4) The changes in NDVI for the plateau are driven by temperature-precipitation, but for the desert and forest areas, the relation changes to precipitation-temperature-wind velocity and wind velocity-temperature-precipitation.
Keywords: Qinghai-Tibet plateau, NDVI, climate warming.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2218176 Financial Literacy Testing: Results of Conducted Research and Introduction of a Project
Authors: J. Nesleha, H. Florianova
Abstract:
The goal of the study is to provide results of a conducted study devoted to financial literacy in the Czech Republic and to introduce a project related to financial education in the Czech Republic. Financial education has become an important part of education in the country, yet it is still neglected on the lowest level of formal education–primary schools. The project is based on investigation of financial literacy on primary schools in the Czech Republic. Consequently, the authors aim to formulate possible amendments related to this type of education. The gained dataset is intended to be used for analysis concerning financial education in the Czech Republic. With regard to used methods, the most important one is regression analysis for disclosure of predictors causing different levels of financial literacy. Furthermore, comparison of different groups is planned, for which t-tests are intended to be used. The study also employs descriptive statistics to introduce basic relationship in the data file.Keywords: Czech Republic, financial education, financial literacy, primary school, regression analysis.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 855175 Localisation of Anatomical Soft Tissue Landmarks of the Head in CT Images
Authors: M. Ovinis, D. Kerr, K. Bouazza-Marouf, M. Vloeberghs
Abstract:
In this paper, algorithms for the automatic localisation of two anatomical soft tissue landmarks of the head the medial canthus (inner corner of the eye) and the tragus (a small, pointed, cartilaginous flap of the ear), in CT images are describet. These landmarks are to be used as a basis for an automated image-to-patient registration system we are developing. The landmarks are localised on a surface model extracted from CT images, based on surface curvature and a rule based system that incorporates prior knowledge of the landmark characteristics. The approach was tested on a dataset of near isotropic CT images of 95 patients. The position of the automatically localised landmarks was compared to the position of the manually localised landmarks. The average difference was 1.5 mm and 0.8 mm for the medial canthus and tragus, with a maximum difference of 4.5 mm and 2.6 mm respectively.The medial canthus and tragus can be automatically localised in CT images, with performance comparable to manual localisationKeywords: Anatomical soft tissue landmarks, automatic localisation, Computed Tomography (CT)
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1844174 A Context-Sensitive Algorithm for Media Similarity Search
Authors: Guang-Ho Cha
Abstract:
This paper presents a context-sensitive media similarity search algorithm. One of the central problems regarding media search is the semantic gap between the low-level features computed automatically from media data and the human interpretation of them. This is because the notion of similarity is usually based on high-level abstraction but the low-level features do not sometimes reflect the human perception. Many media search algorithms have used the Minkowski metric to measure similarity between image pairs. However those functions cannot adequately capture the aspects of the characteristics of the human visual system as well as the nonlinear relationships in contextual information given by images in a collection. Our search algorithm tackles this problem by employing a similarity measure and a ranking strategy that reflect the nonlinearity of human perception and contextual information in a dataset. Similarity search in an image database based on this contextual information shows encouraging experimental results.
Keywords: Context-sensitive search, image search, media search, similarity ranking, similarity search.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 639173 Breast Cancer Prediction Using Score-Level Fusion of Machine Learning and Deep Learning Models
Authors: [email protected]
Abstract:
Breast cancer is one of the most common types in women. Early prediction of breast cancer helps physicians detect cancer in its early stages. Big cancer data need a very powerful tool to analyze and extract predictions. Machine learning and deep learning are two of the most efficient tools for predicting cancer based on textual data. In this study, we developed a fusion model of two machine learning and deep learning models. To obtain the final prediction, Long-Short Term Memory (LSTM), ensemble learning with hyper parameters optimization, and score-level fusion is used. Experiments are done on the Breast Cancer Surveillance Consortium (BCSC) dataset after balancing and grouping the class categories. Five different training scenarios are used, and the tests show that the designed fusion model improved the performance by 3.3% compared to the individual models.
Keywords: Machine learning, Deep learning, cancer prediction, breast cancer, LSTM, Score-Level Fusion.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 402172 Optimal Multilayer Perceptron Structure For Classification of HIV Sub-Type Viruses
Authors: Zeyneb Kurt, Oguzhan Yavuz
Abstract:
The feature of HIV genome is in a wide range because of it is highly heterogeneous. Hence, the infection ability of the virus changes related with different chemokine receptors. From this point, R5 and X4 HIV viruses use CCR5 and CXCR5 coreceptors respectively while R5X4 viruses can utilize both coreceptors. Recently, in Bioinformatics, R5X4 viruses have been studied to classify by using the coreceptors of HIV genome. The aim of this study is to develop the optimal Multilayer Perceptron (MLP) for high classification accuracy of HIV sub-type viruses. To accomplish this purpose, the unit number in hidden layer was incremented one by one, from one to a particular number. The statistical data of R5X4, R5 and X4 viruses was preprocessed by the signal processing methods. Accessible residues of these virus sequences were extracted and modeled by Auto-Regressive Model (AR) due to the dimension of residues is large and different from each other. Finally the pre-processed dataset was used to evolve MLP with various number of hidden units to determine R5X4 viruses. Furthermore, ROC analysis was used to figure out the optimal MLP structure.Keywords: Multilayer Perceptron, Auto-Regressive Model, HIV, ROC Analysis
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1440171 Investments Attractiveness via Combinatorial Optimization Ranking
Authors: Ivan C. Mustakerov, Daniela I. Borissova
Abstract:
The paper proposes an approach to ranking a set of potential countries to invest taking into account the investor point of view about importance of different economic indicators. For the goal, a ranking algorithm that contributes to rational decision making is proposed. The described algorithm is based on combinatorial optimization modeling and repeated multi-criteria tasks solution. The final result is list of countries ranked in respect of investor preferences about importance of economic indicators for investment attractiveness. Different scenarios are simulated conforming to different investors preferences. A numerical example with real dataset of indicators is solved. The numerical testing shows the applicability of the described algorithm. The proposed approach can be used with any sets of indicators as ranking criteria reflecting different points of view of investors.
Keywords: Combinatorial optimization modeling, economics investment attractiveness, economics ranking algorithm, multi-criteria problems.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2107170 Margin-Based Feed-Forward Neural Network Classifiers
Authors: Han Xiao, Xiaoyan Zhu
Abstract:
Margin-Based Principle has been proposed for a long time, it has been proved that this principle could reduce the structural risk and improve the performance in both theoretical and practical aspects. Meanwhile, feed-forward neural network is a traditional classifier, which is very hot at present with a deeper architecture. However, the training algorithm of feed-forward neural network is developed and generated from Widrow-Hoff Principle that means to minimize the squared error. In this paper, we propose a new training algorithm for feed-forward neural networks based on Margin-Based Principle, which could effectively promote the accuracy and generalization ability of neural network classifiers with less labelled samples and flexible network. We have conducted experiments on four UCI open datasets and achieved good results as expected. In conclusion, our model could handle more sparse labelled and more high-dimension dataset in a high accuracy while modification from old ANN method to our method is easy and almost free of work.Keywords: Max-Margin Principle, Feed-Forward Neural Network, Classifier.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1725169 Collaborative and Content-based Recommender System for Social Bookmarking Website
Authors: Cheng-Lung Huang, Cheng-Wei Lin
Abstract:
This study proposes a new recommender system based on the collaborative folksonomy. The purpose of the proposed system is to recommend Internet resources (such as books, articles, documents, pictures, audio and video) to users. The proposed method includes four steps: creating the user profile based on the tags, grouping the similar users into clusters using an agglomerative hierarchical clustering, finding similar resources based on the user-s past collections by using content-based filtering, and recommending similar items to the target user. This study examines the system-s performance for the dataset collected from “del.icio.us," which is a famous social bookmarking website. Experimental results show that the proposed tag-based collaborative and content-based filtering hybridized recommender system is promising and effectiveness in the folksonomy-based bookmarking website.
Keywords: Collaborative recommendation, Folksonomy, Social tagging
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2248168 A Tree Based Association Rule Approach for XML Data with Semantic Integration
Authors: D. Sasikala, K. Premalatha
Abstract:
The use of eXtensible Markup Language (XML) in web, business and scientific databases lead to the development of methods, techniques and systems to manage and analyze XML data. Semi-structured documents suffer due to its heterogeneity and dimensionality. XML structure and content mining represent convergence for research in semi-structured data and text mining. As the information available on the internet grows drastically, extracting knowledge from XML documents becomes a harder task. Certainly, documents are often so large that the data set returned as answer to a query may also be very big to convey the required information. To improve the query answering, a Semantic Tree Based Association Rule (STAR) mining method is proposed. This method provides intentional information by considering the structure, content and the semantics of the content. The method is applied on Reuter’s dataset and the results show that the proposed method outperforms well.
Keywords: Semi--structured Document, Tree based Association Rule (TAR), Semantic Association Rule Mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2352167 Modified Hybrid Genetic Algorithm-Based Artificial Neural Network Application on Wall Shear Stress Prediction
Authors: Zohreh Sheikh Khozani, Wan Hanna Melini Wan Mohtar, Mojtaba Porhemmat
Abstract:
Prediction of wall shear stress in a rectangular channel, with non-homogeneous roughness distribution, was studied. Estimation of shear stress is an important subject in hydraulic engineering, since it affects the flow structure directly. In this study, the Genetic Algorithm Artificial (GAA) neural network is introduced as a hybrid methodology of the Artificial Neural Network (ANN) and modified Genetic Algorithm (GA) combination. This GAA method was employed to predict the wall shear stress. Various input combinations and transfer functions were considered to find the most appropriate GAA model. The results show that the proposed GAA method could predict the wall shear stress of open channels with high accuracy, by Root Mean Square Error (RMSE) of 0.064 in the test dataset. Thus, using GAA provides an accurate and practical simple-to-use equation.
Keywords: Artificial neural network, genetic algorithm, genetic programming, rectangular channel, shear stress.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 670166 Apoptosis Inspired Intrusion Detection System
Authors: R. Sridevi, G. Jagajothi
Abstract:
Artificial Immune Systems (AIS), inspired by the human immune system, are algorithms and mechanisms which are self-adaptive and self-learning classifiers capable of recognizing and classifying by learning, long-term memory and association. Unlike other human system inspired techniques like genetic algorithms and neural networks, AIS includes a range of algorithms modeling on different immune mechanism of the body. In this paper, a mechanism of a human immune system based on apoptosis is adopted to build an Intrusion Detection System (IDS) to protect computer networks. Features are selected from network traffic using Fisher Score. Based on the selected features, the record/connection is classified as either an attack or normal traffic by the proposed methodology. Simulation results demonstrates that the proposed AIS based on apoptosis performs better than existing AIS for intrusion detection.
Keywords: Apoptosis, Artificial Immune System (AIS), Fisher Score, KDD dataset, Network intrusion detection.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2191165 Feature Selection and Predictive Modeling of Housing Data Using Random Forest
Authors: Bharatendra Rai
Abstract:
Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative features that describe various aspects people consider while buying a new house. Boruta algorithm that supports feature selection using a wrapper approach build around random forest is used in this study. This feature selection process leads to 49 confirmed features which are then used for developing predictive random forest models. The study also explores five different data partitioning ratios and their impact on model accuracy are captured using coefficient of determination (r-square) and root mean square error (rsme).
Keywords: Housing data, feature selection, random forest, Boruta algorithm, root mean square error.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1715164 An Efficient Graph Query Algorithm Based on Important Vertices and Decision Features
Authors: Xiantong Li, Jianzhong Li
Abstract:
Graph has become increasingly important in modeling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. Different from the existing methods, our approach, called VFM (Vertex to Frequent Feature Mapping), makes use of vertices and decision features as the basic indexing feature. VFM constructs two mappings between vertices and frequent features to answer graph queries. The VFM approach not only provides an elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit from data mining, especially frequent pattern mining. The results show that the proposed method not only avoids the enumeration method of getting subgraphs of query graph, but also effectively reduces the subgraph isomorphism tests between the query graph and graphs in candidate answer set in verification stage.Keywords: Decision Feature, Frequent Feature, Graph Dataset, Graph Query
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1871163 Aliveness Detection of Fingerprints using Multiple Static Features
Authors: Heeseung Choi, Raechoong Kang, Kyungtaek Choi, Jaihie Kim
Abstract:
Fake finger submission attack is a major problem in fingerprint recognition systems. In this paper, we introduce an aliveness detection method based on multiple static features, which derived from a single fingerprint image. The static features are comprised of individual pore spacing, residual noise and several first order statistics. Specifically, correlation filter is adopted to address individual pore spacing. The multiple static features are useful to reflect the physiological and statistical characteristics of live and fake fingerprint. The classification can be made by calculating the liveness scores from each feature and fusing the scores through a classifier. In our dataset, we compare nine classifiers and the best classification rate at 85% is attained by using a Reduced Multivariate Polynomial classifier. Our approach is faster and more convenient for aliveness check for field applications.Keywords: Aliveness detection, Fingerprint recognition, individual pore spacing, multiple static features, residual noise.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1925162 Experiments on Element and Document Statistics for XML Retrieval
Authors: Mohamed Ben Aouicha, Mohamed Tmar, Mohand Boughanem, Mohamed Abid
Abstract:
This paper presents an information retrieval model on XML documents based on tree matching. Queries and documents are represented by extended trees. An extended tree is built starting from the original tree, with additional weighted virtual links between each node and its indirect descendants allowing to directly reach each descendant. Therefore only one level separates between each node and its indirect descendants. This allows to compare the user query and the document with flexibility and with respect to the structural constraints of the query. The content of each node is very important to decide weither a document element is relevant or not, thus the content should be taken into account in the retrieval process. We separate between the structure-based and the content-based retrieval processes. The content-based score of each node is commonly based on the well-known Tf × Idf criteria. In this paper, we compare between this criteria and another one we call Tf × Ief. The comparison is based on some experiments into a dataset provided by INEX1 to show the effectiveness of our approach on one hand and those of both weighting functions on the other.Keywords: XML retrieval, INEX, Tf × Idf, Tf × Ief
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1336161 A Comparison between Artificial Neural Network Prediction Models for Coronal Hole Related High-Speed Streams
Authors: Rehab Abdulmajed, Amr Hamada, Ahmed Elsaid, Hisashi Hayakawa, Ayman Mahrous
Abstract:
Solar emissions have a high impact on the Earth’s magnetic field, and the prediction of solar events is of high interest. Various techniques have been used in the prediction of the solar wind using mathematical models, MHD models and neural network (NN) models. This study investigates the coronal hole (CH) derived high-speed streams (HSSs) and their correlation to the CH area and create a neural network model to predict the HSSs. Two different algorithms were used to compare different models to find a model that best simulated the HSSs. A dataset of CH synoptic maps through Carrington rotations 1601 to 2185 along with Omni-data set solar wind speed averaged over the Carrington rotations is used, which covers Solar Cycles (SC) 21, 22, 23, and most of 24.
Keywords: Artificial Neural Network, ANN, Coronal Hole Area Feed-Forward neural network models, solar High-Speed Streams, HSSs.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 130160 A Bayesian Hierarchical 13COBT to Correct Estimates Associated with a Delayed Gastric Emptying
Authors: Leslie J.C.Bluck, Sarah J.Jackson, Georgios Vlasakakis, Adrian Mander
Abstract:
The use of a Bayesian Hierarchical Model (BHM) to interpret breath measurements obtained during a 13C Octanoic Breath Test (13COBT) is demonstrated. The statistical analysis was implemented using WinBUGS, a commercially available computer package for Bayesian inference. A hierarchical setting was adopted where poorly defined parameters associated with a delayed Gastric Emptying (GE) were able to "borrow" strength from global distributions. This is proved to be a sufficient tool to correct model's failures and data inconsistencies apparent in conventional analyses employing a Non-linear least squares technique (NLS). Direct comparison of two parameters describing gastric emptying ng ( tlag -lag phase, t1/ 2 -half emptying time) revealed a strong correlation between the two methods. Despite our large dataset ( n = 164 ), Bayesian modeling was fast and provided a successful fitting for all subjects. On the contrary, NLS failed to return acceptable estimates in cases where GE was delayed.
Keywords: Bayesian hierarchical analysis, 13COBT, Gastricemptying, WinBUGS.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1455159 GeNS: a Biological Data Integration Platform
Authors: Joel Arrais, João E. Pereira, João Fernandes, José Luís Oliveira
Abstract:
The scientific achievements coming from molecular biology depend greatly on the capability of computational applications to analyze the laboratorial results. A comprehensive analysis of an experiment requires typically the simultaneous study of the obtained dataset with data that is available in several distinct public databases. Nevertheless, developing a centralized access to these distributed databases rises up a set of challenges such as: what is the best integration strategy, how to solve nomenclature clashes, how to solve database overlapping data and how to deal with huge datasets. In this paper we present GeNS, a system that uses a simple and yet innovative approach to address several biological data integration issues. Compared with existing systems, the main advantages of GeNS are related to its maintenance simplicity and to its coverage and scalability, in terms of number of supported databases and data types. To support our claims we present the current use of GeNS in two concrete applications. GeNS currently contains more than 140 million of biological relations and it can be publicly downloaded or remotely access through SOAP web services.Keywords: Data integration, biological databases
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1632