Search results for: Distributed Data Mining
7973 Arabic Light Stemmer for Better Search Accuracy
Authors: Sahar Khedr, Dina Sayed, Ayman Hanafy
Abstract:
Arabic is one of the most ancient and critical languages in the world. It has over than 250 million Arabic native speakers and more than twenty countries having Arabic as one of its official languages. In the past decade, we have witnessed a rapid evolution in smart devices, social network and technology sector which led to the need to provide tools and libraries that properly tackle the Arabic language in different domains. Stemming is one of the most crucial linguistic fundamentals. It is used in many applications especially in information extraction and text mining fields. The motivation behind this work is to enhance the Arabic light stemmer to serve the data mining industry and leverage it in an open source community. The presented implementation works on enhancing the Arabic light stemmer by utilizing and enhancing an algorithm that provides an extension for a new set of rules and patterns accompanied by adjusted procedure. This study has proven a significant enhancement for better search accuracy with an average 10% improvement in comparison with previous works.Keywords: Arabic data mining, Arabic Information extraction, Arabic Light stemmer, Arabic stemmer.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14967972 Content Based Sampling over Transactional Data Streams
Authors: Mansour Tarafdar, Mohammad Saniee Abade
Abstract:
This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.
Keywords: Sampling, data streams, closed frequent item set mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17097971 Using Data Mining Techniques for Estimating Minimum, Maximum and Average Daily Temperature Values
Authors: S. Kotsiantis, A. Kostoulas, S. Lykoudis, A. Argiriou, K. Menagias
Abstract:
Estimates of temperature values at a specific time of day, from daytime and daily profiles, are needed for a number of environmental, ecological, agricultural and technical applications, ranging from natural hazards assessments, crop growth forecasting to design of solar energy systems. The scope of this research is to investigate the efficiency of data mining techniques in estimating minimum, maximum and mean temperature values. For this reason, a number of experiments have been conducted with well-known regression algorithms using temperature data from the city of Patras in Greece. The performance of these algorithms has been evaluated using standard statistical indicators, such as Correlation Coefficient, Root Mean Squared Error, etc.
Keywords: regression algorithms, supervised machine learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 34187970 Data Preprocessing for Supervised Leaning
Authors: S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas
Abstract:
Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.Keywords: Data mining, feature selection, data cleaning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 60917969 A New DIDS Design Based on a Combination Feature Selection Approach
Authors: Adel Sabry Eesa, Adnan Mohsin Abdulazeez Brifcani, Zeynep Orman
Abstract:
Feature selection has been used in many fields such as classification, data mining and object recognition and proven to be effective for removing irrelevant and redundant features from the original dataset. In this paper, a new design of distributed intrusion detection system using a combination feature selection model based on bees and decision tree. Bees algorithm is used as the search strategy to find the optimal subset of features, whereas decision tree is used as a judgment for the selected features. Both the produced features and the generated rules are used by Decision Making Mobile Agent to decide whether there is an attack or not in the networks. Decision Making Mobile Agent will migrate through the networks, moving from node to another, if it found that there is an attack on one of the nodes, it then alerts the user through User Interface Agent or takes some action through Action Mobile Agent. The KDD Cup 99 dataset is used to test the effectiveness of the proposed system. The results show that even if only four features are used, the proposed system gives a better performance when it is compared with the obtained results using all 41 features.Keywords: Distributed intrusion detection system, mobile agent, feature selection, Bees Algorithm, decision tree.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 19397968 Data Mining on the Router Logs for Statistical Application Classification
Authors: M. Rahmati, S.M. Mirzababaei
Abstract:
With the advance of information technology in the new era the applications of Internet to access data resources has steadily increased and huge amount of data have become accessible in various forms. Obviously, the network providers and agencies, look after to prevent electronic attacks that may be harmful or may be related to terrorist applications. Thus, these have facilitated the authorities to under take a variety of methods to protect the special regions from harmful data. One of the most important approaches is to use firewall in the network facilities. The main objectives of firewalls are to stop the transfer of suspicious packets in several ways. However because of its blind packet stopping, high process power requirements and expensive prices some of the providers are reluctant to use the firewall. In this paper we proposed a method to find a discriminate function to distinguish between usual packets and harmful ones by the statistical processing on the network router logs. By discriminating these data, an administrator may take an approach action against the user. This method is very fast and can be used simply in adjacent with the Internet routers.Keywords: Data Mining, Firewall, Optimization, Packetclassification, Statistical Pattern Recognition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16557967 A Delay-Tolerant Distributed Query Processing Architecture for Mobile Environment
Authors: T.P. Andamuthu, Dr. P. Balasubramanie
Abstract:
The intermittent connectivity modifies the “always on" network assumption made by all the distributed query processing systems. In modern- day systems, the absence of network connectivity is considered as a fault. Since the last upload, it might not be feasible to transmit all the data accumulated right away over the available connection. It is possible that vital information may be delayed excessively when the less important information takes place of the vital information. Owing to the restricted and uneven bandwidth, it is vital that the mobile nodes make the most advantageous use of the connectivity when it arrives. Hence, in order to select the data that needs to be transmitted first, some sort of data prioritization is essential. A continuous query processing system for intermittently connected mobile networks that comprises of a delaytolerant continuous query processor distributed across the mobile hosts has been proposed in this paper. In addition, a mechanism for prioritizing query results has been designed that guarantees enhanced accuracy and reduced delay. It is illustrated that our architecture reduces the client power consumption, increases query efficiency by the extensive simulation results.Keywords: Broadcast, Location, Mobile host, Mobility, Query.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14507966 Extraction of Data from Web Pages: A Vision Based Approach
Authors: P. S. Hiremath, Siddu P. Algur
Abstract:
With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright notices etc., surrounding the main content of the web page. Hence, tools for the mining of data regions, data records and data items need to be developed in order to provide value-added services. Currently available automatic techniques to mine data regions from web pages are still unsatisfactory because of their poor performance and tag-dependence. In this paper a novel method to extract data items from the web pages automatically is proposed. It comprises of two steps: (1) Identification and Extraction of the data regions based on visual clues information. (2) Identification of data records and extraction of data items from a data region. For step1, a novel and more effective method is proposed based on visual clues, which finds the data regions formed by all types of tags using visual clues. For step2 a more effective method namely, Extraction of Data Items from web Pages (EDIP), is adopted to mine data items. The EDIP technique is a list-based approach in which the list is a linear data structure. The proposed technique is able to mine the non-contiguous data records and can correctly identify data regions, irrespective of the type of tag in which it is bound. Our experimental results show that the proposed technique performs better than the existing techniques.
Keywords: Web data records, web data regions, web mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 19017965 Performance Evaluation of Distributed and Co-Located MIMO LTE Physical Layer Using Wireless Open-Access Research Platform
Authors: Ishak Suleiman, Ahmad Kamsani Samingan, Yeoh Chun Yeow, Abdul Aziz Bin Abdul Rahman
Abstract:
In this paper, we evaluate the benefits of distributed 4x4 MIMO LTE downlink systems compared to that of the co-located 4x4 MIMO LTE downlink system. The performance evaluation was carried out experimentally by using Wireless Open-Access Research Platform (WARP), where the comparison between the 4x4 MIMO LTE transmission downlink system in distributed and co-located techniques was examined. The measured Error Vector Magnitude (EVM) results showed that the distributed technique achieved better system performance compared to the co-located arrangement.Keywords: Multiple-input-multiple-output, MIMO, distributed MIMO, co-located MIMO, LTE.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18507964 Data Placement in Heterogeneous Storage of Short Videos
Authors: W. Jaipahkdee, C. Srinilta
Abstract:
The overall service performance of I/O intensive system depends mainly on workload on its storage system. In heterogeneous storage environment where storage elements from different vendors with different capacity and performance are put together, workload should be distributed according to storage capability. This paper addresses data placement issue in short video sharing website. Workload contributed by a video is estimated by the number of views and life time span of existing videos in same category. Experiment was conducted on 42,000 video titles in six weeks. Result showed that the proposed algorithm distributed workload and maintained balance better than round robin and random algorithms.Keywords: data placement, heterogeneous storage system, YouTube, short videos
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14887963 A General Framework for Knowledge Discovery Using High Performance Machine Learning Algorithms
Authors: S. Nandagopalan, N. Pradeep
Abstract:
The aim of this paper is to propose a general framework for storing, analyzing, and extracting knowledge from two-dimensional echocardiographic images, color Doppler images, non-medical images, and general data sets. A number of high performance data mining algorithms have been used to carry out this task. Our framework encompasses four layers namely physical storage, object identification, knowledge discovery, user level. Techniques such as active contour model to identify the cardiac chambers, pixel classification to segment the color Doppler echo image, universal model for image retrieval, Bayesian method for classification, parallel algorithms for image segmentation, etc., were employed. Using the feature vector database that have been efficiently constructed, one can perform various data mining tasks like clustering, classification, etc. with efficient algorithms along with image mining given a query image. All these facilities are included in the framework that is supported by state-of-the-art user interface (UI). The algorithms were tested with actual patient data and Coral image database and the results show that their performance is better than the results reported already.Keywords: Active Contour, Bayesian, Echocardiographic image, Feature vector.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17137962 A Fuzzy Dynamic Load Balancing Algorithm for Homogenous Distributed Systems
Authors: Ali M. Alakeel
Abstract:
Load balancing in distributed computer systems is the process of redistributing the work load among processors in the system to improve system performance. Most of previous research in using fuzzy logic for the purpose of load balancing has only concentrated in utilizing fuzzy logic concepts in describing processors load and tasks execution length. The responsibility of the fuzzy-based load balancing process itself, however, has not been discussed and in most reported work is assumed to be performed in a distributed fashion by all nodes in the network. This paper proposes a new fuzzy dynamic load balancing algorithm for homogenous distributed systems. The proposed algorithm utilizes fuzzy logic in dealing with inaccurate load information, making load distribution decisions, and maintaining overall system stability. In terms of control, we propose a new approach that specifies how, when, and by which node the load balancing is implemented. Our approach is called Centralized-But-Distributed (CBD).Keywords: Dynamic load balancing, fuzzy logic, distributed systems, algorithm.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 24557961 Linguistic Summarization of Structured Patent Data
Authors: E. Y. Igde, S. Aydogan, F. E. Boran, D. Akay
Abstract:
Patent data have an increasingly important role in economic growth, innovation, technical advantages and business strategies and even in countries competitions. Analyzing of patent data is crucial since patents cover large part of all technological information of the world. In this paper, we have used the linguistic summarization technique to prove the validity of the hypotheses related to patent data stated in the literature.Keywords: Data mining, fuzzy sets, linguistic summarization, patent data.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 12177960 On the Continuous Service of Distributed e-Learning System
Authors: Kazunari Meguro, Shinichi Motomura, Takao Kawamura, Kazunori Sugahara
Abstract:
In this paper, backup and recovery technique for Peer to Peer applications, such as a distributed asynchronous Web-Based Training system that we have previously proposed. In order to improve the scalability and robustness of this system, all contents and function are realized on mobile agents. These agents are distributed to computers, and they can obtain using a Peer to Peer network that modified Content-Addressable Network. In the proposed system, although entire services do not become impossible even if some computers break down, the problem that contents disappear occurs with an agent-s disappearance. As a solution for this issue, backups of agents are distributed to computers. If a failure of a computer is detected, other computers will continue service using backups of the agents belonged to the computer.Keywords: Distributed Multimedia Systems, e-Learning, P2P, Mobile Agent
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15597959 Reliability Evaluation of Distribution System Considering Distributed Generation
Authors: Raju Kaduru, Narsaiah Srinivas Gondlala
Abstract:
This paper presents an analytical approach for evaluating distribution system reliability indices in the presence of distributed generation. Modeling distributed generation and evaluation of distribution system reliability indices using the frequency duration technique. Using model implements and case studies are discussed. Results showed that location of DG and its effect in distribution reliability indices. In this respect, impact of DG on distribution system is investigated using the IEEE Roy Billinton test system (RBTS2) included feeder 1. Therefore, it will help to the distribution system planners in the DG resource placement.Keywords: Distributed Generation, DG Location, Distribution System, Reliability Indices.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21917958 Performance Comparison of Particle Swarm Optimization with Traditional Clustering Algorithms used in Self-Organizing Map
Authors: Anurag Sharma, Christian W. Omlin
Abstract:
Self-organizing map (SOM) is a well known data reduction technique used in data mining. It can reveal structure in data sets through data visualization that is otherwise hard to detect from raw data alone. However, interpretation through visual inspection is prone to errors and can be very tedious. There are several techniques for the automatic detection of clusters of code vectors found by SOM, but they generally do not take into account the distribution of code vectors; this may lead to unsatisfactory clustering and poor definition of cluster boundaries, particularly where the density of data points is low. In this paper, we propose the use of an adaptive heuristic particle swarm optimization (PSO) algorithm for finding cluster boundaries directly from the code vectors obtained from SOM. The application of our method to several standard data sets demonstrates its feasibility. PSO algorithm utilizes a so-called U-matrix of SOM to determine cluster boundaries; the results of this novel automatic method compare very favorably to boundary detection through traditional algorithms namely k-means and hierarchical based approach which are normally used to interpret the output of SOM.Keywords: cluster boundaries, clustering, code vectors, data mining, particle swarm optimization, self-organizing maps, U-matrix.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 19097957 Discovery of Time Series Event Patterns based on Time Constraints from Textual Data
Authors: Shigeaki Sakurai, Ken Ueno, Ryohei Orihara
Abstract:
This paper proposes a method that discovers time series event patterns from textual data with time information. The patterns are composed of sequences of events and each event is extracted from the textual data, where an event is characteristic content included in the textual data such as a company name, an action, and an impression of a customer. The method introduces 7 types of time constraints based on the analysis of the textual data. The method also evaluates these constraints when the frequency of a time series event pattern is calculated. We can flexibly define the time constraints for interesting combinations of events and can discover valid time series event patterns which satisfy these conditions. The paper applies the method to daily business reports collected by a sales force automation system and verifies its effectiveness through numerical experiments.
Keywords: Text mining, sequential mining, time constraints, daily business reports.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14887956 Mining Image Features in an Automatic Two-Dimensional Shape Recognition System
Authors: R. A. Salam, M.A. Rodrigues
Abstract:
The number of features required to represent an image can be very huge. Using all available features to recognize objects can suffer from curse dimensionality. Feature selection and extraction is the pre-processing step of image mining. Main issues in analyzing images is the effective identification of features and another one is extracting them. The mining problem that has been focused is the grouping of features for different shapes. Experiments have been conducted by using shape outline as the features. Shape outline readings are put through normalization and dimensionality reduction process using an eigenvector based method to produce a new set of readings. After this pre-processing step data will be grouped through their shapes. Through statistical analysis, these readings together with peak measures a robust classification and recognition process is achieved. Tests showed that the suggested methods are able to automatically recognize objects through their shapes. Finally, experiments also demonstrate the system invariance to rotation, translation, scale, reflection and to a small degree of distortion.Keywords: Image mining, feature selection, shape recognition, peak measures.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14587955 A Practical Distributed String Matching Algorithm Architecture and Implementation
Authors: Bi Kun, Gu Nai-jie, Tu Kun, Liu Xiao-hu, Liu Gang
Abstract:
Traditional parallel single string matching algorithms are always based on PRAM computation model. Those algorithms concentrate on the cost optimal design and the theoretical speed. Based on the distributed string matching algorithm proposed by CHEN, a practical distributed string matching algorithm architecture is proposed in this paper. And also an improved single string matching algorithm based on a variant Boyer-Moore algorithm is presented. We implement our algorithm on the above architecture and the experiments prove that it is really practical and efficient on distributed memory machine. Its computation complexity is O(n/p + m), where n is the length of the text, and m is the length of the pattern, and p is the number of the processors.Keywords: Boyer-Moore algorithm, distributed algorithm, parallel string matching, string matching.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21897954 A Supervised Learning Data Mining Approach for Object Recognition and Classification in High Resolution Satellite Data
Authors: Mais Nijim, Rama Devi Chennuboyina, Waseem Al Aqqad
Abstract:
Advances in spatial and spectral resolution of satellite images have led to tremendous growth in large image databases. The data we acquire through satellites, radars, and sensors consists of important geographical information that can be used for remote sensing applications such as region planning, disaster management. Spatial data classification and object recognition are important tasks for many applications. However, classifying objects and identifying them manually from images is a difficult task. Object recognition is often considered as a classification problem, this task can be performed using machine-learning techniques. Despite of many machine-learning algorithms, the classification is done using supervised classifiers such as Support Vector Machines (SVM) as the area of interest is known. We proposed a classification method, which considers neighboring pixels in a region for feature extraction and it evaluates classifications precisely according to neighboring classes for semantic interpretation of region of interest (ROI). A dataset has been created for training and testing purpose; we generated the attributes by considering pixel intensity values and mean values of reflectance. We demonstrated the benefits of using knowledge discovery and data-mining techniques, which can be on image data for accurate information extraction and classification from high spatial resolution remote sensing imagery.Keywords: Remote sensing, object recognition, classification, data mining, waterbody identification, feature extraction.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20537953 A Study of Soil Heavy Metal Pollution in the Manganese Mining in Drama, Greece
Authors: A. Argiri, A. Molla, Tzouvalekas, E. Skoufogianni, N. Danalatos
Abstract:
The release of heavy metals into the environment has increased over the last years. In this study, 25 soil samples (0-15 cm) from the fields near the mining area in Drama region were selected. The samples were analyzed in the laboratory for their physicochemical properties and for seven “pseudo-total’’ heavy metals content, namely Pb, Zn, Cd, Cr, Cu, Ni, and Mn. The total metal concentrations (Pb, Zn, Cd, Cr, Cu, Ni and Mn) in digests were determined by using the atomic absorption spectrophotometer. According to the results, the mean concentration of the listed heavy metals in 25 soil samples are Cd 1.1 mg/kg, Cr 15 mg/kg, Cu 21.7 mg/kg, Ni 30.1 mg/kg, Pd 50.8 mg/kg, Zn 99.5 mg/kg and Mn 815.3 mg/kg. The results show that the heavy metals remain in the soil even if the mining closed many years ago.
Keywords: Greece, heavy metals, mining, pollution
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 5827952 Association of Smoking with Chest Radiographic and Lung Function Findings in Retired Bauxite Mining Workers
Authors: L. R. Ferreira, R. C. G. Bianchi, L. C.R. Ferreira, C. M. Galhardi, E. P. Baciuk, L. H. Oliveira
Abstract:
Inhalation hazards are associated with potentially injurious exposure and increased risk for lung diseases, within the bauxite mining industry, especially for the smelter workers. Smoking is related to decreased lung function and leads to chronic lung diseases. This study had the objective to evaluate whether smoking is related to functional and radiographic respiratory changes in retired bauxite mining workers. Methods: This was a retrospective and cross-sectional study involving the analysis of database information of 140 retired bauxite mining workers from Poços de Caldas-MG evaluated at Worker’s Health Reference Center and at the Social Security Brazilian National Institute, from July 1st, 2015 until June 30th, 2016. The workers were divided into three groups: non-smokers (n = 47), ex-smokers (n = 46), and smokers (n = 47). The data included: age, gender, spirometry results, and the presence or not of pulmonary pleural and/or parenchymal changes in chest radiographs. Chi-Squared test was used (p < 0,05). Results: In the smokers’ group, 83% of spirometry tests and 64% of chest x-rays were altered. In the non-smokers’ group, 19% of spirometry tests and 13% of chest x-rays were altered. In the ex-smokers’ group, 35% of spirometry tests and 30% of chest x-rays were altered. Most of the results were statistically significant. Results demonstrated a significant difference between smokers’ and non-smokers’ groups in regard to spirometric and radiographic pulmonary alterations. Ex-smokers’ and non-smokers’ group demonstrated better results when compared to the smokers’ group in relation to altered spirometry and radiograph findings. These data may contribute to planning strategies to enhance smoking cessation programs within the bauxite mining industry.
Keywords: Bauxite mining, spirometry, chest radiography, smoking.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 7017951 Towards Clustering of Web-based Document Structures
Authors: Matthias Dehmer, Frank Emmert Streib, Jürgen Kilian, Andreas Zulauf
Abstract:
Methods for organizing web data into groups in order to analyze web-based hypertext data and facilitate data availability are very important in terms of the number of documents available online. Thereby, the task of clustering web-based document structures has many applications, e.g., improving information retrieval on the web, better understanding of user navigation behavior, improving web users requests servicing, and increasing web information accessibility. In this paper we investigate a new approach for clustering web-based hypertexts on the basis of their graph structures. The hypertexts will be represented as so called generalized trees which are more general than usual directed rooted trees, e.g., DOM-Trees. As a important preprocessing step we measure the structural similarity between the generalized trees on the basis of a similarity measure d. Then, we apply agglomerative clustering to the obtained similarity matrix in order to create clusters of hypertext graph patterns representing navigation structures. In the present paper we will run our approach on a data set of hypertext structures and obtain good results in Web Structure Mining. Furthermore we outline the application of our approach in Web Usage Mining as future work.Keywords: Clustering methods, graph-based patterns, graph similarity, hypertext structures, web structure mining
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15067950 Improved FP-growth Algorithm with Multiple Minimum Supports Using Maximum Constraints
Authors: Elsayeda M. Elgaml, Dina M. Ibrahim, Elsayed A. Sallam
Abstract:
Association rule mining is one of the most important fields of data mining and knowledge discovery. In this paper, we propose an efficient multiple support frequent pattern growth algorithm which we called “MSFP-growth” that enhancing the FPgrowth algorithm by making infrequent child node pruning step with multiple minimum support using maximum constrains. The algorithm is implemented, and it is compared with other common algorithms: Apriori-multiple minimum supports using maximum constraints and FP-growth. The experimental results show that the rule mining from the proposed algorithm are interesting and our algorithm achieved better performance than other algorithms without scarifying the accuracy.
Keywords: Association Rules, FP-growth, Multiple minimum supports, Weka Tool
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 33187949 Mining Association Rules from Unstructured Documents
Authors: Hany Mahgoub
Abstract:
This paper presents a system for discovering association rules from collections of unstructured documents called EART (Extract Association Rules from Text). The EART system treats texts only not images or figures. EART discovers association rules amongst keywords labeling the collection of textual documents. The main characteristic of EART is that the system integrates XML technology (to transform unstructured documents into structured documents) with Information Retrieval scheme (TF-IDF) and Data Mining technique for association rules extraction. EART depends on word feature to extract association rules. It consists of four phases: structure phase, index phase, text mining phase and visualization phase. Our work depends on the analysis of the keywords in the extracted association rules through the co-occurrence of the keywords in one sentence in the original text and the existing of the keywords in one sentence without co-occurrence. Experiments applied on a collection of scientific documents selected from MEDLINE that are related to the outbreak of H5N1 avian influenza virus.Keywords: Association rules, information retrieval, knowledgediscovery in text, text mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 24427948 Quantification of GHGs Emissions from Electricity and Diesel Fuel Consumption in Basalt Mining Industry in Thailand
Authors: S. Kittipongvises, A. Dubsok
Abstract:
The mineral and mining industry is necessary for countries to have an adequate and reliable supply of materials to meet their socio-economic development. Despite its importance, the environmental impacts from mineral exploration are hugely significant. This study aimed to investigate and quantify the amount of GHGs emissions emitted from both electricity and diesel vehicle fuel consumption in basalt mining in Thailand. Plant A, located in the northeastern region of Thailand, was selected as a case study. Results indicated that total GHGs emissions from basalt mining and operation (Plant A) were approximately 2,501,086 kgCO2e and 1,997,412 kgCO2e in 2014 and 2015, respectively. The estimated carbon intensity ranged between 1.824 kgCO2e to 2.284 kgCO2e per ton of rock product. Scope 1 (direct emissions) was the dominant driver of its total GHGs compared to scope 2 (indirect emissions). As such, transport related combustion of diesel fuels generated the highest GHGs emission (65%) compared to emissions from purchased electricity (35%). Some of the potential implications for mining entities were also presented.
Keywords: Basalt mining, diesel fuel, electricity, GHGs emissions, Thailand.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 10557947 Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features
Authors: Mehmet Hacibeyoglu, Ahmet Arslan, Sirzat Kahramanli
Abstract:
This study analyzes the effect of discretization on classification of datasets including continuous valued features. Six datasets from UCI which containing continuous valued features are discretized with entropy-based discretization method. The performance improvement between the dataset with original features and the dataset with discretized features is compared with k-nearest neighbors, Naive Bayes, C4.5 and CN2 data mining classification algorithms. As the result the classification accuracies of the six datasets are improved averagely by 1.71% to 12.31%.Keywords: Data mining classification algorithms, entropy-baseddiscretization method
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 24617946 Cirrhosis Mortality Prediction as Classification Using Frequent Subgraph Mining
Authors: Abdolghani Ebrahimi, Diego Klabjan, Chenxi Ge, Daniela Ladner, Parker Stride
Abstract:
In this work, we use machine learning and data analysis techniques to predict the one-year mortality of cirrhotic patients. Data from 2,322 patients with liver cirrhosis are collected at a single medical center. Different machine learning models are applied to predict one-year mortality. A comprehensive feature space including demographic information, comorbidity, clinical procedure and laboratory tests is being analyzed. A temporal pattern mining technic called Frequent Subgraph Mining (FSM) is being used. Model for End-stage liver disease (MELD) prediction of mortality is used as a comparator. All of our models statistically significantly outperform the MELD-score model and show an average 10% improvement of the area under the curve (AUC). The FSM technic itself does not improve the model significantly, but FSM, together with a machine learning technique called an ensemble, further improves the model performance. With the abundance of data available in healthcare through electronic health records (EHR), existing predictive models can be refined to identify and treat patients at risk for higher mortality. However, due to the sparsity of the temporal information needed by FSM, the FSM model does not yield significant improvements. Our work applies modern machine learning algorithms and data analysis methods on predicting one-year mortality of cirrhotic patients and builds a model that predicts one-year mortality significantly more accurate than the MELD score. We have also tested the potential of FSM and provided a new perspective of the importance of clinical features.
Keywords: machine learning, liver cirrhosis, subgraph mining, supervised learning
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4497945 Using Data Mining Techniques for Finding Cardiac Outlier Patients
Authors: Farhan Ismaeel Dakheel, Raoof Smko, K. Negrat, Abdelsalam Almarimi
Abstract:
In this paper we used data mining techniques to identify outlier patients who are using large amount of drugs over a long period of time. Any healthcare or health insurance system should deal with the quantities of drugs utilized by chronic diseases patients. In Kingdom of Bahrain, about 20% of health budget is spent on medications. For the managers of healthcare systems, there is no enough information about the ways of drug utilization by chronic diseases patients, is there any misuse or is there outliers patients. In this work, which has been done in cooperation with information department in the Bahrain Defence Force hospital; we select the data for Cardiac patients in the period starting from 1/1/2008 to December 31/12/2008 to be the data for the model in this paper. We used three techniques for finding the drug utilization for cardiac patients. First we applied a clustering technique, followed by measuring of clustering validity, and finally we applied a decision tree as classification algorithm. The clustering results is divided into three clusters according to the drug utilization, for 1603 patients, who received 15,806 prescriptions during this period can be partitioned into three groups, where 23 patients (2.59%) who received 1316 prescriptions (8.32%) are classified to be outliers. The classification algorithm shows that the use of average drug utilization and the age, and the gender of the patient can be considered to be the main predictive factors in the induced model.Keywords: Data Mining, Clustering, Classification, Drug Utilization..
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18987944 Distributed Denial of Service Attacks in Mobile Adhoc Networks
Authors: Gurjinder Kaur, Yogesh Chaba, V. K. Jain
Abstract:
The aim of this paper is to explore the security issues that significantly affect the performance of Mobile Adhoc Networks (MANET)and limit the services provided to their intended users. The MANETs are more vulnerable to Distributed Denial of Service attacks (DDoS) because of their properties like shared medium, dynamic topologies etc. A DDoS attack is a coordinated attempt made by malicious users to flood the victim network with the large amount of data such that the resources of the victim network are exhausted resulting in the deterioration of the network performance. This paper highlights the effects of different types of DDoS attacks in MANETs and categorizes them according to their behavior.Keywords: Distributed Denial, Mobile Adhoc Networks
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2435