Search results for: model based clustering
37813 Spatial-Temporal Clustering Characteristics of Dengue in the Northern Region of Sri Lanka, 2010-2013
Authors: Sumiko Anno, Keiji Imaoka, Takeo Tadono, Tamotsu Igarashi, Subramaniam Sivaganesh, Selvam Kannathasan, Vaithehi Kumaran, Sinnathamby Noble Surendran
Abstract:
Dengue outbreaks are affected by biological, ecological, socio-economic and demographic factors that vary over time and space. These factors have been examined separately and still require systematic clarification. The present study aimed to investigate the spatial-temporal clustering relationships between these factors and dengue outbreaks in the northern region of Sri Lanka. Remote sensing (RS) data gathered from a plurality of satellites were used to develop an index comprising rainfall, humidity and temperature data. RS data gathered by ALOS/AVNIR-2 were used to detect urbanization, and a digital land cover map was used to extract land cover information. Other data on relevant factors and dengue outbreaks were collected through institutions and extant databases. The analyzed RS data and databases were integrated into geographic information systems, enabling temporal analysis, spatial statistical analysis and space-time clustering analysis. Our present results showed that increases in the number of the combination of ecological factor and socio-economic and demographic factors with above the average or the presence contribute to significantly high rates of space-time dengue clusters.Keywords: ALOS/AVNIR-2, dengue, space-time clustering analysis, Sri Lanka
Procedia PDF Downloads 47637812 Bridge Members Segmentation Algorithm of Terrestrial Laser Scanner Point Clouds Using Fuzzy Clustering Method
Authors: Donghwan Lee, Gichun Cha, Jooyoung Park, Junkyeong Kim, Seunghee Park
Abstract:
3D shape models of the existing structure are required for many purposes such as safety and operation management. The traditional 3D modeling methods are based on manual or semi-automatic reconstruction from close-range images. It occasions great expense and time consuming. The Terrestrial Laser Scanner (TLS) is a common survey technique to measure quickly and accurately a 3D shape model. This TLS is used to a construction site and cultural heritage management. However there are many limits to process a TLS point cloud, because the raw point cloud is massive volume data. So the capability of carrying out useful analyses is also limited with unstructured 3-D point. Thus, segmentation becomes an essential step whenever grouping of points with common attributes is required. In this paper, members segmentation algorithm was presented to separate a raw point cloud which includes only 3D coordinates. This paper presents a clustering approach based on a fuzzy method for this objective. The Fuzzy C-Means (FCM) is reviewed and used in combination with a similarity-driven cluster merging method. It is applied to the point cloud acquired with Lecia Scan Station C10/C5 at the test bed. The test-bed was a bridge which connects between 1st and 2nd engineering building in Sungkyunkwan University in Korea. It is about 32m long and 2m wide. This bridge was used as pedestrian between two buildings. The 3D point cloud of the test-bed was constructed by a measurement of the TLS. This data was divided by segmentation algorithm for each member. Experimental analyses of the results from the proposed unsupervised segmentation process are shown to be promising. It can be processed to manage configuration each member, because of the segmentation process of point cloud.Keywords: fuzzy c-means (FCM), point cloud, segmentation, terrestrial laser scanner (TLS)
Procedia PDF Downloads 23437811 Model Driven Architecture Methodologies: A Review
Authors: Arslan Murtaza
Abstract:
Model Driven Architecture (MDA) is technique presented by OMG (Object Management Group) for software development in which different models are proposed and converted them into code. The main plan is to identify task by using PIM (Platform Independent Model) and transform it into PSM (Platform Specific Model) and then converted into code. In this review paper describes some challenges and issues that are faced in MDA, type and transformation of models (e.g. CIM, PIM and PSM), and evaluation of MDA-based methodologies.Keywords: OMG, model driven rrchitecture (MDA), computation independent model (CIM), platform independent model (PIM), platform specific model(PSM), MDA-based methodologies
Procedia PDF Downloads 45837810 Classification of Regional Innovation Types and Region-Based Innovation Policies
Authors: Seongho Han, Dongkwan Kim
Abstract:
The focus of regional innovation policies is shifting from a central government to local governments. The central government demands that regions enforce autonomous and responsible regional innovation policies and that regional governments seek for innovation policies fit for regional characteristics. However, the central government and local governments have not arrived yet at a conclusion on what innovation policies are appropriate for regional circumstances. In particular, even if each local government is trying to find regional innovation strategies that are based on the needs of a region, its innovation strategies turn out to be similar with those of other regions. This leads to a consequence that is inefficient not only at a national level, but also at a regional level. Existing researches on regional innovation types point out that there are remarkable differences in the types or characteristics of innovation among the regions of a nation. In addition they imply that there would be no expected innovation output in cases in which policies are enforced with ignoring such differences. This means that it is undesirable to enforce regional innovation policies under a single standard. This research, given this problem, aims to find out the characteristics and differences in innovation types among the regions in Korea and suggests appropriate policy implications by classifying such characteristics and differences. This research, given these objectives, classified regions in consideration of the various indicators that comprise the innovation suggested by existing related researches and illustrated policies based on such characteristics and differences. This research used recent data, mainly from 2012, and as a methodology, clustering analysis based on multiple factor analysis was applied. Supplementary researches on dynamically analyzing stability in regional innovation types, establishing systematic indicators based on the regional innovation theory, and developing additional indicators are necessary in the future.Keywords: regional innovation policy, regional innovation type, region-based innovation, multiple factor analysis, clustering analysis
Procedia PDF Downloads 47537809 Integrating Data Mining with Case-Based Reasoning for Diagnosing Sorghum Anthracnose
Authors: Mariamawit T. Belete
Abstract:
Cereal production and marketing are the means of livelihood for millions of households in Ethiopia. However, cereal production is constrained by technical and socio-economic factors. Among the technical factors, cereal crop diseases are the major contributing factors to the low yield. The aim of this research is to develop an integration of data mining and knowledge based system for sorghum anthracnose disease diagnosis that assists agriculture experts and development agents to make timely decisions. Anthracnose diagnosing systems gather information from Melkassa agricultural research center and attempt to score anthracnose severity scale. Empirical research is designed for data exploration, modeling, and confirmatory procedures for testing hypothesis and prediction to draw a sound conclusion. WEKA (Waikato Environment for Knowledge Analysis) was employed for the modeling. Knowledge based system has come across a variety of approaches based on the knowledge representation method; case-based reasoning (CBR) is one of the popular approaches used in knowledge-based system. CBR is a problem solving strategy that uses previous cases to solve new problems. The system utilizes hidden knowledge extracted by employing clustering algorithms, specifically K-means clustering from sampled anthracnose dataset. Clustered cases with centroid value are mapped to jCOLIBRI, and then the integrator application is created using NetBeans with JDK 8.0.2. The important part of a case based reasoning model includes case retrieval; the similarity measuring stage, reuse; which allows domain expert to transfer retrieval case solution to suit for the current case, revise; to test the solution, and retain to store the confirmed solution to the case base for future use. Evaluation of the system was done for both system performance and user acceptance. For testing the prototype, seven test cases were used. Experimental result shows that the system achieves an average precision and recall values of 70% and 83%, respectively. User acceptance testing also performed by involving five domain experts, and an average of 83% acceptance is achieved. Although the result of this study is promising, however, further study should be done an investigation on hybrid approach such as rule based reasoning, and pictorial retrieval process are recommended.Keywords: sorghum anthracnose, data mining, case based reasoning, integration
Procedia PDF Downloads 8137808 A Local Tensor Clustering Algorithm to Annotate Uncharacterized Genes with Many Biological Networks
Authors: Paul Shize Li, Frank Alber
Abstract:
A fundamental task of clinical genomics is to unravel the functions of genes and their associations with disorders. Although experimental biology has made efforts to discover and elucidate the molecular mechanisms of individual genes in the past decades, still about 40% of human genes have unknown functions, not to mention the diseases they may be related to. For those biologists who are interested in a particular gene with unknown functions, a powerful computational method tailored for inferring the functions and disease relevance of uncharacterized genes is strongly needed. Studies have shown that genes strongly linked to each other in multiple biological networks are more likely to have similar functions. This indicates that the densely connected subgraphs in multiple biological networks are useful in the functional and phenotypic annotation of uncharacterized genes. Therefore, in this work, we have developed an integrative network approach to identify the frequent local clusters, which are defined as those densely connected subgraphs that frequently occur in multiple biological networks and consist of the query gene that has few or no disease or function annotations. This is a local clustering algorithm that models multiple biological networks sharing the same gene set as a three-dimensional matrix, the so-called tensor, and employs the tensor-based optimization method to efficiently find the frequent local clusters. Specifically, massive public gene expression data sets that comprehensively cover dynamic, physiological, and environmental conditions are used to generate hundreds of gene co-expression networks. By integrating these gene co-expression networks, for a given uncharacterized gene that is of biologist’s interest, the proposed method can be applied to identify the frequent local clusters that consist of this uncharacterized gene. Finally, those frequent local clusters are used for function and disease annotation of this uncharacterized gene. This local tensor clustering algorithm outperformed the competing tensor-based algorithm in both module discovery and running time. We also demonstrated the use of the proposed method on real data of hundreds of gene co-expression data and showed that it can comprehensively characterize the query gene. Therefore, this study provides a new tool for annotating the uncharacterized genes and has great potential to assist clinical genomic diagnostics.Keywords: local tensor clustering, query gene, gene co-expression network, gene annotation
Procedia PDF Downloads 16837807 Design of a Fuzzy Luenberger Observer for Fault Nonlinear System
Authors: Mounir Bekaik, Messaoud Ramdani
Abstract:
We present in this work a new technique of stabilization for fault nonlinear systems. The approach we adopt focus on a fuzzy Luenverger observer. The T-S approximation of the nonlinear observer is based on fuzzy C-Means clustering algorithm to find local linear subsystems. The MOESP identification approach was applied to design an empirical model describing the subsystems state variables. The gain of the observer is given by the minimization of the estimation error through Lyapunov-krasovskii functional and LMI approach. We consider a three tank hydraulic system for an illustrative example.Keywords: nonlinear system, fuzzy, faults, TS, Lyapunov-Krasovskii, observer
Procedia PDF Downloads 33137806 Power Aware Modified I-LEACH Protocol Using Fuzzy IF Then Rules
Authors: Gagandeep Singh, Navdeep Singh
Abstract:
Due to limited battery of sensor nodes, so energy efficiency found to be main constraint in WSN. Therefore the main focus of the present work is to find the ways to minimize the energy consumption problem and will results; enhancement in the network stability period and life time. Many researchers have proposed different kind of the protocols to enhance the network lifetime further. This paper has evaluated the issues which have been neglected in the field of the WSNs. WSNs are composed of multiple unattended ultra-small, limited-power sensor nodes. Sensor nodes are deployed randomly in the area of interest. Sensor nodes have limited processing, wireless communication and power resource capabilities Sensor nodes send sensed data to sink or Base Station (BS). I-LEACH gives adaptive clustering mechanism which very efficiently deals with energy conservations. This paper ends up with the shortcomings of various adaptive clustering based WSNs protocols.Keywords: WSN, I-Leach, MATLAB, sensor
Procedia PDF Downloads 27537805 A Concept of Data Mining with XML Document
Authors: Akshay Agrawal, Anand K. Srivastava
Abstract:
The increasing amount of XML datasets available to casual users increases the necessity of investigating techniques to extract knowledge from these data. Data mining is widely applied in the database research area in order to extract frequent correlations of values from both structured and semi-structured datasets. The increasing availability of heterogeneous XML sources has raised a number of issues concerning how to represent and manage these semi structured data. In recent years due to the importance of managing these resources and extracting knowledge from them, lots of methods have been proposed in order to represent and cluster them in different ways.Keywords: XML, similarity measure, clustering, cluster quality, semantic clustering
Procedia PDF Downloads 37937804 RAPDAC: Role Centric Attribute Based Policy Driven Access Control Model
Authors: Jamil Ahmed
Abstract:
Access control models aim to decide whether a user should be denied or granted access to the user‟s requested activity. Various access control models have been established and proposed. The most prominent of these models include role-based, attribute-based, policy based access control models as well as role-centric attribute based access control model. In this paper, a novel access control model is presented called “Role centric Attribute based Policy Driven Access Control (RAPDAC) model”. RAPDAC incorporates the concept of “policy” in the “role centric attribute based access control model”. It leverages the concept of "policy‟ by precisely combining the evaluation of conditions, attributes, permissions and roles in order to allow authorization access. This approach allows capturing the "access control policy‟ of a real time application in a well defined manner. RAPDAC model allows making access decision at much finer granularity as illustrated by the case study of a real time library information system.Keywords: authorization, access control model, role based access control, attribute based access control
Procedia PDF Downloads 15937803 Analysis of Ozone Episodes in the Forest and Vegetation Areas with Using HYSPLIT Model: A Case Study of the North-West Side of Biga Peninsula, Turkey
Authors: Deniz Sari, Selahattin İncecik, Nesimi Ozkurt
Abstract:
Surface ozone, which named as one of the most critical pollutants in the 21th century, threats to human health, forest and vegetation. Specifically, in rural areas surface ozone cause significant influences on agricultural productions and trees. In this study, in order to understand to the surface ozone levels in rural areas we focus on the north-western side of Biga Peninsula which covers by the mountainous and forested area. Ozone concentrations were measured for the first time with passive sampling at 10 sites and two online monitoring stations in this rural area from 2013 and 2015. Using with the daytime hourly O3 measurements during light hours (08:00–20:00) exceeding the threshold of 40 ppb over the 3 months (May, June and July) for agricultural crops, and over the six months (April to September) for forest trees AOT40 (Accumulated hourly O3 concentrations Over a Threshold of 40 ppb) cumulative index was calculated. AOT40 is defined by EU Directive 2008/50/EC to evaluate whether ozone pollution is a risk for vegetation, and is calculated by using hourly ozone concentrations from monitoring systems. In the present study, we performed the trajectory analysis by The Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) model to follow the long-range transport sources contributing to the high ozone levels in the region. The ozone episodes observed between 2013 and 2015 were analysed using the HYSPLIT model developed by the NOAA-ARL. In addition, the cluster analysis is used to identify homogeneous groups of air mass transport patterns can be conducted through air trajectory clustering by grouping similar trajectories in terms of air mass movement. Backward trajectories produced for 3 years by HYSPLIT model were assigned to different clusters according to their moving speed and direction using a k-means clustering algorithm. According to cluster analysis results, northerly flows to study area cause to high ozone levels in the region. The results present that the ozone values in the study area are above the critical levels for forest and vegetation based on EU Directive 2008/50/EC.Keywords: AOT40, Biga Peninsula, HYSPLIT, surface ozone
Procedia PDF Downloads 25537802 An Adaptive Oversampling Technique for Imbalanced Datasets
Authors: Shaukat Ali Shahee, Usha Ananthakumar
Abstract:
A data set exhibits class imbalance problem when one class has very few examples compared to the other class, and this is also referred to as between class imbalance. The traditional classifiers fail to classify the minority class examples correctly due to its bias towards the majority class. Apart from between-class imbalance, imbalance within classes where classes are composed of a different number of sub-clusters with these sub-clusters containing different number of examples also deteriorates the performance of the classifier. Previously, many methods have been proposed for handling imbalanced dataset problem. These methods can be classified into four categories: data preprocessing, algorithmic based, cost-based methods and ensemble of classifier. Data preprocessing techniques have shown great potential as they attempt to improve data distribution rather than the classifier. Data preprocessing technique handles class imbalance either by increasing the minority class examples or by decreasing the majority class examples. Decreasing the majority class examples lead to loss of information and also when minority class has an absolute rarity, removing the majority class examples is generally not recommended. Existing methods available for handling class imbalance do not address both between-class imbalance and within-class imbalance simultaneously. In this paper, we propose a method that handles between class imbalance and within class imbalance simultaneously for binary classification problem. Removing between class imbalance and within class imbalance simultaneously eliminates the biases of the classifier towards bigger sub-clusters by minimizing the error domination of bigger sub-clusters in total error. The proposed method uses model-based clustering to find the presence of sub-clusters or sub-concepts in the dataset. The number of examples oversampled among the sub-clusters is determined based on the complexity of sub-clusters. The method also takes into consideration the scatter of the data in the feature space and also adaptively copes up with unseen test data using Lowner-John ellipsoid for increasing the accuracy of the classifier. In this study, neural network is being used as this is one such classifier where the total error is minimized and removing the between-class imbalance and within class imbalance simultaneously help the classifier in giving equal weight to all the sub-clusters irrespective of the classes. The proposed method is validated on 9 publicly available data sets and compared with three existing oversampling techniques that rely on the spatial location of minority class examples in the euclidean feature space. The experimental results show the proposed method to be statistically significantly superior to other methods in terms of various accuracy measures. Thus the proposed method can serve as a good alternative to handle various problem domains like credit scoring, customer churn prediction, financial distress, etc., that typically involve imbalanced data sets.Keywords: classification, imbalanced dataset, Lowner-John ellipsoid, model based clustering, oversampling
Procedia PDF Downloads 41837801 Machine Learning Analysis of Eating Disorders Risk, Physical Activity and Psychological Factors in Adolescents: A Community Sample Study
Authors: Marc Toutain, Pascale Leconte, Antoine Gauthier
Abstract:
Introduction: Eating Disorders (ED), such as anorexia, bulimia, and binge eating, are psychiatric illnesses that mostly affect young people. The main symptoms concern eating (restriction, excessive food intake) and weight control behaviors (laxatives, vomiting). Psychological comorbidities (depression, executive function disorders, etc.) and problematic behaviors toward physical activity (PA) are commonly associated with ED. Acquaintances on ED risk factors are still lacking, and more community sample studies are needed to improve prevention and early detection. To our knowledge, studies are needed to specifically investigate the link between ED risk level, PA, and psychological risk factors in a community sample of adolescents. The aim of this study is to assess the relation between ED risk level, exercise (type, frequency, and motivations for engaging in exercise), and psychological factors based on the Jacobi risk factors model. We suppose that a high risk of ED will be associated with the practice of high caloric cost PA, motivations oriented to weight and shape control, and psychological disturbances. Method: An online survey destined for students has been sent to several middle schools and colleges in northwest France. This survey combined several questionnaires, the Eating Attitude Test-26 assessing ED risk; the Exercise Motivation Inventory–2 assessing motivations toward PA; the Hospital Anxiety and Depression Scale assessing anxiety and depression, the Contour Drawing Rating Scale; and the Body Esteem Scale assessing body dissatisfaction, Rosenberg Self-esteem Scale assessing self-esteem, the Exercise Dependence Scale-Revised assessing PA dependence, the Multidimensional Assessment of Interoceptive Awareness assessing interoceptive awareness and the Frost Multidimensional Perfectionism Scale assessing perfectionism. Machine learning analysis will be performed in order to constitute groups with a tree-based model clustering method, extract risk profile(s) with a bootstrap method comparison, and predict ED risk with a prediction method based on a decision tree-based model. Expected results: 1044 complete records have already been collected, and the survey will be closed at the end of May 2022. Records will be analyzed with a clustering method and a bootstrap method in order to reveal risk profile(s). Furthermore, a predictive tree decision method will be done to extract an accurate predictive model of ED risk. This analysis will confirm typical main risk factors and will give more data on presumed strong risk factors such as exercise motivations and interoceptive deficit. Furthermore, it will enlighten particular risk profiles with a strong level of proof and greatly contribute to improving the early detection of ED and contribute to a better understanding of ED risk factors.Keywords: eating disorders, risk factors, physical activity, machine learning
Procedia PDF Downloads 8337800 Research on the Risks of Railroad Receiving and Dispatching Trains Operators: Natural Language Processing Risk Text Mining
Authors: Yangze Lan, Ruihua Xv, Feng Zhou, Yijia Shan, Longhao Zhang, Qinghui Xv
Abstract:
Receiving and dispatching trains is an important part of railroad organization, and the risky evaluation of operating personnel is still reflected by scores, lacking further excavation of wrong answers and operating accidents. With natural language processing (NLP) technology, this study extracts the keywords and key phrases of 40 relevant risk events about receiving and dispatching trains and reclassifies the risk events into 8 categories, such as train approach and signal risks, dispatching command risks, and so on. Based on the historical risk data of personnel, the K-Means clustering method is used to classify the risk level of personnel. The result indicates that the high-risk operating personnel need to strengthen the training of train receiving and dispatching operations towards essential trains and abnormal situations.Keywords: receiving and dispatching trains, natural language processing, risk evaluation, K-means clustering
Procedia PDF Downloads 9137799 The Use of Appeals in Green Printed Advertisements: A Case of Product Orientation and Organizational Image Orientation Ads
Authors: Chutima Ruanguttamanun
Abstract:
Despite the relatively large number of studies that have examined the use of appeals in advertisements, research on the use of appeals in green advertisements is still underdeveloped and needs to be investigated further, as it is definitely a tool for marketers to create illustrious ads. In this study, content analysis was employed to examine the nature of green advertising appeals and to match the appeals with the green advertisements. Two different types of green print advertisings, product orientation and organizational image orientation were used. Thirty highly educated participants with different backgrounds were asked individually to ascertain three appeals out of thirty-four given appeals found among forty real green advertisements. To analyze participant responses and to group them based on common appeals, two-step K-mean clustering is used. The clustering solution indicates that eye-catching graphics and imaginative appeals are highly notable in both types of green ads. Depressed, meaningful and sad appeals are found to be highly used in organizational image orientation ads, whereas, corporate image, informative and natural appeals are found to be essential for product orientation ads.Keywords: advertising appeals, green marketing, green advertisement, printed advertisement
Procedia PDF Downloads 27737798 Logistic Model Tree and Expectation-Maximization for Pollen Recognition and Grouping
Authors: Endrick Barnacin, Jean-Luc Henry, Jack Molinié, Jimmy Nagau, Hélène Delatte, Gérard Lebreton
Abstract:
Palynology is a field of interest for many disciplines. It has multiple applications such as chronological dating, climatology, allergy treatment, and even honey characterization. Unfortunately, the analysis of a pollen slide is a complicated and time-consuming task that requires the intervention of experts in the field, which is becoming increasingly rare due to economic and social conditions. So, the automation of this task is a necessity. Pollen slides analysis is mainly a visual process as it is carried out with the naked eye. That is the reason why a primary method to automate palynology is the use of digital image processing. This method presents the lowest cost and has relatively good accuracy in pollen retrieval. In this work, we propose a system combining recognition and grouping of pollen. It consists of using a Logistic Model Tree to classify pollen already known by the proposed system while detecting any unknown species. Then, the unknown pollen species are divided using a cluster-based approach. Success rates for the recognition of known species have been achieved, and automated clustering seems to be a promising approach.Keywords: pollen recognition, logistic model tree, expectation-maximization, local binary pattern
Procedia PDF Downloads 18237797 Towards a Measurement-Based E-Government Portals Maturity Model
Authors: Abdoullah Fath-Allah, Laila Cheikhi, Rafa E. Al-Qutaish, Ali Idri
Abstract:
The e-government emerging concept transforms the way in which the citizens are dealing with their governments. Thus, the citizens can execute the intended services online anytime and anywhere. This results in great benefits for both the governments (reduces the number of officers) and the citizens (more flexibility and time saving). Therefore, building a maturity model to assess the e-government portals becomes desired to help in the improvement process of such portals. This paper aims at proposing an e-government maturity model based on the measurement of the best practices’ presence. The main benefit of such maturity model is to provide a way to rank an e-government portal based on the used best practices, and also giving a set of recommendations to go to the higher stage in the maturity model.Keywords: best practices, e-government portal, maturity model, quality model
Procedia PDF Downloads 33837796 Dow Polyols near Infrared Chemometric Model Reduction Based on Clustering: Reducing Thirty Global Hydroxyl Number (OH) Models to Less Than Five
Authors: Wendy Flory, Kazi Czarnecki, Matthijs Mercy, Mark Joswiak, Mary Beth Seasholtz
Abstract:
Polyurethane Materials are present in a wide range of industrial segments such as Furniture, Building and Construction, Composites, Automotive, Electronics, and more. Dow is one of the leaders for the manufacture of the two main raw materials, Isocyanates and Polyols used to produce polyurethane products. Dow is also a key player for the manufacture of Polyurethane Systems/Formulations designed for targeted applications. In 1990, the first analytical chemometric models were developed and deployed for use in the Dow QC labs of the polyols business for the quantification of OH, water, cloud point, and viscosity. Over the years many models have been added; there are now over 140 models for quantification and hundreds for product identification, too many to be reasonable for support. There are 29 global models alone for the quantification of OH across > 70 products at many sites. An attempt was made to consolidate these into a single model. While the consolidated model proved good statistics across the entire range of OH, several products had a bias by ASTM E1655 with individual product validation. This project summary will show the strategy for global model updates for OH, to reduce the number of models for quantification from over 140 to 5 or less using chemometric methods. In order to gain an understanding of the best product groupings, we identify clusters by reducing spectra to a few dimensions via Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Results from these cluster analyses and a separate validation set allowed dow to reduce the number of models for predicting OH from 29 to 3 without loss of accuracy.Keywords: hydroxyl, global model, model maintenance, near infrared, polyol
Procedia PDF Downloads 13537795 The Analyzer: Clustering Based System for Improving Business Productivity by Analyzing User Profiles to Enhance Human Computer Interaction
Authors: Dona Shaini Abhilasha Nanayakkara, Kurugamage Jude Pravinda Gregory Perera
Abstract:
E-commerce platforms have revolutionized the shopping experience, offering convenient ways for consumers to make purchases. To improve interactions with customers and optimize marketing strategies, it is essential for businesses to understand user behavior, preferences, and needs on these platforms. This paper focuses on recommending businesses to customize interactions with users based on their behavioral patterns, leveraging data-driven analysis and machine learning techniques. Businesses can improve engagement and boost the adoption of e-commerce platforms by aligning behavioral patterns with user goals of usability and satisfaction. We propose TheAnalyzer, a clustering-based system designed to enhance business productivity by analyzing user-profiles and improving human-computer interaction. The Analyzer seamlessly integrates with business applications, collecting relevant data points based on users' natural interactions without additional burdens such as questionnaires or surveys. It defines five key user analytics as features for its dataset, which are easily captured through users' interactions with e-commerce platforms. This research presents a study demonstrating the successful distinction of users into specific groups based on the five key analytics considered by TheAnalyzer. With the assistance of domain experts, customized business rules can be attached to each group, enabling The Analyzer to influence business applications and provide an enhanced personalized user experience. The outcomes are evaluated quantitatively and qualitatively, demonstrating that utilizing TheAnalyzer’s capabilities can optimize business outcomes, enhance customer satisfaction, and drive sustainable growth. The findings of this research contribute to the advancement of personalized interactions in e-commerce platforms. By leveraging user behavioral patterns and analyzing both new and existing users, businesses can effectively tailor their interactions to improve customer satisfaction, loyalty and ultimately drive sales.Keywords: data clustering, data standardization, dimensionality reduction, human computer interaction, user profiling
Procedia PDF Downloads 7237794 Automatic Detection of Traffic Stop Locations Using GPS Data
Authors: Areej Salaymeh, Loren Schwiebert, Stephen Remias, Jonathan Waddell
Abstract:
Extracting information from new data sources has emerged as a crucial task in many traffic planning processes, such as identifying traffic patterns, route planning, traffic forecasting, and locating infrastructure improvements. Given the advanced technologies used to collect Global Positioning System (GPS) data from dedicated GPS devices, GPS equipped phones, and navigation tools, intelligent data analysis methodologies are necessary to mine this raw data. In this research, an automatic detection framework is proposed to help identify and classify the locations of stopped GPS waypoints into two main categories: signalized intersections or highway congestion. The Delaunay triangulation is used to perform this assessment in the clustering phase. While most of the existing clustering algorithms need assumptions about the data distribution, the effectiveness of the Delaunay triangulation relies on triangulating geographical data points without such assumptions. Our proposed method starts by cleaning noise from the data and normalizing it. Next, the framework will identify stoppage points by calculating the traveled distance. The last step is to use clustering to form groups of waypoints for signalized traffic and highway congestion. Next, a binary classifier was applied to find distinguish highway congestion from signalized stop points. The binary classifier uses the length of the cluster to find congestion. The proposed framework shows high accuracy for identifying the stop positions and congestion points in around 99.2% of trials. We show that it is possible, using limited GPS data, to distinguish with high accuracy.Keywords: Delaunay triangulation, clustering, intelligent transportation systems, GPS data
Procedia PDF Downloads 27537793 Verification & Validation of Map Reduce Program Model for Parallel K-Mediod Algorithm on Hadoop Cluster
Authors: Trapti Sharma, Devesh Kumar Srivastava
Abstract:
This paper is basically a analysis study of above MapReduce implementation and also to verify and validate the MapReduce solution model for Parallel K-Mediod algorithm on Hadoop Cluster. MapReduce is a programming model which authorize the managing of huge amounts of data in parallel, on a large number of devices. It is specially well suited to constant or moderate changing set of data since the implementation point of a position is usually high. MapReduce has slowly become the framework of choice for “big data”. The MapReduce model authorizes for systematic and instant organizing of large scale data with a cluster of evaluate nodes. One of the primary affect in Hadoop is how to minimize the completion length (i.e. makespan) of a set of MapReduce duty. In this paper, we have verified and validated various MapReduce applications like wordcount, grep, terasort and parallel K-Mediod clustering algorithm. We have found that as the amount of nodes increases the completion time decreases.Keywords: hadoop, mapreduce, k-mediod, validation, verification
Procedia PDF Downloads 36937792 Dissimilarity-Based Coloring for Symbolic and Multivariate Data Visualization
Authors: K. Umbleja, M. Ichino, H. Yaguchi
Abstract:
In this paper, we propose a coloring method for multivariate data visualization by using parallel coordinates based on dissimilarity and tree structure information gathered during hierarchical clustering. The proposed method is an extension for proximity-based coloring that suffers from a few undesired side effects if hierarchical tree structure is not balanced tree. We describe the algorithm by assigning colors based on dissimilarity information, show the application of proposed method on three commonly used datasets, and compare the results with proximity-based coloring. We found our proposed method to be especially beneficial for symbolic data visualization where many individual objects have already been aggregated into a single symbolic object.Keywords: data visualization, dissimilarity-based coloring, proximity-based coloring, symbolic data
Procedia PDF Downloads 17037791 Spatial Pattern and Predictors of Malaria in Ethiopia: Application of Auto Logistics Spatial Regression
Authors: Melkamu A. Zeru, Yamral M. Warkaw, Aweke A. Mitku, Muluwerk Ayele
Abstract:
Introduction: Malaria is a severe health threat in the World, mainly in Africa. It is the major cause of health problems in which the risk of morbidity and mortality associated with malaria cases are characterized by spatial variations across the county. This study aimed to investigate the spatial patterns and predictors of malaria distribution in Ethiopia. Methods: A weighted sample of 15,239 individuals with rapid diagnosis tests was obtained from the Central Statistical Agency and Ethiopia malaria indicator survey of 2015. Global Moran's I and Moran scatter plots were used in determining the distribution of malaria cases, whereas the local Moran's I statistic was used in identifying exposed areas. In data manipulation, machine learning was used for variable reduction and statistical software R, Stata, and Python were used for data management and analysis. The auto logistics spatial binary regression model was used to investigate the predictors of malaria. Results: The final auto logistics regression model reported that male clients had a positive significant effect on malaria cases as compared to female clients [AOR=2.401, 95 % CI: (2.125 - 2.713)]. The distribution of malaria across the regions was different. The highest incidence of malaria was found in Gambela [AOR=52.55, 95%CI: (40.54-68.12)] followed by Beneshangul [AOR=34.95, 95%CI: (27.159 - 44.963)]. Similarly, individuals in Amhara [AOR=0.243, 95% CI:(0.1950.303],Oromiya[AOR=0.197,95%CI:(0.1580.244)],DireDawa[AOR=0.064,95%CI(0.049-0.082)],AddisAbaba[AOR=0.057,95%CI:(0.044-0.075)], Somali[AOR=0.077,95%CI:(0.059-0.097)], SNNPR[OR=0.329, 95%CI: (0.261- 0.413)] and Harari [AOR=0.256, 95%CI:(0.201 - 0.325)] were less likely to had low incidence of malaria as compared with Tigray. Furthermore, for a one-meter increase in altitude, the odds of a positive rapid diagnostic test (RDT) decrease by 1.6% [AOR = 0.984, 95% CI :( 0.984 - 0.984)]. The use of a shared toilet facility was found as a protective factor for malaria in Ethiopia [AOR=1.671, 95% CI: (1.504 - 1.854)]. The spatial autocorrelation variable changes the constant from AOR = 0.471 for logistic regression to AOR = 0.164 for auto logistics regression. Conclusions: This study found that the incidence of malaria in Ethiopia had a spatial pattern that is associated with socio-economic, demographic, and geographic risk factors. Spatial clustering of malaria cases had occurred in all regions, and the risk of clustering was different across the regions. The risk of malaria was found to be higher for those who live in soil floor-type houses as compared to those who live in cement or ceramics floor type. Similarly, households with thatched, metal and thin, and other roof-type houses have a higher risk of malaria than ceramic tiles roof houses. Moreover, using a protected anti-mosquito net reduced the risk of malaria incidence.Keywords: malaria, Ethiopia, auto logistics, spatial model, spatial clustering
Procedia PDF Downloads 3437790 Automatic Detection of Proliferative Cells in Immunohistochemically Images of Meningioma Using Fuzzy C-Means Clustering and HSV Color Space
Authors: Vahid Anari, Mina Bakhshi
Abstract:
Visual search and identification of immunohistochemically stained tissue of meningioma was performed manually in pathologic laboratories to detect and diagnose the cancers type of meningioma. This task is very tedious and time-consuming. Moreover, because of cell's complex nature, it still remains a challenging task to segment cells from its background and analyze them automatically. In this paper, we develop and test a computerized scheme that can automatically identify cells in microscopic images of meningioma and classify them into positive (proliferative) and negative (normal) cells. Dataset including 150 images are used to test the scheme. The scheme uses Fuzzy C-means algorithm as a color clustering method based on perceptually uniform hue, saturation, value (HSV) color space. Since the cells are distinguishable by the human eye, the accuracy and stability of the algorithm are quantitatively compared through application to a wide variety of real images.Keywords: positive cell, color segmentation, HSV color space, immunohistochemistry, meningioma, thresholding, fuzzy c-means
Procedia PDF Downloads 21037789 Empirical Study of Partitions Similarity Measures
Authors: Abdelkrim Alfalah, Lahcen Ouarbya, John Howroyd
Abstract:
This paper investigates and compares the performance of four existing distances and similarity measures between partitions. The partition measures considered are Rand Index (RI), Adjusted Rand Index (ARI), Variation of Information (VI), and Normalised Variation of Information (NVI). This work investigates the ability of these partition measures to capture three predefined intuitions: the variation within randomly generated partitions, the sensitivity to small perturbations, and finally the independence from the dataset scale. It has been shown that the Adjusted Rand Index performed well overall, with regards to these three intuitions.Keywords: clustering, comparing partitions, similarity measure, partition distance, partition metric, similarity between partitions, clustering comparison.
Procedia PDF Downloads 20237788 Unsupervised Part-of-Speech Tagging for Amharic Using K-Means Clustering
Authors: Zelalem Fantahun
Abstract:
Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word into naturally occurring text. Part-of-speech tagging is the most fundamental and basic task almost in all natural language processing. In natural language processing, the problem of providing large amount of manually annotated data is a knowledge acquisition bottleneck. Since, Amharic is one of under-resourced language, the availability of tagged corpus is the bottleneck problem for natural language processing especially for POS tagging. A promising direction to tackle this problem is to provide a system that does not require manually tagged data. In unsupervised learning, the learner is not provided with classifications. Unsupervised algorithms seek out similarity between pieces of data in order to determine whether they can be characterized as forming a group. This paper explicates the development of unsupervised part-of-speech tagger using K-Means clustering for Amharic language since large amount of data is produced in day-to-day activities. In the development of the tagger, the following procedures are followed. First, the unlabeled data (raw text) is divided into 10 folds and tokenization phase takes place; at this level, the raw text is chunked at sentence level and then into words. The second phase is feature extraction which includes word frequency, syntactic and morphological features of a word. The third phase is clustering. Among different clustering algorithms, K-means is selected and implemented in this study that brings group of similar words together. The fourth phase is mapping, which deals with looking at each cluster carefully and the most common tag is assigned to a group. This study finds out two features that are capable of distinguishing one part-of-speech from others these are morphological feature and positional information and show that it is possible to use unsupervised learning for Amharic POS tagging. In order to increase performance of the unsupervised part-of-speech tagger, there is a need to incorporate other features that are not included in this study, such as semantic related information. Finally, based on experimental result, the performance of the system achieves a maximum of 81% accuracy.Keywords: POS tagging, Amharic, unsupervised learning, k-means
Procedia PDF Downloads 45137787 Data Clustering in Wireless Sensor Network Implemented on Self-Organization Feature Map (SOFM) Neural Network
Authors: Krishan Kumar, Mohit Mittal, Pramod Kumar
Abstract:
Wireless sensor network is one of the most promising communication networks for monitoring remote environmental areas. In this network, all the sensor nodes are communicated with each other via radio signals. The sensor nodes have capability of sensing, data storage and processing. The sensor nodes collect the information through neighboring nodes to particular node. The data collection and processing is done by data aggregation techniques. For the data aggregation in sensor network, clustering technique is implemented in the sensor network by implementing self-organizing feature map (SOFM) neural network. Some of the sensor nodes are selected as cluster head nodes. The information aggregated to cluster head nodes from non-cluster head nodes and then this information is transferred to base station (or sink nodes). The aim of this paper is to manage the huge amount of data with the help of SOM neural network. Clustered data is selected to transfer to base station instead of whole information aggregated at cluster head nodes. This reduces the battery consumption over the huge data management. The network lifetime is enhanced at a greater extent.Keywords: artificial neural network, data clustering, self organization feature map, wireless sensor network
Procedia PDF Downloads 51737786 Feature Based Unsupervised Intrusion Detection
Authors: Deeman Yousif Mahmood, Mohammed Abdullah Hussein
Abstract:
The goal of a network-based intrusion detection system is to classify activities of network traffics into two major categories: normal and attack (intrusive) activities. Nowadays, data mining and machine learning plays an important role in many sciences; including intrusion detection system (IDS) using both supervised and unsupervised techniques. However, one of the essential steps of data mining is feature selection that helps in improving the efficiency, performance and prediction rate of proposed approach. This paper applies unsupervised K-means clustering algorithm with information gain (IG) for feature selection and reduction to build a network intrusion detection system. For our experimental analysis, we have used the new NSL-KDD dataset, which is a modified dataset for KDDCup 1999 intrusion detection benchmark dataset. With a split of 60.0% for the training set and the remainder for the testing set, a 2 class classifications have been implemented (Normal, Attack). Weka framework which is a java based open source software consists of a collection of machine learning algorithms for data mining tasks has been used in the testing process. The experimental results show that the proposed approach is very accurate with low false positive rate and high true positive rate and it takes less learning time in comparison with using the full features of the dataset with the same algorithm.Keywords: information gain (IG), intrusion detection system (IDS), k-means clustering, Weka
Procedia PDF Downloads 29637785 Ambiguity Resolution for Ground-based Pulse Doppler Radars Using Multiple Medium Pulse Repetition Frequency
Authors: Khue Nguyen Dinh, Loi Nguyen Van, Thanh Nguyen Nhu
Abstract:
In this paper, we propose an adaptive method to resolve ambiguities and a ghost target removal process to extract targets detected by a ground-based pulse-Doppler radar using medium pulse repetition frequency (PRF) waveforms. The ambiguity resolution method is an adaptive implementation of the coincidence algorithm, which is implemented on a two-dimensional (2D) range-velocity matrix to resolve range and velocity ambiguities simultaneously, with a proposed clustering filter to enhance the anti-error ability of the system. Here we consider the scenario of multiple target environments. The ghost target removal process, which is based on the power after Doppler processing, is proposed to mitigate ghosting detections to enhance the performance of ground-based radars using a short PRF schedule in multiple target environments. Simulation results on a ground-based pulsed Doppler radar model will be presented to show the effectiveness of the proposed approach.Keywords: ambiguity resolution, coincidence algorithm, medium PRF, ghosting removal
Procedia PDF Downloads 15137784 Credit Card Fraud Detection with Ensemble Model: A Meta-Heuristic Approach
Authors: Gong Zhilin, Jing Yang, Jian Yin
Abstract:
The purpose of this paper is to develop a novel system for credit card fraud detection based on sequential modeling of data using hybrid deep learning models. The projected model encapsulates five major phases are pre-processing, imbalance-data handling, feature extraction, optimal feature selection, and fraud detection with an ensemble classifier. The collected raw data (input) is pre-processed to enhance the quality of the data through alleviation of the missing data, noisy data as well as null values. The pre-processed data are class imbalanced in nature, and therefore they are handled effectively with the K-means clustering-based SMOTE model. From the balanced class data, the most relevant features like improved Principal Component Analysis (PCA), statistical features (mean, median, standard deviation) and higher-order statistical features (skewness and kurtosis). Among the extracted features, the most optimal features are selected with the Self-improved Arithmetic Optimization Algorithm (SI-AOA). This SI-AOA model is the conceptual improvement of the standard Arithmetic Optimization Algorithm. The deep learning models like Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and optimized Quantum Deep Neural Network (QDNN). The LSTM and CNN are trained with the extracted optimal features. The outcomes from LSTM and CNN will enter as input to optimized QDNN that provides the final detection outcome. Since the QDNN is the ultimate detector, its weight function is fine-tuned with the Self-improved Arithmetic Optimization Algorithm (SI-AOA).Keywords: credit card, data mining, fraud detection, money transactions
Procedia PDF Downloads 130