Search results for: agglomerative
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 18

Search results for: agglomerative

18 Performance Analysis of Hierarchical Agglomerative Clustering in a Wireless Sensor Network Using Quantitative Data

Authors: Tapan Jain, Davender Singh Saini

Abstract:

Clustering is a useful mechanism in wireless sensor networks which helps to cope with scalability and data transmission problems. The basic aim of our research work is to provide efficient clustering using Hierarchical agglomerative clustering (HAC). If the distance between the sensing nodes is calculated using their location then it’s quantitative HAC. This paper compares the various agglomerative clustering techniques applied in a wireless sensor network using the quantitative data. The simulations are done in MATLAB and the comparisons are made between the different protocols using dendrograms.

Keywords: routing, hierarchical clustering, agglomerative, quantitative, wireless sensor network

Procedia PDF Downloads 562
17 Agglomerative Hierarchical Clustering Using the Tθ Family of Similarity Measures

Authors: Salima Kouici, Abdelkader Khelladi

Abstract:

In this work, we begin with the presentation of the Tθ family of usual similarity measures concerning multidimensional binary data. Subsequently, some properties of these measures are proposed. Finally, the impact of the use of different inter-elements measures on the results of the Agglomerative Hierarchical Clustering Methods is studied.

Keywords: binary data, similarity measure, Tθ measures, agglomerative hierarchical clustering

Procedia PDF Downloads 445
16 Semi-Supervised Hierarchical Clustering Given a Reference Tree of Labeled Documents

Authors: Ying Zhao, Xingyan Bin

Abstract:

Semi-supervised clustering algorithms have been shown effective to improve clustering process with even limited supervision. However, semi-supervised hierarchical clustering remains challenging due to the complexities of expressing constraints for agglomerative clustering algorithms. This paper proposes novel semi-supervised agglomerative clustering algorithms to build a hierarchy based on a known reference tree. We prove that by enforcing distance constraints defined by a reference tree during the process of hierarchical clustering, the resultant tree is guaranteed to be consistent with the reference tree. We also propose a framework that allows the hierarchical tree generation be aware of levels of levels of the agglomerative tree under creation, so that metric weights can be learned and adopted at each level in a recursive fashion. The experimental evaluation shows that the additional cost of our contraint-based semi-supervised hierarchical clustering algorithm (HAC) is negligible, and our combined semi-supervised HAC algorithm outperforms the state-of-the-art algorithms on real-world datasets. The experiments also show that our proposed methods can improve clustering performance even with a small number of unevenly distributed labeled data.

Keywords: semi-supervised clustering, hierarchical agglomerative clustering, reference trees, distance constraints

Procedia PDF Downloads 504
15 Hybrid Hierarchical Clustering Approach for Community Detection in Social Network

Authors: Radhia Toujani, Jalel Akaichi

Abstract:

Social Networks generally present a hierarchy of communities. To determine these communities and the relationship between them, detection algorithms should be applied. Most of the existing algorithms, proposed for hierarchical communities identification, are based on either agglomerative clustering or divisive clustering. In this paper, we present a hybrid hierarchical clustering approach for community detection based on both bottom-up and bottom-down clustering. Obviously, our approach provides more relevant community structure than hierarchical method which considers only divisive or agglomerative clustering to identify communities. Moreover, we performed some comparative experiments to enhance the quality of the clustering results and to show the effectiveness of our algorithm.

Keywords: agglomerative hierarchical clustering, community structure, divisive hierarchical clustering, hybrid hierarchical clustering, opinion mining, social network, social network analysis

Procedia PDF Downloads 324
14 A Model Based Metaheuristic for Hybrid Hierarchical Community Structure in Social Networks

Authors: Radhia Toujani, Jalel Akaichi

Abstract:

In recent years, the study of community detection in social networks has received great attention. The hierarchical structure of the network leads to the emergence of the convergence to a locally optimal community structure. In this paper, we aim to avoid this local optimum in the introduced hybrid hierarchical method. To achieve this purpose, we present an objective function where we incorporate the value of structural and semantic similarity based modularity and a metaheuristic namely bees colonies algorithm to optimize our objective function on both hierarchical level divisive and agglomerative. In order to assess the efficiency and the accuracy of the introduced hybrid bee colony model, we perform an extensive experimental evaluation on both synthetic and real networks.

Keywords: social network, community detection, agglomerative hierarchical clustering, divisive hierarchical clustering, similarity, modularity, metaheuristic, bee colony

Procedia PDF Downloads 350
13 Finding Bicluster on Gene Expression Data of Lymphoma Based on Singular Value Decomposition and Hierarchical Clustering

Authors: Alhadi Bustaman, Soeganda Formalidin, Titin Siswantining

Abstract:

DNA microarray technology is used to analyze thousand gene expression data simultaneously and a very important task for drug development and test, function annotation, and cancer diagnosis. Various clustering methods have been used for analyzing gene expression data. However, when analyzing very large and heterogeneous collections of gene expression data, conventional clustering methods often cannot produce a satisfactory solution. Biclustering algorithm has been used as an alternative approach to identifying structures from gene expression data. In this paper, we introduce a transform technique based on singular value decomposition to identify normalized matrix of gene expression data followed by Mixed-Clustering algorithm and the Lift algorithm, inspired in the node-deletion and node-addition phases proposed by Cheng and Church based on Agglomerative Hierarchical Clustering (AHC). Experimental study on standard datasets demonstrated the effectiveness of the algorithm in gene expression data.

Keywords: agglomerative hierarchical clustering (AHC), biclustering, gene expression data, lymphoma, singular value decomposition (SVD)

Procedia PDF Downloads 250
12 Agglomerative Hierarchical Clustering Based on Morphmetric Parameters of the Populations of Labeo rohita

Authors: Fayyaz Rasool, Naureen Aziz Qureshi, Shakeela Parveen

Abstract:

Labeo rohita populations from five geographical locations from the hatchery and riverine system of Punjab-Pakistan were studied for the clustering on the basis of similarities and differences based on morphometric parameters within the species. Agglomerative Hierarchical Clustering (AHC) was done by using Pearson Correlation Coefficient and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) as Agglomeration method by XLSTAT 2012 version 1.02. A dendrogram with the data on the morphometrics of the representative samples of each site divided the populations of Labeo rohita in to five major clusters or classes. The variance decomposition for the optimal classification values remained as 19.24% for within class variation, while 80.76% for the between class differences. The representative central objects of the each class, the distances between the class centroids and also the distance between the central objects of the classes were generated by the analysis. A measurable distinction between the classes of the populations of the Labeo rohita was indicated in this study which determined the impacts of changing environment and other possible factors influencing the variation level among the populations of the same species.

Keywords: AHC, Labeo rohita, hatchery, riverine, morphometric

Procedia PDF Downloads 418
11 A Product-Specific/Unobservable Approach to Segmentation for a Value Expressive Credit Card Service

Authors: Manfred F. Maute, Olga Naumenko, Raymond T. Kong

Abstract:

Using data from a nationally representative financial panel of Canadian households, this study develops a psychographic segmentation of the customers of a value-expressive credit card service and tests for effects on relational response differences. The variety of segments elicited by agglomerative and k means clustering and the familiar profiles of individual clusters suggest that the face validity of the psychographic segmentation was quite high. Segmentation had a significant effect on customer satisfaction and relationship depth. However, when socio-demographic characteristics like household size and income were accounted for in the psychographic segmentation, the effect on relational response differences was magnified threefold. Implications for the segmentation of financial services markets are considered.

Keywords: customer satisfaction, financial services, psychographics, response differences, segmentation

Procedia PDF Downloads 303
10 A Non-parametric Clustering Approach for Multivariate Geostatistical Data

Authors: Francky Fouedjio

Abstract:

Multivariate geostatistical data have become omnipresent in the geosciences and pose substantial analysis challenges. One of them is the grouping of data locations into spatially contiguous clusters so that data locations within the same cluster are more similar while clusters are different from each other, in some sense. Spatially contiguous clusters can significantly improve the interpretation that turns the resulting clusters into meaningful geographical subregions. In this paper, we develop an agglomerative hierarchical clustering approach that takes into account the spatial dependency between observations. It relies on a dissimilarity matrix built from a non-parametric kernel estimator of the spatial dependence structure of data. It integrates existing methods to find the optimal cluster number and to evaluate the contribution of variables to the clustering. The capability of the proposed approach to provide spatially compact, connected and meaningful clusters is assessed using bivariate synthetic dataset and multivariate geochemical dataset. The proposed clustering method gives satisfactory results compared to other similar geostatistical clustering methods.

Keywords: clustering, geostatistics, multivariate data, non-parametric

Procedia PDF Downloads 454
9 A Polynomial Time Clustering Algorithm for Solving the Assignment Problem in the Vehicle Routing Problem

Authors: Lydia Wahid, Mona F. Ahmed, Nevin Darwish

Abstract:

The vehicle routing problem (VRP) consists of a group of customers that needs to be served. Each customer has a certain demand of goods. A central depot having a fleet of vehicles is responsible for supplying the customers with their demands. The problem is composed of two subproblems: The first subproblem is an assignment problem where the number of vehicles that will be used as well as the customers assigned to each vehicle are determined. The second subproblem is the routing problem in which for each vehicle having a number of customers assigned to it, the order of visits of the customers is determined. Optimal number of vehicles, as well as optimal total distance, should be achieved. In this paper, an approach for solving the first subproblem (the assignment problem) is presented. In the approach, a clustering algorithm is proposed for finding the optimal number of vehicles by grouping the customers into clusters where each cluster is visited by one vehicle. Finding the optimal number of clusters is NP-hard. This work presents a polynomial time clustering algorithm for finding the optimal number of clusters and solving the assignment problem.

Keywords: vehicle routing problems, clustering algorithms, Clarke and Wright Saving Method, agglomerative hierarchical clustering

Procedia PDF Downloads 357
8 Antibacterial Evaluation, in Silico ADME and QSAR Studies of Some Benzimidazole Derivatives

Authors: Strahinja Kovačević, Lidija Jevrić, Miloš Kuzmanović, Sanja Podunavac-Kuzmanović

Abstract:

In this paper, various derivatives of benzimidazole have been evaluated against Gram-negative bacteria Escherichia coli. For all investigated compounds the minimum inhibitory concentration (MIC) was determined. Quantitative structure-activity relationships (QSAR) attempts to find consistent relationships between the variations in the values of molecular properties and the biological activity for a series of compounds so that these rules can be used to evaluate new chemical entities. The correlation between MIC and some absorption, distribution, metabolism and excretion (ADME) parameters was investigated, and the mathematical models for predicting the antibacterial activity of this class of compounds were developed. The quality of the multiple linear regression (MLR) models was validated by the leave-one-out (LOO) technique, as well as by the calculation of the statistical parameters for the developed models and the results are discussed on the basis of the statistical data. The results of this study indicate that ADME parameters have a significant effect on the antibacterial activity of this class of compounds. Principal component analysis (PCA) and agglomerative hierarchical clustering algorithms (HCA) confirmed that the investigated molecules can be classified into groups on the basis of the ADME parameters: Madin-Darby Canine Kidney cell permeability (MDCK), Plasma protein binding (PPB%), human intestinal absorption (HIA%) and human colon carcinoma cell permeability (Caco-2).

Keywords: benzimidazoles, QSAR, ADME, in silico

Procedia PDF Downloads 345
7 Identifying Autism Spectrum Disorder Using Optimization-Based Clustering

Authors: Sharifah Mousli, Sona Taheri, Jiayuan He

Abstract:

Autism spectrum disorder (ASD) is a complex developmental condition involving persistent difficulties with social communication, restricted interests, and repetitive behavior. The challenges associated with ASD can interfere with an affected individual’s ability to function in social, academic, and employment settings. Although there is no effective medication known to treat ASD, to our best knowledge, early intervention can significantly improve an affected individual’s overall development. Hence, an accurate diagnosis of ASD at an early phase is essential. The use of machine learning approaches improves and speeds up the diagnosis of ASD. In this paper, we focus on the application of unsupervised clustering methods in ASD as a large volume of ASD data generated through hospitals, therapy centers, and mobile applications has no pre-existing labels. We conduct a comparative analysis using seven clustering approaches such as K-means, agglomerative hierarchical, model-based, fuzzy-C-means, affinity propagation, self organizing maps, linear vector quantisation – as well as the recently developed optimization-based clustering (COMSEP-Clust) approach. We evaluate the performances of the clustering methods extensively on real-world ASD datasets encompassing different age groups: toddlers, children, adolescents, and adults. Our experimental results suggest that the COMSEP-Clust approach outperforms the other seven methods in recognizing ASD with well-separated clusters.

Keywords: autism spectrum disorder, clustering, optimization, unsupervised machine learning

Procedia PDF Downloads 81
6 The Relationship Between Car Drivers' Background Information and Risky Events In I- Dreams Project

Authors: Dagim Dessalegn Haile

Abstract:

This study investigated the interaction between the drivers' socio-demographic background information (age, gender, and driving experience) and the risky events score in the i-DREAMS platform. Further, the relationship between the participants' background driving behavior and the i-DREAMS platform behavioral output scores of risky events was also investigated. The i-DREAMS acronym stands for Smart Driver and Road Environment Assessment and Monitoring System. It is a European Union Horizon 2020 funded project consisting of 13 partners, researchers, and industry partners from 8 countries. A total of 25 Belgian car drivers (16 male and nine female) were considered for analysis. Drivers' ages were categorized into ages 18-25, 26-45, 46-65, and 65 and older. Drivers' driving experience was also categorized into four groups: 1-15, 16-30, 31-45, and 46-60 years. Drivers are classified into two clusters based on the recorded score for risky events during phase 1 (baseline) using risky events; acceleration, deceleration, speeding, tailgating, overtaking, and lane discipline. Agglomerative hierarchical clustering using SPSS shows Cluster 1 drivers are safer drivers, and Cluster 2 drivers are identified as risky drivers. The analysis result indicated no significant relationship between age groups, gender, and experience groups except for risky events like acceleration, tailgating, and overtaking in a few phases. This is mainly because the fewer participants create less variability of socio-demographic background groups. Repeated measure ANOVA shows that cluster 2 drivers improved more than cluster 1 drivers for tailgating, lane discipline, and speeding events. A positive relationship between background drivers' behavior and i-DREAMS platform behavioral output scores is observed. It implies that car drivers who in the questionnaire data indicate committing more risky driving behavior demonstrate more risky driver behavior in the i-DREAMS observed driving data.

Keywords: i-dreams, car drivers, socio-demographic background, risky events

Procedia PDF Downloads 34
5 A Comparative Analysis of Clustering Approaches for Understanding Patterns in Health Insurance Uptake: Evidence from Sociodemographic Kenyan Data

Authors: Nelson Kimeli Kemboi Yego, Juma Kasozi, Joseph Nkruzinza, Francis Kipkogei

Abstract:

The study investigated the low uptake of health insurance in Kenya despite efforts to achieve universal health coverage through various health insurance schemes. Unsupervised machine learning techniques were employed to identify patterns in health insurance uptake based on sociodemographic factors among Kenyan households. The aim was to identify key demographic groups that are underinsured and to provide insights for the development of effective policies and outreach programs. Using the 2021 FinAccess Survey, the study clustered Kenyan households based on their health insurance uptake and sociodemographic features to reveal patterns in health insurance uptake across the country. The effectiveness of k-prototypes clustering, hierarchical clustering, and agglomerative hierarchical clustering in clustering based on sociodemographic factors was compared. The k-prototypes approach was found to be the most effective at uncovering distinct and well-separated clusters in the Kenyan sociodemographic data related to health insurance uptake based on silhouette, Calinski-Harabasz, Davies-Bouldin, and Rand indices. Hence, it was utilized in uncovering the patterns in uptake. The results of the analysis indicate that inclusivity in health insurance is greatly related to affordability. The findings suggest that targeted policy interventions and outreach programs are necessary to increase health insurance uptake in Kenya, with the ultimate goal of achieving universal health coverage. The study provides important insights for policymakers and stakeholders in the health insurance sector to address the low uptake of health insurance and to ensure that healthcare services are accessible and affordable to all Kenyans, regardless of their socio-demographic status. The study highlights the potential of unsupervised machine learning techniques to provide insights into complex health policy issues and improve decision-making in the health sector.

Keywords: health insurance, unsupervised learning, clustering algorithms, machine learning

Procedia PDF Downloads 85
4 Investigating Homicide Offender Typologies Based on Their Clinical Histories and Crime Scene Behaviour Patterns

Authors: Valeria Abreu Minero, Edward Barker, Hannah Dickson, Francois Husson, Sandra Flynn, Jennifer Shaw

Abstract:

Purpose – The purpose of this paper is to identify offender typologies based on aspects of the offenders’ psychopathology and their associations with crime scene behaviours using data derived from the National Confidential Enquiry into Suicide and Safety in Mental Health concerning homicides in England and Wales committed by offenders in contact with mental health services in the year preceding the offence (n=759). Design/methodology/approach – The authors used multiple correspondence analysis to investigate the interrelationships between the variables and hierarchical agglomerative clustering to identify offender typologies. Variables describing: the offender’s mental health history; the offenders’ mental state at the time of offence; characteristics useful for police investigations; and patterns of crime scene behaviours were included. Findings – Results showed differences in the offender’s histories in relation to their crime scene behaviours. Further, analyses revealed three homicide typologies: externalising, psychosis and depression. Analyses revealed three homicide typologies: externalising, psychotic and depressive. Practical implications – These typologies may assist the police during homicide investigations by: furthering their understanding of the crime or likely suspect; offering insights into crime patterns; provide advice as to what an offender’s offence behaviour might signify about his/her mental health background; findings suggest information concerning offender psychopathology may be useful for offender profiling purposes in cases of homicide offenders with schizophrenia, depression and comorbid diagnosis of personality disorder and alcohol/drug dependence. Originality/value – Empirical studies with an emphasis on offender profiling have almost exclusively focussed on the inference of offender demographic characteristics. This study provides a first step in the exploration of offender psychopathology and its integration to the multivariate analysis of offence information for the purposes of investigative profiling of homicide by identifying the dominant patterns of mental illness within homicidal behaviour.

Keywords: offender profiling, mental illness, psychopathology, multivariate analysis, homicide, crime scene analysis, crime scene behviours, investigative advice

Procedia PDF Downloads 95
3 Phylogenetic Inferences based on Morphoanatomical Characters in Plectranthus esculentus N. E. Br. (Lamiaceae) from Nigeria

Authors: Otuwose E. Agyeno, Adeniyi A. Jayeola, Bashir A. Ajala

Abstract:

P. esculentus is indigenous to Nigeria yet no wild relation has been encountered or reported. This has made it difficult to establish proper lineages between the varieties and landraces under cultivation. The present work is the first to determine the apormophy of 135 morphoanatomical characters in organs of 46 accessions drawn from 23 populations of this species based on dicta. The character states were coded in accession x character-state matrices and only 83 were informative and utilised for neighbour joining clustering based on euclidean values, and heuristic search in parsimony analysis using PAST ver. 3.15 software. Compatibility and evolutionary trends between accessions were then explored from values and diagrams produced. The low consistency indices (CI) recorded support monophyly and low homoplasy in this taxon. Agglomerative schedules based on character type and source data sets divided the accessions into mainly 3 clades, each of complexes of accessions. Solenostemon rotundifolius (Poir) J.K Morton was the outgroup (OG) used, and it occurred within the largest clades except when the characters were combined in a data set. The OG showed better compatibility with accessions of populations of landrace Isci, and varieties Riyum and Long’at. Otherwise, its aerial parts are more consistent with those of accessions of variety Bebot. The highly polytomous clades produced due to anatomical data set may be an indication of how stable such characters are in this species. Strict consensus trees with more than 60 nodes outputted showed that the basal nodes were strongly supported by 3 to 17 characters across the data sets, suggesting that populations of this species are more alike. The OG was clearly the first diverging lineage and closely related to accessions of landrace Gwe and variety Bebot morphologically, but different from them anatomically. It was also distantly related to landrace Fina and variety Long’at in terms of root, stem and leaf structural attributes. There were at least 5 other clades with each comprising of complexes of accessions from different localities and terrains within the study area. Spherical stem in cross section, size of vascular bundles at the stem corners as well as the alternate and whorl phyllotaxy are attributes which may have facilitated each other’s evolution in all accessions of the landrace Gwe, and they may be innovative since such states are not characteristic of the larger Lamiaceae, and Plectranthus L’Her in particular. In conclusion, this study has provided valuable information about infraspecific diversity in this taxon. It supports recognition of the varietal statuses accorded to populations of P. esculentus, as well as the hypothesis that the wild gene might have been distributed on the Jos Plateau. However, molecular characterisation of accessions of populations of this species would resolve this problem better.

Keywords: clustering, lineage, morphoanatomical characters, Nigeria, phylogenetics, Plectranthus esculentus, population

Procedia PDF Downloads 108
2 Barriers to Tuberculosis Detection in Portuguese Prisons

Authors: M. F. Abreu, A. I. Aguiar, R. Gaio, R. Duarte

Abstract:

Background: Prison establishments constitute high-risk environments for the transmission and spread of tuberculosis (TB), given their epidemiological context and the difficulty of implementing preventive and control measures. Guidelines for control and prevention of tuberculosis in prisons have been described as incomplete and heterogeneous internationally, due to several identified obstacles, for example scarcity of human resources and funding of prisoner health services. In Portugal, a protocol was created in 2014 with the aim to define and standardize procedures of detection and prevention of tuberculosis within prisons. Objective: The main objective of this study was to identify and describe barriers to tuberculosis detection in prisons of Porto and Lisbon districts in Portugal. Methods: A cross-sectional study was conducted from 2ⁿᵈ January 2018 till 30ᵗʰ June 2018. Semi-structured questionnaires were applied to health care professionals working in the prisons of the districts of Porto (n=6) and Lisbon (n=8). As inclusion criteria we considered having work experience in the area of tuberculosis (either in diagnosis, treatment, or follow up). The questionnaires were self-administered, in paper format. Descriptive analyses of the questionnaire variables were made using frequencies and median. Afterwards, a hierarchical agglomerative clusters analysis was performed. After obtaining the clusters, the chi-square test was applied to study the association between the variables collected and the clusters. The level of significance considered was 0.05. Results: From the total of 186 health professionals, 139 met the criteria of inclusion and 82 health professionals were interviewed (62,2% of participation). Most were female, nurses, with a median age of 34 years, with term employment contract. From the cluster analysis, two groups were identified with different characteristics and behaviors for the procedures of this protocol. Statistically significant results were found in: elements of cluster 1 (78% of the total participants) work in prisons for a longer time (p=0.003), 45,3% work > 4 years while 50% of the elements of cluster 2 work for less than a year, and more frequently answered they know and apply the procedures of the protocol (p=0.000). Both clusters answered frequently the need of having theoretical-practical training for TB (p=0.000), especially in the areas of diagnosis, treatment and prevention and that there is scarcity of funding to prisoner health services (p=0.000). Regarding procedures for TB screening (periodic and contact screening) and procedures for transferring a prisoner with this disease, cluster 1 also answered more frequently to perform them (p=0.000). They also referred that the material/equipment for TB screening is accessible and available (p=0.000). From this clusters we identified as barriers scarcity of human resources, the need to theoretical-practical training for tuberculosis, inexperience in working in health services prisons and limited knowledge of protocol procedures. Conclusions: The barriers found in this study are the same described internationally. This protocol is mostly being applied in portuguese prisons. The study also showed the need to invest in human and material resources. This investigation bridged gaps in knowledge that could help prison health services optimize the care provided for early detection and adherence of prisoners to treatment of tuberculosis.

Keywords: barriers, health care professionals, prisons, protocol, tuberculosis

Procedia PDF Downloads 113
1 A Clustering-Based Approach for Weblog Data Cleaning

Authors: Amine Ganibardi, Cherif Arab Ali

Abstract:

This paper addresses the data cleaning issue as a part of web usage data preprocessing within the scope of Web Usage Mining. Weblog data recorded by web servers within log files reflect usage activity, i.e., End-users’ clicks and underlying user-agents’ hits. As Web Usage Mining is interested in End-users’ behavior, user-agents’ hits are referred to as noise to be cleaned-off before mining. Filtering hits from clicks is not trivial for two reasons, i.e., a server records requests interlaced in sequential order regardless of their source or type, website resources may be set up as requestable interchangeably by end-users and user-agents. The current methods are content-centric based on filtering heuristics of relevant/irrelevant items in terms of some cleaning attributes, i.e., website’s resources filetype extensions, website’s resources pointed by hyperlinks/URIs, http methods, user-agents, etc. These methods need exhaustive extra-weblog data and prior knowledge on the relevant and/or irrelevant items to be assumed as clicks or hits within the filtering heuristics. Such methods are not appropriate for dynamic/responsive Web for three reasons, i.e., resources may be set up to as clickable by end-users regardless of their type, website’s resources are indexed by frame names without filetype extensions, web contents are generated and cancelled differently from an end-user to another. In order to overcome these constraints, a clustering-based cleaning method centered on the logging structure is proposed. This method focuses on the statistical properties of the logging structure at the requested and referring resources attributes levels. It is insensitive to logging content and does not need extra-weblog data. The used statistical property takes on the structure of the generated logging feature by webpage requests in terms of clicks and hits. Since a webpage consists of its single URI and several components, these feature results in a single click to multiple hits ratio in terms of the requested and referring resources. Thus, the clustering-based method is meant to identify two clusters based on the application of the appropriate distance to the frequency matrix of the requested and referring resources levels. As the ratio clicks to hits is single to multiple, the clicks’ cluster is the smallest one in requests number. Hierarchical Agglomerative Clustering based on a pairwise distance (Gower) and average linkage has been applied to four logfiles of dynamic/responsive websites whose click to hits ratio range from 1/2 to 1/15. The optimal clustering set on the basis of average linkage and maximum inter-cluster inertia results always in two clusters. The evaluation of the smallest cluster referred to as clicks cluster under the terms of confusion matrix indicators results in 97% of true positive rate. The content-centric cleaning methods, i.e., conventional and advanced cleaning, resulted in a lower rate 91%. Thus, the proposed clustering-based cleaning outperforms the content-centric methods within dynamic and responsive web design without the need of any extra-weblog. Such an improvement in cleaning quality is likely to refine dependent analysis.

Keywords: clustering approach, data cleaning, data preprocessing, weblog data, web usage data

Procedia PDF Downloads 153