Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 28

Search results for: imputation

28 Two-Phase Sampling for Estimating a Finite Population Total in Presence of Missing Values

Authors: Daniel Fundi Murithi


Missing data is a real bane in many surveys. To overcome the problems caused by missing data, partial deletion, and single imputation methods, among others, have been proposed. However, problems such as discarding usable data and inaccuracy in reproducing known population parameters and standard errors are associated with them. For regression and stochastic imputation, it is assumed that there is a variable with complete cases to be used as a predictor in estimating missing values in the other variable, and the relationship between the two variables is linear, which might not be realistic in practice. In this project, we estimate population total in presence of missing values in two-phase sampling. Instead of regression or stochastic models, non-parametric model based regression model is used in imputing missing values. Empirical study showed that nonparametric model-based regression imputation is better in reproducing variance of population total estimate obtained when there were no missing values compared to mean, median, regression, and stochastic imputation methods. Although regression and stochastic imputation were better than nonparametric model-based imputation in reproducing population total estimates obtained when there were no missing values in one of the sample sizes considered, nonparametric model-based imputation may be used when the relationship between outcome and predictor variables is not linear.

Keywords: finite population total, missing data, model-based imputation, two-phase sampling

Procedia PDF Downloads 48
27 Effect of Genuine Missing Data Imputation on Prediction of Urinary Incontinence

Authors: Suzan Arslanturk, Mohammad-Reza Siadat, Theophilus Ogunyemi, Ananias Diokno


Missing data is a common challenge in statistical analyses of most clinical survey datasets. A variety of methods have been developed to enable analysis of survey data to deal with missing values. Imputation is the most commonly used among the above methods. However, in order to minimize the bias introduced due to imputation, one must choose the right imputation technique and apply it to the correct type of missing data. In this paper, we have identified different types of missing values: missing data due to skip pattern (SPMD), undetermined missing data (UMD), and genuine missing data (GMD) and applied rough set imputation on only the GMD portion of the missing data. We have used rough set imputation to evaluate the effect of such imputation on prediction by generating several simulation datasets based on an existing epidemiological dataset (MESA). To measure how well each dataset lends itself to the prediction model (logistic regression), we have used p-values from the Wald test. To evaluate the accuracy of the prediction, we have considered the width of 95% confidence interval for the probability of incontinence. Both imputed and non-imputed simulation datasets were fit to the prediction model, and they both turned out to be significant (p-value < 0.05). However, the Wald score shows a better fit for the imputed compared to non-imputed datasets (28.7 vs. 23.4). The average confidence interval width was decreased by 10.4% when the imputed dataset was used, meaning higher precision. The results show that using the rough set method for missing data imputation on GMD data improve the predictive capability of the logistic regression. Further studies are required to generalize this conclusion to other clinical survey datasets.

Keywords: rough set, imputation, clinical survey data simulation, genuine missing data, predictive index

Procedia PDF Downloads 82
26 A Neural Network Based Clustering Approach for Imputing Multivariate Values in Big Data

Authors: S. Nickolas, Shobha K.


The treatment of incomplete data is an important step in the data pre-processing. Missing values creates a noisy environment in all applications and it is an unavoidable problem in big data management and analysis. Numerous techniques likes discarding rows with missing values, mean imputation, expectation maximization, neural networks with evolutionary algorithms or optimized techniques and hot deck imputation have been introduced by researchers for handling missing data. Among these, imputation techniques plays a positive role in filling missing values when it is necessary to use all records in the data and not to discard records with missing values. In this paper we propose a novel artificial neural network based clustering algorithm, Adaptive Resonance Theory-2(ART2) for imputation of missing values in mixed attribute data sets. The process of ART2 can recognize learned models fast and be adapted to new objects rapidly. It carries out model-based clustering by using competitive learning and self-steady mechanism in dynamic environment without supervision. The proposed approach not only imputes the missing values but also provides information about handling the outliers.

Keywords: ART2, data imputation, clustering, missing data, neural network, pre-processing

Procedia PDF Downloads 207
25 A Large Dataset Imputation Approach Applied to Country Conflict Prediction Data

Authors: Benjamin Leiby, Darryl Ahner


This study demonstrates an alternative stochastic imputation approach for large datasets when preferred commercial packages struggle to iterate due to numerical problems. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The methodology capitalizes on correlation while using model residuals to provide the uncertainty in estimating unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Static tolerances common in most packages are replaced with tailorable tolerances that exploit residuals to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the country conflict dataset illustrates promise with modeling first-order interactions while presenting a need for further refinement that mimics predictive mean matching.

Keywords: correlation, country conflict, imputation, stochastic regression

Procedia PDF Downloads 39
24 Energy Complementary In Colombia: Imputation of Dataset

Authors: Felipe Villegas-Velasquez, Harold Pantoja-Villota, Sergio Holguin-Cardona, Alejandro Osorio-Botero, Brayan Candamil-Arango


Colombian electricity comes mainly from hydric resources, affected by environmental variations such as the El Niño phenomenon. That is why incorporating other types of resources is necessary to provide electricity constantly. This research seeks to fill the wind speed and global solar irradiance dataset for two years with the highest amount of information. A further result is the characterization of the data by region that led to infer which errors occurred and offered the incomplete dataset.

Keywords: energy, wind speed, global solar irradiance, Colombia, imputation

Procedia PDF Downloads 31
23 Overview of Adaptive Spline interpolation

Authors: Rongli Gai, Zhiyuan Chang


At this stage, in view of various situations in the interpolation process, most researchers use self-adaptation to adjust the interpolation process, which is also one of the current and future research hotspots in the field of CNC machining. In the interpolation process, according to the overview of the spline curve interpolation algorithm, the adaptive analysis is carried out from the factors affecting the interpolation process. The adaptive operation is reflected in various aspects, such as speed, parameters, errors, nodes, feed rates, random Period, sensitive point, step size, curvature, adaptive segmentation, adaptive optimization, etc. This paper will analyze and summarize the research of adaptive imputation in the direction of the above factors affecting imputation.

Keywords: adaptive algorithm, CNC machining, interpolation constraints, spline curve interpolation

Procedia PDF Downloads 56
22 Bias-Corrected Estimation Methods for Receiver Operating Characteristic Surface

Authors: Khanh To Duc, Monica Chiogna, Gianfranco Adimari


With three diagnostic categories, assessment of the performance of diagnostic tests is achieved by the analysis of the receiver operating characteristic (ROC) surface, which generalizes the ROC curve for binary diagnostic outcomes. The volume under the ROC surface (VUS) is a summary index usually employed for measuring the overall diagnostic accuracy. When the true disease status can be exactly assessed by means of a gold standard (GS) test, unbiased nonparametric estimators of the ROC surface and VUS are easily obtained. In practice, unfortunately, disease status verification via the GS test could be unavailable for all study subjects, due to the expensiveness or invasiveness of the GS test. Thus, often only a subset of patients undergoes disease verification. Statistical evaluations of diagnostic accuracy based only on data from subjects with verified disease status are typically biased. This bias is known as verification bias. Here, we consider the problem of correcting for verification bias when continuous diagnostic tests for three-class disease status are considered. We assume that selection for disease verification does not depend on disease status, given test results and other observed covariates, i.e., we assume that the true disease status, when missing, is missing at random. Under this assumption, we discuss several solutions for ROC surface analysis based on imputation and re-weighting methods. In particular, verification bias-corrected estimators of the ROC surface and of VUS are proposed, namely, full imputation, mean score imputation, inverse probability weighting and semiparametric efficient estimators. Consistency and asymptotic normality of the proposed estimators are established, and their finite sample behavior is investigated by means of Monte Carlo simulation studies. Two illustrations using real datasets are also given.

Keywords: imputation, missing at random, inverse probability weighting, ROC surface analysis

Procedia PDF Downloads 328
21 dynr.mi: An R Program for Multiple Imputation in Dynamic Modeling

Authors: Yanling Li, Linying Ji, Zita Oravecz, Timothy R. Brick, Michael D. Hunter, Sy-Miin Chow


Assessing several individuals intensively over time yields intensive longitudinal data (ILD). Even though ILD provide rich information, they also bring other data analytic challenges. One of these is the increased occurrence of missingness with increased study length, possibly under non-ignorable missingness scenarios. Multiple imputation (MI) handles missing data by creating several imputed data sets, and pooling the estimation results across imputed data sets to yield final estimates for inferential purposes. In this article, we introduce dynr.mi(), a function in the R package, Dynamic Modeling in R (dynr). The package dynr provides a suite of fast and accessible functions for estimating and visualizing the results from fitting linear and nonlinear dynamic systems models in discrete as well as continuous time. By integrating the estimation functions in dynr and the MI procedures available from the R package, Multivariate Imputation by Chained Equations (MICE), the dynr.mi() routine is designed to handle possibly non-ignorable missingness in the dependent variables and/or covariates in a user-specified dynamic systems model via MI, with convergence diagnostic check. We utilized dynr.mi() to examine, in the context of a vector autoregressive model, the relationships among individuals’ ambulatory physiological measures, and self-report affect valence and arousal. The results from MI were compared to those from listwise deletion of entries with missingness in the covariates. When we determined the number of iterations based on the convergence diagnostics available from dynr.mi(), differences in the statistical significance of the covariate parameters were observed between the listwise deletion and MI approaches. These results underscore the importance of considering diagnostic information in the implementation of MI procedures.

Keywords: dynamic modeling, missing data, mobility, multiple imputation

Procedia PDF Downloads 94
20 Self-Organizing Maps for Exploration of Partially Observed Data and Imputation of Missing Values in the Context of the Manufacture of Aircraft Engines

Authors: Sara Rejeb, Catherine Duveau, Tabea Rebafka


To monitor the production process of turbofan aircraft engines, multiple measurements of various geometrical parameters are systematically recorded on manufactured parts. Engine parts are subject to extremely high standards as they can impact the performance of the engine. Therefore, it is essential to analyze these databases to better understand the influence of the different parameters on the engine's performance. Self-organizing maps are unsupervised neural networks which achieve two tasks simultaneously: they visualize high-dimensional data by projection onto a 2-dimensional map and provide clustering of the data. This technique has become very popular for data exploration since it provides easily interpretable results and a meaningful global view of the data. As such, self-organizing maps are usually applied to aircraft engine condition monitoring. As databases in this field are huge and complex, they naturally contain multiple missing entries for various reasons. The classical Kohonen algorithm to compute self-organizing maps is conceived for complete data only. A naive approach to deal with partially observed data consists in deleting items or variables with missing entries. However, this requires a sufficient number of complete individuals to be fairly representative of the population; otherwise, deletion leads to a considerable loss of information. Moreover, deletion can also induce bias in the analysis results. Alternatively, one can first apply a common imputation method to create a complete dataset and then apply the Kohonen algorithm. However, the choice of the imputation method may have a strong impact on the resulting self-organizing map. Our approach is to address simultaneously the two problems of computing a self-organizing map and imputing missing values, as these tasks are not independent. In this work, we propose an extension of self-organizing maps for partially observed data, referred to as missSOM. First, we introduce a criterion to be optimized, that aims at defining simultaneously the best self-organizing map and the best imputations for the missing entries. As such, missSOM is also an imputation method for missing values. To minimize the criterion, we propose an iterative algorithm that alternates the learning of a self-organizing map and the imputation of missing values. Moreover, we develop an accelerated version of the algorithm by entwining the iterations of the Kohonen algorithm with the updates of the imputed values. This method is efficiently implemented in R and will soon be released on CRAN. Compared to the standard Kohonen algorithm, it does not come with any additional cost in terms of computing time. Numerical experiments illustrate that missSOM performs well in terms of both clustering and imputation compared to the state of the art. In particular, it turns out that missSOM is robust to the missingness mechanism, which is in contrast to many imputation methods that are appropriate for only a single mechanism. This is an important property of missSOM as, in practice, the missingness mechanism is often unknown. An application to measurements on one type of part is also provided and shows the practical interest of missSOM.

Keywords: imputation method of missing data, partially observed data, robustness to missingness mechanism, self-organizing maps

Procedia PDF Downloads 84
19 A Review of Methods for Handling Missing Data in the Formof Dropouts in Longitudinal Clinical Trials

Authors: A. Satty, H. Mwambi


Much clinical trials data-based research are characterized by the unavoidable problem of dropout as a result of missing or erroneous values. This paper aims to review some of the various techniques to address the dropout problems in longitudinal clinical trials. The fundamental concepts of the patterns and mechanisms of dropout are discussed. This study presents five general techniques for handling dropout: (1) Deletion methods; (2) Imputation-based methods; (3) Data augmentation methods; (4) Likelihood-based methods; and (5) MNAR-based methods. Under each technique, several methods that are commonly used to deal with dropout are presented, including a review of the existing literature in which we examine the effectiveness of these methods in the analysis of incomplete data. Two application examples are presented to study the potential strengths or weaknesses of some of the methods under certain dropout mechanisms as well as to assess the sensitivity of the modelling assumptions.

Keywords: incomplete longitudinal clinical trials, missing at random (MAR), imputation, weighting methods, sensitivity analysis

Procedia PDF Downloads 329
18 Imputation of Urban Movement Patterns Using Big Data

Authors: Eusebio Odiari, Mark Birkin, Susan Grant-Muller, Nicolas Malleson


Big data typically refers to consumer datasets revealing some detailed heterogeneity in human behavior, which if harnessed appropriately, could potentially revolutionize our understanding of the collective phenomena of the physical world. Inadvertent missing values skew these datasets and compromise the validity of the thesis. Here we discuss a conceptually consistent strategy for identifying other relevant datasets to combine with available big data, to plug the gaps and to create a rich requisite comprehensive dataset for subsequent analysis. Specifically, emphasis is on how these methodologies can for the first time enable the construction of more detailed pictures of passenger demand and drivers of mobility on the railways. These methodologies can predict the influence of changes within the network (like a change in time-table or impact of a new station), explain local phenomena outside the network (like rail-heading) and the other impacts of urban morphology. Our analysis also reveals that our new imputation data model provides for more equitable revenue sharing amongst network operators who manage different parts of the integrated UK railways.

Keywords: big-data, micro-simulation, mobility, ticketing-data, commuters, transport, synthetic, population

Procedia PDF Downloads 141
17 Linkage Disequilibrium and Haplotype Blocks Study from Two High-Density Panels and a Combined Panel in Nelore Beef Cattle

Authors: Priscila A. Bernardes, Marcos E. Buzanskas, Luciana C. A. Regitano, Ricardo V. Ventura, Danisio P. Munari


Genotype imputation has been used to reduce genomic selections costs. In order to increase haplotype detection accuracy in methods that considers the linkage disequilibrium, another approach could be used, such as combined genotype data from different panels. Therefore, this study aimed to evaluate the linkage disequilibrium and haplotype blocks in two high-density panels before and after the imputation to a combined panel in Nelore beef cattle. A total of 814 animals were genotyped with the Illumina BovineHD BeadChip (IHD), wherein 93 animals (23 bulls and 70 progenies) were also genotyped with the Affymetrix Axion Genome-Wide BOS 1 Array Plate (AHD). After the quality control, 809 IHD animals (509,107 SNPs) and 93 AHD (427,875 SNPs) remained for analyses. The combined genotype panel (CP) was constructed by merging both panels after quality control, resulting in 880,336 SNPs. Imputation analysis was conducted using software FImpute v.2.2b. The reference (CP) and target (IHD) populations consisted of 23 bulls and 786 animals, respectively. The linkage disequilibrium and haplotype blocks studies were carried out for IHD, AHD, and imputed CP. Two linkage disequilibrium measures were considered; the correlation coefficient between alleles from two loci (r²) and the |D’|. Both measures were calculated using the software PLINK. The haplotypes' blocks were estimated using the software Haploview. The r² measurement presented different decay when compared to |D’|, wherein AHD and IHD had almost the same decay. For r², even with possible overestimation by the sample size for AHD (93 animals), the IHD presented higher values when compared to AHD for shorter distances, but with the increase of distance, both panels presented similar values. The r² measurement is influenced by the minor allele frequency of the pair of SNPs, which can cause the observed difference comparing the r² decay and |D’| decay. As a sum of the combinations between Illumina and Affymetrix panels, the CP presented a decay equivalent to a mean of these combinations. The estimated haplotype blocks detected for IHD, AHD, and CP were 84,529, 63,967, and 140,336, respectively. The IHD were composed by haplotype blocks with mean of 137.70 ± 219.05kb, the AHD with mean of 102.10kb ± 155.47, and the CP with mean of 107.10kb ± 169.14. The majority of the haplotype blocks of these three panels were composed by less than 10 SNPs, with only 3,882 (IHD), 193 (AHD) and 8,462 (CP) haplotype blocks composed by 10 SNPs or more. There was an increase in the number of chromosomes covered with long haplotypes when CP was used as well as an increase in haplotype coverage for short chromosomes (23-29), which can contribute for studies that explore haplotype blocks. In general, using CP could be an alternative to increase density and number of haplotype blocks, increasing the probability to obtain a marker close to a quantitative trait loci of interest.

Keywords: Bos taurus indicus, decay, genotype imputation, single nucleotide polymorphism

Procedia PDF Downloads 206
16 Genome-Wide Association Study Identify COL2A1 as a Susceptibility Gene for the Hand Development Failure of Kashin-Beck Disease

Authors: Feng Zhang


Kashin-Beck disease (KBD) is a chronic osteochondropathy. The mechanism of hand growth and development failure of KBD remains elusive now. In this study, we conducted a two-stage genome-wide association study (GWAS) of palmar length-width ratio (LWR) of KBD, totally involving 493 Chinese Han KBD patients. Affymetrix Genome Wide Human SNP Array 6.0 was applied for SNP genotyping. Association analysis was conducted by PLINK software. Imputation analysis was performed by IMPUTE against the reference panel of the 1000 genome project. In the GWAS, the most significant association was observed between palmar LWR and rs2071358 of COL2A1 gene (P value = 4.68×10-8). Imputation analysis identified 3 SNPs surrounding rs2071358 with significant or suggestive association signals. Replication study observed additional significant association signals at both rs2071358 (P value = 0.017) and rs4760608 (P value = 0.002) of COL2A1 gene after Bonferroni correction. Our results suggest that COL2A1 gene was a novel susceptibility gene involved in the growth and development failure of hand of KBD.

Keywords: Kashin-Beck disease, genome-wide association study, COL2A1, hand

Procedia PDF Downloads 141
15 Ensemble Methods in Machine Learning: An Algorithmic Approach to Derive Distinctive Behaviors of Criminal Activity Applied to the Poaching Domain

Authors: Zachary Blanks, Solomon Sonya


Poaching presents a serious threat to endangered animal species, environment conservations, and human life. Additionally, some poaching activity has even been linked to supplying funds to support terrorist networks elsewhere around the world. Consequently, agencies dedicated to protecting wildlife habitats have a near intractable task of adequately patrolling an entire area (spanning several thousand kilometers) given limited resources, funds, and personnel at their disposal. Thus, agencies need predictive tools that are both high-performing and easily implementable by the user to help in learning how the significant features (e.g. animal population densities, topography, behavior patterns of the criminals within the area, etc) interact with each other in hopes of abating poaching. This research develops a classification model using machine learning algorithms to aid in forecasting future attacks that is both easy to train and performs well when compared to other models. In this research, we demonstrate how data imputation methods (specifically predictive mean matching, gradient boosting, and random forest multiple imputation) can be applied to analyze data and create significant predictions across a varied data set. Specifically, we apply these methods to improve the accuracy of adopted prediction models (Logistic Regression, Support Vector Machine, etc). Finally, we assess the performance of the model and the accuracy of our data imputation methods by learning on a real-world data set constituting four years of imputed data and testing on one year of non-imputed data. This paper provides three main contributions. First, we extend work done by the Teamcore and CREATE (Center for Risk and Economic Analysis of Terrorism Events) research group at the University of Southern California (USC) working in conjunction with the Department of Homeland Security to apply game theory and machine learning algorithms to develop more efficient ways of reducing poaching. This research introduces ensemble methods (Random Forests and Stochastic Gradient Boosting) and applies it to real-world poaching data gathered from the Ugandan rain forest park rangers. Next, we consider the effect of data imputation on both the performance of various algorithms and the general accuracy of the method itself when applied to a dependent variable where a large number of observations are missing. Third, we provide an alternate approach to predict the probability of observing poaching both by season and by month. The results from this research are very promising. We conclude that by using Stochastic Gradient Boosting to predict observations for non-commercial poaching by season, we are able to produce statistically equivalent results while being orders of magnitude faster in computation time and complexity. Additionally, when predicting potential poaching incidents by individual month vice entire seasons, boosting techniques produce a mean area under the curve increase of approximately 3% relative to previous prediction schedules by entire seasons.

Keywords: ensemble methods, imputation, machine learning, random forests, statistical analysis, stochastic gradient boosting, wildlife protection

Procedia PDF Downloads 203
14 Optimal Pricing Based on Real Estate Demand Data

Authors: Vanessa Kummer, Maik Meusel


Real estate demand estimates are typically derived from transaction data. However, in regions with excess demand, transactions are driven by supply and therefore do not indicate what people are actually looking for. To estimate the demand for housing in Switzerland, search subscriptions from all important Swiss real estate platforms are used. These data do, however, suffer from missing information—for example, many users do not specify how many rooms they would like or what price they would be willing to pay. In economic analyses, it is often the case that only complete data is used. Usually, however, the proportion of complete data is rather small which leads to most information being neglected. Also, the data might have a strong distortion if it is complete. In addition, the reason that data is missing might itself also contain information, which is however ignored with that approach. An interesting issue is, therefore, if for economic analyses such as the one at hand, there is an added value by using the whole data set with the imputed missing values compared to using the usually small percentage of complete data (baseline). Also, it is interesting to see how different algorithms affect that result. The imputation of the missing data is done using unsupervised learning. Out of the numerous unsupervised learning approaches, the most common ones, such as clustering, principal component analysis, or neural networks techniques are applied. By training the model iteratively on the imputed data and, thereby, including the information of all data into the model, the distortion of the first training set—the complete data—vanishes. In a next step, the performances of the algorithms are measured. This is done by randomly creating missing values in subsets of the data, estimating those values with the relevant algorithms and several parameter combinations, and comparing the estimates to the actual data. After having found the optimal parameter set for each algorithm, the missing values are being imputed. Using the resulting data sets, the next step is to estimate the willingness to pay for real estate. This is done by fitting price distributions for real estate properties with certain characteristics, such as the region or the number of rooms. Based on these distributions, survival functions are computed to obtain the functional relationship between characteristics and selling probabilities. Comparing the survival functions shows that estimates which are based on imputed data sets do not differ significantly from each other; however, the demand estimate that is derived from the baseline data does. This indicates that the baseline data set does not include all available information and is therefore not representative for the entire sample. Also, demand estimates derived from the whole data set are much more accurate than the baseline estimation. Thus, in order to obtain optimal results, it is important to make use of all available data, even though it involves additional procedures such as data imputation.

Keywords: demand estimate, missing-data imputation, real estate, unsupervised learning

Procedia PDF Downloads 203
13 Imputation of Incomplete Large-Scale Monitoring Count Data via Penalized Estimation

Authors: Mohamed Dakki, Genevieve Robin, Marie Suet, Abdeljebbar Qninba, Mohamed A. El Agbani, Asmâa Ouassou, Rhimou El Hamoumi, Hichem Azafzaf, Sami Rebah, Claudia Feltrup-Azafzaf, Nafouel Hamouda, Wed a.L. Ibrahim, Hosni H. Asran, Amr A. Elhady, Haitham Ibrahim, Khaled Etayeb, Essam Bouras, Almokhtar Saied, Ashrof Glidan, Bakar M. Habib, Mohamed S. Sayoud, Nadjiba Bendjedda, Laura Dami, Clemence Deschamps, Elie Gaget, Jean-Yves Mondain-Monval, Pierre Defos Du Rau


In biodiversity monitoring, large datasets are becoming more and more widely available and are increasingly used globally to estimate species trends and con- servation status. These large-scale datasets challenge existing statistical analysis methods, many of which are not adapted to their size, incompleteness and heterogeneity. The development of scalable methods to impute missing data in incomplete large-scale monitoring datasets is crucial to balance sampling in time or space and thus better inform conservation policies. We developed a new method based on penalized Poisson models to impute and analyse incomplete monitoring data in a large-scale framework. The method al- lows parameterization of (a) space and time factors, (b) the main effects of predic- tor covariates, as well as (c) space–time interactions. It also benefits from robust statistical and computational capability in large-scale settings. The method was tested extensively on both simulated and real-life waterbird data, with the findings revealing that it outperforms six existing methods in terms of missing data imputation errors. Applying the method to 16 waterbird species, we estimated their long-term trends for the first time at the entire North African scale, a region where monitoring data suffer from many gaps in space and time series. This new approach opens promising perspectives to increase the accuracy of species-abundance trend estimations. We made it freely available in the r package ‘lori’ ( and recommend its use for large- scale count data, particularly in citizen science monitoring programmes.

Keywords: biodiversity monitoring, high-dimensional statistics, incomplete count data, missing data imputation, waterbird trends in North-Africa

Procedia PDF Downloads 50
12 Internal Migration and Poverty Dynamic Analysis Using a Bayesian Approach: The Tunisian Case

Authors: Amal Jmaii, Damien Rousseliere, Besma Belhadj


We explore the relationship between internal migration and poverty in Tunisia. We present a methodology combining potential outcomes approach with multiple imputation to highlight the effect of internal migration on poverty states. We find that probability of being poor decreases when leaving the poorest regions (the west areas) to the richer regions (greater Tunis and the east regions).

Keywords: internal migration, potential outcomes approach, poverty dynamics, Tunisia

Procedia PDF Downloads 231
11 Wind Spped Data Analysis in Colombia in 2013 and 2015

Authors: Harold P. Villota, Alejandro Osorio B.


The energy meteorology is an area for study energy complementarity and the use of renewable sources in interconnected systems. Due to diversify the energy matrix in Colombia with wind sources, is necessary to know the data bases about this one. However, the time series given by 260 automatic weather stations have empty, and no apply data, so the purpose is to fill the time series selecting two years to characterize, impute and use like base to complete the data between 2005 and 2020.

Keywords: complementarity, wind speed, renewable, colombia, characteri, characterization, imputation

Procedia PDF Downloads 25
10 Imputation Technique for Feature Selection in Microarray Data Set

Authors: Younies Saeed Hassan Mahmoud, Mai Mabrouk, Elsayed Sallam


Analysing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.

Keywords: DNA microarray, feature selection, missing data, bioinformatics

Procedia PDF Downloads 429
9 Association of Nuclear – Mitochondrial Epistasis with BMI in Type 1 Diabetes Mellitus Patients

Authors: Agnieszka H. Ludwig-Slomczynska, Michal T. Seweryn, Przemyslaw Kapusta, Ewelina Pitera, Katarzyna Cyganek, Urszula Mantaj, Lucja Dobrucka, Ewa Wender-Ozegowska, Maciej T. Malecki, Pawel Wolkow


Obesity results from an imbalance between energy intake and its expenditure. Genome-Wide Association Study (GWAS) analyses have led to discovery of only about 100 variants influencing body mass index (BMI), which explain only a small portion of genetic variability. Analysis of gene epistasis gives a chance to discover another part. Since it was shown that interaction and communication between nuclear and mitochondrial genome are indispensable for normal cell function, we have looked for epistatic interactions between the two genomes to find their correlation with BMI. Methods: The analysis was performed on 366 T1DM patients using Illumina Infinium OmniExpressExome-8 chip and followed by imputation on Michigan Imputation Server. Only genes which influence mitochondrial functioning (listed in Human MitoCarta 2.0) were included in the analysis – variants of nuclear origin (MAF > 5%) in 1140 genes and 42 mitochondrial variants (MAF > 1%). Gene expression analysis was performed on GTex data. Association analysis between genetic variants and BMI was performed with the use of Linear Mixed Models as implemented in the package 'GENESIS' in R. Analysis of association between mRNA expression and BMI was performed with the use of linear models and standard significance tests in R. Results: Among variants involved in epistasis between mitochondria and nucleus we have identified one in mitochondrial transcription factor, TFB2M (rs6701836). It interacted with mitochondrial variants localized to MT-RNR1 (p=0.0004, MAF=15%), MT-ND2 (p=0.07, MAF=5%) and MT-ND4 (p=0.01, MAF=1.1%). Analysis of the interaction between nuclear variant rs6701836 (nuc) and rs3021088 localized to MT-ND2 mitochondrial gene (mito) has shown that the combination of the two led to BMI decrease (p=0.024). Each of the variants on its own does not correlate with higher BMI [p(nuc)=0.856, p(mito)=0.116)]. Although rs6701836 is intronic, it influences gene expression in the thyroid (p=0.000037). rs3021088 is a missense variant that leads to alanine to threonine substitution in the MT-ND2 gene which belongs to complex I of the electron transport chain. The analysis of the influence of genetic variants on gene expression has confirmed the trend explained above – the interaction of the two genes leads to BMI decrease (p=0.0308). Each of the mRNAs on its own is associated with higher BMI (p(mito)=0.0244 and p(nuc)=0.0269). Conclusıons: Our results show that nuclear-mitochondrial epistasis can influence BMI in T1DM patients. The correlation between transcription factor expression and mitochondrial genetic variants will be subject to further analysis.

Keywords: body mass index, epistasis, mitochondria, type 1 diabetes

Procedia PDF Downloads 98
8 Modern Imputation Technique for Missing Data in Linear Functional Relationship Model

Authors: Adilah Abdul Ghapor, Yong Zulina Zubairi, Rahmatullah Imon


Missing value problem is common in statistics and has been of interest for years. This article considers two modern techniques in handling missing data for linear functional relationship model (LFRM) namely the Expectation-Maximization (EM) algorithm and Expectation-Maximization with Bootstrapping (EMB) algorithm using three performance indicators; namely the mean absolute error (MAE), root mean square error (RMSE) and estimated biased (EB). In this study, we applied the methods of imputing missing values in the LFRM. Results of the simulation study suggest that EMB algorithm performs much better than EM algorithm in both models. We also illustrate the applicability of the approach in a real data set.

Keywords: expectation-maximization, expectation-maximization with bootstrapping, linear functional relationship model, performance indicators

Procedia PDF Downloads 309
7 The Channels through Which Energy Tax Can Affect Economic Growth: Panel Data Analysis

Authors: Mahmoud Hassan, Walid Oueslati, Damien Rousseliere


This paper explores the channels through which energy taxes may affect economic growth, using a simultaneous equations model for a balanced panel data of 31 OECD countries over the 1994–2013 period. The empirical results reveal a negative impact of energy taxes on physical investment in the short and long term. This impact is negatively sensitive to the existence and level of public debt. Additionally, the results show that energy taxes have an indirect effect on human capital through their impact on polluting emissions. The taxes on energy products are able to reduce both the flux and the stock of polluting emissions that have a negative impact on human capital skills in the short and long term. Finally, we found that energy taxes could encourage eco-innovation in the short and long term.

Keywords: energy taxes, economic growth, public debt, simultaneous equations model, multiple imputation

Procedia PDF Downloads 112
6 Analyzing the Performance of Machine Learning Models to Predict Alzheimer's Disease and its Stages Addressing Missing Value Problem

Authors: Carlos Theran, Yohn Parra Bautista, Victor Adankai, Richard Alo, Jimwi Liu, Clement G. Yedjou


Alzheimer's disease (AD) is a neurodegenerative disorder primarily characterized by deteriorating cognitive functions. AD has gained relevant attention in the last decade. An estimated 24 million people worldwide suffered from this disease by 2011. In 2016 an estimated 40 million were diagnosed with AD, and for 2050 is expected to reach 131 million people affected by AD. Therefore, detecting and confirming AD at its different stages is a priority for medical practices to provide adequate and accurate treatments. Recently, Machine Learning (ML) models have been used to study AD's stages handling missing values in multiclass, focusing on the delineation of Early Mild Cognitive Impairment (EMCI), Late Mild Cognitive Impairment (LMCI), and normal cognitive (CN). But, to our best knowledge, robust performance information of these models and the missing data analysis has not been presented in the literature. In this paper, we propose studying the performance of five different machine learning models for AD's stages multiclass prediction in terms of accuracy, precision, and F1-score. Also, the analysis of three imputation methods to handle the missing value problem is presented. A framework that integrates ML model for AD's stages multiclass prediction is proposed, performing an average accuracy of 84%.

Keywords: alzheimer's disease, missing value, machine learning, performance evaluation

Procedia PDF Downloads 51
5 Higher Freshwater Fish and Sea Fish Intake Is Inversely Associated with Liver Cancer in Patients with Hepatitis B

Authors: Maomao Cao


Background and aims While the association between higher consumption of fish and lower liver cancer risk has been confirmed, however, the association between specific fish intake and liver cancer risk remains unknown. We aimed to identify the association between specific fish consumption and the risk of liver cancer. Methods: Based on a community-based seropositive hepatitis B cohort involving 18404 individuals, face to face interview was conducted by a standardized questionnaire to acquire baseline information. Three common fish types in this study were analyzed, including freshwater fish, sea fish, and small fish (shrimp, crab, conch, and shell). All participants received liver cancer screening, and possible cases were identified by CT or MRI. Multivariable logistic models were applied to estimate the odds ratio (OR) and 95% confidence intervals (CI). Multivariate multiple imputations were utilized to impute observations with missing values. Results: 179 liver cancer cases were identified. Consumption of freshwater fish and sea fish at least once a week had a strong inverse association with liver cancer risk compared with the lowest intake level, with an adjusted OR of 0.53 (95% CI, 0.38-0.75) and 0.38 (95% CI, 0.19-0.73), respectively. This inverse association was also observed after the imputation. There was no statistically significant association between intake of small fish and liver cancer risk (OR=0.58, 95%, CI 0.32-1.08). Conclusions: Our findings suggest that consumption of freshwater fish and sea fish at least once a week could reduce liver cancer risk.

Keywords: cross-sectional study, fish intake, liver cancer, risk factor

Procedia PDF Downloads 132
4 Hormone Replacement Therapy (HRT) and Its Impact on the All-Cause Mortality of UK Women: A Matched Cohort Study 1984-2017

Authors: Nurunnahar Akter, Elena Kulinskaya, Nicholas Steel, Ilyas Bakbergenuly


Although Hormone Replacement Therapy (HRT) is an effective treatment in ameliorating menopausal symptoms, it has mixed effects on different health outcomes, increasing, for instance, the risk of breast cancer. Because of this, many symptomatic women are left untreated. Untreated menopausal symptoms may result in other health issues, which eventually put an extra burden and costs to the health care system. All-cause mortality analysis may explain the net benefits and risks of the HRT therapy. However, it received far less attention in HRT studies. This study investigated the impact of HRT on all-cause mortality using electronically recorded primary care data from The Health Improvement Network (THIN) that broadly represents the female population in the United Kingdom (UK). The study entry date for this study was the record of the first HRT prescription from 1984, and patients were followed up until death or transfer to another GP practice or study end date, which was January 2017. 112,354 HRT users (cases) were matched with 245,320 non-users by age at HRT initiation and general practice (GP). The hazards of all-cause mortality associated with HRT were estimated by a parametric Weibull-Cox model adjusting for a wide range of important medical, lifestyle, and socio-demographic factors. The multilevel multiple imputation techniques were used to deal with missing data. This study found that during 32 years of follow-up, combined HRT reduced the hazard ratio (HR) of all-cause mortality by 9% (HR: 0.91; 95% Confidence Interval, 0.88-0.94) in women of age between 46 to 65 at first treatment compared to the non-users of the same age. Age-specific mortality analyses found that combined HRT decreased mortality by 13% (HR: 0.87; 95% CI, 0.82-0.92), 12% (HR: 0.88; 95% CI, 0.82-0.93), and 8% (HR: 0.92; 95% CI, 0.85-0.98), in 51 to 55, 56 to 60, and 61 to 65 age group at first treatment, respectively. There was no association between estrogen-only HRT and women’s all-cause mortality. The findings from this study may help to inform the choices of women at menopause and to further educate the clinicians and resource planners.

Keywords: hormone replacement therapy, multiple imputations, primary care data, the health improvement network (THIN)

Procedia PDF Downloads 97
3 Long Term Survival after a First Transient Ischemic Attack in England: A Case-Control Study

Authors: Padma Chutoo, Elena Kulinskaya, Ilyas Bakbergenuly, Nicholas Steel, Dmitri Pchejetski


Transient ischaemic attacks (TIAs) are warning signs for future strokes. TIA patients are at increased risk of stroke and cardio-vascular events after a first episode. A majority of studies on TIA focused on the occurrence of these ancillary events after a TIA. Long-term mortality after TIA received only limited attention. We undertook this study to determine the long-term hazards of all-cause mortality following a first episode of a TIA using anonymised electronic health records (EHRs). We used a retrospective case-control study using electronic primary health care records from The Health Improvement Network (THIN) database. Patients born prior to or in year 1960, resident in England, with a first diagnosis of TIA between January 1986 and January 2017 were matched to three controls on age, sex and general medical practice. The primary outcome was all-cause mortality. The hazards of all-cause mortality were estimated using a time-varying Weibull-Cox survival model which included both scale and shape effects and a random frailty effect of GP practice. 20,633 cases and 58,634 controls were included. Cases aged 39 to 60 years at the first TIA event had the highest hazard ratio (HR) of mortality compared to matched controls (HR = 3.04, 95% CI (2.91 - 3.18)). The HRs for cases aged 61-70 years, 71-76 years and 77+ years were 1.98 (1.55 - 2.30), 1.79 (1.20 - 2.07) and 1.52 (1.15 - 1.97) compared to matched controls. Aspirin provided long-term survival benefits to cases. Cases aged 39-60 years on aspirin had HR of 0.93 (0.84 - 1.00), 0.90 (0.82 - 0.98) and 0.88 (0.80 - 0.96) at 5 years, 10 years and 15 years, respectively, compared to cases in the same age group who were not on antiplatelets. Similar beneficial effects of aspirin were observed in other age groups. There were no significant survival benefits with other antiplatelet options. No survival benefits of antiplatelet drugs were observed in controls. Our study highlights the excess long-term risk of death of TIA patients and cautions that TIA should not be treated as a benign condition. The study further recommends aspirin as the better option for secondary prevention for TIA patients compared to clopidogrel recommended by NICE guidelines. Management of risk factors and treatment strategies should be important challenges to reduce the burden of disease.

Keywords: dual antiplatelet therapy (DAPT), General Practice, Multiple Imputation, The Health Improvement Network(THIN), hazard ratio (HR), Weibull-Cox model

Procedia PDF Downloads 66
2 Survival Analysis after a First Ischaemic Stroke Event: A Case-Control Study in the Adult Population of England.

Authors: Padma Chutoo, Elena Kulinskaya, Ilyas Bakbergenuly, Nicholas Steel, Dmitri Pchejetski


Stroke is associated with a significant risk of morbidity and mortality. There is scarcity of research on the long-term survival after first-ever ischaemic stroke (IS) events in England with regards to effects of different medical therapies and comorbidities. The objective of this study was to model the all-cause mortality after an IS diagnosis in the adult population of England. Using a retrospective case-control design, we extracted the electronic medical records of patients born prior to or in year 1960 in England with a first-ever ischaemic stroke diagnosis from January 1986 to January 2017 within the Health and Improvement Network (THIN) database. Participants with a history of ischaemic stroke were matched to 3 controls by sex and age at diagnosis and general practice. The primary outcome was the all-cause mortality. The hazards of the all-cause mortality were estimated using a Weibull-Cox survival model which included both scale and shape effects and a shared random effect of general practice. The model included sex, birth cohort, socio-economic status, comorbidities and medical therapies. 20,250 patients with a history of IS (cases) and 55,519 controls were followed up to 30 years. From 2008 to 2015, the one-year all-cause mortality for the IS patients declined with an absolute change of -0.5%. Preventive treatments to cases increased considerably over time. These included prescriptions of statins and antihypertensives. However, prescriptions for antiplatelet drugs decreased in the routine general practice since 2010. The survival model revealed a survival benefit of antiplatelet treatment to stroke survivors with hazard ratio (HR) of 0.92 (0.90 – 0.94). IS diagnosis had significant interactions with gender and age at entry and hypertension diagnosis. IS diagnosis was associated with high risk of all-cause mortality with HR= 3.39 (3.05-3.72) for cases compared to controls. Hypertension was associated with poor survival with HR = 4.79 (4.49 - 5.09) for hypertensive cases relative to non-hypertensive controls, though the detrimental effect of hypertension has not reached significance for hypertensive controls, HR = 1.19(0.82-1.56). This study of English primary care data showed that between 2008 and 2015, the rates of prescriptions of stroke preventive treatments increased, and a short-term all-cause mortality after IS stroke declined. However, stroke resulted in poor long-term survival. Hypertension, a modifiable risk factor, was found to be associated with poor survival outcomes in IS patients. Antiplatelet drugs were found to be protective to survival. Better efforts are required to reduce the burden of stroke through health service development and primary prevention.

Keywords: general practice, hazard ratio, health improvement network (THIN), ischaemic stroke, multiple imputation, Weibull-Cox model.

Procedia PDF Downloads 67
1 Influence of Atmospheric Pollutants on Child Respiratory Disease in Cartagena De Indias, Colombia

Authors: Jose A. Alvarez Aldegunde, Adrian Fernandez Sanchez, Matthew D. Menden, Bernardo Vila Rodriguez


Up to five statistical pre-processings have been carried out considering the pollutant records of the stations present in Cartagena de Indias, Colombia, also taking into account the childhood asthma incidence surveys conducted in hospitals in the city by the Health Ministry of Colombia for this study. These pre-processings have consisted of different techniques such as the determination of the quality of data collection, determination of the quality of the registration network, identification and debugging of errors in data collection, completion of missing data and purified data, as well as the improvement of the time scale of records. The characterization of the quality of the data has been conducted by means of density analysis of the pollutant registration stations using ArcGis Software and through mass balance techniques, making it possible to determine inconsistencies in the records relating the registration data between stations following the linear regression. The results obtained in this process have highlighted the positive quality in the pollutant registration process. Consequently, debugging of errors has allowed us to identify certain data as statistically non-significant in the incidence and series of contamination. This data, together with certain missing records in the series recorded by the measuring stations, have been completed by statistical imputation equations. Following the application of these prior processes, the basic series of incidence data for respiratory disease and pollutant records have allowed the characterization of the influence of pollutants on respiratory diseases such as, for example, childhood asthma. This characterization has been carried out using statistical correlation methods, including visual correlation, simple linear regression correlation and spectral analysis with PAST Software which identifies maximum periodicity cycles and minimums under the formula of the Lomb periodgram. In relation to part of the results obtained, up to eleven maximums and minimums considered contemporary between the incidence records and the particles have been identified taking into account the visual comparison. The spectral analyses that have been performed on the incidence and the PM2.5 have returned a series of similar maximum periods in both registers, which are at a maximum during a period of one year and another every 25 days (0.9 and 0.07 years). The bivariate analysis has managed to characterize the variable "Daily Vehicular Flow" in the ninth position of importance of a total of 55 variables. However, the statistical correlation has not obtained a favorable result, having obtained a low value of the R2 coefficient. The series of analyses conducted has demonstrated the importance of the influence of pollutants such as PM2.5 in the development of childhood asthma in Cartagena. The quantification of the influence of the variables has been able to determine that there is a 56% probability of dependence between PM2.5 and childhood respiratory asthma in Cartagena. Considering this justification, the study could be completed through the application of the BenMap Software, throwing a series of spatial results of interpolated values of the pollutant contamination records that exceeded the established legal limits (represented by homogeneous units up to the neighborhood level) and results of the impact on the exacerbation of pediatric asthma. As a final result, an economic estimate (in Colombian Pesos) of the monthly and individual savings derived from the percentage reduction of the influence of pollutants in relation to visits to the Hospital Emergency Room due to asthma exacerbation in pediatric patients has been granted.

Keywords: Asthma Incidence, BenMap, PM2.5, Statistical Analysis

Procedia PDF Downloads 45