Search results for: imputation
36 Two-Phase Sampling for Estimating a Finite Population Total in Presence of Missing Values
Authors: Daniel Fundi Murithi
Abstract:
Missing data is a real bane in many surveys. To overcome the problems caused by missing data, partial deletion, and single imputation methods, among others, have been proposed. However, problems such as discarding usable data and inaccuracy in reproducing known population parameters and standard errors are associated with them. For regression and stochastic imputation, it is assumed that there is a variable with complete cases to be used as a predictor in estimating missing values in the other variable, and the relationship between the two variables is linear, which might not be realistic in practice. In this project, we estimate population total in presence of missing values in two-phase sampling. Instead of regression or stochastic models, non-parametric model based regression model is used in imputing missing values. Empirical study showed that nonparametric model-based regression imputation is better in reproducing variance of population total estimate obtained when there were no missing values compared to mean, median, regression, and stochastic imputation methods. Although regression and stochastic imputation were better than nonparametric model-based imputation in reproducing population total estimates obtained when there were no missing values in one of the sample sizes considered, nonparametric model-based imputation may be used when the relationship between outcome and predictor variables is not linear.Keywords: finite population total, missing data, model-based imputation, two-phase sampling
Procedia PDF Downloads 13035 Effect of Genuine Missing Data Imputation on Prediction of Urinary Incontinence
Authors: Suzan Arslanturk, Mohammad-Reza Siadat, Theophilus Ogunyemi, Ananias Diokno
Abstract:
Missing data is a common challenge in statistical analyses of most clinical survey datasets. A variety of methods have been developed to enable analysis of survey data to deal with missing values. Imputation is the most commonly used among the above methods. However, in order to minimize the bias introduced due to imputation, one must choose the right imputation technique and apply it to the correct type of missing data. In this paper, we have identified different types of missing values: missing data due to skip pattern (SPMD), undetermined missing data (UMD), and genuine missing data (GMD) and applied rough set imputation on only the GMD portion of the missing data. We have used rough set imputation to evaluate the effect of such imputation on prediction by generating several simulation datasets based on an existing epidemiological dataset (MESA). To measure how well each dataset lends itself to the prediction model (logistic regression), we have used p-values from the Wald test. To evaluate the accuracy of the prediction, we have considered the width of 95% confidence interval for the probability of incontinence. Both imputed and non-imputed simulation datasets were fit to the prediction model, and they both turned out to be significant (p-value < 0.05). However, the Wald score shows a better fit for the imputed compared to non-imputed datasets (28.7 vs. 23.4). The average confidence interval width was decreased by 10.4% when the imputed dataset was used, meaning higher precision. The results show that using the rough set method for missing data imputation on GMD data improve the predictive capability of the logistic regression. Further studies are required to generalize this conclusion to other clinical survey datasets.Keywords: rough set, imputation, clinical survey data simulation, genuine missing data, predictive index
Procedia PDF Downloads 16834 A Neural Network Based Clustering Approach for Imputing Multivariate Values in Big Data
Authors: S. Nickolas, Shobha K.
Abstract:
The treatment of incomplete data is an important step in the data pre-processing. Missing values creates a noisy environment in all applications and it is an unavoidable problem in big data management and analysis. Numerous techniques likes discarding rows with missing values, mean imputation, expectation maximization, neural networks with evolutionary algorithms or optimized techniques and hot deck imputation have been introduced by researchers for handling missing data. Among these, imputation techniques plays a positive role in filling missing values when it is necessary to use all records in the data and not to discard records with missing values. In this paper we propose a novel artificial neural network based clustering algorithm, Adaptive Resonance Theory-2(ART2) for imputation of missing values in mixed attribute data sets. The process of ART2 can recognize learned models fast and be adapted to new objects rapidly. It carries out model-based clustering by using competitive learning and self-steady mechanism in dynamic environment without supervision. The proposed approach not only imputes the missing values but also provides information about handling the outliers.Keywords: ART2, data imputation, clustering, missing data, neural network, pre-processing
Procedia PDF Downloads 27433 Imputing Missing Data in Electronic Health Records: A Comparison of Linear and Non-Linear Imputation Models
Authors: Alireza Vafaei Sadr, Vida Abedi, Jiang Li, Ramin Zand
Abstract:
Missing data is a common challenge in medical research and can lead to biased or incomplete results. When the data bias leaks into models, it further exacerbates health disparities; biased algorithms can lead to misclassification and reduced resource allocation and monitoring as part of prevention strategies for certain minorities and vulnerable segments of patient populations, which in turn further reduce data footprint from the same population – thus, a vicious cycle. This study compares the performance of six imputation techniques grouped into Linear and Non-Linear models on two different realworld electronic health records (EHRs) datasets, representing 17864 patient records. The mean absolute percentage error (MAPE) and root mean squared error (RMSE) are used as performance metrics, and the results show that the Linear models outperformed the Non-Linear models in terms of both metrics. These results suggest that sometimes Linear models might be an optimal choice for imputation in laboratory variables in terms of imputation efficiency and uncertainty of predicted values.Keywords: EHR, machine learning, imputation, laboratory variables, algorithmic bias
Procedia PDF Downloads 8532 A Large Dataset Imputation Approach Applied to Country Conflict Prediction Data
Authors: Benjamin Leiby, Darryl Ahner
Abstract:
This study demonstrates an alternative stochastic imputation approach for large datasets when preferred commercial packages struggle to iterate due to numerical problems. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The methodology capitalizes on correlation while using model residuals to provide the uncertainty in estimating unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Static tolerances common in most packages are replaced with tailorable tolerances that exploit residuals to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the country conflict dataset illustrates promise with modeling first-order interactions while presenting a need for further refinement that mimics predictive mean matching.Keywords: correlation, country conflict, imputation, stochastic regression
Procedia PDF Downloads 12031 Energy Complementary in Colombia: Imputation of Dataset
Authors: Felipe Villegas-Velasquez, Harold Pantoja-Villota, Sergio Holguin-Cardona, Alejandro Osorio-Botero, Brayan Candamil-Arango
Abstract:
Colombian electricity comes mainly from hydric resources, affected by environmental variations such as the El Niño phenomenon. That is why incorporating other types of resources is necessary to provide electricity constantly. This research seeks to fill the wind speed and global solar irradiance dataset for two years with the highest amount of information. A further result is the characterization of the data by region that led to infer which errors occurred and offered the incomplete dataset.Keywords: energy, wind speed, global solar irradiance, Colombia, imputation
Procedia PDF Downloads 14630 Overview of Adaptive Spline interpolation
Authors: Rongli Gai, Zhiyuan Chang
Abstract:
At this stage, in view of various situations in the interpolation process, most researchers use self-adaptation to adjust the interpolation process, which is also one of the current and future research hotspots in the field of CNC machining. In the interpolation process, according to the overview of the spline curve interpolation algorithm, the adaptive analysis is carried out from the factors affecting the interpolation process. The adaptive operation is reflected in various aspects, such as speed, parameters, errors, nodes, feed rates, random Period, sensitive point, step size, curvature, adaptive segmentation, adaptive optimization, etc. This paper will analyze and summarize the research of adaptive imputation in the direction of the above factors affecting imputation.Keywords: adaptive algorithm, CNC machining, interpolation constraints, spline curve interpolation
Procedia PDF Downloads 20529 Outlier Detection in Stock Market Data using Tukey Method and Wavelet Transform
Authors: Sadam Alwadi
Abstract:
Outlier values become a problem that frequently occurs in the data observation or recording process. Thus, the need for data imputation has become an essential matter. In this work, it will make use of the methods described in the prior work to detect the outlier values based on a collection of stock market data. In order to implement the detection and find some solutions that maybe helpful for investors, real closed price data were obtained from the Amman Stock Exchange (ASE). Tukey and Maximum Overlapping Discrete Wavelet Transform (MODWT) methods will be used to impute the detect the outlier values.Keywords: outlier values, imputation, stock market data, detecting, estimation
Procedia PDF Downloads 8128 Single Imputation for Audiograms
Authors: Sarah Beaver, Renee Bryce
Abstract:
Audiograms detect hearing impairment, but missing values pose problems. This work explores imputations in an attempt to improve accuracy. This work implements Linear Regression, Lasso, Linear Support Vector Regression, Bayesian Ridge, K Nearest Neighbors (KNN), and Random Forest machine learning techniques to impute audiogram frequencies ranging from 125Hz to 8000Hz. The data contains patients who had or were candidates for cochlear implants. Accuracy is compared across two different Nested Cross-Validation k values. Over 4000 audiograms were used from 800 unique patients. Additionally, training on data combines and compares left and right ear audiograms versus single ear side audiograms. The accuracy achieved using Root Mean Square Error (RMSE) values for the best models for Random Forest ranges from 4.74 to 6.37. The R\textsuperscript{2} values for the best models for Random Forest ranges from .91 to .96. The accuracy achieved using RMSE values for the best models for KNN ranges from 5.00 to 7.72. The R\textsuperscript{2} values for the best models for KNN ranges from .89 to .95. The best imputation models received R\textsuperscript{2} between .89 to .96 and RMSE values less than 8dB. We also show that the accuracy of classification predictive models performed better with our best imputation models versus constant imputations by a two percent increase.Keywords: machine learning, audiograms, data imputations, single imputations
Procedia PDF Downloads 8227 Bias-Corrected Estimation Methods for Receiver Operating Characteristic Surface
Authors: Khanh To Duc, Monica Chiogna, Gianfranco Adimari
Abstract:
With three diagnostic categories, assessment of the performance of diagnostic tests is achieved by the analysis of the receiver operating characteristic (ROC) surface, which generalizes the ROC curve for binary diagnostic outcomes. The volume under the ROC surface (VUS) is a summary index usually employed for measuring the overall diagnostic accuracy. When the true disease status can be exactly assessed by means of a gold standard (GS) test, unbiased nonparametric estimators of the ROC surface and VUS are easily obtained. In practice, unfortunately, disease status verification via the GS test could be unavailable for all study subjects, due to the expensiveness or invasiveness of the GS test. Thus, often only a subset of patients undergoes disease verification. Statistical evaluations of diagnostic accuracy based only on data from subjects with verified disease status are typically biased. This bias is known as verification bias. Here, we consider the problem of correcting for verification bias when continuous diagnostic tests for three-class disease status are considered. We assume that selection for disease verification does not depend on disease status, given test results and other observed covariates, i.e., we assume that the true disease status, when missing, is missing at random. Under this assumption, we discuss several solutions for ROC surface analysis based on imputation and re-weighting methods. In particular, verification bias-corrected estimators of the ROC surface and of VUS are proposed, namely, full imputation, mean score imputation, inverse probability weighting and semiparametric efficient estimators. Consistency and asymptotic normality of the proposed estimators are established, and their finite sample behavior is investigated by means of Monte Carlo simulation studies. Two illustrations using real datasets are also given.Keywords: imputation, missing at random, inverse probability weighting, ROC surface analysis
Procedia PDF Downloads 41626 dynr.mi: An R Program for Multiple Imputation in Dynamic Modeling
Authors: Yanling Li, Linying Ji, Zita Oravecz, Timothy R. Brick, Michael D. Hunter, Sy-Miin Chow
Abstract:
Assessing several individuals intensively over time yields intensive longitudinal data (ILD). Even though ILD provide rich information, they also bring other data analytic challenges. One of these is the increased occurrence of missingness with increased study length, possibly under non-ignorable missingness scenarios. Multiple imputation (MI) handles missing data by creating several imputed data sets, and pooling the estimation results across imputed data sets to yield final estimates for inferential purposes. In this article, we introduce dynr.mi(), a function in the R package, Dynamic Modeling in R (dynr). The package dynr provides a suite of fast and accessible functions for estimating and visualizing the results from fitting linear and nonlinear dynamic systems models in discrete as well as continuous time. By integrating the estimation functions in dynr and the MI procedures available from the R package, Multivariate Imputation by Chained Equations (MICE), the dynr.mi() routine is designed to handle possibly non-ignorable missingness in the dependent variables and/or covariates in a user-specified dynamic systems model via MI, with convergence diagnostic check. We utilized dynr.mi() to examine, in the context of a vector autoregressive model, the relationships among individuals’ ambulatory physiological measures, and self-report affect valence and arousal. The results from MI were compared to those from listwise deletion of entries with missingness in the covariates. When we determined the number of iterations based on the convergence diagnostics available from dynr.mi(), differences in the statistical significance of the covariate parameters were observed between the listwise deletion and MI approaches. These results underscore the importance of considering diagnostic information in the implementation of MI procedures.Keywords: dynamic modeling, missing data, mobility, multiple imputation
Procedia PDF Downloads 16425 Self-Organizing Maps for Exploration of Partially Observed Data and Imputation of Missing Values in the Context of the Manufacture of Aircraft Engines
Authors: Sara Rejeb, Catherine Duveau, Tabea Rebafka
Abstract:
To monitor the production process of turbofan aircraft engines, multiple measurements of various geometrical parameters are systematically recorded on manufactured parts. Engine parts are subject to extremely high standards as they can impact the performance of the engine. Therefore, it is essential to analyze these databases to better understand the influence of the different parameters on the engine's performance. Self-organizing maps are unsupervised neural networks which achieve two tasks simultaneously: they visualize high-dimensional data by projection onto a 2-dimensional map and provide clustering of the data. This technique has become very popular for data exploration since it provides easily interpretable results and a meaningful global view of the data. As such, self-organizing maps are usually applied to aircraft engine condition monitoring. As databases in this field are huge and complex, they naturally contain multiple missing entries for various reasons. The classical Kohonen algorithm to compute self-organizing maps is conceived for complete data only. A naive approach to deal with partially observed data consists in deleting items or variables with missing entries. However, this requires a sufficient number of complete individuals to be fairly representative of the population; otherwise, deletion leads to a considerable loss of information. Moreover, deletion can also induce bias in the analysis results. Alternatively, one can first apply a common imputation method to create a complete dataset and then apply the Kohonen algorithm. However, the choice of the imputation method may have a strong impact on the resulting self-organizing map. Our approach is to address simultaneously the two problems of computing a self-organizing map and imputing missing values, as these tasks are not independent. In this work, we propose an extension of self-organizing maps for partially observed data, referred to as missSOM. First, we introduce a criterion to be optimized, that aims at defining simultaneously the best self-organizing map and the best imputations for the missing entries. As such, missSOM is also an imputation method for missing values. To minimize the criterion, we propose an iterative algorithm that alternates the learning of a self-organizing map and the imputation of missing values. Moreover, we develop an accelerated version of the algorithm by entwining the iterations of the Kohonen algorithm with the updates of the imputed values. This method is efficiently implemented in R and will soon be released on CRAN. Compared to the standard Kohonen algorithm, it does not come with any additional cost in terms of computing time. Numerical experiments illustrate that missSOM performs well in terms of both clustering and imputation compared to the state of the art. In particular, it turns out that missSOM is robust to the missingness mechanism, which is in contrast to many imputation methods that are appropriate for only a single mechanism. This is an important property of missSOM as, in practice, the missingness mechanism is often unknown. An application to measurements on one type of part is also provided and shows the practical interest of missSOM.Keywords: imputation method of missing data, partially observed data, robustness to missingness mechanism, self-organizing maps
Procedia PDF Downloads 15124 A Review of Methods for Handling Missing Data in the Formof Dropouts in Longitudinal Clinical Trials
Abstract:
Much clinical trials data-based research are characterized by the unavoidable problem of dropout as a result of missing or erroneous values. This paper aims to review some of the various techniques to address the dropout problems in longitudinal clinical trials. The fundamental concepts of the patterns and mechanisms of dropout are discussed. This study presents five general techniques for handling dropout: (1) Deletion methods; (2) Imputation-based methods; (3) Data augmentation methods; (4) Likelihood-based methods; and (5) MNAR-based methods. Under each technique, several methods that are commonly used to deal with dropout are presented, including a review of the existing literature in which we examine the effectiveness of these methods in the analysis of incomplete data. Two application examples are presented to study the potential strengths or weaknesses of some of the methods under certain dropout mechanisms as well as to assess the sensitivity of the modelling assumptions.Keywords: incomplete longitudinal clinical trials, missing at random (MAR), imputation, weighting methods, sensitivity analysis
Procedia PDF Downloads 41523 Imputation of Urban Movement Patterns Using Big Data
Authors: Eusebio Odiari, Mark Birkin, Susan Grant-Muller, Nicolas Malleson
Abstract:
Big data typically refers to consumer datasets revealing some detailed heterogeneity in human behavior, which if harnessed appropriately, could potentially revolutionize our understanding of the collective phenomena of the physical world. Inadvertent missing values skew these datasets and compromise the validity of the thesis. Here we discuss a conceptually consistent strategy for identifying other relevant datasets to combine with available big data, to plug the gaps and to create a rich requisite comprehensive dataset for subsequent analysis. Specifically, emphasis is on how these methodologies can for the first time enable the construction of more detailed pictures of passenger demand and drivers of mobility on the railways. These methodologies can predict the influence of changes within the network (like a change in time-table or impact of a new station), explain local phenomena outside the network (like rail-heading) and the other impacts of urban morphology. Our analysis also reveals that our new imputation data model provides for more equitable revenue sharing amongst network operators who manage different parts of the integrated UK railways.Keywords: big-data, micro-simulation, mobility, ticketing-data, commuters, transport, synthetic, population
Procedia PDF Downloads 23122 Linkage Disequilibrium and Haplotype Blocks Study from Two High-Density Panels and a Combined Panel in Nelore Beef Cattle
Authors: Priscila A. Bernardes, Marcos E. Buzanskas, Luciana C. A. Regitano, Ricardo V. Ventura, Danisio P. Munari
Abstract:
Genotype imputation has been used to reduce genomic selections costs. In order to increase haplotype detection accuracy in methods that considers the linkage disequilibrium, another approach could be used, such as combined genotype data from different panels. Therefore, this study aimed to evaluate the linkage disequilibrium and haplotype blocks in two high-density panels before and after the imputation to a combined panel in Nelore beef cattle. A total of 814 animals were genotyped with the Illumina BovineHD BeadChip (IHD), wherein 93 animals (23 bulls and 70 progenies) were also genotyped with the Affymetrix Axion Genome-Wide BOS 1 Array Plate (AHD). After the quality control, 809 IHD animals (509,107 SNPs) and 93 AHD (427,875 SNPs) remained for analyses. The combined genotype panel (CP) was constructed by merging both panels after quality control, resulting in 880,336 SNPs. Imputation analysis was conducted using software FImpute v.2.2b. The reference (CP) and target (IHD) populations consisted of 23 bulls and 786 animals, respectively. The linkage disequilibrium and haplotype blocks studies were carried out for IHD, AHD, and imputed CP. Two linkage disequilibrium measures were considered; the correlation coefficient between alleles from two loci (r²) and the |D’|. Both measures were calculated using the software PLINK. The haplotypes' blocks were estimated using the software Haploview. The r² measurement presented different decay when compared to |D’|, wherein AHD and IHD had almost the same decay. For r², even with possible overestimation by the sample size for AHD (93 animals), the IHD presented higher values when compared to AHD for shorter distances, but with the increase of distance, both panels presented similar values. The r² measurement is influenced by the minor allele frequency of the pair of SNPs, which can cause the observed difference comparing the r² decay and |D’| decay. As a sum of the combinations between Illumina and Affymetrix panels, the CP presented a decay equivalent to a mean of these combinations. The estimated haplotype blocks detected for IHD, AHD, and CP were 84,529, 63,967, and 140,336, respectively. The IHD were composed by haplotype blocks with mean of 137.70 ± 219.05kb, the AHD with mean of 102.10kb ± 155.47, and the CP with mean of 107.10kb ± 169.14. The majority of the haplotype blocks of these three panels were composed by less than 10 SNPs, with only 3,882 (IHD), 193 (AHD) and 8,462 (CP) haplotype blocks composed by 10 SNPs or more. There was an increase in the number of chromosomes covered with long haplotypes when CP was used as well as an increase in haplotype coverage for short chromosomes (23-29), which can contribute for studies that explore haplotype blocks. In general, using CP could be an alternative to increase density and number of haplotype blocks, increasing the probability to obtain a marker close to a quantitative trait loci of interest.Keywords: Bos taurus indicus, decay, genotype imputation, single nucleotide polymorphism
Procedia PDF Downloads 28021 Managing Incomplete PSA Observations in Prostate Cancer Data: Key Strategies and Best Practices for Handling Loss to Follow-Up and Missing Data
Authors: Madiha Liaqat, Rehan Ahmed Khan, Shahid Kamal
Abstract:
Multiple imputation with delta adjustment is a versatile and transparent technique for addressing univariate missing data in the presence of various missing mechanisms. This approach allows for the exploration of sensitivity to the missing-at-random (MAR) assumption. In this review, we outline the delta-adjustment procedure and illustrate its application for assessing the sensitivity to deviations from the MAR assumption. By examining diverse missingness scenarios and conducting sensitivity analyses, we gain valuable insights into the implications of missing data on our analyses, enhancing the reliability of our study's conclusions. In our study, we focused on assessing logPSA, a continuous biomarker in incomplete prostate cancer data, to examine the robustness of conclusions against plausible departures from the MAR assumption. We introduced several approaches for conducting sensitivity analyses, illustrating their application within the pattern mixture model (PMM) under the delta adjustment framework. This proposed approach effectively handles missing data, particularly loss to follow-up.Keywords: loss to follow-up, incomplete response, multiple imputation, sensitivity analysis, prostate cancer
Procedia PDF Downloads 8920 Genome-Wide Association Study Identify COL2A1 as a Susceptibility Gene for the Hand Development Failure of Kashin-Beck Disease
Authors: Feng Zhang
Abstract:
Kashin-Beck disease (KBD) is a chronic osteochondropathy. The mechanism of hand growth and development failure of KBD remains elusive now. In this study, we conducted a two-stage genome-wide association study (GWAS) of palmar length-width ratio (LWR) of KBD, totally involving 493 Chinese Han KBD patients. Affymetrix Genome Wide Human SNP Array 6.0 was applied for SNP genotyping. Association analysis was conducted by PLINK software. Imputation analysis was performed by IMPUTE against the reference panel of the 1000 genome project. In the GWAS, the most significant association was observed between palmar LWR and rs2071358 of COL2A1 gene (P value = 4.68×10-8). Imputation analysis identified 3 SNPs surrounding rs2071358 with significant or suggestive association signals. Replication study observed additional significant association signals at both rs2071358 (P value = 0.017) and rs4760608 (P value = 0.002) of COL2A1 gene after Bonferroni correction. Our results suggest that COL2A1 gene was a novel susceptibility gene involved in the growth and development failure of hand of KBD.Keywords: Kashin-Beck disease, genome-wide association study, COL2A1, hand
Procedia PDF Downloads 21919 Ensemble Methods in Machine Learning: An Algorithmic Approach to Derive Distinctive Behaviors of Criminal Activity Applied to the Poaching Domain
Authors: Zachary Blanks, Solomon Sonya
Abstract:
Poaching presents a serious threat to endangered animal species, environment conservations, and human life. Additionally, some poaching activity has even been linked to supplying funds to support terrorist networks elsewhere around the world. Consequently, agencies dedicated to protecting wildlife habitats have a near intractable task of adequately patrolling an entire area (spanning several thousand kilometers) given limited resources, funds, and personnel at their disposal. Thus, agencies need predictive tools that are both high-performing and easily implementable by the user to help in learning how the significant features (e.g. animal population densities, topography, behavior patterns of the criminals within the area, etc) interact with each other in hopes of abating poaching. This research develops a classification model using machine learning algorithms to aid in forecasting future attacks that is both easy to train and performs well when compared to other models. In this research, we demonstrate how data imputation methods (specifically predictive mean matching, gradient boosting, and random forest multiple imputation) can be applied to analyze data and create significant predictions across a varied data set. Specifically, we apply these methods to improve the accuracy of adopted prediction models (Logistic Regression, Support Vector Machine, etc). Finally, we assess the performance of the model and the accuracy of our data imputation methods by learning on a real-world data set constituting four years of imputed data and testing on one year of non-imputed data. This paper provides three main contributions. First, we extend work done by the Teamcore and CREATE (Center for Risk and Economic Analysis of Terrorism Events) research group at the University of Southern California (USC) working in conjunction with the Department of Homeland Security to apply game theory and machine learning algorithms to develop more efficient ways of reducing poaching. This research introduces ensemble methods (Random Forests and Stochastic Gradient Boosting) and applies it to real-world poaching data gathered from the Ugandan rain forest park rangers. Next, we consider the effect of data imputation on both the performance of various algorithms and the general accuracy of the method itself when applied to a dependent variable where a large number of observations are missing. Third, we provide an alternate approach to predict the probability of observing poaching both by season and by month. The results from this research are very promising. We conclude that by using Stochastic Gradient Boosting to predict observations for non-commercial poaching by season, we are able to produce statistically equivalent results while being orders of magnitude faster in computation time and complexity. Additionally, when predicting potential poaching incidents by individual month vice entire seasons, boosting techniques produce a mean area under the curve increase of approximately 3% relative to previous prediction schedules by entire seasons.Keywords: ensemble methods, imputation, machine learning, random forests, statistical analysis, stochastic gradient boosting, wildlife protection
Procedia PDF Downloads 29218 Optimal Pricing Based on Real Estate Demand Data
Authors: Vanessa Kummer, Maik Meusel
Abstract:
Real estate demand estimates are typically derived from transaction data. However, in regions with excess demand, transactions are driven by supply and therefore do not indicate what people are actually looking for. To estimate the demand for housing in Switzerland, search subscriptions from all important Swiss real estate platforms are used. These data do, however, suffer from missing information—for example, many users do not specify how many rooms they would like or what price they would be willing to pay. In economic analyses, it is often the case that only complete data is used. Usually, however, the proportion of complete data is rather small which leads to most information being neglected. Also, the data might have a strong distortion if it is complete. In addition, the reason that data is missing might itself also contain information, which is however ignored with that approach. An interesting issue is, therefore, if for economic analyses such as the one at hand, there is an added value by using the whole data set with the imputed missing values compared to using the usually small percentage of complete data (baseline). Also, it is interesting to see how different algorithms affect that result. The imputation of the missing data is done using unsupervised learning. Out of the numerous unsupervised learning approaches, the most common ones, such as clustering, principal component analysis, or neural networks techniques are applied. By training the model iteratively on the imputed data and, thereby, including the information of all data into the model, the distortion of the first training set—the complete data—vanishes. In a next step, the performances of the algorithms are measured. This is done by randomly creating missing values in subsets of the data, estimating those values with the relevant algorithms and several parameter combinations, and comparing the estimates to the actual data. After having found the optimal parameter set for each algorithm, the missing values are being imputed. Using the resulting data sets, the next step is to estimate the willingness to pay for real estate. This is done by fitting price distributions for real estate properties with certain characteristics, such as the region or the number of rooms. Based on these distributions, survival functions are computed to obtain the functional relationship between characteristics and selling probabilities. Comparing the survival functions shows that estimates which are based on imputed data sets do not differ significantly from each other; however, the demand estimate that is derived from the baseline data does. This indicates that the baseline data set does not include all available information and is therefore not representative for the entire sample. Also, demand estimates derived from the whole data set are much more accurate than the baseline estimation. Thus, in order to obtain optimal results, it is important to make use of all available data, even though it involves additional procedures such as data imputation.Keywords: demand estimate, missing-data imputation, real estate, unsupervised learning
Procedia PDF Downloads 28517 Imputation of Incomplete Large-Scale Monitoring Count Data via Penalized Estimation
Authors: Mohamed Dakki, Genevieve Robin, Marie Suet, Abdeljebbar Qninba, Mohamed A. El Agbani, Asmâa Ouassou, Rhimou El Hamoumi, Hichem Azafzaf, Sami Rebah, Claudia Feltrup-Azafzaf, Nafouel Hamouda, Wed a.L. Ibrahim, Hosni H. Asran, Amr A. Elhady, Haitham Ibrahim, Khaled Etayeb, Essam Bouras, Almokhtar Saied, Ashrof Glidan, Bakar M. Habib, Mohamed S. Sayoud, Nadjiba Bendjedda, Laura Dami, Clemence Deschamps, Elie Gaget, Jean-Yves Mondain-Monval, Pierre Defos Du Rau
Abstract:
In biodiversity monitoring, large datasets are becoming more and more widely available and are increasingly used globally to estimate species trends and con- servation status. These large-scale datasets challenge existing statistical analysis methods, many of which are not adapted to their size, incompleteness and heterogeneity. The development of scalable methods to impute missing data in incomplete large-scale monitoring datasets is crucial to balance sampling in time or space and thus better inform conservation policies. We developed a new method based on penalized Poisson models to impute and analyse incomplete monitoring data in a large-scale framework. The method al- lows parameterization of (a) space and time factors, (b) the main effects of predic- tor covariates, as well as (c) space–time interactions. It also benefits from robust statistical and computational capability in large-scale settings. The method was tested extensively on both simulated and real-life waterbird data, with the findings revealing that it outperforms six existing methods in terms of missing data imputation errors. Applying the method to 16 waterbird species, we estimated their long-term trends for the first time at the entire North African scale, a region where monitoring data suffer from many gaps in space and time series. This new approach opens promising perspectives to increase the accuracy of species-abundance trend estimations. We made it freely available in the r package ‘lori’ (https://CRAN.R-project.org/package=lori) and recommend its use for large- scale count data, particularly in citizen science monitoring programmes.Keywords: biodiversity monitoring, high-dimensional statistics, incomplete count data, missing data imputation, waterbird trends in North-Africa
Procedia PDF Downloads 15616 Internal Migration and Poverty Dynamic Analysis Using a Bayesian Approach: The Tunisian Case
Authors: Amal Jmaii, Damien Rousseliere, Besma Belhadj
Abstract:
We explore the relationship between internal migration and poverty in Tunisia. We present a methodology combining potential outcomes approach with multiple imputation to highlight the effect of internal migration on poverty states. We find that probability of being poor decreases when leaving the poorest regions (the west areas) to the richer regions (greater Tunis and the east regions).Keywords: internal migration, potential outcomes approach, poverty dynamics, Tunisia
Procedia PDF Downloads 31015 Wind Speed Data Analysis in Colombia in 2013 and 2015
Authors: Harold P. Villota, Alejandro Osorio B.
Abstract:
The energy meteorology is an area for study energy complementarity and the use of renewable sources in interconnected systems. Due to diversify the energy matrix in Colombia with wind sources, is necessary to know the data bases about this one. However, the time series given by 260 automatic weather stations have empty, and no apply data, so the purpose is to fill the time series selecting two years to characterize, impute and use like base to complete the data between 2005 and 2020.Keywords: complementarity, wind speed, renewable, colombia, characteri, characterization, imputation
Procedia PDF Downloads 16414 Imputation Technique for Feature Selection in Microarray Data Set
Authors: Younies Saeed Hassan Mahmoud, Mai Mabrouk, Elsayed Sallam
Abstract:
Analysing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.Keywords: DNA microarray, feature selection, missing data, bioinformatics
Procedia PDF Downloads 57413 Association of Nuclear – Mitochondrial Epistasis with BMI in Type 1 Diabetes Mellitus Patients
Authors: Agnieszka H. Ludwig-Slomczynska, Michal T. Seweryn, Przemyslaw Kapusta, Ewelina Pitera, Katarzyna Cyganek, Urszula Mantaj, Lucja Dobrucka, Ewa Wender-Ozegowska, Maciej T. Malecki, Pawel Wolkow
Abstract:
Obesity results from an imbalance between energy intake and its expenditure. Genome-Wide Association Study (GWAS) analyses have led to discovery of only about 100 variants influencing body mass index (BMI), which explain only a small portion of genetic variability. Analysis of gene epistasis gives a chance to discover another part. Since it was shown that interaction and communication between nuclear and mitochondrial genome are indispensable for normal cell function, we have looked for epistatic interactions between the two genomes to find their correlation with BMI. Methods: The analysis was performed on 366 T1DM patients using Illumina Infinium OmniExpressExome-8 chip and followed by imputation on Michigan Imputation Server. Only genes which influence mitochondrial functioning (listed in Human MitoCarta 2.0) were included in the analysis – variants of nuclear origin (MAF > 5%) in 1140 genes and 42 mitochondrial variants (MAF > 1%). Gene expression analysis was performed on GTex data. Association analysis between genetic variants and BMI was performed with the use of Linear Mixed Models as implemented in the package 'GENESIS' in R. Analysis of association between mRNA expression and BMI was performed with the use of linear models and standard significance tests in R. Results: Among variants involved in epistasis between mitochondria and nucleus we have identified one in mitochondrial transcription factor, TFB2M (rs6701836). It interacted with mitochondrial variants localized to MT-RNR1 (p=0.0004, MAF=15%), MT-ND2 (p=0.07, MAF=5%) and MT-ND4 (p=0.01, MAF=1.1%). Analysis of the interaction between nuclear variant rs6701836 (nuc) and rs3021088 localized to MT-ND2 mitochondrial gene (mito) has shown that the combination of the two led to BMI decrease (p=0.024). Each of the variants on its own does not correlate with higher BMI [p(nuc)=0.856, p(mito)=0.116)]. Although rs6701836 is intronic, it influences gene expression in the thyroid (p=0.000037). rs3021088 is a missense variant that leads to alanine to threonine substitution in the MT-ND2 gene which belongs to complex I of the electron transport chain. The analysis of the influence of genetic variants on gene expression has confirmed the trend explained above – the interaction of the two genes leads to BMI decrease (p=0.0308). Each of the mRNAs on its own is associated with higher BMI (p(mito)=0.0244 and p(nuc)=0.0269). Conclusıons: Our results show that nuclear-mitochondrial epistasis can influence BMI in T1DM patients. The correlation between transcription factor expression and mitochondrial genetic variants will be subject to further analysis.Keywords: body mass index, epistasis, mitochondria, type 1 diabetes
Procedia PDF Downloads 17412 Modern Imputation Technique for Missing Data in Linear Functional Relationship Model
Authors: Adilah Abdul Ghapor, Yong Zulina Zubairi, Rahmatullah Imon
Abstract:
Missing value problem is common in statistics and has been of interest for years. This article considers two modern techniques in handling missing data for linear functional relationship model (LFRM) namely the Expectation-Maximization (EM) algorithm and Expectation-Maximization with Bootstrapping (EMB) algorithm using three performance indicators; namely the mean absolute error (MAE), root mean square error (RMSE) and estimated biased (EB). In this study, we applied the methods of imputing missing values in the LFRM. Results of the simulation study suggest that EMB algorithm performs much better than EM algorithm in both models. We also illustrate the applicability of the approach in a real data set.Keywords: expectation-maximization, expectation-maximization with bootstrapping, linear functional relationship model, performance indicators
Procedia PDF Downloads 39911 The Channels through Which Energy Tax Can Affect Economic Growth: Panel Data Analysis
Authors: Mahmoud Hassan, Walid Oueslati, Damien Rousseliere
Abstract:
This paper explores the channels through which energy taxes may affect economic growth, using a simultaneous equations model for a balanced panel data of 31 OECD countries over the 1994–2013 period. The empirical results reveal a negative impact of energy taxes on physical investment in the short and long term. This impact is negatively sensitive to the existence and level of public debt. Additionally, the results show that energy taxes have an indirect effect on human capital through their impact on polluting emissions. The taxes on energy products are able to reduce both the flux and the stock of polluting emissions that have a negative impact on human capital skills in the short and long term. Finally, we found that energy taxes could encourage eco-innovation in the short and long term.Keywords: energy taxes, economic growth, public debt, simultaneous equations model, multiple imputation
Procedia PDF Downloads 23210 Analyzing the Performance of Machine Learning Models to Predict Alzheimer's Disease and its Stages Addressing Missing Value Problem
Authors: Carlos Theran, Yohn Parra Bautista, Victor Adankai, Richard Alo, Jimwi Liu, Clement G. Yedjou
Abstract:
Alzheimer's disease (AD) is a neurodegenerative disorder primarily characterized by deteriorating cognitive functions. AD has gained relevant attention in the last decade. An estimated 24 million people worldwide suffered from this disease by 2011. In 2016 an estimated 40 million were diagnosed with AD, and for 2050 is expected to reach 131 million people affected by AD. Therefore, detecting and confirming AD at its different stages is a priority for medical practices to provide adequate and accurate treatments. Recently, Machine Learning (ML) models have been used to study AD's stages handling missing values in multiclass, focusing on the delineation of Early Mild Cognitive Impairment (EMCI), Late Mild Cognitive Impairment (LMCI), and normal cognitive (CN). But, to our best knowledge, robust performance information of these models and the missing data analysis has not been presented in the literature. In this paper, we propose studying the performance of five different machine learning models for AD's stages multiclass prediction in terms of accuracy, precision, and F1-score. Also, the analysis of three imputation methods to handle the missing value problem is presented. A framework that integrates ML model for AD's stages multiclass prediction is proposed, performing an average accuracy of 84%.Keywords: alzheimer's disease, missing value, machine learning, performance evaluation
Procedia PDF Downloads 2509 Higher Freshwater Fish and Sea Fish Intake Is Inversely Associated with Liver Cancer in Patients with Hepatitis B
Authors: Maomao Cao
Abstract:
Background and aims While the association between higher consumption of fish and lower liver cancer risk has been confirmed, however, the association between specific fish intake and liver cancer risk remains unknown. We aimed to identify the association between specific fish consumption and the risk of liver cancer. Methods: Based on a community-based seropositive hepatitis B cohort involving 18404 individuals, face to face interview was conducted by a standardized questionnaire to acquire baseline information. Three common fish types in this study were analyzed, including freshwater fish, sea fish, and small fish (shrimp, crab, conch, and shell). All participants received liver cancer screening, and possible cases were identified by CT or MRI. Multivariable logistic models were applied to estimate the odds ratio (OR) and 95% confidence intervals (CI). Multivariate multiple imputations were utilized to impute observations with missing values. Results: 179 liver cancer cases were identified. Consumption of freshwater fish and sea fish at least once a week had a strong inverse association with liver cancer risk compared with the lowest intake level, with an adjusted OR of 0.53 (95% CI, 0.38-0.75) and 0.38 (95% CI, 0.19-0.73), respectively. This inverse association was also observed after the imputation. There was no statistically significant association between intake of small fish and liver cancer risk (OR=0.58, 95%, CI 0.32-1.08). Conclusions: Our findings suggest that consumption of freshwater fish and sea fish at least once a week could reduce liver cancer risk.Keywords: cross-sectional study, fish intake, liver cancer, risk factor
Procedia PDF Downloads 2738 Primary Analysis of a Randomized Controlled Trial of Topical Analgesia Post Haemorrhoidectomy
Authors: James Jin, Weisi Xia, Runzhe Gao, Alain Vandal, Darren Svirkis, Andrew Hill
Abstract:
Background: Post-haemorrhoidectomy pain is concerned by patients/clinicians. Minimizing the postoperation pain is highly interested clinically. Combinations of topical cream targeting three hypothesised post-haemorrhoidectomy pain mechanisms were developed and their effectiveness were evaluated. Specifically, a multi-centred double-blinded randomized clinical trial (RCT) was conducted in adults undergoing excisional haemorrhoidectomy. The primary analysis was conveyed on the data collected to evaluate the effectiveness of the combinations of topical cream targeting three hypothesized pain mechanisms after the operations. Methods: 192 patients were randomly allocated to 4 arms (each arm has 48 patients), and each arm was provided with pain cream 10% metronidazole (M), M and 2% diltiazem (MD), M with 4% lidocaine (ML), or MDL, respectively. Patients were instructed to apply topical treatments three times a day for 7 days, and record outcomes for 14 days after the operations. The primary outcome was VAS pain on day 4. Covariates and models were selected in the blind review stage. Multiple imputations were applied for the missingness. LMER, GLMER models together with natural splines were applied. Sandwich estimators and Wald statistics were used. P-values < 0.05 were considered as significant. Conclusions: The addition of topical lidocaine or diltiazem to metronidazole does not add any benefit. ML had significantly better pain and recovery scores than combination MDL. Multimodal topical analgesia with ML after haemorrhoidectomy could be considered for further evaluation. Further trials considering only 3 arms (M, ML, MD) might be worth exploring.Keywords: RCT, primary analysis, multiple imputation, pain scores, haemorrhoidectomy, analgesia, lmer
Procedia PDF Downloads 1197 Optimizing Nitrogen Fertilizer Application in Rice Cultivation: A Decision Model for Top and Ear Dressing Dosages
Authors: Ya-Li Tsai
Abstract:
Nitrogen is a vital element crucial for crop growth, significantly influencing crop yield. In rice cultivation, farmers often apply substantial nitrogen fertilizer to maximize yields. However, excessive nitrogen application increases the risk of lodging and pest infestation, leading to yield losses. Additionally, conventional flooded irrigation methods consume significant water resources, necessitating precise agricultural and intelligent water management systems. In this study, it leveraged physiological data and field images captured by unmanned aerial vehicles, considering fertilizer treatment and irrigation as key factors. Statistical models incorporating rice physiological data, yield, and vegetation indices from image data were developed. Missing physiological data were addressed using multiple imputation and regression methods, and regression models were established using principal component analysis and stepwise regression. Target nitrogen accumulation at key growth stages was identified to optimize fertilizer application, with the difference between actual and target nitrogen accumulation guiding recommendations for ear dressing dosage. Field experiments conducted in 2022 validated the recommended ear dressing dosage, demonstrating no significant difference in final yield compared to traditional fertilizer levels under alternate wetting and drying irrigation. These findings highlight the efficacy of applying recommended dosages based on fertilizer decision models, offering the potential for reduced fertilizer use while maintaining yield in rice cultivation.Keywords: intelligent fertilizer management, nitrogen top and ear dressing fertilizer, rice, yield optimization
Procedia PDF Downloads 81