Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 373

Search results for: incomplete categorical covariates

373 Survival Data with Incomplete Missing Categorical Covariates

Authors: Madaki Umar Yusuf, Mohd Rizam B. Abubakar


The survival censored data with incomplete covariate data is a common occurrence in many studies in which the outcome is survival time. With model when the missing covariates are categorical, a useful technique for obtaining parameter estimates is the EM by the method of weights. The survival outcome for the class of generalized linear model is applied and this method requires the estimation of the parameters of the distribution of the covariates. In this paper, we propose some clinical trials with ve covariates, four of which have some missing values which clearly show that they were fully censored data.

Keywords: EM algorithm, incomplete categorical covariates, ignorable missing data, missing at random (MAR), Weibull Distribution

Procedia PDF Downloads 309
372 Measurement Errors and Misclassifications in Covariates in Logistic Regression: Bayesian Adjustment of Main and Interaction Effects and the Sample Size Implications

Authors: Shahadut Hossain


Measurement errors in continuous covariates and/or misclassifications in categorical covariates are common in epidemiological studies. Regression analysis ignoring such mismeasurements seriously biases the estimated main and interaction effects of covariates on the outcome of interest. Thus, adjustments for such mismeasurements are necessary. In this research, we propose a Bayesian parametric framework for eliminating deleterious impacts of covariate mismeasurements in logistic regression. The proposed adjustment method is unified and thus can be applied to any generalized linear and non-linear regression models. Furthermore, adjustment for covariate mismeasurements requires validation data usually in the form of either gold standard measurements or replicates of the mismeasured covariates on a subset of the study population. Initial investigation shows that adequacy of such adjustment depends on the sizes of main and validation samples, especially when prevalences of the categorical covariates are low. Thus, we investigate the impact of main and validation sample sizes on the adjusted estimates, and provide a general guideline about these sample sizes based on simulation studies.

Keywords: measurement errors, misclassification, mismeasurement, validation sample, Bayesian adjustment

Procedia PDF Downloads 326
371 Determination Power and Sample Size Zero-Inflated Negative Binomial Dependent Death Rate of Age Model (ZINBD): Regression Analysis Mortality Acquired Immune Deficiency De ciency Syndrome (AIDS)

Authors: Mohd Asrul Affendi Bin Abdullah


Sample size calculation is especially important for zero inflated models because a large sample size is required to detect a significant effect with this model. This paper verify how to present percentage of power approximation for categorical and then extended to zero inflated models. Wald test was chosen to determine power sample size of AIDS death rate because it is frequently used due to its approachability and its natural for several major recent contribution in sample size calculation for this test. Power calculation can be conducted when covariates are used in the modeling ‘excessing zero’ data and assist categorical covariate. Analysis of AIDS death rate study is used for this paper. Aims of this study to determine the power of sample size (N = 945) categorical death rate based on parameter estimate in the simulation of the study.

Keywords: power sample size, Wald test, standardize rate, ZINBDR

Procedia PDF Downloads 343
370 Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency

Authors: Semeh Ben Salem, Sami Naouali, Moetez Sallami


Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.

Keywords: clustering, unsupervised learning, pattern recognition, categorical datasets, knowledge discovery, k-means

Procedia PDF Downloads 173
369 Estimation of Missing Values in Aggregate Level Spatial Data

Authors: Amitha Puranik, V. S. Binu, Seena Biju


Missing data is a common problem in spatial analysis especially at the aggregate level. Missing can either occur in covariate or in response variable or in both in a given location. Many missing data techniques are available to estimate the missing data values but not all of these methods can be applied on spatial data since the data are autocorrelated. Hence there is a need to develop a method that estimates the missing values in both response variable and covariates in spatial data by taking account of the spatial autocorrelation. The present study aims to develop a model to estimate the missing data points at the aggregate level in spatial data by accounting for (a) Spatial autocorrelation of the response variable (b) Spatial autocorrelation of covariates and (c) Correlation between covariates and the response variable. Estimating the missing values of spatial data requires a model that explicitly account for the spatial autocorrelation. The proposed model not only accounts for spatial autocorrelation but also utilizes the correlation that exists between covariates, within covariates and between a response variable and covariates. The precise estimation of the missing data points in spatial data will result in an increased precision of the estimated effects of independent variables on the response variable in spatial regression analysis.

Keywords: spatial regression, missing data estimation, spatial autocorrelation, simulation analysis

Procedia PDF Downloads 272
368 Using Genetic Algorithms and Rough Set Based Fuzzy K-Modes to Improve Centroid Model Clustering Performance on Categorical Data

Authors: Rishabh Srivastav, Divyam Sharma


We propose an algorithm to cluster categorical data named as ‘Genetic algorithm initialized rough set based fuzzy K-Modes for categorical data’. We propose an amalgamation of the simple K-modes algorithm, the Rough and Fuzzy set based K-modes and the Genetic Algorithm to form a new algorithm,which we hypothesise, will provide better Centroid Model clustering results, than existing standard algorithms. In the proposed algorithm, the initialization and updation of modes is done by the use of genetic algorithms while the membership values are calculated using the rough set and fuzzy logic.

Keywords: categorical data, fuzzy logic, genetic algorithm, K modes clustering, rough sets

Procedia PDF Downloads 43
367 An Estimating Equation for Survival Data with a Possibly Time-Varying Covariates under a Semiparametric Transformation Models

Authors: Yemane Hailu Fissuh, Zhongzhan Zhang


An estimating equation technique is an alternative method of the widely used maximum likelihood methods, which enables us to ease some complexity due to the complex characteristics of time-varying covariates. In the situations, when both the time-varying covariates and left-truncation are considered in the model, the maximum likelihood estimation procedures become much more burdensome and complex. To ease the complexity, in this study, the modified estimating equations those have been given high attention and considerations in many researchers under semiparametric transformation model was proposed. The purpose of this article was to develop the modified estimating equation under flexible and general class of semiparametric transformation models for left-truncated and right censored survival data with time-varying covariates. Besides the commonly applied Cox proportional hazards model, such kind of problems can be also analyzed with a general class of semiparametric transformation models to estimate the effect of treatment given possibly time-varying covariates on the survival time. The consistency and asymptotic properties of the estimators were intuitively derived via the expectation-maximization (EM) algorithm. The characteristics of the estimators in the finite sample performance for the proposed model were illustrated via simulation studies and Stanford heart transplant real data examples. To sum up the study, the bias for covariates has been adjusted by estimating density function for the truncation time variable. Then the effect of possibly time-varying covariates was evaluated in some special semiparametric transformation models.

Keywords: EM algorithm, estimating equation, semiparametric transformation models, time-to-event outcomes, time varying covariate

Procedia PDF Downloads 77
366 Imputation of Incomplete Large-Scale Monitoring Count Data via Penalized Estimation

Authors: Mohamed Dakki, Genevieve Robin, Marie Suet, Abdeljebbar Qninba, Mohamed A. El Agbani, Asmâa Ouassou, Rhimou El Hamoumi, Hichem Azafzaf, Sami Rebah, Claudia Feltrup-Azafzaf, Nafouel Hamouda, Wed a.L. Ibrahim, Hosni H. Asran, Amr A. Elhady, Haitham Ibrahim, Khaled Etayeb, Essam Bouras, Almokhtar Saied, Ashrof Glidan, Bakar M. Habib, Mohamed S. Sayoud, Nadjiba Bendjedda, Laura Dami, Clemence Deschamps, Elie Gaget, Jean-Yves Mondain-Monval, Pierre Defos Du Rau


In biodiversity monitoring, large datasets are becoming more and more widely available and are increasingly used globally to estimate species trends and con- servation status. These large-scale datasets challenge existing statistical analysis methods, many of which are not adapted to their size, incompleteness and heterogeneity. The development of scalable methods to impute missing data in incomplete large-scale monitoring datasets is crucial to balance sampling in time or space and thus better inform conservation policies. We developed a new method based on penalized Poisson models to impute and analyse incomplete monitoring data in a large-scale framework. The method al- lows parameterization of (a) space and time factors, (b) the main effects of predic- tor covariates, as well as (c) space–time interactions. It also benefits from robust statistical and computational capability in large-scale settings. The method was tested extensively on both simulated and real-life waterbird data, with the findings revealing that it outperforms six existing methods in terms of missing data imputation errors. Applying the method to 16 waterbird species, we estimated their long-term trends for the first time at the entire North African scale, a region where monitoring data suffer from many gaps in space and time series. This new approach opens promising perspectives to increase the accuracy of species-abundance trend estimations. We made it freely available in the r package ‘lori’ ( and recommend its use for large- scale count data, particularly in citizen science monitoring programmes.

Keywords: biodiversity monitoring, high-dimensional statistics, incomplete count data, missing data imputation, waterbird trends in North-Africa

Procedia PDF Downloads 44
365 Structural Behavior of Incomplete Box Girder Bridges Subjected to Unpredicted Loads

Authors: E. H. N. Gashti, J. Razzaghi, K. Kujala


In general, codes and regulations consider seismic loads only for completed structures of the bridges while, evaluation of incomplete structure of bridges, especially those constructed by free cantilever method, under these loads is also of great importance. Hence, this research tried to study the behavior of incomplete structure of common bridge type (box girder bridge), in construction phase under vertical seismic loads. Subsequently, the paper provided suitable guidelines and solutions to withstand this destructive phenomena. Research results proved that use of preventive methods can significantly reduce the stresses resulted from vertical seismic loads in box cross sections to an acceptable range recommended by design codes.

Keywords: box girder bridges, prestress loads, free cantilever method, seismic loads, construction phase

Procedia PDF Downloads 277
364 Geo-Additive Modeling of Family Size in Nigeria

Authors: Oluwayemisi O. Alaba, John O. Olaomi


The 2013 Nigerian Demographic Health Survey (NDHS) data was used to investigate the determinants of family size in Nigeria using the geo-additive model. The fixed effect of categorical covariates were modelled using the diffuse prior, P-spline with second-order random walk for the nonlinear effect of continuous variable, spatial effects followed Markov random field priors while the exchangeable normal priors were used for the random effects of the community and household. The Negative Binomial distribution was used to handle overdispersion of the dependent variable. Inference was fully Bayesian approach. Results showed a declining effect of secondary and higher education of mother, Yoruba tribe, Christianity, family planning, mother giving birth by caesarean section and having a partner who has secondary education on family size. Big family size is positively associated with age at first birth, number of daughters in a household, being gainfully employed, married and living with partner, community and household effects.

Keywords: Bayesian analysis, family size, geo-additive model, negative binomial

Procedia PDF Downloads 453
363 The Univalence Principle: Equivalent Mathematical Structures Are Indistinguishable

Authors: Michael Shulman, Paige North, Benedikt Ahrens, Dmitris Tsementzis


The Univalence Principle is the statement that equivalent mathematical structures are indistinguishable. We prove a general version of this principle that applies to all set-based, categorical, and higher-categorical structures defined in a non-algebraic and space-based style, as well as models of higher-order theories such as topological spaces. In particular, we formulate a general definition of indiscernibility for objects of any such structure, and a corresponding univalence condition that generalizes Rezk’s completeness condition for Segal spaces and ensures that all equivalences of structures are levelwise equivalences. Our work builds on Makkai’s First-Order Logic with Dependent Sorts, but is expressed in Voevodsky’s Univalent Foundations (UF), extending previous work on the Structure Identity Principle and univalent categories in UF. This enables indistinguishability to be expressed simply as identification, and yields a formal theory that is interpretable in classical homotopy theory, but also in other higher topos models. It follows that Univalent Foundations is a fully equivalence-invariant foundation for higher-categorical mathematics, as intended by Voevodsky.

Keywords: category theory, higher structures, inverse category, univalence

Procedia PDF Downloads 64
362 Zero Cross-Correlation Codes Based on Balanced Incomplete Block Design: Performance Analysis and Applications

Authors: Garadi Ahmed, Boubakar S. Bouazza


The Zero Cross-Correlation (C, w) code is a family of binary sequences of length C and constant Hamming-weight, the cross correlation between any two sequences equal zero. In this paper, we evaluate the performance of ZCC code based on Balanced Incomplete Block Design (BIBD) for Spectral Amplitude Coding Optical Code Division Multiple Access (SAC-OCDMA) system using direct detection. The BER obtained is better than 10-9 for five simultaneous users.

Keywords: spectral amplitude coding-optical code-division-multiple-access (SAC-OCDMA), phase induced intensity noise (PIIN), balanced incomplete block design (BIBD), zero cross-correlation (ZCC)

Procedia PDF Downloads 279
361 Causal Estimation for the Left-Truncation Adjusted Time-Varying Covariates under the Semiparametric Transformation Models of a Survival Time

Authors: Yemane Hailu Fissuh, Zhongzhan Zhang


In biomedical researches and randomized clinical trials, the most commonly interested outcomes are time-to-event so-called survival data. The importance of robust models in this context is to compare the effect of randomly controlled experimental groups that have a sense of causality. Causal estimation is the scientific concept of comparing the pragmatic effect of treatments conditional to the given covariates rather than assessing the simple association of response and predictors. Hence, the causal effect based semiparametric transformation model was proposed to estimate the effect of treatment with the presence of possibly time-varying covariates. Due to its high flexibility and robustness, the semiparametric transformation model which shall be applied in this paper has been given much more attention for estimation of a causal effect in modeling left-truncated and right censored survival data. Despite its wide applications and popularity in estimating unknown parameters, the maximum likelihood estimation technique is quite complex and burdensome in estimating unknown parameters and unspecified transformation function in the presence of possibly time-varying covariates. Thus, to ease the complexity we proposed the modified estimating equations. After intuitive estimation procedures, the consistency and asymptotic properties of the estimators were derived and the characteristics of the estimators in the finite sample performance of the proposed model were illustrated via simulation studies and Stanford heart transplant real data example. To sum up the study, the bias of covariates was adjusted via estimating the density function for truncation variable which was also incorporated in the model as a covariate in order to relax the independence assumption of failure time and truncation time. Moreover, the expectation-maximization (EM) algorithm was described for the estimation of iterative unknown parameters and unspecified transformation function. In addition, the causal effect was derived by the ratio of the cumulative hazard function of active and passive experiments after adjusting for bias raised in the model due to the truncation variable.

Keywords: causal estimation, EM algorithm, semiparametric transformation models, time-to-event outcomes, time-varying covariate

Procedia PDF Downloads 54
360 Tests for Zero Inflation in Count Data with Measurement Error in Covariates

Authors: Man-Yu Wong, Siyu Zhou, Zhiqiang Cao


In quality of life, health service utilization is an important determinant of medical resource expenditures on Colorectal cancer (CRC) care, a better understanding of the increased utilization of health services is essential for optimizing the allocation of healthcare resources to services and thus for enhancing the service quality, especially for high expenditure on CRC care like Hong Kong region. In assessing the association between the health-related quality of life (HRQOL) and health service utilization in patients with colorectal neoplasm, count data models can be used, which account for over dispersion or extra zero counts. In our data, the HRQOL evaluation is a self-reported measure obtained from a questionnaire completed by the patients, misreports and variations in the data are inevitable. Besides, there are more zero counts from the observed number of clinical consultations (observed frequency of zero counts = 206) than those from a Poisson distribution with mean equal to 1.33 (expected frequency of zero counts = 156). This suggests that excess of zero counts may exist. Therefore, we study tests for detecting zero-inflation in models with measurement error in covariates. Method: Under classical measurement error model, the approximate likelihood function for zero-inflation Poisson regression model can be obtained, then Approximate Maximum Likelihood Estimation(AMLE) can be derived accordingly, which is consistent and asymptotically normally distributed. By calculating score function and Fisher information based on AMLE, a score test is proposed to detect zero-inflation effect in ZIP model with measurement error. The proposed test follows asymptotically standard normal distribution under H0, and it is consistent with the test proposed for zero-inflation effect when there is no measurement error. Results: Simulation results show that empirical power of our proposed test is the highest among existing tests for zero-inflation in ZIP model with measurement error. In real data analysis, with or without considering measurement error in covariates, existing tests, and our proposed test all imply H0 should be rejected with P-value less than 0.001, i.e., zero-inflation effect is very significant, ZIP model is superior to Poisson model for analyzing this data. However, if measurement error in covariates is not considered, only one covariate is significant; if measurement error in covariates is considered, only another covariate is significant. Moreover, the direction of coefficient estimations for these two covariates is different in ZIP regression model with or without considering measurement error. Conclusion: In our study, compared to Poisson model, ZIP model should be chosen when assessing the association between condition-specific HRQOL and health service utilization in patients with colorectal neoplasm. and models taking measurement error into account will result in statistically more reliable and precise information.

Keywords: count data, measurement error, score test, zero inflation

Procedia PDF Downloads 206
359 A Modified Estimating Equations in Derivation of the Causal Effect on the Survival Time with Time-Varying Covariates

Authors: Yemane Hailu Fissuh, Zhongzhan Zhang


a systematic observation from a defined time of origin up to certain failure or censor is known as survival data. Survival analysis is a major area of interest in biostatistics and biomedical researches. At the heart of understanding, the most scientific and medical research inquiries lie for a causality analysis. Thus, the main concern of this study is to investigate the causal effect of treatment on survival time conditional to the possibly time-varying covariates. The theory of causality often differs from the simple association between the response variable and predictors. A causal estimation is a scientific concept to compare a pragmatic effect between two or more experimental arms. To evaluate an average treatment effect on survival outcome, the estimating equation was adjusted for time-varying covariates under the semi-parametric transformation models. The proposed model intuitively obtained the consistent estimators for unknown parameters and unspecified monotone transformation functions. In this article, the proposed method estimated an unbiased average causal effect of treatment on survival time of interest. The modified estimating equations of semiparametric transformation models have the advantage to include the time-varying effect in the model. Finally, the finite sample performance characteristics of the estimators proved through the simulation and Stanford heart transplant real data. To this end, the average effect of a treatment on survival time estimated after adjusting for biases raised due to the high correlation of the left-truncation and possibly time-varying covariates. The bias in covariates was restored, by estimating density function for left-truncation. Besides, to relax the independence assumption between failure time and truncation time, the model incorporated the left-truncation variable as a covariate. Moreover, the expectation-maximization (EM) algorithm iteratively obtained unknown parameters and unspecified monotone transformation functions. To summarize idea, the ratio of cumulative hazards functions between the treated and untreated experimental group has a sense of the average causal effect for the entire population.

Keywords: a modified estimation equation, causal effect, semiparametric transformation models, survival analysis, time-varying covariate

Procedia PDF Downloads 87
358 A PROMETHEE-BELIEF Approach for Multi-Criteria Decision Making Problems with Incomplete Information

Authors: H. Moalla, A. Frikha


Multi-criteria decision aid methods consider decision problems where numerous alternatives are evaluated on several criteria. These methods are used to deal with perfect information. However, in practice, it is obvious that this information requirement is too much strict. In fact, the imperfect data provided by more or less reliable decision makers usually affect decision results since any decision is closely linked to the quality and availability of information. In this paper, a PROMETHEE-BELIEF approach is proposed to help multi-criteria decisions based on incomplete information. This approach solves problems with incomplete decision matrix and unknown weights within PROMETHEE method. On the base of belief function theory, our approach first determines the distributions of belief masses based on PROMETHEE’s net flows and then calculates weights. Subsequently, it aggregates the distribution masses associated to each criterion using Murphy’s modified combination rule in order to infer a global belief structure. The final action ranking is obtained via pignistic probability transformation. A case study of real-world application concerning the location of a waste treatment center from healthcare activities with infectious risk in the center of Tunisia is studied to illustrate the detailed process of the BELIEF-PROMETHEE approach.

Keywords: belief function theory, incomplete information, multiple criteria analysis, PROMETHEE method

Procedia PDF Downloads 68
357 Modelling Causal Effects from Complex Longitudinal Data via Point Effects of Treatments

Authors: Xiaoqin Wang, Li Yin


Background and purpose: In many practices, one estimates causal effects arising from a complex stochastic process, where a sequence of treatments are assigned to influence a certain outcome of interest, and there exist time-dependent covariates between treatments. When covariates are plentiful and/or continuous, statistical modeling is needed to reduce the huge dimensionality of the problem and allow for the estimation of causal effects. Recently, Wang and Yin (Annals of statistics, 2020) derived a new general formula, which expresses these causal effects in terms of the point effects of treatments in single-point causal inference. As a result, it is possible to conduct the modeling via point effects. The purpose of the work is to study the modeling of these causal effects via point effects. Challenges and solutions: The time-dependent covariates often have influences from earlier treatments as well as on subsequent treatments. Consequently, the standard parameters – i.e., the mean of the outcome given all treatments and covariates-- are essentially all different (null paradox). Furthermore, the dimension of the parameters is huge (curse of dimensionality). Therefore, it can be difficult to conduct the modeling in terms of standard parameters. Instead of standard parameters, we have use point effects of treatments to develop likelihood-based parametric approach to the modeling of these causal effects and are able to model the causal effects of a sequence of treatments by modeling a small number of point effects of individual treatment Achievements: We are able to conduct the modeling of the causal effects from a sequence of treatments in the familiar framework of single-point causal inference. The simulation shows that our method achieves not only an unbiased estimate for the causal effect but also the nominal level of type I error and a low level of type II error for the hypothesis testing. We have applied this method to a longitudinal study of COVID-19 mortality among Scandinavian countries and found that the Swedish approach performed far worse than the other countries' approach for COVID-19 mortality and the poor performance was largely due to its early measure during the initial period of the pandemic.

Keywords: causal effect, point effect, statistical modelling, sequential causal inference

Procedia PDF Downloads 95
356 Spatial Point Process Analysis of Dengue Fever in Tainan, Taiwan

Authors: Ya-Mei Chang


This research is intended to apply spatio-temporal point process methods to the dengue fever data in Tainan. The spatio-temporal intensity function of the dataset is assumed to be separable. The kernel estimation is a widely used approach to estimate intensity functions. The intensity function is very helpful to study the relation of the spatio-temporal point process and some covariates. The covariate effects might be nonlinear. An nonparametric smoothing estimator is used to detect the nonlinearity of the covariate effects. A fitted parametric model could describe the influence of the covariates to the dengue fever. The correlation between the data points is detected by the K-function. The result of this research could provide useful information to help the government or the stakeholders making decisions.

Keywords: dengue fever, spatial point process, kernel estimation, covariate effect

Procedia PDF Downloads 261
355 Deep Learning Approach for Chronic Kidney Disease Complications

Authors: Mario Isaza-Ruget, Claudia C. Colmenares-Mejia, Nancy Yomayusa, Camilo A. González, Andres Cely, Jossie Murcia


Quantification of risks associated with complications development from chronic kidney disease (CKD) through accurate survival models can help with patient management. A retrospective cohort that included patients diagnosed with CKD from a primary care program and followed up between 2013 and 2018 was carried out. Time-dependent and static covariates associated with demographic, clinical, and laboratory factors were included. Deep Learning (DL) survival analyzes were developed for three CKD outcomes: CKD stage progression, >25% decrease in Estimated Glomerular Filtration Rate (eGFR), and Renal Replacement Therapy (RRT). Models were evaluated and compared with Random Survival Forest (RSF) based on concordance index (C-index) metric. 2.143 patients were included. Two models were developed for each outcome, Deep Neural Network (DNN) model reported C-index=0.9867 for CKD stage progression; C-index=0.9905 for reduction in eGFR; C-index=0.9867 for RRT. Regarding the RSF model, C-index=0.6650 was reached for CKD stage progression; decreased eGFR C-index=0.6759; RRT C-index=0.8926. DNN models applied in survival analysis context with considerations of longitudinal covariates at the start of follow-up can predict renal stage progression, a significant decrease in eGFR and RRT. The success of these survival models lies in the appropriate definition of survival times and the analysis of covariates, especially those that vary over time.

Keywords: artificial intelligence, chronic kidney disease, deep neural networks, survival analysis

Procedia PDF Downloads 29
354 Distances over Incomplete Diabetes and Breast Cancer Data Based on Bhattacharyya Distance

Authors: Loai AbdAllah, Mahmoud Kaiyal


Missing values in real-world datasets are a common problem. Many algorithms were developed to deal with this problem, most of them replace the missing values with a fixed value that was computed based on the observed values. In our work, we used a distance function based on Bhattacharyya distance to measure the distance between objects with missing values. Bhattacharyya distance, which measures the similarity of two probability distributions. The proposed distance distinguishes between known and unknown values. Where the distance between two known values is the Mahalanobis distance. When, on the other hand, one of them is missing the distance is computed based on the distribution of the known values, for the coordinate that contains the missing value. This method was integrated with Wikaya, a digital health company developing a platform that helps to improve prevention of chronic diseases such as diabetes and cancer. In order for Wikaya’s recommendation system to work distance between users need to be measured. Since there are missing values in the collected data, there is a need to develop a distance function distances between incomplete users profiles. To evaluate the accuracy of the proposed distance function in reflecting the actual similarity between different objects, when some of them contain missing values, we integrated it within the framework of k nearest neighbors (kNN) classifier, since its computation is based only on the similarity between objects. To validate this, we ran the algorithm over diabetes and breast cancer datasets, standard benchmark datasets from the UCI repository. Our experiments show that kNN classifier using our proposed distance function outperforms the kNN using other existing methods.

Keywords: missing values, incomplete data, distance, incomplete diabetes data

Procedia PDF Downloads 132
353 Secondary Traumatic Stress and Related Factors in Australian Social Workers and Psychologists

Authors: Cindy Davis, Samantha Rayner


Secondary traumatic stress (STS) is an indirect form of trauma affecting the psychological well-being of mental health workers; STS is found to be a prevalent risk in mental health occupations. Various factors impact the development of STS within the literature; including the level of trauma individuals are exposed to and their level of empathy. Research is limited on STS in mental health workers in Australia; therefore, this study examined STS and related factors of empathetic behavior and trauma caseload among mental health workers. The research utilized an online survey quantitative research design with a purposive sample of 190 mental health workers (176 females) recruited via professional websites and unofficial social media groups. Participants completed an online questionnaire comprising of demographics, the secondary traumatic stress scale and the empathy scale for social workers. A standard hierarchical regression analysis was conducted to examine the significance of covariates, traumatized clients, traumatic stress within workload and empathy in predicting STS. The current research found 29.5% of participants to meet the criteria for a diagnosis of STS. Age and past trauma within the covariates were significantly associated with STS. Amount of traumatized clients significantly predicted 4.7% of the variance in STS, traumatic stress within workload significantly predicted 4.8% of the variance in STS and empathy significantly predicted 4.9% of the variance in STS. These three independent variables and the covariates accounted for 18.5% of the variance in STS. Practical implications include a focus on developing risk strategies and treatment methods that can diminish the impact of STS.

Keywords: mental health, PTSD, social work, trauma

Procedia PDF Downloads 196
352 Generalized Additive Model for Estimating Propensity Score

Authors: Tahmidul Islam


Propensity Score Matching (PSM) technique has been widely used for estimating causal effect of treatment in observational studies. One major step of implementing PSM is estimating the propensity score (PS). Logistic regression model with additive linear terms of covariates is most used technique in many studies. Logistics regression model is also used with cubic splines for retaining flexibility in the model. However, choosing the functional form of the logistic regression model has been a question since the effectiveness of PSM depends on how accurately the PS been estimated. In many situations, the linearity assumption of linear logistic regression may not hold and non-linear relation between the logit and the covariates may be appropriate. One can estimate PS using machine learning techniques such as random forest, neural network etc for more accuracy in non-linear situation. In this study, an attempt has been made to compare the efficacy of Generalized Additive Model (GAM) in various linear and non-linear settings and compare its performance with usual logistic regression. GAM is a non-parametric technique where functional form of the covariates can be unspecified and a flexible regression model can be fitted. In this study various simple and complex models have been considered for treatment under several situations (small/large sample, low/high number of treatment units) and examined which method leads to more covariate balance in the matched dataset. It is found that logistic regression model is impressively robust against inclusion quadratic and interaction terms and reduces mean difference in treatment and control set equally efficiently as GAM does. GAM provided no significantly better covariate balance than logistic regression in both simple and complex models. The analysis also suggests that larger proportion of controls than treatment units leads to better balance for both of the methods.

Keywords: accuracy, covariate balances, generalized additive model, logistic regression, non-linearity, propensity score matching

Procedia PDF Downloads 290
351 Electrochemical Regeneration of GIC Adsorbent in a Continuous Electrochemical Reactor

Authors: S. N. Hussain, H. M. A. Asghar, H. Sattar, E. P. L. Roberts


Arvia™ introduced a novel technology consisting of adsorption followed by electrochemical regeneration with a graphite intercalation compound adsorbent that takes place in a single unit. The adsorbed species may lead to the formation of intermediate by-products products due to incomplete mineralization during electrochemical regeneration. Therefore, the investigation of breakdown products due to incomplete oxidation is of great concern regarding the commercial applications of this process. In the present paper, the formation of the chlorinated breakdown products during continuous process of adsorption and electrochemical regeneration based on a graphite intercalation compound adsorbent has been investigated.

Keywords: GIC, adsorption, electrochemical regeneration, chlorphenols

Procedia PDF Downloads 217
350 Syllogistic Reasoning with 108 Inference Rules While Case Quantities Change

Authors: Mikhail Zarechnev, Bora I. Kumova


A syllogism is a deductive inference scheme used to derive a conclusion from a set of premises. In a categorical syllogisms, there are only two premises and every premise and conclusion is given in form of a quantified relationship between two objects. The different order of objects in premises give classification known as figures. We have shown that the ordered combinations of 3 generalized quantifiers with certain figure provide in total of 108 syllogistic moods which can be considered as different inference rules. The classical syllogistic system allows to model human thought and reasoning with syllogistic structures always attracted the attention of cognitive scientists. Since automated reasoning is considered as part of learning subsystem of AI agents, syllogistic system can be applied for this approach. Another application of syllogistic system is related to inference mechanisms on the Semantic Web applications. In this paper we proposed the mathematical model and algorithm for syllogistic reasoning. Also the model of iterative syllogistic reasoning in case of continuous flows of incoming data based on case–based reasoning and possible applications of proposed system were discussed.

Keywords: categorical syllogism, case-based reasoning, cognitive architecture, inference on the semantic web, syllogistic reasoning

Procedia PDF Downloads 331
349 The Determinants of Country Corruption: Unobserved Heterogeneity and Individual Choice- An empirical Application with Finite Mixture Models

Authors: Alessandra Marcelletti, Giovanni Trovato


Corruption in public offices is found to be the reflection of country-specific features, however, the exact magnitude and the statistical significance of its determinants effect has not yet been identified. The paper aims to propose an estimation method to measure the impact of country fundamentals on corruption, showing that covariates could differently affect the extent of corruption across countries. Thus, we exploit a model able to take into account different factors affecting the incentive to ask or to be asked for a bribe, coherently with the use of the Corruption Perception Index. We assume that discordant results achieved in literature may be explained by omitted hidden factors affecting the agents' decision process. Moreover, assuming homogeneous covariates effect may lead to unreliable conclusions since the country-specific environment is not accounted for. We apply a Finite Mixture Model with concomitant variables to 129 countries from 1995 to 2006, accounting for the impact of the initial conditions in the socio-economic structure on the corruption patterns. Our findings confirm the hypothesis of the decision process of accepting or asking for a bribe varies with specific country fundamental features.

Keywords: Corruption, Finite Mixture Models, Concomitant Variables, Countries Classification

Procedia PDF Downloads 195
348 The Effect of Size and Tumor Depth on Histological Clearance Margins of Basal Cell Carcinomas

Authors: Martin Van, Mohammed Javed, Sarah Hemington-Gorse


Aim: Our aim was to determine the effect of size and tumor depth of basal cell carcinomas (BCCs) on surgical margin clearance. Methods: A retrospective study was conducted at the Welsh Centre for Burns and Plastic Surgery (WCBPS), Morriston Hospital between 1 Jan 2016 – 31 July 2016. Only patients with confirmed BCC on histopathological analysis were included. Patient data including anatomical region treated, lesion size, histopathological clearance margins and histological sub-types were recorded. An independent T-test was performed determine statistical significance. Results: A total of 228 BCCs were excised in 160 patients. Eleven lesions (4.8%) were incompletely excised. The nose area had the highest rate of incomplete excision. The mean diameter of incompletely excised lesions was 11.4mm vs 11.5mm in completely excised lesions (p=0.959) and the mean histological depth of incompletely excised lesions was 4.1mm vs. 2.5mm for completely excised BCCs (p < 0.05). Conclusions: BCC tumor depth of > 4.1 mm was associated with high rate of incomplete margin clearance. Hence, in prospective patients, a BCC tumor depth (>4 mm) on tissue biopsy should alert the surgeon of potentially higher risk of incomplete excision of lesion.

Keywords: basal cell carcinoma, excision margins, plastic surgery, treatment

Procedia PDF Downloads 151
347 Performance of the Cmip5 Models in Simulation of the Present and Future Precipitation over the Lake Victoria Basin

Authors: M. A. Wanzala, L. A. Ogallo, F. J. Opijah, J. N. Mutemi


The usefulness and limitations in climate information are due to uncertainty inherent in the climate system. For any given region to have sustainable development it is important to apply climate information into its socio-economic strategic plans. The overall objective of the study was to assess the performance of the Coupled Model Inter-comparison Project (CMIP5) over the Lake Victoria Basin. The datasets used included the observed point station data, gridded rainfall data from Climate Research Unit (CRU) and hindcast data from eight CMIP5. The methodology included trend analysis, spatial analysis, correlation analysis, Principal Component Analysis (PCA) regression analysis, and categorical statistical skill score. Analysis of the trends in the observed rainfall records indicated an increase in rainfall variability both in space and time for all the seasons. The spatial patterns of the individual models output from the models of MPI, MIROC, EC-EARTH and CNRM were closest to the observed rainfall patterns.

Keywords: categorical statistics, coupled model inter-comparison project, principal component analysis, statistical downscaling

Procedia PDF Downloads 286
346 Different Sampling Schemes for Semi-Parametric Frailty Model

Authors: Nursel Koyuncu, Nihal Ata Tutkun


Frailty model is a survival model that takes into account the unobserved heterogeneity for exploring the relationship between the survival of an individual and several covariates. In the recent years, proposed survival models become more complex and this feature causes convergence problems especially in large data sets. Therefore selection of sample from these big data sets is very important for estimation of parameters. In sampling literature, some authors have defined new sampling schemes to predict the parameters correctly. For this aim, we try to see the effect of sampling design in semi-parametric frailty model. We conducted a simulation study in R programme to estimate the parameters of semi-parametric frailty model for different sample sizes, censoring rates under classical simple random sampling and ranked set sampling schemes. In the simulation study, we used data set recording 17260 male Civil Servants aged 40–64 years with complete 10-year follow-up as population. Time to death from coronary heart disease is treated as a survival-time and age, systolic blood pressure are used as covariates. We select the 1000 samples from population using different sampling schemes and estimate the parameters. From the simulation study, we concluded that ranked set sampling design performs better than simple random sampling for each scenario.

Keywords: frailty model, ranked set sampling, efficiency, simple random sampling

Procedia PDF Downloads 122
345 Frailty Models for Modeling Heterogeneity: Simulation Study and Application to Quebec Pension Plan

Authors: Souad Romdhane, Lotfi Belkacem


When referring to actuarial analysis of lifetime, only models accounting for observable risk factors have been developed. Within this context, Cox proportional hazards model (CPH model) is commonly used to assess the effects of observable covariates as gender, age, smoking habits, on the hazard rates. These covariates may fail to fully account for the true lifetime interval. This may be due to the existence of another random variable (frailty) that is still being ignored. The aim of this paper is to examine the shared frailty issue in the Cox proportional hazard model by including two different parametric forms of frailty into the hazard function. Four estimated methods are used to fit them. The performance of the parameter estimates is assessed and compared between the classical Cox model and these frailty models through a real-life data set from the Quebec Pension Plan and then using a more general simulation study. This performance is investigated in terms of the bias of point estimates and their empirical standard errors in both fixed and random effect parts. Both the simulation and the real dataset studies showed differences between classical Cox model and shared frailty model.

Keywords: life insurance-pension plan, survival analysis, risk factors, cox proportional hazards model, multivariate failure-time data, shared frailty, simulations study

Procedia PDF Downloads 277
344 [Keynote Talk]: Evidence Fusion in Decision Making

Authors: Mohammad Abdullah-Al-Wadud


In the current era of automation and artificial intelligence, different systems have been increasingly keeping on depending on decision-making capabilities of machines. Such systems/applications may range from simple classifiers to sophisticated surveillance systems based on traditional sensors and related equipment which are becoming more common in the internet of things (IoT) paradigm. However, the available data for such problems are usually imprecise and incomplete, which leads to uncertainty in decisions made based on traditional probability-based classifiers. This requires a robust fusion framework to combine the available information sources with some degree of certainty. The theory of evidence can provide with such a method for combining evidence from different (may be unreliable) sources/observers. This talk will address the employment of the Dempster-Shafer Theory of evidence in some practical applications.

Keywords: decision making, dempster-shafer theory, evidence fusion, incomplete data, uncertainty

Procedia PDF Downloads 333