Search results for: incomplete data
25275 Imputing Missing Data in Electronic Health Records: A Comparison of Linear and Non-Linear Imputation Models
Authors: Alireza Vafaei Sadr, Vida Abedi, Jiang Li, Ramin Zand
Abstract:
Missing data is a common challenge in medical research and can lead to biased or incomplete results. When the data bias leaks into models, it further exacerbates health disparities; biased algorithms can lead to misclassification and reduced resource allocation and monitoring as part of prevention strategies for certain minorities and vulnerable segments of patient populations, which in turn further reduce data footprint from the same population – thus, a vicious cycle. This study compares the performance of six imputation techniques grouped into Linear and Non-Linear models on two different realworld electronic health records (EHRs) datasets, representing 17864 patient records. The mean absolute percentage error (MAPE) and root mean squared error (RMSE) are used as performance metrics, and the results show that the Linear models outperformed the Non-Linear models in terms of both metrics. These results suggest that sometimes Linear models might be an optimal choice for imputation in laboratory variables in terms of imputation efficiency and uncertainty of predicted values.Keywords: EHR, machine learning, imputation, laboratory variables, algorithmic bias
Procedia PDF Downloads 8525274 Bayesian Analysis of Topp-Leone Generalized Exponential Distribution
Authors: Najrullah Khan, Athar Ali Khan
Abstract:
The Topp-Leone distribution was introduced by Topp- Leone in 1955. In this paper, an attempt has been made to fit Topp-Leone Generalized exponential (TPGE) distribution. A real survival data set is used for illustrations. Implementation is done using R and JAGS and appropriate illustrations are made. R and JAGS codes have been provided to implement censoring mechanism using both optimization and simulation tools. The main aim of this paper is to describe and illustrate the Bayesian modelling approach to the analysis of survival data. Emphasis is placed on the modeling of data and the interpretation of the results. Crucial to this is an understanding of the nature of the incomplete or 'censored' data encountered. Analytic approximation and simulation tools are covered here, but most of the emphasis is on Markov chain based Monte Carlo method including independent Metropolis algorithm, which is currently the most popular technique. For analytic approximation, among various optimization algorithms and trust region method is found to be the best. In this paper, TPGE model is also used to analyze the lifetime data in Bayesian paradigm. Results are evaluated from the above mentioned real survival data set. The analytic approximation and simulation methods are implemented using some software packages. It is clear from our findings that simulation tools provide better results as compared to those obtained by asymptotic approximation.Keywords: Bayesian Inference, JAGS, Laplace Approximation, LaplacesDemon, posterior, R Software, simulation
Procedia PDF Downloads 53525273 Energy Complementary in Colombia: Imputation of Dataset
Authors: Felipe Villegas-Velasquez, Harold Pantoja-Villota, Sergio Holguin-Cardona, Alejandro Osorio-Botero, Brayan Candamil-Arango
Abstract:
Colombian electricity comes mainly from hydric resources, affected by environmental variations such as the El Niño phenomenon. That is why incorporating other types of resources is necessary to provide electricity constantly. This research seeks to fill the wind speed and global solar irradiance dataset for two years with the highest amount of information. A further result is the characterization of the data by region that led to infer which errors occurred and offered the incomplete dataset.Keywords: energy, wind speed, global solar irradiance, Colombia, imputation
Procedia PDF Downloads 14625272 Influence of Processing Parameters on the Reliability of Sieving as a Particle Size Distribution Measurements
Authors: Eseldin Keleb
Abstract:
In the pharmaceutical industry particle size distribution is an important parameter for the characterization of pharmaceutical powders. The powder flowability, reactivity and compatibility, which have a decisive impact on the final product, are determined by particle size and size distribution. Therefore, the aim of this study was to evaluate the influence of processing parameters on the particle size distribution measurements. Different Size fractions of α-lactose monohydrate and 5% polyvinylpyrrolidone were prepared by wet granulation and were used for the preparation of samples. The influence of sieve load (50, 100, 150, 200, 250, 300, and 350 g), processing time (5, 10, and 15 min), sample size ratios (high percentage of small and large particles), type of disturbances (vibration and shaking) and process reproducibility have been investigated. Results obtained showed that a sieve load of 50 g produce the best separation, a further increase in sample weight resulted in incomplete separation even after the extension of the processing time for 15 min. Performing sieving using vibration was rapider and more efficient than shaking. Meanwhile between day reproducibility showed that particle size distribution measurements are reproducible. However, for samples containing 70% fines or 70% large particles, which processed at optimized parameters, the incomplete separation was always observed. These results indicated that sieving reliability is highly influenced by the particle size distribution of the sample and care must be taken for samples with particle size distribution skewness.Keywords: sieving, reliability, particle size distribution, processing parameters
Procedia PDF Downloads 61325271 Expand Rabies Post-Exposure Prophylaxis to Where It Is Needed the Most
Authors: Henry Wilde, Thiravat Hemachudha
Abstract:
Human rabies deaths are underreported worldwide at 55,000 annual cases; more than of dengue and Japanese encephalitis. Almost half are children. A recent study from the Philippines of nearly 2,000 rabies deaths revealed that none of had received incomplete or no post exposure prophylaxis. Coming from a canine rabies endemic country, this is not unique. There are two major barriers to reducing human rabies deaths: 1) the large number of unvaccinated dogs and 2) post-exposure prophylaxis (PEP) that is not available, incomplete, not affordable, or not within reach for bite victims travel means. Only the first barrier, inadequate vaccination of dogs, is now being seriously addressed. It is also often not done effectively or sustainably. Rabies PEP has evolved as a complex, prolonged process, usually delegated to centers in larger cities. It is virtually unavailable in villages or small communities where most dog bites occur, victims are poor and usually unable to travel a long distance multiple times to receive PEP. Reseacrh that led to better understanding of the pathophysiology of rabies and immune responses to potent vaccines and immunoglobulin have allowed shortening and making PEP more evidence based. This knowledge needs to be adopted and applied so that PEP can be rendered safely and affordably where needed the most: by village health care workers who have long performed more complex services after appropriate training. Recent research makes this an important and long neglected goal that is now within our means to implement.Keywords: rabies, post-exposure prophylaxis, availability, immunoglobulin
Procedia PDF Downloads 26425270 Electrical Cardiac Remodeling in Triathletes: A Comparative Study in Elite Male and Female Athletes
Authors: Lingxia Li, Frédéric Schnell, Thibault Lachard, Anne-Charlotte Dupont, Shuzhe Ding, Solène Le Douairon Lahaye
Abstract:
Background: Prolonged intensive endurance exercise is associated with cardiovascular adaptations in athletes. However, the sex differences in electrocardiographic (ECG) performance in triathletes are poorly understood. Methods: ECG results of male and female triathletes registered on the French ministerial lists of high-level athletes between 2015 and 2021 were involved. The ECG was evaluated according to commonly accepted criteria. Results: Eighty-six triathletes (male 50, female 36) were involved; the average age was 19.9 ± 4.8 years. The training volume was 21±6 hours/week in males and 19 ± 6 hours/week in females (p>0.05). Despite the relatively larger P wave (96.0 ± 12.0 vs. 89.9 ± 11.5 ms, p=0.02) and longer QRS complex (96.6 ± 11.1 vs. 90.3 ± 8.6 ms, p=0.005) in males than in females, all indicators were within normal ranges. The most common electrical manifestations were early repolarization (46.5%) and incomplete right bundle branch block (39.5%). No difference between sexes was found in electrical manifestations (p > 0.05). Conclusion: All ECG patterns were within normal limits under similar training volumes, but male triathletes were more susceptible to cardiovascular changes than females. The most common ECG manifestations in triathletes were early repolarization and incomplete right bundle branch block, with no disparity between males and females. Large samples involving both sexes are required.Keywords: cardiovascular remodeling, electrocardiography, triathlon, elite athletes
Procedia PDF Downloads 625269 Gradient Boosted Trees on Spark Platform for Supervised Learning in Health Care Big Data
Authors: Gayathri Nagarajan, L. D. Dhinesh Babu
Abstract:
Health care is one of the prominent industries that generate voluminous data thereby finding the need of machine learning techniques with big data solutions for efficient processing and prediction. Missing data, incomplete data, real time streaming data, sensitive data, privacy, heterogeneity are few of the common challenges to be addressed for efficient processing and mining of health care data. In comparison with other applications, accuracy and fast processing are of higher importance for health care applications as they are related to the human life directly. Though there are many machine learning techniques and big data solutions used for efficient processing and prediction in health care data, different techniques and different frameworks are proved to be effective for different applications largely depending on the characteristics of the datasets. In this paper, we present a framework that uses ensemble machine learning technique gradient boosted trees for data classification in health care big data. The framework is built on Spark platform which is fast in comparison with other traditional frameworks. Unlike other works that focus on a single technique, our work presents a comparison of six different machine learning techniques along with gradient boosted trees on datasets of different characteristics. Five benchmark health care datasets are considered for experimentation, and the results of different machine learning techniques are discussed in comparison with gradient boosted trees. The metric chosen for comparison is misclassification error rate and the run time of the algorithms. The goal of this paper is to i) Compare the performance of gradient boosted trees with other machine learning techniques in Spark platform specifically for health care big data and ii) Discuss the results from the experiments conducted on datasets of different characteristics thereby drawing inference and conclusion. The experimental results show that the accuracy is largely dependent on the characteristics of the datasets for other machine learning techniques whereas gradient boosting trees yields reasonably stable results in terms of accuracy without largely depending on the dataset characteristics.Keywords: big data analytics, ensemble machine learning, gradient boosted trees, Spark platform
Procedia PDF Downloads 24025268 Electrical Cardiac Remodeling in Elite Athletes: A Comparative Study between Triathletes and Cyclists
Authors: Lingxia Li, Frédéric Schnell, Thibault Lachard, Anne-Charlotte Dupont, Shuzhe Ding, Solène Le Douairon Lahaye
Abstract:
Background: Repetitive participation in triathlon training results in significant myocardial changes. However, whether the cardiac remodeling in triathletes is related to the specificities of the sport (consisting of three sports) raises questions. Methods: Elite triathletes and cyclists registered on the French ministerial lists of high-level athletes were involved. The basic information and routine electrocardiogram records were obtained. Electrocardiograms were evaluated according to clinical criteria. Results: Of the 105 athletes included in the study, 42 were from the short-distance triathlon (40%), and 63 were from the road cycling (60%). The average age was 22.1±4.2 years. The P wave amplitude was significantly lower in triathletes than in cyclists (p=0.005), and no significant statistical difference was found in heart rate, RR interval, PR or PQ interval, QRS complex, QRS axe, QT interval, and QTc (p>0.05). All the measured parameters were within normal ranges. The most common electrical manifestations were early repolarization (60.95%) and incomplete right bundle branch block (43.81%); there was no statistical difference between the groups (p>0.05). Conclusions: Prolonged intensive endurance exercise training induces physiological cardiac remodeling in both triathletes and cyclists. The most common electrocardiogram manifestations were early repolarization and incomplete right bundle branch block.Keywords: cardiac screening, electrocardiogram, triathlon, cycling, elite athletes
Procedia PDF Downloads 625267 The Normal-Generalized Hyperbolic Secant Distribution: Properties and Applications
Authors: Hazem M. Al-Mofleh
Abstract:
In this paper, a new four-parameter univariate continuous distribution called the Normal-Generalized Hyperbolic Secant Distribution (NGHS) is defined and studied. Some general and structural distributional properties are investigated and discussed, including: central and non-central n-th moments and incomplete moments, quantile and generating functions, hazard function, Rényi and Shannon entropies, shapes: skewed right, skewed left, and symmetric, modality regions: unimodal and bimodal, maximum likelihood (MLE) estimators for the parameters. Finally, two real data sets are used to demonstrate empirically its flexibility and prove the strength of the new distribution.Keywords: bimodality, estimation, hazard function, moments, Shannon’s entropy
Procedia PDF Downloads 34825266 Molecular Alterations Shed Light on Alteration of Methionine Metabolism in Gastric Intestinal Metaplesia; Insight for Treatment Approach
Authors: Nigatu Tadesse, Ying Liu, Juan Li, Hong Ming Liu
Abstract:
Gastric carcinogenesis is a lengthy process of histopathological transition from normal to atrophic gastritis (AG) to intestinal metaplasia (GIM), dysplasia toward gastric cancer (GC). The stage of GIM identified as precancerous lesions with resistance to H-pylori eradication and recurrence after endoscopic surgical resection therapies. GIM divided in to two morphologically distinct phenotypes such as complete GIM bearing intestinal type morphology whereas the incomplete type has colonic type morphology. The incomplete type GIM considered to be the greatest risk factor for the development of GC. Studies indicated the expression of the caudal type homeobox 2 (CDX2) gene is responsible for the development of complete GIM but its progressive downregulation from incomplete metaplasia toward advanced GC identified as the risk for IM progression and neoplastic transformation. The downregulation of CDX2 gene have promoted cell growth and proliferation in gastric and colon cancers and ascribed in chemo-treatment inefficacies. CDX2 downregulated through promoter region hypermethylation in which the methylation frequency positively correlated with the dietary history of the patients, suggesting the role of diet as methyl carbon donor sources such as methionine. However, the metabolism of exogenous methionine is yet unclear. Targeting exogenous methionine metabolism has become a promising approach to limits tumor cell growth, proliferation and progression and increase treatment outcome. This review article discusses molecular alterations that could shed light on the potential of exogenous methionine metabolisms, such as gut microbiota alteration as sources of methionine to host cells, metabolic pathway signaling via PI3K/AKt/mTORC1-c-MYC to rewire exogenous methionine and signature of increased gene methylation index, cell growth and proliferation in GIM, with insights to new treatment avenue via targeting methionine metabolism, and the need for future integrated studies on molecular alterations and metabolomics to uncover altered methionine metabolism and characterization of CDX2 methylation in gastric intestinal metaplasia for potential therapeutic exploitation.Keywords: altered methionine metabolism, Intestinal metaplesia, CDX2 gene, gastric cancer
Procedia PDF Downloads 8625265 Linear Frequency Modulation-Frequency Shift Keying Radar with Compressive Sensing
Authors: Ho Jeong Jin, Chang Won Seo, Choon Sik Cho, Bong Yong Choi, Kwang Kyun Na, Sang Rok Lee
Abstract:
In this paper, a radar signal processing technique using the LFM-FSK (Linear Frequency Modulation-Frequency Shift Keying) is proposed for reducing the false alarm rate based on the compressive sensing. The LFM-FSK method combines FMCW (Frequency Modulation Continuous Wave) signal with FSK (Frequency Shift Keying). This shows an advantage which can suppress the ghost phenomenon without the complicated CFAR (Constant False Alarm Rate) algorithm. Moreover, the parametric sparse algorithm applying the compressive sensing that restores signals efficiently with respect to the incomplete data samples is also integrated, leading to reducing the burden of ADC in the receiver of radars. 24 GHz FMCW signal is applied and tested in the real environment with FSK modulated data for verifying the proposed algorithm along with the compressive sensing.Keywords: compressive sensing, LFM-FSK radar, radar signal processing, sparse algorithm
Procedia PDF Downloads 48225264 Marginalized Two-Part Joint Models for Generalized Gamma Family of Distributions
Authors: Mohadeseh Shojaei Shahrokhabadi, Ding-Geng (Din) Chen
Abstract:
Positive continuous outcomes with a substantial number of zero values and incomplete longitudinal follow-up are quite common in medical cost data. To jointly model semi-continuous longitudinal cost data and survival data and to provide marginalized covariate effect estimates, a marginalized two-part joint model (MTJM) has been developed for outcome variables with lognormal distributions. In this paper, we propose MTJM models for outcome variables from a generalized gamma (GG) family of distributions. The GG distribution constitutes a general family that includes approximately all of the most frequently used distributions like the Gamma, Exponential, Weibull, and Log Normal. In the proposed MTJM-GG model, the conditional mean from a conventional two-part model with a three-parameter GG distribution is parameterized to provide the marginal interpretation for regression coefficients. In addition, MTJM-gamma and MTJM-Weibull are developed as special cases of MTJM-GG. To illustrate the applicability of the MTJM-GG, we applied the model to a set of real electronic health record data recently collected in Iran, and we provided SAS code for application. The simulation results showed that when the outcome distribution is unknown or misspecified, which is usually the case in real data sets, the MTJM-GG consistently outperforms other models. The GG family of distribution facilitates estimating a model with improved fit over the MTJM-gamma, standard Weibull, or Log-Normal distributions.Keywords: marginalized two-part model, zero-inflated, right-skewed, semi-continuous, generalized gamma
Procedia PDF Downloads 17625263 Comparative Study of Estimators of Population Means in Two Phase Sampling in the Presence of Non-Response
Authors: Syed Ali Taqi, Muhammad Ismail
Abstract:
A comparative study of estimators of population means in two phase sampling in the presence of non-response when Unknown population means of the auxiliary variable(s) and incomplete information of study variable y as well as of auxiliary variable(s) is made. Three real data sets of University students, hospital and unemployment are used for comparison of all the available techniques in two phase sampling in the presence of non-response with the newly generalized ratio estimators.Keywords: two-phase sampling, ratio estimator, product estimator, generalized estimators
Procedia PDF Downloads 23325262 Monte Carlo Methods and Statistical Inference of Multitype Branching Processes
Authors: Ana Staneva, Vessela Stoimenova
Abstract:
A parametric estimation of the MBP with Power Series offspring distribution family is considered in this paper. The MLE for the parameters is obtained in the case when the observable data are incomplete and consist only with the generation sizes of the family tree of MBP. The parameter estimation is calculated by using the Monte Carlo EM algorithm. The estimation for the posterior distribution and for the offspring distribution parameters are calculated by using the Bayesian approach and the Gibbs sampler. The article proposes various examples with bivariate branching processes together with computational results, simulation and an implementation using R.Keywords: Bayesian, branching processes, EM algorithm, Gibbs sampler, Monte Carlo methods, statistical estimation
Procedia PDF Downloads 42125261 Retail Strategy to Reduce Waste Keeping High Profit Utilizing Taylor's Law in Point-of-Sales Data
Authors: Gen Sakoda, Hideki Takayasu, Misako Takayasu
Abstract:
Waste reduction is a fundamental problem for sustainability. Methods for waste reduction with point-of-sales (POS) data are proposed, utilizing the knowledge of a recent econophysics study on a statistical property of POS data. Concretely, the non-stationary time series analysis method based on the Particle Filter is developed, which considers abnormal fluctuation scaling known as Taylor's law. This method is extended for handling incomplete sales data because of stock-outs by introducing maximum likelihood estimation for censored data. The way for optimal stock determination with pricing the cost of waste reduction is also proposed. This study focuses on the examination of the methods for large sales numbers where Taylor's law is obvious. Numerical analysis using aggregated POS data shows the effectiveness of the methods to reduce food waste maintaining a high profit for large sales numbers. Moreover, the way of pricing the cost of waste reduction reveals that a small profit loss realizes substantial waste reduction, especially in the case that the proportionality constant of Taylor’s law is small. Specifically, around 1% profit loss realizes half disposal at =0.12, which is the actual value of processed food items used in this research. The methods provide practical and effective solutions for waste reduction keeping a high profit, especially with large sales numbers.Keywords: food waste reduction, particle filter, point-of-sales, sustainable development goals, Taylor's law, time series analysis
Procedia PDF Downloads 13125260 3D Model Completion Based on Similarity Search with Slim-Tree
Authors: Alexis Aldo Mendoza Villarroel, Ademir Clemente Villena Zevallos, Cristian Jose Lopez Del Alamo
Abstract:
With the advancement of technology it is now possible to scan entire objects and obtain their digital representation by using point clouds or polygon meshes. However, some objects may be broken or have missing parts; thus, several methods focused on this problem have been proposed based on Geometric Deep Learning, such as GCNN, ACNN, PointNet, among others. In this article an approach from a different paradigm is proposed, using metric data structures to index global descriptors in the spectral domain and allow the recovery of a set of similar models in polynomial time; to later use the Iterative Close Point algorithm and recover the parts of the incomplete model using the geometry and topology of the model with less Hausdorff distance.Keywords: 3D reconstruction method, point cloud completion, shape completion, similarity search
Procedia PDF Downloads 12225259 Interrelationship between Quadriceps' Activation and Inhibition as a Function of Knee-Joint Angle and Muscle Length: A Torque and Electro and Mechanomyographic Investigation
Authors: Ronald Croce, Timothy Quinn, John Miller
Abstract:
Incomplete activation, or activation failure, of motor units during maximal voluntary contractions is often referred to as muscle inhibition (MI), and is defined as the inability of the central nervous system to maximally drive a muscle during a voluntary contraction. The purpose of the present study was to assess the interrelationship amongst peak torque (PT), muscle inhibition (MI; incomplete activation of motor units), and voluntary muscle activation (VMA) of the quadriceps’ muscle group as a function of knee angle and muscle length during maximal voluntary isometric contractions (MVICs). Nine young adult males (mean + standard deviation: age: 21.58 + 1.30 years; height: 180.07 + 4.99 cm; weight: 89.07 + 7.55 kg) performed MVICs in random order with the knee at 15, 55, and 95° flexion. MI was assessed using the interpolated twitch technique and was estimated by the amount of additional knee extensor PT evoked by the superimposed twitch during MVICs. Voluntary muscle activation was estimated by root mean square amplitude electromyography (EMGrms) and mechanomyography (MMGrms) of agonist (vastus medialis [VM], vastus lateralis [VL], and rectus femoris [RF]) and antagonist (biceps femoris ([BF]) muscles during MVICs. Data were analyzed using separate repeated measures analysis of variance. Results revealed a strong dependency of quadriceps’ PT (p < 0.001), MI (p < 0.001) and MA (p < 0.01) on knee joint position: PT was smallest at the most shortened muscle position (15°) and greatest at mid-position (55°); MI and MA were smallest at the most shortened muscle position (15°) and greatest at the most lengthened position (95°), with the RF showing the greatest change in MA. It is hypothesized that the ability to more fully activate the quadriceps at short compared to longer muscle lengths (96% contracted at 15°; 91% at 55°; 90% at 95°) might partly compensate for the unfavorable force-length mechanics at the more extended position and consequent declines in VMA (decreases in EMGrms and MMGrms muscle amplitude during MVICs) and force production (PT = 111-Nm at 15°, 217-NM at 55°, 199-Nm at 95°). Biceps femoris EMG and MMG data showed no statistical differences (p = 0.11 and 0.12, respectively) at joint angles tested, although there were greater values at the extended position. Increased BF muscle amplitude at this position could be a mechanism by which anterior shear and tibial rotation induced by high quadriceps’ activity are countered. Measuring and understanding the degree to which one sees MI and VMA in the QF muscle has particular clinical relevance because different knee-joint disorders, such ligament injuries or osteoarthritis, increase levels of MI observed and markedly reduced the capability of full VMA.Keywords: electromyography, interpolated twitch technique, mechanomyography, muscle activation, muscle inhibition
Procedia PDF Downloads 34725258 Comparing the Apparent Error Rate of Gender Specifying from Human Skeletal Remains by Using Classification and Cluster Methods
Authors: Jularat Chumnaul
Abstract:
In forensic science, corpses from various homicides are different; there are both complete and incomplete, depending on causes of death or forms of homicide. For example, some corpses are cut into pieces, some are camouflaged by dumping into the river, some are buried, some are burned to destroy the evidence, and others. If the corpses are incomplete, it can lead to the difficulty of personally identifying because some tissues and bones are destroyed. To specify gender of the corpses from skeletal remains, the most precise method is DNA identification. However, this method is costly and takes longer so that other identification techniques are used instead. The first technique that is widely used is considering the features of bones. In general, an evidence from the corpses such as some pieces of bones, especially the skull and pelvis can be used to identify their gender. To use this technique, forensic scientists are required observation skills in order to classify the difference between male and female bones. Although this technique is uncomplicated, saving time and cost, and the forensic scientists can fairly accurately determine gender by using this technique (apparently an accuracy rate of 90% or more), the crucial disadvantage is there are only some positions of skeleton that can be used to specify gender such as supraorbital ridge, nuchal crest, temporal lobe, mandible, and chin. Therefore, the skeletal remains that will be used have to be complete. The other technique that is widely used for gender specifying in forensic science and archeology is skeletal measurements. The advantage of this method is it can be used in several positions in one piece of bones, and it can be used even if the bones are not complete. In this study, the classification and cluster analysis are applied to this technique, including the Kth Nearest Neighbor Classification, Classification Tree, Ward Linkage Cluster, K-mean Cluster, and Two Step Cluster. The data contains 507 particular individuals and 9 skeletal measurements (diameter measurements), and the performance of five methods are investigated by considering the apparent error rate (APER). The results from this study indicate that the Two Step Cluster and Kth Nearest Neighbor method seem to be suitable to specify gender from human skeletal remains because both yield small apparent error rate of 0.20% and 4.14%, respectively. On the other hand, the Classification Tree, Ward Linkage Cluster, and K-mean Cluster method are not appropriate since they yield large apparent error rate of 10.65%, 10.65%, and 16.37%, respectively. However, there are other ways to evaluate the performance of classification such as an estimate of the error rate using the holdout procedure or misclassification costs, and the difference methods can make the different conclusions.Keywords: skeletal measurements, classification, cluster, apparent error rate
Procedia PDF Downloads 25225257 Structural Invertibility and Optimal Sensor Node Placement for Error and Input Reconstruction in Dynamic Systems
Authors: Maik Kschischo, Dominik Kahl, Philipp Wendland, Andreas Weber
Abstract:
Understanding and modelling of real-world complex dynamic systems in biology, engineering and other fields is often made difficult by incomplete knowledge about the interactions between systems states and by unknown disturbances to the system. In fact, most real-world dynamic networks are open systems receiving unknown inputs from their environment. To understand a system and to estimate the state dynamics, these inputs need to be reconstructed from output measurements. Reconstructing the input of a dynamic system from its measured outputs is an ill-posed problem if only a limited number of states is directly measurable. A first requirement for solving this problem is the invertibility of the input-output map. In our work, we exploit the fact that invertibility of a dynamic system is a structural property, which depends only on the network topology. Therefore, it is possible to check for invertibility using a structural invertibility algorithm which counts the number of node disjoint paths linking inputs and outputs. The algorithm is efficient enough, even for large networks up to a million nodes. To understand structural features influencing the invertibility of a complex dynamic network, we analyze synthetic and real networks using the structural invertibility algorithm. We find that invertibility largely depends on the degree distribution and that dense random networks are easier to invert than sparse inhomogeneous networks. We show that real networks are often very difficult to invert unless the sensor nodes are carefully chosen. To overcome this problem, we present a sensor node placement algorithm to achieve invertibility with a minimum set of measured states. This greedy algorithm is very fast and also guaranteed to find an optimal sensor node-set if it exists. Our results provide a practical approach to experimental design for open, dynamic systems. Since invertibility is a necessary condition for unknown input observers and data assimilation filters to work, it can be used as a preprocessing step to check, whether these input reconstruction algorithms can be successful. If not, we can suggest additional measurements providing sufficient information for input reconstruction. Invertibility is also important for systems design and model building. Dynamic models are always incomplete, and synthetic systems act in an environment, where they receive inputs or even attack signals from their exterior. Being able to monitor these inputs is an important design requirement, which can be achieved by our algorithms for invertibility analysis and sensor node placement.Keywords: data-driven dynamic systems, inversion of dynamic systems, observability, experimental design, sensor node placement
Procedia PDF Downloads 15025256 Cleaning of Scientific References in Large Patent Databases Using Rule-Based Scoring and Clustering
Authors: Emiel Caron
Abstract:
Patent databases contain patent related data, organized in a relational data model, and are used to produce various patent statistics. These databases store raw data about scientific references cited by patents. For example, Patstat holds references to tens of millions of scientific journal publications and conference proceedings. These references might be used to connect patent databases with bibliographic databases, e.g. to study to the relation between science, technology, and innovation in various domains. Problematic in such studies is the low data quality of the references, i.e. they are often ambiguous, unstructured, and incomplete. Moreover, a complete bibliographic reference is stored in only one attribute. Therefore, a computerized cleaning and disambiguation method for large patent databases is developed in this work. The method uses rule-based scoring and clustering. The rules are based on bibliographic metadata, retrieved from the raw data by regular expressions, and are transparent and adaptable. The rules in combination with string similarity measures are used to detect pairs of records that are potential duplicates. Due to the scoring, different rules can be combined, to join scientific references, i.e. the rules reinforce each other. The scores are based on expert knowledge and initial method evaluation. After the scoring, pairs of scientific references that are above a certain threshold, are clustered by means of single-linkage clustering algorithm to form connected components. The method is designed to disambiguate all the scientific references in the Patstat database. The performance evaluation of the clustering method, on a large golden set with highly cited papers, shows on average a 99% precision and a 95% recall. The method is therefore accurate but careful, i.e. it weighs precision over recall. Consequently, separate clusters of high precision are sometimes formed, when there is not enough evidence for connecting scientific references, e.g. in the case of missing year and journal information for a reference. The clusters produced by the method can be used to directly link the Patstat database with bibliographic databases as the Web of Science or Scopus.Keywords: clustering, data cleaning, data disambiguation, data mining, patent analysis, scientometrics
Procedia PDF Downloads 19425255 Combined Analysis of Sudoku Square Designs with Same Treatments
Authors: A. Danbaba
Abstract:
Several experiments are conducted at different environments such as locations or periods (seasons) with identical treatments to each experiment purposely to study the interaction between the treatments and environments or between the treatments and periods (seasons). The commonly used designs of experiments for this purpose are randomized block design, Latin square design, balanced incomplete block design, Youden design, and one or more factor designs. The interest is to carry out a combined analysis of the data from these multi-environment experiments, instead of analyzing each experiment separately. This paper proposed combined analysis of experiments conducted via Sudoku square design of odd order with same experimental treatments.Keywords: combined analysis, sudoku design, common treatment, multi-environment experiments
Procedia PDF Downloads 34525254 Challenges of Design, Cost and Surveying in Dams
Authors: Ali Mohammadi
Abstract:
The construction of Embankment dams is considered one of the most challenging construction projects, for which several main reasons can be mentioned. Excavation and embankment must be done in a large area, and its design is based on preliminary studies, but at the time of construction, it is possible that excavation does not match with the stability or slope of the rock, or the design is incomplete, and corrections should be made in order to be able to carry out excavation and embankment. Also, the progress of the work depends on the main factors, the lack of each of which can slow down the construction of the dams, and lead to an increase in costs, and control of excavations and embankments and calculations of their volumes are done in this collection. In the following, we will investigate three Embankment dams in Iran that faced these challenges and how they overcame these challenges. KHODA AFARIN on the Aras River between the two countries of IRAN and AZARBAIJAN, SIAH BISHEH PUMPED STORAGE on CHALUS River and GOTVAND on KARUN River are among the most important dams built in Iran.Keywords: section, data transfer, tunnel, free station
Procedia PDF Downloads 7325253 Heat Transfer and Diffusion Modelling
Authors: R. Whalley
Abstract:
The heat transfer modelling for a diffusion process will be considered. Difficulties in computing the time-distance dynamics of the representation will be addressed. Incomplete and irrational Laplace function will be identified as the computational issue. Alternative approaches to the response evaluation process will be provided. An illustration application problem will be presented. Graphical results confirming the theoretical procedures employed will be provided.Keywords: heat, transfer, diffusion, modelling, computation
Procedia PDF Downloads 55325252 Firm-Created Social Media Communication and Consumer Brand Perceptions
Authors: Rabail Khalid
Abstract:
Social media has changed the business communication strategies in the corporate world. Firms are using social media to reach their maximum stakeholders in minimum time at different social media forums. The current study examines the role of firm-created social media communication on consumer brand perceptions and their loyalty to brand. An online survey is conducted through social media forums including Facebook and Twitter to collect data regarding social media communication of a well-reputed clothing company’s brand in Pakistan. A link is sent to 900 customers of that company. Out of 900 questionnaires, 534 were received. So, the response rate is 59.33%. During data screening and entry, 13 questionnaires are rejected due to incomplete answer. Therefore, 521 questionnaires are completed in all respect and seem to be helpful for the study. So, the positive response rate is 57.89%. The empirical results report positive and significant influence of company-generated social media communication on brand trust, brand equity, and brand loyalty. The findings of this study provide important information to the marketing professionals and brand managers to understand consumer behavior through social media communication.Keywords: firm-created social media communication, brand trust, brand equity, consumer behavior, brand loyalty
Procedia PDF Downloads 38625251 The Data Quality Model for the IoT based Real-time Water Quality Monitoring Sensors
Authors: Rabbia Idrees, Ananda Maiti, Saurabh Garg, Muhammad Bilal Amin
Abstract:
IoT devices are the basic building blocks of IoT network that generate enormous volume of real-time and high-speed data to help organizations and companies to take intelligent decisions. To integrate this enormous data from multisource and transfer it to the appropriate client is the fundamental of IoT development. The handling of this huge quantity of devices along with the huge volume of data is very challenging. The IoT devices are battery-powered and resource-constrained and to provide energy efficient communication, these IoT devices go sleep or online/wakeup periodically and a-periodically depending on the traffic loads to reduce energy consumption. Sometime these devices get disconnected due to device battery depletion. If the node is not available in the network, then the IoT network provides incomplete, missing, and inaccurate data. Moreover, many IoT applications, like vehicle tracking and patient tracking require the IoT devices to be mobile. Due to this mobility, If the distance of the device from the sink node become greater than required, the connection is lost. Due to this disconnection other devices join the network for replacing the broken-down and left devices. This make IoT devices dynamic in nature which brings uncertainty and unreliability in the IoT network and hence produce bad quality of data. Due to this dynamic nature of IoT devices we do not know the actual reason of abnormal data. If data are of poor-quality decisions are likely to be unsound. It is highly important to process data and estimate data quality before bringing it to use in IoT applications. In the past many researchers tried to estimate data quality and provided several Machine Learning (ML), stochastic and statistical methods to perform analysis on stored data in the data processing layer, without focusing the challenges and issues arises from the dynamic nature of IoT devices and how it is impacting data quality. A comprehensive review on determining the impact of dynamic nature of IoT devices on data quality is done in this research and presented a data quality model that can deal with this challenge and produce good quality of data. This research presents the data quality model for the sensors monitoring water quality. DBSCAN clustering and weather sensors are used in this research to make data quality model for the sensors monitoring water quality. An extensive study has been done in this research on finding the relationship between the data of weather sensors and sensors monitoring water quality of the lakes and beaches. The detailed theoretical analysis has been presented in this research mentioning correlation between independent data streams of the two sets of sensors. With the help of the analysis and DBSCAN, a data quality model is prepared. This model encompasses five dimensions of data quality: outliers’ detection and removal, completeness, patterns of missing values and checks the accuracy of the data with the help of cluster’s position. At the end, the statistical analysis has been done on the clusters formed as the result of DBSCAN, and consistency is evaluated through Coefficient of Variation (CoV).Keywords: clustering, data quality, DBSCAN, and Internet of things (IoT)
Procedia PDF Downloads 13925250 Exploring Time-Series Phosphoproteomic Datasets in the Context of Network Models
Authors: Sandeep Kaur, Jenny Vuong, Marcel Julliard, Sean O'Donoghue
Abstract:
Time-series data are useful for modelling as they can enable model-evaluation. However, when reconstructing models from phosphoproteomic data, often non-exact methods are utilised, as the knowledge regarding the network structure, such as, which kinases and phosphatases lead to the observed phosphorylation state, is incomplete. Thus, such reactions are often hypothesised, which gives rise to uncertainty. Here, we propose a framework, implemented via a web-based tool (as an extension to Minardo), which given time-series phosphoproteomic datasets, can generate κ models. The incompleteness and uncertainty in the generated model and reactions are clearly presented to the user via the visual method. Furthermore, we demonstrate, via a toy EGF signalling model, the use of algorithmic verification to verify κ models. Manually formulated requirements were evaluated with regards to the model, leading to the highlighting of the nodes causing unsatisfiability (i.e. error causing nodes). We aim to integrate such methods into our web-based tool and demonstrate how the identified erroneous nodes can be presented to the user via the visual method. Thus, in this research we present a framework, to enable a user to explore phosphorylation proteomic time-series data in the context of models. The observer can visualise which reactions in the model are highly uncertain, and which nodes cause incorrect simulation outputs. A tool such as this enables an end-user to determine the empirical analysis to perform, to reduce uncertainty in the presented model - thus enabling a better understanding of the underlying system.Keywords: κ-models, model verification, time-series phosphoproteomic datasets, uncertainty and error visualisation
Procedia PDF Downloads 25525249 Maximum Likelihood Estimation Methods on a Two-Parameter Rayleigh Distribution under Progressive Type-Ii Censoring
Authors: Daniel Fundi Murithi
Abstract:
Data from economic, social, clinical, and industrial studies are in some way incomplete or incorrect due to censoring. Such data may have adverse effects if used in the estimation problem. We propose the use of Maximum Likelihood Estimation (MLE) under a progressive type-II censoring scheme to remedy this problem. In particular, maximum likelihood estimates (MLEs) for the location (µ) and scale (λ) parameters of two Parameter Rayleigh distribution are realized under a progressive type-II censoring scheme using the Expectation-Maximization (EM) and the Newton-Raphson (NR) algorithms. These algorithms are used comparatively because they iteratively produce satisfactory results in the estimation problem. The progressively type-II censoring scheme is used because it allows the removal of test units before the termination of the experiment. Approximate asymptotic variances and confidence intervals for the location and scale parameters are derived/constructed. The efficiency of EM and the NR algorithms is compared given root mean squared error (RMSE), bias, and the coverage rate. The simulation study showed that in most sets of simulation cases, the estimates obtained using the Expectation-maximization algorithm had small biases, small variances, narrower/small confidence intervals width, and small root of mean squared error compared to those generated via the Newton-Raphson (NR) algorithm. Further, the analysis of a real-life data set (data from simple experimental trials) showed that the Expectation-Maximization (EM) algorithm performs better compared to Newton-Raphson (NR) algorithm in all simulation cases under the progressive type-II censoring scheme.Keywords: expectation-maximization algorithm, maximum likelihood estimation, Newton-Raphson method, two-parameter Rayleigh distribution, progressive type-II censoring
Procedia PDF Downloads 16325248 Data Transformations in Data Envelopment Analysis
Authors: Mansour Mohammadpour
Abstract:
Data transformation refers to the modification of any point in a data set by a mathematical function. When applying transformations, the measurement scale of the data is modified. Data transformations are commonly employed to turn data into the appropriate form, which can serve various functions in the quantitative analysis of the data. This study addresses the investigation of the use of data transformations in Data Envelopment Analysis (DEA). Although data transformations are important options for analysis, they do fundamentally alter the nature of the variable, making the interpretation of the results somewhat more complex.Keywords: data transformation, data envelopment analysis, undesirable data, negative data
Procedia PDF Downloads 2025247 Hydrology and Hydraulics Analysis of Aremenie Earthen Dam, Ethiopia
Authors: Azazhu Wassie
Abstract:
This study tried to analyze the impact of the hydrologic and hydraulic parameters (catchment area, rainfall intensity, and runoff coefficient) on the referenced study area. The study was conducted in June 2023. The Aremenie River Dam has 30 years of record, which is reasonably sufficient data. It is a matter of common experience that, due to the failure of an instrument or the absence of a gauged river, the rainfall record at quite a number of stations is incomplete. From the analysis, the 50-year return period design flood is 62.685 m³/s at 1.2 hr peak time. This implies that for this watershed, the peak flood rate per km² area of the watershed is about this value, which ensures that high rainfall in the area can generate a higher rate of runoff per km² of the generating catchment. The Aremenie Rivers carry a large amount of sediment along with water. These sediments are deposited in the reservoir upstream of the dam because of the reduction in velocity. Sediment reduces the available capacity of the reservoir with continuous sedimentation; the useful life of the reservoir goes on decreasing.Keywords: dam design, peak flood, rainfall, reservoir capacity, runoff
Procedia PDF Downloads 3325246 Evidence Theory Based Emergency Multi-Attribute Group Decision-Making: Application in Facility Location Problem
Authors: Bidzina Matsaberidze
Abstract:
It is known that, in emergency situations, multi-attribute group decision-making (MAGDM) models are characterized by insufficient objective data and a lack of time to respond to the task. Evidence theory is an effective tool for describing such incomplete information in decision-making models when the expert and his knowledge are involved in the estimations of the MAGDM parameters. We consider an emergency decision-making model, where expert assessments on humanitarian aid from distribution centers (HADC) are represented in q-rung ortho-pair fuzzy numbers, and the data structure is described within the data body theory. Based on focal probability construction and experts’ evaluations, an objective function-distribution centers’ selection ranking index is constructed. Our approach for solving the constructed bicriteria partitioning problem consists of two phases. In the first phase, based on the covering’s matrix, we generate a matrix, the columns of which allow us to find all possible partitionings of the HADCs with the service centers. Some constraints are also taken into consideration while generating the matrix. In the second phase, based on the matrix and using our exact algorithm, we find the partitionings -allocations of the HADCs to the centers- which correspond to the Pareto-optimal solutions. For an illustration of the obtained results, a numerical example is given for the facility location-selection problem.Keywords: emergency MAGDM, q-rung orthopair fuzzy sets, evidence theory, HADC, facility location problem, multi-objective combinatorial optimization problem, Pareto-optimal solutions
Procedia PDF Downloads 92