Search results for: categorical datasets
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 809

Search results for: categorical datasets

779 A Web-Based Systems Immunology Toolkit Allowing the Visualization and Comparative Analysis of Publically Available Collective Data to Decipher Immune Regulation in Early Life

Authors: Mahbuba Rahman, Sabri Boughorbel, Scott Presnell, Charlie Quinn, Darawan Rinchai, Damien Chaussabel, Nico Marr

Abstract:

Collections of large-scale datasets made available in public repositories can be used to identify and fill gaps in biomedical knowledge. But first, these data need to be made readily accessible to researchers for analysis and interpretation. Here a collection of transcriptome datasets was made available to investigate the functional programming of human hematopoietic cells in early life. Thirty two datasets were retrieved from the NCBI Gene Expression Omnibus (GEO) and loaded in a custom, interactive web application called the Gene Expression browser (GXB), designed for visualization and query of integrated large-scale data. Multiple sample groupings and gene rank lists were created based on the study design and variables in each dataset. Web links to customized graphical views can be generated by users and subsequently be used to graphically present data in manuscripts for publication. The GXB tool also enables browsing of a single gene across datasets, which can provide information on the role of a given molecule across biological systems. The dataset collection is available online. As a proof-of-principle, one of the datasets (GSE25087) was re-analyzed to identify genes that are differentially expressed by regulatory T cells in early life. Re-analysis of this dataset and a cross-study comparison using multiple other datasets in the above mentioned collection revealed that PMCH, a gene encoding a precursor of melanin-concentrating hormone (MCH), a cyclic neuropeptide, is highly expressed in a variety of other hematopoietic cell types, including neonatal erythroid cells as well as plasmacytoid dendritic cells upon viral infection. Our findings suggest an as yet unrecognized role of MCH in immune regulation, thereby highlighting the unique potential of the curated dataset collection and systems biology approach to generate new hypotheses which can be tested in future mechanistic studies.

Keywords: early-life, GEO datasets, PMCH, interactive query, systems biology

Procedia PDF Downloads 268
778 Business-Intelligence Mining of Large Decentralized Multimedia Datasets with a Distributed Multi-Agent System

Authors: Karima Qayumi, Alex Norta

Abstract:

The rapid generation of high volume and a broad variety of data from the application of new technologies pose challenges for the generation of business-intelligence. Most organizations and business owners need to extract data from multiple sources and apply analytical methods for the purposes of developing their business. Therefore, the recently decentralized data management environment is relying on a distributed computing paradigm. While data are stored in highly distributed systems, the implementation of distributed data-mining techniques is a challenge. The aim of this technique is to gather knowledge from every domain and all the datasets stemming from distributed resources. As agent technologies offer significant contributions for managing the complexity of distributed systems, we consider this for next-generation data-mining processes. To demonstrate agent-based business intelligence operations, we use agent-oriented modeling techniques to develop a new artifact for mining massive datasets.

Keywords: agent-oriented modeling (AOM), business intelligence model (BIM), distributed data mining (DDM), multi-agent system (MAS)

Procedia PDF Downloads 391
777 AutoML: Comprehensive Review and Application to Engineering Datasets

Authors: Parsa Mahdavi, M. Amin Hariri-Ardebili

Abstract:

The development of accurate machine learning and deep learning models traditionally demands hands-on expertise and a solid background to fine-tune hyperparameters. With the continuous expansion of datasets in various scientific and engineering domains, researchers increasingly turn to machine learning methods to unveil hidden insights that may elude classic regression techniques. This surge in adoption raises concerns about the adequacy of the resultant meta-models and, consequently, the interpretation of the findings. In response to these challenges, automated machine learning (AutoML) emerges as a promising solution, aiming to construct machine learning models with minimal intervention or guidance from human experts. AutoML encompasses crucial stages such as data preparation, feature engineering, hyperparameter optimization, and neural architecture search. This paper provides a comprehensive overview of the principles underpinning AutoML, surveying several widely-used AutoML platforms. Additionally, the paper offers a glimpse into the application of AutoML on various engineering datasets. By comparing these results with those obtained through classical machine learning methods, the paper quantifies the uncertainties inherent in the application of a single ML model versus the holistic approach provided by AutoML. These examples showcase the efficacy of AutoML in extracting meaningful patterns and insights, emphasizing its potential to revolutionize the way we approach and analyze complex datasets.

Keywords: automated machine learning, uncertainty, engineering dataset, regression

Procedia PDF Downloads 34
776 Visual Analytics of Higher Order Information for Trajectory Datasets

Authors: Ye Wang, Ickjai Lee

Abstract:

Due to the widespread of mobile sensing, there is a strong need to handle trails of moving objects, trajectories. This paper proposes three visual analytic approaches for higher order information of trajectory data sets based on the higher order Voronoi diagram data structure. Proposed approaches reveal geometrical information, topological, and directional information. Experimental results demonstrate the applicability and usefulness of proposed three approaches.

Keywords: visual analytics, higher order information, trajectory datasets, spatio-temporal data

Procedia PDF Downloads 382
775 Qualitative Data Analysis for Health Care Services

Authors: Taner Ersoz, Filiz Ersoz

Abstract:

This study was designed enable application of multivariate technique in the interpretation of categorical data for measuring health care services satisfaction in Turkey. The data was collected from a total of 17726 respondents. The establishment of the sample group and collection of the data were carried out by a joint team from The Ministry of Health and Turkish Statistical Institute (Turk Stat) of Turkey. The multiple correspondence analysis (MCA) was used on the data of 2882 respondents who answered the questionnaire in full. The multiple correspondence analysis indicated that, in the evaluation of health services females, public employees, younger and more highly educated individuals were more concerned and complainant than males, private sector employees, older and less educated individuals. Overall 53 % of the respondents were pleased with the improvements in health care services in the past three years. This study demonstrates the public consciousness in health services and health care satisfaction in Turkey. It was found that most the respondents were pleased with the improvements in health care services over the past three years. Awareness of health service quality increases with education levels. Older individuals and males would appear to have lower expectancies in health services.

Keywords: multiple correspondence analysis, multivariate categorical data, health care services, health satisfaction survey

Procedia PDF Downloads 202
774 A Theoretical Model for Pattern Extraction in Large Datasets

Authors: Muhammad Usman

Abstract:

Pattern extraction has been done in past to extract hidden and interesting patterns from large datasets. Recently, advancements are being made in these techniques by providing the ability of multi-level mining, effective dimension reduction, advanced evaluation and visualization support. This paper focuses on reviewing the current techniques in literature on the basis of these parameters. Literature review suggests that most of the techniques which provide multi-level mining and dimension reduction, do not handle mixed-type data during the process. Patterns are not extracted using advanced algorithms for large datasets. Moreover, the evaluation of patterns is not done using advanced measures which are suited for high-dimensional data. Techniques which provide visualization support are unable to handle a large number of rules in a small space. We present a theoretical model to handle these issues. The implementation of the model is beyond the scope of this paper.

Keywords: association rule mining, data mining, data warehouses, visualization of association rules

Procedia PDF Downloads 197
773 Imputation of Urban Movement Patterns Using Big Data

Authors: Eusebio Odiari, Mark Birkin, Susan Grant-Muller, Nicolas Malleson

Abstract:

Big data typically refers to consumer datasets revealing some detailed heterogeneity in human behavior, which if harnessed appropriately, could potentially revolutionize our understanding of the collective phenomena of the physical world. Inadvertent missing values skew these datasets and compromise the validity of the thesis. Here we discuss a conceptually consistent strategy for identifying other relevant datasets to combine with available big data, to plug the gaps and to create a rich requisite comprehensive dataset for subsequent analysis. Specifically, emphasis is on how these methodologies can for the first time enable the construction of more detailed pictures of passenger demand and drivers of mobility on the railways. These methodologies can predict the influence of changes within the network (like a change in time-table or impact of a new station), explain local phenomena outside the network (like rail-heading) and the other impacts of urban morphology. Our analysis also reveals that our new imputation data model provides for more equitable revenue sharing amongst network operators who manage different parts of the integrated UK railways.

Keywords: big-data, micro-simulation, mobility, ticketing-data, commuters, transport, synthetic, population

Procedia PDF Downloads 194
772 Multi-Level Air Quality Classification in China Using Information Gain and Support Vector Machine

Authors: Bingchun Liu, Pei-Chann Chang, Natasha Huang, Dun Li

Abstract:

Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.

Keywords: machine learning, air quality classification, air quality index, information gain, support vector machine, cross-validation

Procedia PDF Downloads 199
771 Dataset Quality Index:Development of Composite Indicator Based on Standard Data Quality Indicators

Authors: Sakda Loetpiparwanich, Preecha Vichitthamaros

Abstract:

Nowadays, poor data quality is considered one of the majority costs for a data project. The data project with data quality awareness almost as much time to data quality processes while data project without data quality awareness negatively impacts financial resources, efficiency, productivity, and credibility. One of the processes that take a long time is defining the expectations and measurements of data quality because the expectation is different up to the purpose of each data project. Especially, big data project that maybe involves with many datasets and stakeholders, that take a long time to discuss and define quality expectations and measurements. Therefore, this study aimed at developing meaningful indicators to describe overall data quality for each dataset to quick comparison and priority. The objectives of this study were to: (1) Develop a practical data quality indicators and measurements, (2) Develop data quality dimensions based on statistical characteristics and (3) Develop Composite Indicator that can describe overall data quality for each dataset. The sample consisted of more than 500 datasets from public sources obtained by random sampling. After datasets were collected, there are five steps to develop the Dataset Quality Index (SDQI). First, we define standard data quality expectations. Second, we find any indicators that can measure directly to data within datasets. Thirdly, each indicator aggregates to dimension using factor analysis. Next, the indicators and dimensions were weighted by an effort for data preparing process and usability. Finally, the dimensions aggregate to Composite Indicator. The results of these analyses showed that: (1) The developed useful indicators and measurements contained ten indicators. (2) the developed data quality dimension based on statistical characteristics, we found that ten indicators can be reduced to 4 dimensions. (3) The developed Composite Indicator, we found that the SDQI can describe overall datasets quality of each dataset and can separate into 3 Level as Good Quality, Acceptable Quality, and Poor Quality. The conclusion, the SDQI provide an overall description of data quality within datasets and meaningful composition. We can use SQDI to assess for all data in the data project, effort estimation, and priority. The SDQI also work well with Agile Method by using SDQI to assessment in the first sprint. After passing the initial evaluation, we can add more specific data quality indicators into the next sprint.

Keywords: data quality, dataset quality, data quality management, composite indicator, factor analysis, principal component analysis

Procedia PDF Downloads 110
770 A Concept of Data Mining with XML Document

Authors: Akshay Agrawal, Anand K. Srivastava

Abstract:

The increasing amount of XML datasets available to casual users increases the necessity of investigating techniques to extract knowledge from these data. Data mining is widely applied in the database research area in order to extract frequent correlations of values from both structured and semi-structured datasets. The increasing availability of heterogeneous XML sources has raised a number of issues concerning how to represent and manage these semi structured data. In recent years due to the importance of managing these resources and extracting knowledge from them, lots of methods have been proposed in order to represent and cluster them in different ways.

Keywords: XML, similarity measure, clustering, cluster quality, semantic clustering

Procedia PDF Downloads 346
769 Retrospective Demographic Analysis of Patients Lost to Follow-Up from Antiretroviral Therapy in Mulanje Mission Hospital, Malawi

Authors: Silas Webb, Joseph Hartland

Abstract:

Background: Long-term retention of patients on ART has become a major health challenge in Sub-Saharan Africa (SSA). In 2010 a systematic review of 39 papers found that 30% of patients were no longer taking their ARTs two years after starting treatment. In the same review, it was noted that there was a paucity of data as to why patients become lost to follow-up (LTFU) in SSA. This project was performed in Mulanje Mission Hospital in Malawi as part of Swindon Academy’s Global Health eSSC. The HIV prevalence for Malawi is 10.3%, one of the highest rates in the world, however prevalence soars to 18% in the Mulanje. Therefore it is essential that patients at risk of being LTFU are identified early and managed appropriately to help them continue to participate in the service. Methodology: All patients on adult antiretroviral formulations at MMH, who were classified as ‘defaulters’ (patients missing a scheduled follow up visit by more than two months) over the last 12 months were included in the study. Demographic varibales were collected from Mastercards for data analysis. A comparison group of patients currently not lost to follow up was created by using all of the patients who attended the HIV clinic between 18th-22nd July 2016 who had never defaulted from ART. Data was analysed using the chi squared (χ²) test, as data collected was categorical, with alpha levels set at 0.05. Results: Overall, 136 patients had defaulted from ART over the past 12 months at MMH. Of these, 43 patients had missing Mastercards, so 93 defaulter datasets were analysed. In the comparison group 93 datasets were also analysed and statistical analysis done using Chi-Squared testing. A higher proportion of men in the defaulting group was noted (χ²=0.034) and defaulters tended to be younger (χ²=0.052). 94.6% of patients who defaulted were taking Tenofovir, Lamivudine and Efavirenz, the standard first line ART therapy in Malawi. The mean length of time on ART was 39.0 months (RR: -22.4-100.4) in the defaulters group and 47.3 months (RR: -19.71-114.23) in the control group, with a mean difference of 8.3 less months in the defaulters group (χ ²=0.056). Discussion: The findings in this study echo the literature, however this review expands on that and shows the demographic for the patient at most risk of defaulting and being LTFU would be: a young male who has missed more than 4 doses of ART and is within his first year of treatment. For the hospital, this data is important at it identifies significant areas for public health focus. For instance, fear of disclosure and stigma may be disproportionately affecting younger men, so interventions can be aimed specifically at them to improve their health outcomes. The mean length of time on medication was 8.3 months less in the defaulters group, with a p-value of 0.056, emphasising the need for more intensive follow-up in the early stages of treatment, when patients are at the highest risk of defaulting.

Keywords: anti-retroviral therapy, ART, HIV, lost to follow up, Malawi

Procedia PDF Downloads 157
768 Gradient Boosted Trees on Spark Platform for Supervised Learning in Health Care Big Data

Authors: Gayathri Nagarajan, L. D. Dhinesh Babu

Abstract:

Health care is one of the prominent industries that generate voluminous data thereby finding the need of machine learning techniques with big data solutions for efficient processing and prediction. Missing data, incomplete data, real time streaming data, sensitive data, privacy, heterogeneity are few of the common challenges to be addressed for efficient processing and mining of health care data. In comparison with other applications, accuracy and fast processing are of higher importance for health care applications as they are related to the human life directly. Though there are many machine learning techniques and big data solutions used for efficient processing and prediction in health care data, different techniques and different frameworks are proved to be effective for different applications largely depending on the characteristics of the datasets. In this paper, we present a framework that uses ensemble machine learning technique gradient boosted trees for data classification in health care big data. The framework is built on Spark platform which is fast in comparison with other traditional frameworks. Unlike other works that focus on a single technique, our work presents a comparison of six different machine learning techniques along with gradient boosted trees on datasets of different characteristics. Five benchmark health care datasets are considered for experimentation, and the results of different machine learning techniques are discussed in comparison with gradient boosted trees. The metric chosen for comparison is misclassification error rate and the run time of the algorithms. The goal of this paper is to i) Compare the performance of gradient boosted trees with other machine learning techniques in Spark platform specifically for health care big data and ii) Discuss the results from the experiments conducted on datasets of different characteristics thereby drawing inference and conclusion. The experimental results show that the accuracy is largely dependent on the characteristics of the datasets for other machine learning techniques whereas gradient boosting trees yields reasonably stable results in terms of accuracy without largely depending on the dataset characteristics.

Keywords: big data analytics, ensemble machine learning, gradient boosted trees, Spark platform

Procedia PDF Downloads 216
767 Supervised/Unsupervised Mahalanobis Algorithm for Improving Performance for Cyberattack Detection over Communications Networks

Authors: Radhika Ranjan Roy

Abstract:

Deployment of machine learning (ML)/deep learning (DL) algorithms for cyberattack detection in operational communications networks (wireless and/or wire-line) is being delayed because of low-performance parameters (e.g., recall, precision, and f₁-score). If datasets become imbalanced, which is the usual case for communications networks, the performance tends to become worse. Complexities in handling reducing dimensions of the feature sets for increasing performance are also a huge problem. Mahalanobis algorithms have been widely applied in scientific research because Mahalanobis distance metric learning is a successful framework. In this paper, we have investigated the Mahalanobis binary classifier algorithm for increasing cyberattack detection performance over communications networks as a proof of concept. We have also found that high-dimensional information in intermediate features that are not utilized as much for classification tasks in ML/DL algorithms are the main contributor to the state-of-the-art of improved performance of the Mahalanobis method, even for imbalanced and sparse datasets. With no feature reduction, MD offers uniform results for precision, recall, and f₁-score for unbalanced and sparse NSL-KDD datasets.

Keywords: Mahalanobis distance, machine learning, deep learning, NS-KDD, local intrinsic dimensionality, chi-square, positive semi-definite, area under the curve

Procedia PDF Downloads 50
766 Q-Map: Clinical Concept Mining from Clinical Documents

Authors: Sheikh Shams Azam, Manoj Raju, Venkatesh Pagidimarri, Vamsi Kasivajjala

Abstract:

Over the past decade, there has been a steep rise in the data-driven analysis in major areas of medicine, such as clinical decision support system, survival analysis, patient similarity analysis, image analytics etc. Most of the data in the field are well-structured and available in numerical or categorical formats which can be used for experiments directly. But on the opposite end of the spectrum, there exists a wide expanse of data that is intractable for direct analysis owing to its unstructured nature which can be found in the form of discharge summaries, clinical notes, procedural notes which are in human written narrative format and neither have any relational model nor any standard grammatical structure. An important step in the utilization of these texts for such studies is to transform and process the data to retrieve structured information from the haystack of irrelevant data using information retrieval and data mining techniques. To address this problem, the authors present Q-Map in this paper, which is a simple yet robust system that can sift through massive datasets with unregulated formats to retrieve structured information aggressively and efficiently. It is backed by an effective mining technique which is based on a string matching algorithm that is indexed on curated knowledge sources, that is both fast and configurable. The authors also briefly examine its comparative performance with MetaMap, one of the most reputed tools for medical concepts retrieval and present the advantages the former displays over the latter.

Keywords: information retrieval, unified medical language system, syntax based analysis, natural language processing, medical informatics

Procedia PDF Downloads 106
765 Direct Phoenix Identification and Antimicrobial Susceptibility Testing from Positive Blood Culture Broths

Authors: Waad Al Saleemi, Badriya Al Adawi, Zaaima Al Jabri, Sahim Al Ghafri, Jalila Al Hadhramia

Abstract:

Objectives: Using standard lab methods, a positive blood culture requires a minimum of two days (two occasions of overnight incubation) to obtain a final identification (ID) and antimicrobial susceptibility results (AST) report. In this study, we aimed to evaluate the accuracy and precision of identification and antimicrobial susceptibility testing of an alternative method (direct method) that will reduce the turnaround time by 24 hours. This method involves the direct inoculation of positive blood culture broths into the Phoenix system using serum separation tubes (SST). Method: This prospective study included monomicrobial-positive blood cultures obtained from January 2022 to May 2023 in SQUH. Blood cultures containing a mixture of organisms, fungi, or anaerobic organisms were excluded from this study. The result of the new “direct method” under study was compared with the current “standard method” used in the lab. The accuracy and precision were evaluated for the ID and AST using Clinical and Laboratory Standards Institute (CLSI) recommendations. The categorical agreement, essential agreement, and the rates of very major errors (VME), major errors (ME), and minor errors (MIE) for both gram-negative and gram-positive bacteria were calculated. Passing criteria were set according to CLSI. Result: The results of ID and AST were available for a total of 158 isolates. Of 77 isolates of gram-negative bacteria, 71 (92%) were correctly identified at the species level. Of 70 isolates of gram-positive bacteria, 47(67%) isolates were correctly identified. For gram-negative bacteria, the essential agreement of the direct method was ≥92% when compared to the standard method, while the categorical agreement was ≥91% for all tested antibiotics. The precision of ID and AST were noted to be 100% for all tested isolates. For gram-positive bacteria, the essential agreement was >93%, while the categorical agreement was >92% for all tested antibiotics except moxifloxacin. Many antibiotics were noted to have an unacceptable higher rate of very major errors including penicillin, cotrimoxazole, clindamycin, ciprofloxacin, and moxifloxacin. However, no error was observed in the results of vancomycin, linezolid, and daptomycin. Conclusion: The direct method of ID and AST for positive blood cultures using SST is reliable for gram negative bacteria. It will significantly decrease the turnaround time and will facilitate antimicrobial stewardship.

Keywords: bloodstream infection, oman, direct ast, blood culture, rapid identification, antimicrobial susceptibility, phoenix, direct inoculation

Procedia PDF Downloads 22
764 Adaptive Swarm Balancing Algorithms for Rare-Event Prediction in Imbalanced Healthcare Data

Authors: Jinyan Li, Simon Fong, Raymond Wong, Mohammed Sabah, Fiaidhi Jinan

Abstract:

Clinical data analysis and forecasting have make great contributions to disease control, prevention and detection. However, such data usually suffer from highly unbalanced samples in class distributions. In this paper, we target at the binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat-inspired algorithm, and combine both of them with the synthetic minority over-sampling technique (SMOTE) for processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reveal that while the performance improvements obtained by the former methods are not scalable to larger data scales, the later one, which we call Adaptive Swarm Balancing Algorithms, leads to significant efficiency and effectiveness improvements on large datasets. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. Leading to more credible performances of the classifier, and shortening the running time compared with the brute-force method.

Keywords: Imbalanced dataset, meta-heuristic algorithm, SMOTE, big data

Procedia PDF Downloads 413
763 An MrPPG Method for Face Anti-Spoofing

Authors: Lan Zhang, Cailing Zhang

Abstract:

In recent years, many face anti-spoofing algorithms have high detection accuracy when detecting 2D face anti-spoofing or 3D mask face anti-spoofing alone in the field of face anti-spoofing, but their detection performance is greatly reduced in multidimensional and cross-datasets tests. The rPPG method used for face anti-spoofing uses the unique vital information of real face to judge real faces and face anti-spoofing, so rPPG method has strong stability compared with other methods, but its detection rate of 2D face anti-spoofing needs to be improved. Therefore, in this paper, we improve an rPPG(Remote Photoplethysmography) method(MrPPG) for face anti-spoofing which through color space fusion, using the correlation of pulse signals between real face regions and background regions, and introducing the cyclic neural network (LSTM) method to improve accuracy in 2D face anti-spoofing. Meanwhile, the MrPPG also has high accuracy and good stability in face anti-spoofing of multi-dimensional and cross-data datasets. The improved method was validated on Replay-Attack, CASIA-FASD, Siw and HKBU_MARs_V2 datasets, the experimental results show that the performance and stability of the improved algorithm proposed in this paper is superior to many advanced algorithms.

Keywords: face anti-spoofing, face presentation attack detection, remote photoplethysmography, MrPPG

Procedia PDF Downloads 143
762 Two Concurrent Convolution Neural Networks TC*CNN Model for Face Recognition Using Edge

Authors: T. Alghamdi, G. Alaghband

Abstract:

In this paper we develop a model that couples Two Concurrent Convolution Neural Network with different filters (TC*CNN) for face recognition and compare its performance to an existing sequential CNN (base model). We also test and compare the quality and performance of the models on three datasets with various levels of complexity (easy, moderate, and difficult) and show that for the most complex datasets, edges will produce the most accurate and efficient results. We further show that in such cases while Support Vector Machine (SVM) models are fast, they do not produce accurate results.

Keywords: Convolution Neural Network, Edges, Face Recognition , Support Vector Machine.

Procedia PDF Downloads 122
761 Studies of Rule Induction by STRIM from the Decision Table with Contaminated Attribute Values from Missing Data and Noise — in the Case of Critical Dataset Size —

Authors: Tetsuro Saeki, Yuichi Kato, Shoutarou Mizuno

Abstract:

STRIM (Statistical Test Rule Induction Method) has been proposed as a method to effectively induct if-then rules from the decision table which is considered as a sample set obtained from the population of interest. Its usefulness has been confirmed by simulation experiments specifying rules in advance, and by comparison with conventional methods. However, scope for future development remains before STRIM can be applied to the analysis of real-world data sets. The first requirement is to determine the size of the dataset needed for inducting true rules, since finding statistically significant rules is the core of the method. The second is to examine the capacity of rule induction from datasets with contaminated attribute values created by missing data and noise, since real-world datasets usually contain such contaminated data. This paper examines the first problem theoretically, in connection with the rule length. The second problem is then examined in a simulation experiment, utilizing the critical size of dataset derived from the first step. The experimental results show that STRIM is highly robust in the analysis of datasets with contaminated attribute values, and hence is applicable to realworld data.

Keywords: rule induction, decision table, missing data, noise

Procedia PDF Downloads 366
760 Enhancing Cultural Heritage Data Retrieval by Mapping COURAGE to CIDOC Conceptual Reference Model

Authors: Ghazal Faraj, Andras Micsik

Abstract:

The CIDOC Conceptual Reference Model (CRM) is an extensible ontology that provides integrated access to heterogeneous and digital datasets. The CIDOC-CRM offers a “semantic glue” intended to promote accessibility to several diverse and dispersed sources of cultural heritage data. That is achieved by providing a formal structure for the implicit and explicit concepts and their relationships in the cultural heritage field. The COURAGE (“Cultural Opposition – Understanding the CultuRal HeritAGE of Dissent in the Former Socialist Countries”) project aimed to explore methods about socialist-era cultural resistance during 1950-1990 and planned to serve as a basis for further narratives and digital humanities (DH) research. This project highlights the diversity of flourished alternative cultural scenes in Eastern Europe before 1989. Moreover, the dataset of COURAGE is an online RDF-based registry that consists of historical people, organizations, collections, and featured items. For increasing the inter-links between different datasets and retrieving more relevant data from various data silos, a shared federated ontology for reconciled data is needed. As a first step towards these goals, a full understanding of the CIDOC CRM ontology (target ontology), as well as the COURAGE dataset, was required to start the work. Subsequently, the queries toward the ontology were determined, and a table of equivalent properties from COURAGE and CIDOC CRM was created. The structural diagrams that clarify the mapping process and construct queries are on progress to map person, organization, and collection entities to the ontology. Through mapping the COURAGE dataset to CIDOC-CRM ontology, the dataset will have a common ontological foundation with several other datasets. Therefore, the expected results are: 1) retrieving more detailed data about existing entities, 2) retrieving new entities’ data, 3) aligning COURAGE dataset to a standard vocabulary, 4) running distributed SPARQL queries over several CIDOC-CRM datasets and testing the potentials of distributed query answering using SPARQL. The next plan is to map CIDOC-CRM to other upper-level ontologies or large datasets (e.g., DBpedia, Wikidata), and address similar questions on a wide variety of knowledge bases.

Keywords: CIDOC CRM, cultural heritage data, COURAGE dataset, ontology alignment

Procedia PDF Downloads 119
759 An Application of Modified M-out-of-N Bootstrap Method to Heavy-Tailed Distributions

Authors: Hannah F. Opayinka, Adedayo A. Adepoju

Abstract:

This study is an extension of a prior study on the modification of the existing m-out-of-n (moon) bootstrap method for heavy-tailed distributions in which modified m-out-of-n (mmoon) was proposed as an alternative method to the existing moon technique. In this study, both moon and mmoon techniques were applied to two real income datasets which followed Lognormal and Pareto distributions respectively with finite variances. The performances of these two techniques were compared using Standard Error (SE) and Root Mean Square Error (RMSE). The findings showed that mmoon outperformed moon bootstrap in terms of smaller SEs and RMSEs for all the sample sizes considered in the two datasets.

Keywords: Bootstrap, income data, lognormal distribution, Pareto distribution

Procedia PDF Downloads 155
758 Prevalence of Knee Pain and Risk Factors and Its Impact on Functional Impairment among Saudi Adolescents

Authors: Ali H.Alyami, Hussam Darraj, Faisal Hakami, Mohammed Awaf, Sulaiman Hamdi, Nawaf Bakri, Abdulaziz Saber, Khalid Hakami, Almuhanad Alyami, Mohammed khashab

Abstract:

Introduction: Adolescents frequently self-report pain, according to epidemiological research. The knee is one of the sites where the pain is most common. One of the main factors contributing to the number of years people spend disabled and having substantial personal, societal, and economic burdens globally are musculoskeletal disorders. Adolescents may have knee pain due to an abrupt, traumatic injury or an insidious, slowly building onset that neither the adolescent nor the parent is aware of. Objectives: The present study’s authors aimed to estimate the prevalence of knee pain in Saudi adolescents. Methods: This cross-sectional survey, carried out from June to November 2022, included 676 adolescents ages 10 to 18. Data are presented as frequencies and percentages for categorical variables. Analysis of variance (ANOVA) was used to compare means between groups, while the chi-square test was used for the comparison of categorical variables. Statistical significance was set at P< 0.05.Result: Adolescents were invited to take part in the study. 57.5% were girls, and 42.5% were males,68.8% were 676 aged between 15 and 18. The prevalence of knee pain was considerably high among females (26%), while it was 19.2% among males. Moreover, age was a significant predictor for knee pain; also BMI was significant for knee pain. Conclusion: Our study noted a high rate of knee pain among adolescents, so we need to raise awareness about risk factors. Adolescent knee pain can be prevented with conservative methods and some minor lifestyle/activity modifications.

Keywords: knee pain, prevalence of knee pain, exercise training, physical activity

Procedia PDF Downloads 74
757 Land Cover Remote Sensing Classification Advanced Neural Networks Supervised Learning

Authors: Eiman Kattan

Abstract:

This study aims to evaluate the impact of classifying labelled remote sensing images conventional neural network (CNN) architecture, i.e., AlexNet on different land cover scenarios based on two remotely sensed datasets from different point of views such as the computational time and performance. Thus, a set of experiments were conducted to specify the effectiveness of the selected convolutional neural network using two implementing approaches, named fully trained and fine-tuned. For validation purposes, two remote sensing datasets, AID, and RSSCN7 which are publicly available and have different land covers features were used in the experiments. These datasets have a wide diversity of input data, number of classes, amount of labelled data, and texture patterns. A specifically designed interactive deep learning GPU training platform for image classification (Nvidia Digit) was employed in the experiments. It has shown efficiency in training, validation, and testing. As a result, the fully trained approach has achieved a trivial result for both of the two data sets, AID and RSSCN7 by 73.346% and 71.857% within 24 min, 1 sec and 8 min, 3 sec respectively. However, dramatic improvement of the classification performance using the fine-tuning approach has been recorded by 92.5% and 91% respectively within 24min, 44 secs and 8 min 41 sec respectively. The represented conclusion opens the opportunities for a better classification performance in various applications such as agriculture and crops remote sensing.

Keywords: conventional neural network, remote sensing, land cover, land use

Procedia PDF Downloads 340
756 Automated End-to-End Pipeline Processing Solution for Autonomous Driving

Authors: Ashish Kumar, Munesh Raghuraj Varma, Nisarg Joshi, Gujjula Vishwa Teja, Srikanth Sambi, Arpit Awasthi

Abstract:

Autonomous driving vehicles are revolutionizing the transportation system of the 21st century. This has been possible due to intensive research put into making a robust, reliable, and intelligent program that can perceive and understand its environment and make decisions based on the understanding. It is a very data-intensive task with data coming from multiple sensors and the amount of data directly reflects on the performance of the system. Researchers have to design the preprocessing pipeline for different datasets with different sensor orientations and alignments before the dataset can be fed to the model. This paper proposes a solution that provides a method to unify all the data from different sources into a uniform format using the intrinsic and extrinsic parameters of the sensor used to capture the data allowing the same pipeline to use data from multiple sources at a time. This also means easy adoption of new datasets or In-house generated datasets. The solution also automates the complete deep learning pipeline from preprocessing to post-processing for various tasks allowing researchers to design multiple custom end-to-end pipelines. Thus, the solution takes care of the input and output data handling, saving the time and effort spent on it and allowing more time for model improvement.

Keywords: augmentation, autonomous driving, camera, custom end-to-end pipeline, data unification, lidar, post-processing, preprocessing

Procedia PDF Downloads 73
755 Post Pandemic Mobility Analysis through Indexing and Sharding in MongoDB: Performance Optimization and Insights

Authors: Karan Vishavjit, Aakash Lakra, Shafaq Khan

Abstract:

The COVID-19 pandemic has pushed healthcare professionals to use big data analytics as a vital tool for tracking and evaluating the effects of contagious viruses. To effectively analyze huge datasets, efficient NoSQL databases are needed. The analysis of post-COVID-19 health and well-being outcomes and the evaluation of the effectiveness of government efforts during the pandemic is made possible by this research’s integration of several datasets, which cuts down on query processing time and creates predictive visual artifacts. We recommend applying sharding and indexing technologies to improve query effectiveness and scalability as the dataset expands. Effective data retrieval and analysis are made possible by spreading the datasets into a sharded database and doing indexing on individual shards. Analysis of connections between governmental activities, poverty levels, and post-pandemic well being is the key goal. We want to evaluate the effectiveness of governmental initiatives to improve health and lower poverty levels. We will do this by utilising advanced data analysis and visualisations. The findings provide relevant data that supports the advancement of UN sustainable objectives, future pandemic preparation, and evidence-based decision-making. This study shows how Big Data and NoSQL databases may be used to address problems with global health.

Keywords: big data, COVID-19, health, indexing, NoSQL, sharding, scalability, well being

Procedia PDF Downloads 41
754 Combining Shallow and Deep Unsupervised Machine Learning Techniques to Detect Bad Actors in Complex Datasets

Authors: Jun Ming Moey, Zhiyaun Chen, David Nicholson

Abstract:

Bad actors are often hard to detect in data that imprints their behaviour patterns because they are comparatively rare events embedded in non-bad actor data. An unsupervised machine learning framework is applied here to detect bad actors in financial crime datasets that record millions of transactions undertaken by hundreds of actors (<0.01% bad). Specifically, the framework combines ‘shallow’ (PCA, Isolation Forest) and ‘deep’ (Autoencoder) methods to detect outlier patterns. Detection performance analysis for both the individual methods and their combination is reported.

Keywords: detection, machine learning, deep learning, unsupervised, outlier analysis, data science, fraud, financial crime

Procedia PDF Downloads 66
753 Dissimilarity Measure for General Histogram Data and Its Application to Hierarchical Clustering

Authors: K. Umbleja, M. Ichino

Abstract:

Symbolic data mining has been developed to analyze data in very large datasets. It is also useful in cases when entry specific details should remain hidden. Symbolic data mining is quickly gaining popularity as datasets in need of analyzing are becoming ever larger. One type of such symbolic data is a histogram, which enables to save huge amounts of information into a single variable with high-level of granularity. Other types of symbolic data can also be described in histograms, therefore making histogram a very important and general symbolic data type - a method developed for histograms - can also be applied to other types of symbolic data. Due to its complex structure, analyzing histograms is complicated. This paper proposes a method, which allows to compare two histogram-valued variables and therefore find a dissimilarity between two histograms. Proposed method uses the Ichino-Yaguchi dissimilarity measure for mixed feature-type data analysis as a base and develops a dissimilarity measure specifically for histogram data, which allows to compare histograms with different number of bins and bin widths (so called general histogram). Proposed dissimilarity measure is then used as a measure for clustering. Furthermore, linkage method based on weighted averages is proposed with the concept of cluster compactness to measure the quality of clustering. The method is then validated with application on real datasets. As a result, the proposed dissimilarity measure is found producing adequate and comparable results with general histograms without the loss of detail or need to transform the data.

Keywords: dissimilarity measure, hierarchical clustering, histograms, symbolic data analysis

Procedia PDF Downloads 132
752 Distances over Incomplete Diabetes and Breast Cancer Data Based on Bhattacharyya Distance

Authors: Loai AbdAllah, Mahmoud Kaiyal

Abstract:

Missing values in real-world datasets are a common problem. Many algorithms were developed to deal with this problem, most of them replace the missing values with a fixed value that was computed based on the observed values. In our work, we used a distance function based on Bhattacharyya distance to measure the distance between objects with missing values. Bhattacharyya distance, which measures the similarity of two probability distributions. The proposed distance distinguishes between known and unknown values. Where the distance between two known values is the Mahalanobis distance. When, on the other hand, one of them is missing the distance is computed based on the distribution of the known values, for the coordinate that contains the missing value. This method was integrated with Wikaya, a digital health company developing a platform that helps to improve prevention of chronic diseases such as diabetes and cancer. In order for Wikaya’s recommendation system to work distance between users need to be measured. Since there are missing values in the collected data, there is a need to develop a distance function distances between incomplete users profiles. To evaluate the accuracy of the proposed distance function in reflecting the actual similarity between different objects, when some of them contain missing values, we integrated it within the framework of k nearest neighbors (kNN) classifier, since its computation is based only on the similarity between objects. To validate this, we ran the algorithm over diabetes and breast cancer datasets, standard benchmark datasets from the UCI repository. Our experiments show that kNN classifier using our proposed distance function outperforms the kNN using other existing methods.

Keywords: missing values, incomplete data, distance, incomplete diabetes data

Procedia PDF Downloads 188
751 A Comparison of Caesarean Section Indications and Characteristics in 2009 and 2020 in a Saudi Tertiary Hospital

Authors: Sarah K. Basudan, Ragad I. Al Jazzar, Zeinah Sulaihim, Hanan M. Al-Kadri

Abstract:

Background: Cesarean section has been increasing in recent years, with a wide range of etiologies contributing to this rise. This study aimed to assess the indications, outcomes, and complications in Riyadh, Saudi Arabia. Methods: A Retrospective Cohort study was conducted at King Abdulaziz medical city. The study includes two cohorts: G1 (2009) and G2 (2020) groups who met the inclusion criteria. The data was transferred to the SPSS (statistical package for social sciences) version 24 for analysis. The initial descriptive statistics were run for all variables, including numerical and categorical data. The numerical data were reported as median, and standard deviation and categorical data were reported as frequencies and percentages. Results: The data were collected from 399 women who were divided into two groups, G1(199) and G2(200). The mean age of all participants is 32+-6​; G1 and G2 had significant differences in age means with 30+-6 and 34+-5, respectively, with a p-value of <0.001, which indicates delayed fertility by four years. Moreover, a breech presentation was less likely to occur in G2 (OR 0.64, CI: 0.21-0.62. P<0.001). Nonetheless, maternal causes such as repeated C-sections and maternal medical conditions were more likely to happen in G2 (OR 1.5, CI: 1.04-2.38, p=0.03) and (OR 5.4, CI: 1.12-23.9, P=0.01), respectively. Furthermore, postpartum hemorrhage showed an increase of 12% in G2 (OR 5.4, CI: 2.2-13.4, p<0.001). G2 was more likely to be admitted to the neonatal intensive care unit (NICU) (OR 16, CI: 7.4-38.7) and to special care baby (SCB) (OR 7.2, CI: 3.9-13.1), both with a p-value<0.001 compared to regular nursery admission. Conclusion: There are multiple factors that are contributing to the increase in c section rate in a Saudi tertiary hospitals. The factors were suggested to be previous c-sections, abnormal fetal heart rate, malpresentation, and maternal or fetal medical conditions.

Keywords: cesarean sections, maternal indications, maternal complications, neonatal condition

Procedia PDF Downloads 50
750 Detecting Memory-Related Gene Modules in sc/snRNA-seq Data by Deep-Learning

Authors: Yong Chen

Abstract:

To understand the detailed molecular mechanisms of memory formation in engram cells is one of the most fundamental questions in neuroscience. Recent single-cell RNA-seq (scRNA-seq) and single-nucleus RNA-seq (snRNA-seq) techniques have allowed us to explore the sparsely activated engram ensembles, enabling access to the molecular mechanisms that underlie experience-dependent memory formation and consolidation. However, the absence of specific and powerful computational methods to detect memory-related genes (modules) and their regulatory relationships in the sc/snRNA-seq datasets has strictly limited the analysis of underlying mechanisms and memory coding principles in mammalian brains. Here, we present a deep-learning method named SCENTBOX, to detect memory-related gene modules and causal regulatory relationships among themfromsc/snRNA-seq datasets. SCENTBOX first constructs codifferential expression gene network (CEGN) from case versus control sc/snRNA-seq datasets. It then detects the highly correlated modules of differential expression genes (DEGs) in CEGN. The deep network embedding and attention-based convolutional neural network strategies are employed to precisely detect regulatory relationships among DEG genes in a module. We applied them on scRNA-seq datasets of TRAP; Ai14 mouse neurons with fear memory and detected not only known memory-related genes, but also the modules and potential causal regulations. Our results provided novel regulations within an interesting module, including Arc, Bdnf, Creb, Dusp1, Rgs4, and Btg2. Overall, our methods provide a general computational tool for processing sc/snRNA-seq data from case versus control studie and a systematic investigation of fear-memory-related gene modules.

Keywords: sc/snRNA-seq, memory formation, deep learning, gene module, causal inference

Procedia PDF Downloads 85