Search results for: categorical datasets
779 Road Accidents Bigdata Mining and Visualization Using Support Vector Machines
Authors: Usha Lokala, Srinivas Nowduri, Prabhakar K. Sharma
Abstract:
Useful information has been extracted from the road accident data in United Kingdom (UK), using data analytics method, for avoiding possible accidents in rural and urban areas. This analysis make use of several methodologies such as data integration, support vector machines (SVM), correlation machines and multinomial goodness. The entire datasets have been imported from the traffic department of UK with due permission. The information extracted from these huge datasets forms a basis for several predictions, which in turn avoid unnecessary memory lapses. Since data is expected to grow continuously over a period of time, this work primarily proposes a new framework model which can be trained and adapt itself to new data and make accurate predictions. This work also throws some light on use of SVM’s methodology for text classifiers from the obtained traffic data. Finally, it emphasizes the uniqueness and adaptability of SVMs methodology appropriate for this kind of research work.Keywords: support vector mechanism (SVM), machine learning (ML), support vector machines (SVM), department of transportation (DFT)
Procedia PDF Downloads 274778 On an Approach for Rule Generation in Association Rule Mining
Authors: B. Chandra
Abstract:
In Association Rule Mining, much attention has been paid for developing algorithms for large (frequent/closed/maximal) itemsets but very little attention has been paid to improve the performance of rule generation algorithms. Rule generation is an important part of Association Rule Mining. In this paper, a novel approach named NARG (Association Rule using Antecedent Support) has been proposed for rule generation that uses memory resident data structure named FCET (Frequent Closed Enumeration Tree) to find frequent/closed itemsets. In addition, the computational speed of NARG is enhanced by giving importance to the rules that have lower antecedent support. Comparative performance evaluation of NARG with fast association rule mining algorithm for rule generation has been done on synthetic datasets and real life datasets (taken from UCI Machine Learning Repository). Performance analysis shows that NARG is computationally faster in comparison to the existing algorithms for rule generation.Keywords: knowledge discovery, association rule mining, antecedent support, rule generation
Procedia PDF Downloads 325777 Using Machine Learning Techniques for Autism Spectrum Disorder Analysis and Detection in Children
Authors: Norah Mohammed Alshahrani, Abdulaziz Almaleh
Abstract:
Autism Spectrum Disorder (ASD) is a condition related to issues with brain development that affects how a person recognises and communicates with others which results in difficulties with interaction and communication socially and it is constantly growing. Early recognition of ASD allows children to lead safe and healthy lives and helps doctors with accurate diagnoses and management of conditions. Therefore, it is crucial to develop a method that will achieve good results and with high accuracy for the measurement of ASD in children. In this paper, ASD datasets of toddlers and children have been analyzed. We employed the following machine learning techniques to attempt to explore ASD and they are Random Forest (RF), Decision Tree (DT), Na¨ıve Bayes (NB) and Support Vector Machine (SVM). Then Feature selection was used to provide fewer attributes from ASD datasets while preserving model performance. As a result, we found that the best result has been provided by the Support Vector Machine (SVM), achieving 0.98% in the toddler dataset and 0.99% in the children dataset.Keywords: autism spectrum disorder, machine learning, feature selection, support vector machine
Procedia PDF Downloads 152776 Efficient Recommendation System for Frequent and High Utility Itemsets over Incremental Datasets
Authors: J. K. Kavitha, D. Manjula, U. Kanimozhi
Abstract:
Mining frequent and high utility item sets have gained much significance in the recent years. When the data arrives sporadically, incremental and interactive rule mining and utility mining approaches can be adopted to handle user’s dynamic environmental needs and avoid redundancies, using previous data structures, and mining results. The dependence on recommendation systems has exponentially risen since the advent of search engines. This paper proposes a model for building a recommendation system that suggests frequent and high utility item sets over dynamic datasets for a cluster based location prediction strategy to predict user’s trajectories using the Efficient Incremental Rule Mining (EIRM) algorithm and the Fast Update Utility Pattern Tree (FUUP) algorithm. Through comprehensive evaluations by experiments, this scheme has shown to deliver excellent performance.Keywords: data sets, recommendation system, utility item sets, frequent item sets mining
Procedia PDF Downloads 293775 Multimodal Data Fusion Techniques in Audiovisual Speech Recognition
Authors: Hadeer M. Sayed, Hesham E. El Deeb, Shereen A. Taie
Abstract:
In the big data era, we are facing a diversity of datasets from different sources in different domains that describe a single life event. These datasets consist of multiple modalities, each of which has a different representation, distribution, scale, and density. Multimodal fusion is the concept of integrating information from multiple modalities in a joint representation with the goal of predicting an outcome through a classification task or regression task. In this paper, multimodal fusion techniques are classified into two main classes: model-agnostic techniques and model-based approaches. It provides a comprehensive study of recent research in each class and outlines the benefits and limitations of each of them. Furthermore, the audiovisual speech recognition task is expressed as a case study of multimodal data fusion approaches, and the open issues through the limitations of the current studies are presented. This paper can be considered a powerful guide for interested researchers in the field of multimodal data fusion and audiovisual speech recognition particularly.Keywords: multimodal data, data fusion, audio-visual speech recognition, neural networks
Procedia PDF Downloads 112774 An Experimental Study on Some Conventional and Hybrid Models of Fuzzy Clustering
Authors: Jeugert Kujtila, Kristi Hoxhalli, Ramazan Dalipi, Erjon Cota, Ardit Murati, Erind Bedalli
Abstract:
Clustering is a versatile instrument in the analysis of collections of data providing insights of the underlying structures of the dataset and enhancing the modeling capabilities. The fuzzy approach to the clustering problem increases the flexibility involving the concept of partial memberships (some value in the continuous interval [0, 1]) of the instances in the clusters. Several fuzzy clustering algorithms have been devised like FCM, Gustafson-Kessel, Gath-Geva, kernel-based FCM, PCM etc. Each of these algorithms has its own advantages and drawbacks, so none of these algorithms would be able to perform superiorly in all datasets. In this paper we will experimentally compare FCM, GK, GG algorithm and a hybrid two-stage fuzzy clustering model combining the FCM and Gath-Geva algorithms. Firstly we will theoretically dis-cuss the advantages and drawbacks for each of these algorithms and we will describe the hybrid clustering model exploiting the advantages and diminishing the drawbacks of each algorithm. Secondly we will experimentally compare the accuracy of the hybrid model by applying it on several benchmark and synthetic datasets.Keywords: fuzzy clustering, fuzzy c-means algorithm (FCM), Gustafson-Kessel algorithm, hybrid clustering model
Procedia PDF Downloads 514773 Named Entity Recognition System for Tigrinya Language
Authors: Sham Kidane, Fitsum Gaim, Ibrahim Abdella, Sirak Asmerom, Yoel Ghebrihiwot, Simon Mulugeta, Natnael Ambassager
Abstract:
The lack of annotated datasets is a bottleneck to the progress of NLP in low-resourced languages. The work presented here consists of large-scale annotated datasets and models for the named entity recognition (NER) system for the Tigrinya language. Our manually constructed corpus comprises over 340K words tagged for NER, with over 118K of the tokens also having parts-of-speech (POS) tags, annotated with 12 distinct classes of entities, represented using several types of tagging schemes. We conducted extensive experiments covering convolutional neural networks and transformer models; the highest performance achieved is 88.8% weighted F1-score. These results are especially noteworthy given the unique challenges posed by Tigrinya’s distinct grammatical structure and complex word morphologies. The system can be an essential building block for the advancement of NLP systems in Tigrinya and other related low-resourced languages and serve as a bridge for cross-referencing against higher-resourced languages.Keywords: Tigrinya NER corpus, TiBERT, TiRoBERTa, BiLSTM-CRF
Procedia PDF Downloads 131772 The Association of Excessive Work Stress with Job Satisfaction and Turnover Intention in Operating Room Nurses: A Cross-Sectional Study in a Metropolitan Teaching Hospital in Southern Taiwan
Authors: Chia Yu Chen, Shu Fen Wu, Chen-Fuh Lam, I-Ling Tsai, Shu Jiuan Chen, Yen Ling Liu
Abstract:
Aim: It remains undetermined that whether increased work stress may affect the job satisfaction and career loyalty among nursing staffs in the operating room. The long-term goal of this study is to lengthen the professional life of operating room nurses by attenuating the work stress and enhancing their contentment in work. Method: This was a cross-sectional, descriptive study performed in a metropolitan teaching hospital in the southern Taiwan between May 2017 to July 2017. A structured self-administered questionnaire, modified from the Occupational Stress Indicator-2 (OSI-2) and Maslach Burnout Inventory (MBI) manual was collected from the operating room nurses. Chi-square test was used to analyze the categorical data and Pearson correlation was used to analyze the association between two numerical datasets (SPSS version 20.0). Results: The response rate was 80% (80/100) and a total of 73 (73%) completed forms were eventually proceeded for analysis. The average scores for work stress and job satisfaction of the operating room nurses were 145.96±32.91 and 47.38±6.07, respectively. The correlation coefficients of work stress versus job satisfaction and organizational identity were (r=-0.338, p=0.003 and r=-0.354, p=0.002), respectively. There were more nurses who took rotating shift quitted works from the operating room than those who took only dayshift (2=5.176, p<0.05). Nurses who reported of having lower job satisfaction were associated with significantly higher turnover intention (t=3.714, p< 0.01). Following multivariate regression analysis, rotating shift and low job satisfaction were identified as the two independent predictors of intention to quit from working in the operating room. Conclusion: Our study clearly demonstrates that increased work stress significantly attenuates job satisfaction and organizational identity. Rotating shift is associated with higher work stress, lower job satisfaction, and higher turnover intention, which is consistent with the previous surveys carried out in the department of medical technology. Therefore, improvement of working quality in the operating rooms is essential to increase the retain intention of the well-trained nursing staffs. Further investigation into types of work shifts and other strategies of attenuating stress in workplace is currently undertaken in order to improve the job satisfaction and to decrease turnover intention in the operating room.Keywords: rotating shift, work stress, job satisfaction, turnover intention
Procedia PDF Downloads 197771 Multilabel Classification with Neural Network Ensemble Method
Authors: Sezin Ekşioğlu
Abstract:
Multilabel classification has a huge importance for several applications, it is also a challenging research topic. It is a kind of supervised learning that contains binary targets. The distance between multilabel and binary classification is having more than one class in multilabel classification problems. Features can belong to one class or many classes. There exists a wide range of applications for multi label prediction such as image labeling, text categorization, gene functionality. Even though features are classified in many classes, they may not always be properly classified. There are many ensemble methods for the classification. However, most of the researchers have been concerned about better multilabel methods. Especially little ones focus on both efficiency of classifiers and pairwise relationships at the same time in order to implement better multilabel classification. In this paper, we worked on modified ensemble methods by getting benefit from k-Nearest Neighbors and neural network structure to address issues within a beneficial way and to get better impacts from the multilabel classification. Publicly available datasets (yeast, emotion, scene and birds) are performed to demonstrate the developed algorithm efficiency and the technique is measured by accuracy, F1 score and hamming loss metrics. Our algorithm boosts benchmarks for each datasets with different metrics.Keywords: multilabel, classification, neural network, KNN
Procedia PDF Downloads 155770 Evaluation of NASA POWER and CRU Precipitation and Temperature Datasets over a Desert-prone Yobe River Basin: An Investigation of the Impact of Drought in the North-East Arid Zone of Nigeria
Authors: Yusuf Dawa Sidi, Abdulrahman Bulama Bizi
Abstract:
The most dependable and precise source of climate data is often gauge observation. However, long-term records of gauge observations, on the other hand, are unavailable in many regions around the world. In recent years, a number of gridded climate datasets with high spatial and temporal resolutions have emerged as viable alternatives to gauge-based measurements. However, it is crucial to thoroughly evaluate their performance prior to utilising them in hydroclimatic applications. Therefore, this study aims to assess the effectiveness of NASA Prediction of Worldwide Energy Resources (NASA POWER) and Climate Research Unit (CRU) datasets in accurately estimating precipitation and temperature patterns within the dry region of Nigeria from 1990 to 2020. The study employs widely used statistical metrics and the Standardised Precipitation Index (SPI) to effectively capture the monthly variability of precipitation and temperature and inter-annual anomalies in rainfall. The findings suggest that CRU exhibited superior performance compared to NASA POWER in terms of monthly precipitation and minimum and maximum temperatures, demonstrating a high correlation and much lower error values for both RMSE and MAE. Nevertheless, NASA POWER has exhibited a moderate agreement with gauge observations in accurately replicating monthly precipitation. The analysis of the SPI reveals that the CRU product exhibits superior performance compared to NASA POWER in accurately reflecting inter-annual variations in rainfall anomalies. The findings of this study indicate that the CRU gridded product is often regarded as the most favourable gridded precipitation product.Keywords: CRU, climate change, precipitation, SPI, temperature
Procedia PDF Downloads 89769 Improving Fake News Detection Using K-means and Support Vector Machine Approaches
Authors: Kasra Majbouri Yazdi, Adel Majbouri Yazdi, Saeid Khodayi, Jingyu Hou, Wanlei Zhou, Saeed Saedy
Abstract:
Fake news and false information are big challenges of all types of media, especially social media. There is a lot of false information, fake likes, views and duplicated accounts as big social networks such as Facebook and Twitter admitted. Most information appearing on social media is doubtful and in some cases misleading. They need to be detected as soon as possible to avoid a negative impact on society. The dimensions of the fake news datasets are growing rapidly, so to obtain a better result of detecting false information with less computation time and complexity, the dimensions need to be reduced. One of the best techniques of reducing data size is using feature selection method. The aim of this technique is to choose a feature subset from the original set to improve the classification performance. In this paper, a feature selection method is proposed with the integration of K-means clustering and Support Vector Machine (SVM) approaches which work in four steps. First, the similarities between all features are calculated. Then, features are divided into several clusters. Next, the final feature set is selected from all clusters, and finally, fake news is classified based on the final feature subset using the SVM method. The proposed method was evaluated by comparing its performance with other state-of-the-art methods on several specific benchmark datasets and the outcome showed a better classification of false information for our work. The detection performance was improved in two aspects. On the one hand, the detection runtime process decreased, and on the other hand, the classification accuracy increased because of the elimination of redundant features and the reduction of datasets dimensions.Keywords: clustering, fake news detection, feature selection, machine learning, social media, support vector machine
Procedia PDF Downloads 176768 A Novel Heuristic for Analysis of Large Datasets by Selecting Wrapper-Based Features
Authors: Bushra Zafar, Usman Qamar
Abstract:
Large data sample size and dimensions render the effectiveness of conventional data mining methodologies. A data mining technique are important tools for collection of knowledgeable information from variety of databases and provides supervised learning in the form of classification to design models to describe vital data classes while structure of the classifier is based on class attribute. Classification efficiency and accuracy are often influenced to great extent by noisy and undesirable features in real application data sets. The inherent natures of data set greatly masks its quality analysis and leave us with quite few practical approaches to use. To our knowledge first time, we present a new approach for investigation of structure and quality of datasets by providing a targeted analysis of localization of noisy and irrelevant features of data sets. Machine learning is based primarily on feature selection as pre-processing step which offers us to select few features from number of features as a subset by reducing the space according to certain evaluation criterion. The primary objective of this study is to trim down the scope of the given data sample by searching a small set of important features which may results into good classification performance. For this purpose, a heuristic for wrapper-based feature selection using genetic algorithm and for discriminative feature selection an external classifier are used. Selection of feature based on its number of occurrence in the chosen chromosomes. Sample dataset has been used to demonstrate proposed idea effectively. A proposed method has improved average accuracy of different datasets is about 95%. Experimental results illustrate that proposed algorithm increases the accuracy of prediction of different diseases.Keywords: data mining, generic algorithm, KNN algorithms, wrapper based feature selection
Procedia PDF Downloads 316767 Quality Assurance for the Climate Data Store
Authors: Judith Klostermann, Miguel Segura, Wilma Jans, Dragana Bojovic, Isadora Christel Jimenez, Francisco Doblas-Reyees, Judit Snethlage
Abstract:
The Climate Data Store (CDS), developed by the Copernicus Climate Change Service (C3S) implemented by the European Centre for Medium-Range Weather Forecasts (ECMWF) on behalf of the European Union, is intended to become a key instrument for exploring climate data. The CDS contains both raw and processed data to provide information to the users about the past, present and future climate of the earth. It allows for easy and free access to climate data and indicators, presenting an important asset for scientists and stakeholders on the path for achieving a more sustainable future. The C3S Evaluation and Quality Control (EQC) is assessing the quality of the CDS by undertaking a comprehensive user requirement assessment to measure the users’ satisfaction. Recommendations will be developed for the improvement and expansion of the CDS datasets and products. User requirements will be identified on the fitness of the datasets, the toolbox, and the overall CDS service. The EQC function of the CDS will help C3S to make the service more robust: integrated by validated data that follows high-quality standards while being user-friendly. This function will be closely developed with the users of the service. Through their feedback, suggestions, and contributions, the CDS can become more accessible and meet the requirements for a diverse range of users. Stakeholders and their active engagement are thus an important aspect of CDS development. This will be achieved with direct interactions with users such as meetings, interviews or workshops as well as different feedback mechanisms like surveys or helpdesk services at the CDS. The results provided by the users will be categorized as a function of CDS products so that their specific interests will be monitored and linked to the right product. Through this procedure, we will identify the requirements and criteria for data and products in order to build the correspondent recommendations for the improvement and expansion of the CDS datasets and products.Keywords: climate data store, Copernicus, quality, user engagement
Procedia PDF Downloads 146766 An Unsupervised Domain-Knowledge Discovery Framework for Fake News Detection
Authors: Yulan Wu
Abstract:
With the rapid development of social media, the issue of fake news has gained considerable prominence, drawing the attention of both the public and governments. The widespread dissemination of false information poses a tangible threat across multiple domains of society, including politics, economy, and health. However, much research has concentrated on supervised training models within specific domains, their effectiveness diminishes when applied to identify fake news across multiple domains. To solve this problem, some approaches based on domain labels have been proposed. By segmenting news to their specific area in advance, judges in the corresponding field may be more accurate on fake news. However, these approaches disregard the fact that news records can pertain to multiple domains, resulting in a significant loss of valuable information. In addition, the datasets used for training must all be domain-labeled, which creates unnecessary complexity. To solve these problems, an unsupervised domain knowledge discovery framework for fake news detection is proposed. Firstly, to effectively retain the multidomain knowledge of the text, a low-dimensional vector for each news text to capture domain embeddings is generated. Subsequently, a feature extraction module utilizing the unsupervisedly discovered domain embeddings is used to extract the comprehensive features of news. Finally, a classifier is employed to determine the authenticity of the news. To verify the proposed framework, a test is conducted on the existing widely used datasets, and the experimental results demonstrate that this method is able to improve the detection performance for fake news across multiple domains. Moreover, even in datasets that lack domain labels, this method can still effectively transfer domain knowledge, which can educe the time consumed by tagging without sacrificing the detection accuracy.Keywords: fake news, deep learning, natural language processing, multiple domains
Procedia PDF Downloads 97765 The Investigation of Work Stress and Burnout in Nurse Anesthetists: A Cross-Sectional Study
Authors: Yen Ling Liu, Shu-Fen Wu, Chen-Fuh Lam, I-Ling Tsai, Chia-Yu Chen
Abstract:
Purpose: Nurse anesthetists are confronting extraordinarily high job stress in their daily practice, deriving from the fast-track anesthesia care, risk of perioperative complications, routine rotating shifts, teaching programs and interactions with the surgical team in the operating room. This study investigated the influence of work stress on the burnout and turnover intention of nurse anesthetists in a regional general hospital in Southern Taiwan. Methods: This was a descriptive correlational study carried out in 66 full-time nurse anesthetists. Data was collected from March 2017 to June 2017 by in-person interview, and a self-administered structured questionnaire was completed by the interviewee. Outcome measurements included the Practice Environment Scale of the Nursing Work Index (PES-NWI), Maslach Burnout Inventory (MBI) and nursing staff turnover intention. Numerical data were analyzed by descriptive statistics, independent t test, or one-way ANOVA. Categorical data were compared using the chi-square test (x²). Datasets were computed with Pearson product-moment correlation and linear regression. Data were analyzed by using SPSS 20.0 software. Results: The average score for job burnout was 68.7916.67 (out of 100). The three major components of burnout, including emotional depletion (mean score of 26.32), depersonalization (mean score of 13.65), and personal(mean score of 24.48). These average scores suggested that these nurse anesthetists were at high risk of burnout and inversely correlated with turnover intention (t = -4.048, P < 0.05). Using linear regression model, emotional exhaustion and depersonalization were the two independent factors that predicted turnover intention in the nurse anesthetists (19.1% in total variance). Conclusion/Implications for Practice: The study identifies that the high risk of job burnout in the nurse anesthetists is not simply derived from physical overload, but most likely resulted from the additional emotional and psychological stress. The occurrence of job burnout may affect the quality of nursing work, and also influence family harmony, in turn, may increase the turnover rate. Multimodal approach is warranted to reduce work stress and job burnout in nurse anesthetists to enhance their willingness to contribute in anesthesia care.Keywords: anesthesia nurses, burnout, job, turnover intention
Procedia PDF Downloads 296764 Optimizing Machine Learning Through Python Based Image Processing Techniques
Authors: Srinidhi. A, Naveed Ahmed, Twinkle Hareendran, Vriksha Prakash
Abstract:
This work reviews some of the advanced image processing techniques for deep learning applications. Object detection by template matching, image denoising, edge detection, and super-resolution modelling are but a few of the tasks. The paper looks in into great detail, given that such tasks are crucial preprocessing steps that increase the quality and usability of image datasets in subsequent deep learning tasks. We review some of the methods for the assessment of image quality, more specifically sharpness, which is crucial to ensure a robust performance of models. Further, we will discuss the development of deep learning models specific to facial emotion detection, age classification, and gender classification, which essentially includes the preprocessing techniques interrelated with model performance. Conclusions from this study pinpoint the best practices in the preparation of image datasets, targeting the best trade-off between computational efficiency and retaining important image features critical for effective training of deep learning models.Keywords: image processing, machine learning applications, template matching, emotion detection
Procedia PDF Downloads 16763 Privacy Preservation Concerns and Information Disclosure on Social Networks: An Ongoing Research
Authors: Aria Teimourzadeh, Marc Favier, Samaneh Kakavand
Abstract:
The emergence of social networks has revolutionized the exchange of information. Every behavior on these platforms contributes to the generation of data known as social network data that are processed, stored and published by the social network service providers. Hence, it is vital to investigate the role of these platforms in user data by considering the privacy measures, especially when we observe the increased number of individuals and organizations engaging with the current virtual platforms without being aware that the data related to their positioning, connections and behavior is uncovered and used by third parties. Performing analytics on social network datasets may result in the disclosure of confidential information about the individuals or organizations which are the members of these virtual environments. Analyzing separate datasets can reveal private information about relationships, interests and more, especially when the datasets are analyzed jointly. Intentional breaches of privacy is the result of such analysis. Addressing these privacy concerns requires an understanding of the nature of data being accumulated and relevant data privacy regulations, as well as motivations for disclosure of personal information on social network platforms. Some significant points about how user's online information is controlled by the influence of social factors and to what extent the users are concerned about future use of their personal information by the organizations, are highlighted in this paper. Firstly, this research presents a short literature review about the structure of a network and concept of privacy in Online Social Networks. Secondly, the factors of user behavior related to privacy protection and self-disclosure on these virtual communities are presented. In other words, we seek to demonstrates the impact of identified variables on user information disclosure that could be taken into account to explain the privacy preservation of individuals on social networking platforms. Thirdly, a few research directions are discussed to address this topic for new researchers.Keywords: information disclosure, privacy measures, privacy preservation, social network analysis, user experience
Procedia PDF Downloads 281762 Domain Adaptive Dense Retrieval with Query Generation
Authors: Rui Yin, Haojie Wang, Xun Li
Abstract:
Recently, mainstream dense retrieval methods have obtained state-of-the-art results on some datasets and tasks. However, they require large amounts of training data, which is not available in most domains. The severe performance degradation of dense retrievers on new data domains has limited the use of dense retrieval methods to only a few domains with large training datasets. In this paper, we propose an unsupervised domain-adaptive approach based on query generation. First, a generative model is used to generate relevant queries for each passage in the target corpus, and then, the generated queries are used for mining negative passages. Finally, the query-passage pairs are labeled with a cross-encoder and used to train a domain-adapted dense retriever. We also explore contrastive learning as a method for training domain-adapted dense retrievers and show that it leads to strong performance in various retrieval settings. Experiments show that our approach is more robust than previous methods in target domains that require less unlabeled data.Keywords: dense retrieval, query generation, contrastive learning, unsupervised training
Procedia PDF Downloads 104761 Bag of Local Features for Person Re-Identification on Large-Scale Datasets
Authors: Yixiu Liu, Yunzhou Zhang, Jianning Chi, Hao Chu, Rui Zheng, Libo Sun, Guanghao Chen, Fangtong Zhou
Abstract:
In the last few years, large-scale person re-identification has attracted a lot of attention from video surveillance since it has a potential application prospect in public safety management. However, it is still a challenging job considering the variation in human pose, the changing illumination conditions and the lack of paired samples. Although the accuracy has been significantly improved, the data dependence of the sample training is serious. To tackle this problem, a new strategy is proposed based on bag of visual words (BoVW) model of designing the feature representation which has been widely used in the field of image retrieval. The local features are extracted, and more discriminative feature representation is obtained by cross-view dictionary learning (CDL), then the assignment map is obtained through k-means clustering. Finally, the BoVW histograms are formed which encodes the images with the statistics of the feature classes in the assignment map. Experiments conducted on the CUHK03, Market1501 and MARS datasets show that the proposed method performs favorably against existing approaches.Keywords: bag of visual words, cross-view dictionary learning, person re-identification, reranking
Procedia PDF Downloads 195760 Real-Time Big-Data Warehouse a Next-Generation Enterprise Data Warehouse and Analysis Framework
Authors: Abbas Raza Ali
Abstract:
Big Data technology is gradually becoming a dire need of large enterprises. These enterprises are generating massively large amount of off-line and streaming data in both structured and unstructured formats on daily basis. It is a challenging task to effectively extract useful insights from the large scale datasets, even though sometimes it becomes a technology constraint to manage transactional data history of more than a few months. This paper presents a framework to efficiently manage massively large and complex datasets. The framework has been tested on a communication service provider producing massively large complex streaming data in binary format. The communication industry is bound by the regulators to manage history of their subscribers’ call records where every action of a subscriber generates a record. Also, managing and analyzing transactional data allows service providers to better understand their customers’ behavior, for example, deep packet inspection requires transactional internet usage data to explain internet usage behaviour of the subscribers. However, current relational database systems limit service providers to only maintain history at semantic level which is aggregated at subscriber level. The framework addresses these challenges by leveraging Big Data technology which optimally manages and allows deep analysis of complex datasets. The framework has been applied to offload existing Intelligent Network Mediation and relational Data Warehouse of the service provider on Big Data. The service provider has 50+ million subscriber-base with yearly growth of 7-10%. The end-to-end process takes not more than 10 minutes which involves binary to ASCII decoding of call detail records, stitching of all the interrogations against a call (transformations) and aggregations of all the call records of a subscriber.Keywords: big data, communication service providers, enterprise data warehouse, stream computing, Telco IN Mediation
Procedia PDF Downloads 175759 Learning from Small Amount of Medical Data with Noisy Labels: A Meta-Learning Approach
Authors: Gorkem Algan, Ilkay Ulusoy, Saban Gonul, Banu Turgut, Berker Bakbak
Abstract:
Computer vision systems recently made a big leap thanks to deep neural networks. However, these systems require correctly labeled large datasets in order to be trained properly, which is very difficult to obtain for medical applications. Two main reasons for label noise in medical applications are the high complexity of the data and conflicting opinions of experts. Moreover, medical imaging datasets are commonly tiny, which makes each data very important in learning. As a result, if not handled properly, label noise significantly degrades the performance. Therefore, a label-noise-robust learning algorithm that makes use of the meta-learning paradigm is proposed in this article. The proposed solution is tested on retinopathy of prematurity (ROP) dataset with a very high label noise of 68%. Results show that the proposed algorithm significantly improves the classification algorithm's performance in the presence of noisy labels.Keywords: deep learning, label noise, robust learning, meta-learning, retinopathy of prematurity
Procedia PDF Downloads 161758 SiamMask++: More Accurate Object Tracking through Layer Wise Aggregation in Visual Object Tracking
Authors: Hyunbin Choi, Jihyeon Noh, Changwon Lim
Abstract:
In this paper, we propose SiamMask++, an architecture that performs layer-wise aggregation and depth-wise cross-correlation and introduce multi-RPN module and multi-MASK module to improve EAO (Expected Average Overlap), a representative performance evaluation metric for Visual Object Tracking (VOT) challenge. The proposed architecture, SiamMask++, has two versions, namely, bi_SiamMask++, which satisfies the real time (56fps) on systems equipped with GPUs (Titan XP), and rf_SiamMask++, which combines mask refinement modules for EAO improvements. Tests are performed on VOT2016, VOT2018 and VOT2019, the representative datasets of Visual Object Tracking tasks labeled as rotated bounding boxes. SiamMask++ perform better than SiamMask on all the three datasets tested. SiamMask++ is achieved performance of 62.6% accuracy, 26.2% robustness and 39.8% EAO, especially on the VOT2018 dataset. Compared to SiamMask, this is an improvement of 4.18%, 37.17%, 23.99%, respectively. In addition, we do an experimental in-depth analysis of how much the introduction of features and multi modules extracted from the backbone affects the performance of our model in the VOT task.Keywords: visual object tracking, video, deep learning, layer wise aggregation, Siamese network
Procedia PDF Downloads 160757 Systematic Evaluation of Convolutional Neural Network on Land Cover Classification from Remotely Sensed Images
Authors: Eiman Kattan, Hong Wei
Abstract:
In using Convolutional Neural Network (CNN) for classification, there is a set of hyperparameters available for the configuration purpose. This study aims to evaluate the impact of a range of parameters in CNN architecture i.e. AlexNet on land cover classification based on four remotely sensed datasets. The evaluation tests the influence of a set of hyperparameters on the classification performance. The parameters concerned are epoch values, batch size, and convolutional filter size against input image size. Thus, a set of experiments were conducted to specify the effectiveness of the selected parameters using two implementing approaches, named pertained and fine-tuned. We first explore the number of epochs under several selected batch size values (32, 64, 128 and 200). The impact of kernel size of convolutional filters (1, 3, 5, 7, 10, 15, 20, 25 and 30) was evaluated against the image size under testing (64, 96, 128, 180 and 224), which gave us insight of the relationship between the size of convolutional filters and image size. To generalise the validation, four remote sensing datasets, AID, RSD, UCMerced and RSCCN, which have different land covers and are publicly available, were used in the experiments. These datasets have a wide diversity of input data, such as number of classes, amount of labelled data, and texture patterns. A specifically designed interactive deep learning GPU training platform for image classification (Nvidia Digit) was employed in the experiments. It has shown efficiency in both training and testing. The results have shown that increasing the number of epochs leads to a higher accuracy rate, as expected. However, the convergence state is highly related to datasets. For the batch size evaluation, it has shown that a larger batch size slightly decreases the classification accuracy compared to a small batch size. For example, selecting the value 32 as the batch size on the RSCCN dataset achieves the accuracy rate of 90.34 % at the 11th epoch while decreasing the epoch value to one makes the accuracy rate drop to 74%. On the other extreme, setting an increased value of batch size to 200 decreases the accuracy rate at the 11th epoch is 86.5%, and 63% when using one epoch only. On the other hand, selecting the kernel size is loosely related to data set. From a practical point of view, the filter size 20 produces 70.4286%. The last performed image size experiment shows a dependency in the accuracy improvement. However, an expensive performance gain had been noticed. The represented conclusion opens the opportunities toward a better classification performance in various applications such as planetary remote sensing.Keywords: CNNs, hyperparamters, remote sensing, land cover, land use
Procedia PDF Downloads 169756 Hyperspectral Band Selection for Oil Spill Detection Using Deep Neural Network
Authors: Asmau Mukhtar Ahmed, Olga Duran
Abstract:
Hydrocarbon (HC) spills constitute a significant problem that causes great concern to the environment. With the latest technology (hyperspectral images) and state of the earth techniques (image processing tools), hydrocarbon spills can easily be detected at an early stage to mitigate the effects caused by such menace. In this study; a controlled laboratory experiment was used, and clay soil was mixed and homogenized with different hydrocarbon types (diesel, bio-diesel, and petrol). The different mixtures were scanned with HYSPEX hyperspectral camera under constant illumination to generate the hypersectral datasets used for this experiment. So far, the Short Wave Infrared Region (SWIR) has been exploited in detecting HC spills with excellent accuracy. However, the Near-Infrared Region (NIR) is somewhat unexplored with regards to HC contamination and how it affects the spectrum of soils. In this study, Deep Neural Network (DNN) was applied to the controlled datasets to detect and quantify the amount of HC spills in soils in the Near-Infrared Region. The initial results are extremely encouraging because it indicates that the DNN was able to identify features of HC in the Near-Infrared Region with a good level of accuracy.Keywords: hydrocarbon, Deep Neural Network, short wave infrared region, near-infrared region, hyperspectral image
Procedia PDF Downloads 114755 SPARK: An Open-Source Knowledge Discovery Platform That Leverages Non-Relational Databases and Massively Parallel Computational Power for Heterogeneous Genomic Datasets
Authors: Thilina Ranaweera, Enes Makalic, John L. Hopper, Adrian Bickerstaffe
Abstract:
Data are the primary asset of biomedical researchers, and the engine for both discovery and research translation. As the volume and complexity of research datasets increase, especially with new technologies such as large single nucleotide polymorphism (SNP) chips, so too does the requirement for software to manage, process and analyze the data. Researchers often need to execute complicated queries and conduct complex analyzes of large-scale datasets. Existing tools to analyze such data, and other types of high-dimensional data, unfortunately suffer from one or more major problems. They typically require a high level of computing expertise, are too simplistic (i.e., do not fit realistic models that allow for complex interactions), are limited by computing power, do not exploit the computing power of large-scale parallel architectures (e.g. supercomputers, GPU clusters etc.), or are limited in the types of analysis available, compounded by the fact that integrating new analysis methods is not straightforward. Solutions to these problems, such as those developed and implemented on parallel architectures, are currently available to only a relatively small portion of medical researchers with access and know-how. The past decade has seen a rapid expansion of data management systems for the medical domain. Much attention has been given to systems that manage phenotype datasets generated by medical studies. The introduction of heterogeneous genomic data for research subjects that reside in these systems has highlighted the need for substantial improvements in software architecture. To address this problem, we have developed SPARK, an enabling and translational system for medical research, leveraging existing high performance computing resources, and analysis techniques currently available or being developed. It builds these into The Ark, an open-source web-based system designed to manage medical data. SPARK provides a next-generation biomedical data management solution that is based upon a novel Micro-Service architecture and Big Data technologies. The system serves to demonstrate the applicability of Micro-Service architectures for the development of high performance computing applications. When applied to high-dimensional medical datasets such as genomic data, relational data management approaches with normalized data structures suffer from unfeasibly high execution times for basic operations such as insert (i.e. importing a GWAS dataset) and the queries that are typical of the genomics research domain. SPARK resolves these problems by incorporating non-relational NoSQL databases that have been driven by the emergence of Big Data. SPARK provides researchers across the world with user-friendly access to state-of-the-art data management and analysis tools while eliminating the need for high-level informatics and programming skills. The system will benefit health and medical research by eliminating the burden of large-scale data management, querying, cleaning, and analysis. SPARK represents a major advancement in genome research technologies, vastly reducing the burden of working with genomic datasets, and enabling cutting edge analysis approaches that have previously been out of reach for many medical researchers.Keywords: biomedical research, genomics, information systems, software
Procedia PDF Downloads 270754 Shifted Window Based Self-Attention via Swin Transformer for Zero-Shot Learning
Authors: Yasaswi Palagummi, Sareh Rowlands
Abstract:
Generalised Zero-Shot Learning, often known as GZSL, is an advanced variant of zero-shot learning in which the samples in the unseen category may be either seen or unseen. GZSL methods typically have a bias towards the seen classes because they learn a model to perform recognition for both the seen and unseen classes using data samples from the seen classes. This frequently leads to the misclassification of data from the unseen classes into the seen classes, making the task of GZSL more challenging. In this work of ours, to solve the GZSL problem, we propose an approach leveraging the Shifted Window based Self-Attention in the Swin Transformer (Swin-GZSL) to work in the inductive GSZL problem setting. We run experiments on three popular benchmark datasets: CUB, SUN, and AWA2, which are specifically used for ZSL and its other variants. The results show that our model based on Swin Transformer has achieved state-of-the-art harmonic mean for two datasets -AWA2 and SUN and near-state-of-the-art for the other dataset - CUB. More importantly, this technique has a linear computational complexity, which reduces training time significantly. We have also observed less bias than most of the existing GZSL models.Keywords: generalised, zero-shot learning, inductive learning, shifted-window attention, Swin transformer, vision transformer
Procedia PDF Downloads 71753 Enhancing Spatial Interpolation: A Multi-Layer Inverse Distance Weighting Model for Complex Regression and Classification Tasks in Spatial Data Analysis
Authors: Yakin Hajlaoui, Richard Labib, Jean-François Plante, Michel Gamache
Abstract:
This study introduces the Multi-Layer Inverse Distance Weighting Model (ML-IDW), inspired by the mathematical formulation of both multi-layer neural networks (ML-NNs) and Inverse Distance Weighting model (IDW). ML-IDW leverages ML-NNs' processing capabilities, characterized by compositions of learnable non-linear functions applied to input features, and incorporates IDW's ability to learn anisotropic spatial dependencies, presenting a promising solution for nonlinear spatial interpolation and learning from complex spatial data. it employ gradient descent and backpropagation to train ML-IDW, comparing its performance against conventional spatial interpolation models such as Kriging and standard IDW on regression and classification tasks using simulated spatial datasets of varying complexity. the results highlight the efficacy of ML-IDW, particularly in handling complex spatial datasets, exhibiting lower mean square error in regression and higher F1 score in classification.Keywords: deep learning, multi-layer neural networks, gradient descent, spatial interpolation, inverse distance weighting
Procedia PDF Downloads 52752 Sorting Maize Haploids from Hybrids Using Single-Kernel Near-Infrared Spectroscopy
Authors: Paul R Armstrong
Abstract:
Doubled haploids (DHs) have become an important breeding tool for creating maize inbred lines, although several bottlenecks in the DH production process limit wider development, application, and adoption of the technique. DH kernels are typically sorted manually and represent about 10% of the seeds in a much larger pool where the remaining 90% are hybrid siblings. This introduces time constraints on DH production and manual sorting is often not accurate. Automated sorting based on the chemical composition of the kernel can be effective, but devices, namely NMR, have not achieved the sorting speed to be a cost-effective replacement to manual sorting. This study evaluated a single kernel near-infrared reflectance spectroscopy (skNIR) platform to accurately identify DH kernels based on oil content. The skNIR platform is a higher-throughput device, approximately 3 seeds/s, that uses spectra to predict oil content of each kernel from maize crosses intentionally developed to create larger than normal oil differences, 1.5%-2%, between DH and hybrid kernels. Spectra from the skNIR were used to construct a partial least squares regression (PLS) model for oil and for a categorical reference model of 1 (DH kernel) or 2 (hybrid kernel) and then used to sort several crosses to evaluate performance. Two approaches were used for sorting. The first used a general PLS model developed from all crosses to predict oil content and then used for sorting each induction cross, the second was the development of a specific model from a single induction cross where approximately fifty DH and one hundred hybrid kernels used. This second approach used a categorical reference value of 1 and 2, instead of oil content, for the PLS model and kernels selected for the calibration set were manually referenced based on traditional commercial methods using coloration of the tip cap and germ areas. The generalized PLS oil model statistics were R2 = 0.94 and RMSE = .93% for kernels spanning an oil content of 2.7% to 19.3%. Sorting by this model resulted in extracting 55% to 85% of haploid kernels from the four induction crosses. Using the second method of generating a model for each cross yielded model statistics ranging from R2s = 0.96 to 0.98 and RMSEs from 0.08 to 0.10. Sorting in this case resulted in 100% correct classification but required models that were cross. In summary, the first generalized model oil method could be used to sort a significant number of kernels from a kernel pool but was not close to the accuracy of developing a sorting model from a single cross. The penalty for the second method is that a PLS model would need to be developed for each individual cross. In conclusion both methods could find useful application in the sorting of DH from hybrid kernels.Keywords: NIR, haploids, maize, sorting
Procedia PDF Downloads 302751 Comparison Of Virtual Non-Contrast To True Non-Contrast Images Using Dual Layer Spectral Computed Tomography
Authors: O’Day Luke
Abstract:
Purpose: To validate virtual non-contrast reconstructions generated from dual-layer spectral computed tomography (DL-CT) data as an alternative for the acquisition of a dedicated true non-contrast dataset during multiphase contrast studies. Material and methods: Thirty-three patients underwent a routine multiphase clinical CT examination, using Dual-Layer Spectral CT, from March to August 2021. True non-contrast (TNC) and virtual non-contrast (VNC) datasets, generated from both portal venous and arterial phase imaging were evaluated. For every patient in both true and virtual non-contrast datasets, a region-of-interest (ROI) was defined in aorta, liver, fluid (i.e. gallbladder, urinary bladder), kidney, muscle, fat and spongious bone, resulting in 693 ROIs. Differences in attenuation for VNC and TNV images were compared, both separately and combined. Consistency between VNC reconstructions obtained from the arterial and portal venous phase was evaluated. Results: Comparison of CT density (HU) on the VNC and TNC images showed a high correlation. The mean difference between TNC and VNC images (excluding bone results) was 5.5 ± 9.1 HU and > 90% of all comparisons showed a difference of less than 15 HU. For all tissues but spongious bone, the mean absolute difference between TNC and VNC images was below 10 HU. VNC images derived from the arterial and the portal venous phase showed a good correlation in most tissue types. The aortic attenuation was somewhat dependent however on which dataset was used for reconstruction. Bone evaluation with VNC datasets continues to be a problem, as spectral CT algorithms are currently poor in differentiating bone and iodine. Conclusion: Given the increasing availability of DL-CT and proven accuracy of virtual non-contrast processing, VNC is a promising tool for generating additional data during routine contrast-enhanced studies. This study shows the utility of virtual non-contrast scans as an alternative for true non-contrast studies during multiphase CT, with potential for dose reduction, without loss of diagnostic information.Keywords: dual-layer spectral computed tomography, virtual non-contrast, true non-contrast, clinical comparison
Procedia PDF Downloads 141750 Factors Affecting Cesarean Section among Women in Qatar Using Multiple Indicator Cluster Survey Database
Authors: Sahar Elsaleh, Ghada Farhat, Shaikha Al-Derham, Fasih Alam
Abstract:
Background: Cesarean section (CS) delivery is one of the major concerns both in developing and developed countries. The rate of CS deliveries are on the rise globally, and especially in Qatar. Many socio-economic, demographic, clinical and institutional factors play an important role for cesarean sections. This study aims to investigate factors affecting the prevalence of CS among women in Qatar using the UNICEF’s Multiple Indicator Cluster Survey (MICS) 2012 database. Methods: The study has focused on the women’s questionnaire of the MICS, which was successfully distributed to 5699 participants. Following study inclusion and exclusion criteria, a final sample of 761 women aged 19- 49 years who had at least one delivery of giving birth in their lifetime before the survey were included. A number of socio-economic, demographic, clinical and institutional factors, identified through literature review and available in the data, were considered for the analyses. Bivariate and multivariate logistic regression models, along with a multi-level modeling to investigate clustering effect, were undertaken to identify the factors that affect CS prevalence in Qatar. Results: From the bivariate analyses the study has shown that, a number of categorical factors are statistically significantly associated with the dependent variable (CS). When identifying the factors from a multivariate logistic regression, the study found that only three categorical factors -‘age of women’, ‘place at delivery’ and ‘baby weight’ appeared to be significantly affecting the CS among women in Qatar. Although the MICS dataset is based on a cluster survey, an exploratory multi-level analysis did not show any clustering effect, i.e. no significant variation in results at higher level (households), suggesting that all analyses at lower level (individual respondent) are valid without any significant bias in results. Conclusion: The study found a statistically significant association between the dependent variable (CS delivery) and age of women, frequency of TV watching, assistance at birth and place of birth. These results need to be interpreted cautiously; however, it can be used as evidence-base for further research on cesarean section delivery in Qatar.Keywords: cesarean section, factors, multiple indicator cluster survey, MICS database, Qatar
Procedia PDF Downloads 116