Search results for: imbalanced datasets
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 741

Search results for: imbalanced datasets

501 Monocular Visual Odometry for Three Different View Angles by Intel Realsense T265 with the Measurement of Remote

Authors: Heru Syah Putra, Aji Tri Pamungkas Nurcahyo, Chuang-Jan Chang

Abstract:

MOIL-SDK method refers to the spatial angle that forms a view with a different perspective from the Fisheye image. Visual Odometry forms a trusted application for extending projects by tracking using image sequences. A real-time, precise, and persistent approach that is able to contribute to the work when taking datasets and generate ground truth as a reference for the estimates of each image using the FAST Algorithm method in finding Keypoints that are evaluated during the tracking process with the 5-point Algorithm with RANSAC, as well as produce accurate estimates the camera trajectory for each rotational, translational movement on the X, Y, and Z axes.

Keywords: MOIL-SDK, intel realsense T265, Fisheye image, monocular visual odometry

Procedia PDF Downloads 102
500 Benchmarking Bert-Based Low-Resource Language: Case Uzbek NLP Models

Authors: Jamshid Qodirov, Sirojiddin Komolov, Ravilov Mirahmad, Olimjon Mirzayev

Abstract:

Nowadays, natural language processing tools play a crucial role in our daily lives, including various techniques with text processing. There are very advanced models in modern languages, such as English, Russian etc. But, in some languages, such as Uzbek, the NLP models have been developed recently. Thus, there are only a few NLP models in Uzbek language. Moreover, there is no such work that could show which Uzbek NLP model behaves in different situations and when to use them. This work tries to close this gap and compares the Uzbek NLP models existing as of the time this article was written. The authors try to compare the NLP models in two different scenarios: sentiment analysis and sentence similarity, which are the implementations of the two most common problems in the industry: classification and similarity. Another outcome from this work is two datasets for classification and sentence similarity in Uzbek language that we generated ourselves and can be useful in both industry and academia as well.

Keywords: NLP, benchmak, bert, vectorization

Procedia PDF Downloads 23
499 A Novel PSO Based Decision Tree Classification

Authors: Ali Farzan

Abstract:

Classification of data objects or patterns is a major part in most of Decision making systems. One of the popular and commonly used classification methods is Decision Tree (DT). It is a hierarchical decision making system by which a binary tree is constructed and starting from root, at each node some of the classes is rejected until reaching the leaf nods. Each leaf node is a representative of one specific class. Finding the splitting criteria in each node for constructing or training the tree is a major problem. Particle Swarm Optimization (PSO) has been adopted as a metaheuristic searching method for finding the best splitting criteria. Result of evaluating the proposed method over benchmark datasets indicates the higher accuracy of the new PSO based decision tree.

Keywords: decision tree, particle swarm optimization, splitting criteria, metaheuristic

Procedia PDF Downloads 377
498 Automatic Threshold Search for Heat Map Based Feature Selection: A Cancer Dataset Analysis

Authors: Carlos Huertas, Reyes Juarez-Ramirez

Abstract:

Public health is one of the most critical issues today; therefore, there is great interest to improve technologies in the area of diseases detection. With machine learning and feature selection, it has been possible to aid the diagnosis of several diseases such as cancer. In this work, we present an extension to the Heat Map Based Feature Selection algorithm, this modification allows automatic threshold parameter selection that helps to improve the generalization performance of high dimensional data such as mass spectrometry. We have performed a comparison analysis using multiple cancer datasets and compare against the well known Recursive Feature Elimination algorithm and our original proposal, the results show improved classification performance that is very competitive against current techniques.

Keywords: biomarker discovery, cancer, feature selection, mass spectrometry

Procedia PDF Downloads 298
497 Evaluating Alternative Structures for Prefix Trees

Authors: Feras Hanandeh, Izzat Alsmadi, Muhammad M. Kwafha

Abstract:

Prefix trees or tries are data structures that are used to store data or index of data. The goal is to be able to store and retrieve data by executing queries in quick and reliable manners. In principle, the structure of the trie depends on having letters in nodes at the different levels to point to the actual words in the leafs. However, the exact structure of the trie may vary based on several aspects. In this paper, we evaluated different structures for building tries. Using datasets of words of different sizes, we evaluated the different forms of trie structures. Results showed that some characteristics may impact significantly, positively or negatively, the size and the performance of the trie. We investigated different forms and structures for the trie. Results showed that using an array of pointers in each level to represent the different alphabet letters is the best choice.

Keywords: data structures, indexing, tree structure, trie, information retrieval

Procedia PDF Downloads 421
496 Emotion-Convolutional Neural Network for Perceiving Stress from Audio Signals: A Brain Chemistry Approach

Authors: Anup Anand Deshmukh, Catherine Soladie, Renaud Seguier

Abstract:

Emotion plays a key role in many applications like healthcare, to gather patients’ emotional behavior. Unlike typical ASR (Automated Speech Recognition) problems which focus on 'what was said', it is equally important to understand 'how it was said.' There are certain emotions which are given more importance due to their effectiveness in understanding human feelings. In this paper, we propose an approach that models human stress from audio signals. The research challenge in speech emotion detection is finding the appropriate set of acoustic features corresponding to an emotion. Another difficulty lies in defining the very meaning of emotion and being able to categorize it in a precise manner. Supervised Machine Learning models, including state of the art Deep Learning classification methods, rely on the availability of clean and labelled data. One of the problems in affective computation is the limited amount of annotated data. The existing labelled emotions datasets are highly subjective to the perception of the annotator. We address the first issue of feature selection by exploiting the use of traditional MFCC (Mel-Frequency Cepstral Coefficients) features in Convolutional Neural Network. Our proposed Emo-CNN (Emotion-CNN) architecture treats speech representations in a manner similar to how CNN’s treat images in a vision problem. Our experiments show that Emo-CNN consistently and significantly outperforms the popular existing methods over multiple datasets. It achieves 90.2% categorical accuracy on the Emo-DB dataset. We claim that Emo-CNN is robust to speaker variations and environmental distortions. The proposed approach achieves 85.5% speaker-dependant categorical accuracy for SAVEE (Surrey Audio-Visual Expressed Emotion) dataset, beating the existing CNN based approach by 10.2%. To tackle the second problem of subjectivity in stress labels, we use Lovheim’s cube, which is a 3-dimensional projection of emotions. Monoamine neurotransmitters are a type of chemical messengers in the brain that transmits signals on perceiving emotions. The cube aims at explaining the relationship between these neurotransmitters and the positions of emotions in 3D space. The learnt emotion representations from the Emo-CNN are mapped to the cube using three component PCA (Principal Component Analysis) which is then used to model human stress. This proposed approach not only circumvents the need for labelled stress data but also complies with the psychological theory of emotions given by Lovheim’s cube. We believe that this work is the first step towards creating a connection between Artificial Intelligence and the chemistry of human emotions.

Keywords: deep learning, brain chemistry, emotion perception, Lovheim's cube

Procedia PDF Downloads 123
495 Changes in Geospatial Structure of Households in the Czech Republic: Findings from Population and Housing Census

Authors: Jaroslav Kraus

Abstract:

Spatial information about demographic processes are a standard part of outputs in the Czech Republic. That was also the case of Population and Housing Census which was held on 2011. This is a starting point for a follow up study devoted to two basic types of households: single person households and households of one completed family. Single person households and one family households create more than 80 percent of all households, but the share and spatial structure is in long-term changing. The increase of single households is results of long-term fertility decrease and divorce increase, but also possibility of separate living. There are regions in the Czech Republic with traditional demographic behavior, and regions like capital Prague and some others with changing pattern. Population census is based - according to international standards - on the concept of currently living population. Three types of geospatial approaches will be used for analysis: (i) firstly measures of geographic distribution, (ii) secondly mapping clusters to identify the locations of statistically significant hot spots, cold spots, spatial outliers, and similar features and (iii) finally analyzing pattern approach as a starting point for more in-depth analyses (geospatial regression) in the future will be also applied. For analysis of this type of data, number of households by types should be distinct objects. All events in a meaningful delimited study region (e.g. municipalities) will be included in an analysis. Commonly produced measures of central tendency and spread will include: identification of the location of the center of the point set (by NUTS3 level); identification of the median center and standard distance, weighted standard distance and standard deviational ellipses will be also used. Identifying that clustering exists in census households datasets does not provide a detailed picture of the nature and pattern of clustering but will be helpful to apply simple hot-spot (and cold spot) identification techniques to such datasets. Once the spatial structure of households will be determined, any particular measure of autocorrelation can be constructed by defining a way of measuring the difference between location attribute values. The most widely used measure is Moran’s I that will be applied to municipal units where numerical ratio is calculated. Local statistics arise naturally out of any of the methods for measuring spatial autocorrelation and will be applied to development of localized variants of almost any standard summary statistic. Local Moran’s I will give an indication of household data homogeneity and diversity on a municipal level.

Keywords: census, geo-demography, households, the Czech Republic

Procedia PDF Downloads 80
494 Random Subspace Ensemble of CMAC Classifiers

Authors: Somaiyeh Dehghan, Mohammad Reza Kheirkhahan Haghighi

Abstract:

The rapid growth of domains that have data with a large number of features, while the number of samples is limited has caused difficulty in constructing strong classifiers. To reduce the dimensionality of the feature space becomes an essential step in classification task. Random subspace method (or attribute bagging) is an ensemble classifier that consists of several classifiers that each base learner in ensemble has subset of features. In the present paper, we introduce Random Subspace Ensemble of CMAC neural network (RSE-CMAC), each of which has training with subset of features. Then we use this model for classification task. For evaluation performance of our model, we compare it with bagging algorithm on 36 UCI datasets. The results reveal that the new model has better performance.

Keywords: classification, random subspace, ensemble, CMAC neural network

Procedia PDF Downloads 303
493 Defect Detection for Nanofibrous Images with Deep Learning-Based Approaches

Authors: Gaokai Liu

Abstract:

Automatic defect detection for nanomaterial images is widely required in industrial scenarios. Deep learning approaches are considered as the most effective solutions for the great majority of image-based tasks. In this paper, an edge guidance network for defect segmentation is proposed. First, the encoder path with multiple convolution and downsampling operations is applied to the acquisition of shared features. Then two decoder paths both are connected to the last convolution layer of the encoder and supervised by the edge and segmentation labels, respectively, to guide the whole training process. Meanwhile, the edge and encoder outputs from the same stage are concatenated to the segmentation corresponding part to further tune the segmentation result. Finally, the effectiveness of the proposed method is verified via the experiments on open nanofibrous datasets.

Keywords: deep learning, defect detection, image segmentation, nanomaterials

Procedia PDF Downloads 114
492 GPS Refinement in Cities Using Statistical Approach

Authors: Ashwani Kumar

Abstract:

GPS plays an important role in everyday life for safe and convenient transportation. While pedestrians use hand held devices to know their position in a city, vehicles in intelligent transport systems use relatively sophisticated GPS receivers for estimating their current position. However, in urban areas where the GPS satellites are occluded by tall buildings, trees and reflections of GPS signals from nearby vehicles, GPS position estimation becomes poor. In this work, an exhaustive GPS data is collected at a single point in urban area under different times of day and under dynamic environmental conditions. The data is analyzed and statistical refinement methods are used to obtain optimal position estimate among all the measured positions. The results obtained are compared with publically available datasets and obtained position estimation refinement results are promising.

Keywords: global positioning system, statistical approach, intelligent transport systems, least squares estimation

Procedia PDF Downloads 259
491 Predicting Groundwater Areas Using Data Mining Techniques: Groundwater in Jordan as Case Study

Authors: Faisal Aburub, Wael Hadi

Abstract:

Data mining is the process of extracting useful or hidden information from a large database. Extracted information can be used to discover relationships among features, where data objects are grouped according to logical relationships; or to predict unseen objects to one of the predefined groups. In this paper, we aim to investigate four well-known data mining algorithms in order to predict groundwater areas in Jordan. These algorithms are Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbor (kNN) and Classification Based on Association Rule (CBA). The experimental results indicate that the SVMs algorithm outperformed other algorithms in terms of classification accuracy, precision and F1 evaluation measures using the datasets of groundwater areas that were collected from Jordanian Ministry of Water and Irrigation.

Keywords: classification, data mining, evaluation measures, groundwater

Procedia PDF Downloads 251
490 Mask-Prompt-Rerank: An Unsupervised Method for Text Sentiment Transfer

Authors: Yufen Qin

Abstract:

Text sentiment transfer is an important branch of text style transfer. The goal is to generate text with another sentiment attribute based on a text with a specific sentiment attribute while maintaining the content and semantic information unrelated to sentiment unchanged in the process. There are currently two main challenges in this field: no parallel corpus and text attribute entanglement. In response to the above problems, this paper proposed a novel solution: Mask-Prompt-Rerank. Use the method of masking the sentiment words and then using prompt regeneration to transfer the sentence sentiment. Experiments on two sentiment benchmark datasets and one formality transfer benchmark dataset show that this approach makes the performance of small pre-trained language models comparable to that of the most advanced large models, while consuming two orders of magnitude less computing and memory.

Keywords: language model, natural language processing, prompt, text sentiment transfer

Procedia PDF Downloads 46
489 Exploring the Spatial Characteristics of Mortality Map: A Statistical Area Perspective

Authors: Jung-Hong Hong, Jing-Cen Yang, Cai-Yu Ou

Abstract:

The analysis of geographic inequality heavily relies on the use of location-enabled statistical data and quantitative measures to present the spatial patterns of the selected phenomena and analyze their differences. To protect the privacy of individual instance and link to administrative units, point-based datasets are spatially aggregated to area-based statistical datasets, where only the overall status for the selected levels of spatial units is used for decision making. The partition of the spatial units thus has dominant influence on the outcomes of the analyzed results, well known as the Modifiable Areal Unit Problem (MAUP). A new spatial reference framework, the Taiwan Geographical Statistical Classification (TGSC), was recently introduced in Taiwan based on the spatial partition principles of homogeneous consideration of the number of population and households. Comparing to the outcomes of the traditional township units, TGSC provides additional levels of spatial units with finer granularity for presenting spatial phenomena and enables domain experts to select appropriate dissemination level for publishing statistical data. This paper compares the results of respectively using TGSC and township unit on the mortality data and examines the spatial characteristics of their outcomes. For the mortality data between the period of January 1st, 2008 and December 31st, 2010 of the Taitung County, the all-cause age-standardized death rate (ASDR) ranges from 571 to 1757 per 100,000 persons, whereas the 2nd dissemination area (TGSC) shows greater variation, ranged from 0 to 2222 per 100,000. The finer granularity of spatial units of TGSC clearly provides better outcomes for identifying and evaluating the geographic inequality and can be further analyzed with the statistical measures from other perspectives (e.g., population, area, environment.). The management and analysis of the statistical data referring to the TGSC in this research is strongly supported by the use of Geographic Information System (GIS) technology. An integrated workflow that consists of the tasks of the processing of death certificates, the geocoding of street address, the quality assurance of geocoded results, the automatic calculation of statistic measures, the standardized encoding of measures and the geo-visualization of statistical outcomes is developed. This paper also introduces a set of auxiliary measures from a geographic distribution perspective to further examine the hidden spatial characteristics of mortality data and justify the analyzed results. With the common statistical area framework like TGSC, the preliminary results demonstrate promising potential for developing a web-based statistical service that can effectively access domain statistical data and present the analyzed outcomes in meaningful ways to avoid wrong decision making.

Keywords: mortality map, spatial patterns, statistical area, variation

Procedia PDF Downloads 223
488 From Two-Way to Multi-Way: A Comparative Study for Map-Reduce Join Algorithms

Authors: Marwa Hussien Mohamed, Mohamed Helmy Khafagy

Abstract:

Map-Reduce is a programming model which is widely used to extract valuable information from enormous volumes of data. Map-reduce designed to support heterogeneous datasets. Apache Hadoop map-reduce used extensively to uncover hidden pattern like data mining, SQL, etc. The most important operation for data analysis is joining operation. But, map-reduce framework does not directly support join algorithm. This paper explains and compares two-way and multi-way map-reduce join algorithms for map reduce also we implement MR join Algorithms and show the performance of each phase in MR join algorithms. Our experimental results show that map side join and map merge join in two-way join algorithms has the longest time according to preprocessing step sorting data and reduce side cascade join has the longest time at Multi-Way join algorithms.

Keywords: Hadoop, MapReduce, multi-way join, two-way join, Ubuntu

Procedia PDF Downloads 452
487 A Multi-Agent Urban Traffic Simulator for Generating Autonomous Driving Training Data

Authors: Florin Leon

Abstract:

This paper describes a simulator of traffic scenarios tailored to facilitate autonomous driving model training for urban environments. With the rising prominence of self-driving vehicles, the need for diverse datasets is very important. The proposed simulator provides a flexible framework that allows the generation of custom scenarios needed for the validation and enhancement of trajectory prediction algorithms. Its controlled yet dynamic environment addresses the challenges associated with real-world data acquisition and ensures adaptability to diverse driving scenarios. By providing an adaptable solution for scenario creation and algorithm testing, this tool proves to be a valuable resource for advancing autonomous driving technology that aims to ensure safe and efficient self-driving vehicles.

Keywords: autonomous driving, car simulator, machine learning, model training, urban simulation environment

Procedia PDF Downloads 15
486 Simulation-Based Unmanned Surface Vehicle Design Using PX4 and Robot Operating System With Kubernetes and Cloud-Native Tooling

Authors: Norbert Szulc, Jakub Wilk, Franciszek Górski

Abstract:

This paper presents an approach for simulating and testing robotic systems based on PX4, using a local Kubernetes cluster. The approach leverages modern cloud-native tools and runs on single-board computers. Additionally, this solution enables the creation of datasets for computer vision and the evaluation of control system algorithms in an end-to-end manner. This paper compares this approach to method commonly used Docker based approach. This approach was used to develop simulation environment for an unmanned surface vehicle (USV) for RoboBoat 2023 by running a containerized configuration of the PX4 Open-source Autopilot connected to ROS and the Gazebo simulation environment.

Keywords: cloud computing, Kubernetes, single board computers, simulation, ROS

Procedia PDF Downloads 43
485 Big Data for Local Decision-Making: Indicators Identified at International Conference on Urban Health 2017

Authors: Dana R. Thomson, Catherine Linard, Sabine Vanhuysse, Jessica E. Steele, Michal Shimoni, Jose Siri, Waleska Caiaffa, Megumi Rosenberg, Eleonore Wolff, Tais Grippa, Stefanos Georganos, Helen Elsey

Abstract:

The Sustainable Development Goals (SDGs) and Urban Health Equity Assessment and Response Tool (Urban HEART) identify dozens of key indicators to help local decision-makers prioritize and track inequalities in health outcomes. However, presentations and discussions at the International Conference on Urban Health (ICUH) 2017 suggested that additional indicators are needed to make decisions and policies. A local decision-maker may realize that malaria or road accidents are a top priority. However, s/he needs additional health determinant indicators, for example about standing water or traffic, to address the priority and reduce inequalities. Health determinants reflect the physical and social environments that influence health outcomes often at community- and societal-levels and include such indicators as access to quality health facilities, access to safe parks, traffic density, location of slum areas, air pollution, social exclusion, and social networks. Indicator identification and disaggregation are necessarily constrained by available datasets – typically collected about households and individuals in surveys, censuses, and administrative records. Continued advancements in earth observation, data storage, computing and mobile technologies mean that new sources of health determinants indicators derived from 'big data' are becoming available at fine geographic scale. Big data includes high-resolution satellite imagery and aggregated, anonymized mobile phone data. While big data are themselves not representative of the population (e.g., satellite images depict the physical environment), they can provide information about population density, wealth, mobility, and social environments with tremendous detail and accuracy when combined with population-representative survey, census, administrative and health system data. The aim of this paper is to (1) flag to data scientists important indicators needed by health decision-makers at the city and sub-city scale - ideally free and publicly available, and (2) summarize for local decision-makers new datasets that can be generated from big data, with layperson descriptions of difficulties in generating them. We include SDGs and Urban HEART indicators, as well as indicators mentioned by decision-makers attending ICUH 2017.

Keywords: health determinant, health outcome, mobile phone, remote sensing, satellite imagery, SDG, urban HEART

Procedia PDF Downloads 177
484 MULTI-FLGANs: Multi-Distributed Adversarial Networks for Non-Independent and Identically Distributed Distribution

Authors: Akash Amalan, Rui Wang, Yanqi Qiao, Emmanouil Panaousis, Kaitai Liang

Abstract:

Federated learning is an emerging concept in the domain of distributed machine learning. This concept has enabled General Adversarial Networks (GANs) to benefit from the rich distributed training data while preserving privacy. However, in a non-IID setting, current federated GAN architectures are unstable, struggling to learn the distinct features, and vulnerable to mode collapse. In this paper, we propose an architecture MULTI-FLGAN to solve the problem of low-quality images, mode collapse, and instability for non-IID datasets. Our results show that MULTI-FLGAN is four times as stable and performant (i.e., high inception score) on average over 20 clients compared to baseline FLGAN.

Keywords: federated learning, generative adversarial network, inference attack, non-IID data distribution

Procedia PDF Downloads 116
483 Real Time Multi Person Action Recognition Using Pose Estimates

Authors: Aishrith Rao

Abstract:

Human activity recognition is an important aspect of video analytics, and many approaches have been recommended to enable action recognition. In this approach, the model is used to identify the action of the multiple people in the frame and classify them accordingly. A few approaches use RNNs and 3D CNNs, which are computationally expensive and cannot be trained with the small datasets which are currently available. Multi-person action recognition has been performed in order to understand the positions and action of people present in the video frame. The size of the video frame can be adjusted as a hyper-parameter depending on the hardware resources available. OpenPose has been used to calculate pose estimate using CNN to produce heap-maps, one of which provides skeleton features, which are basically joint features. The features are then extracted, and a classification algorithm can be applied to classify the action.

Keywords: human activity recognition, computer vision, pose estimates, convolutional neural networks

Procedia PDF Downloads 112
482 Generating Product Description with Generative Pre-Trained Transformer 2

Authors: Minh-Thuan Nguyen, Phuong-Thai Nguyen, Van-Vinh Nguyen, Quang-Minh Nguyen

Abstract:

Research on automatically generating descriptions for e-commerce products is gaining increasing attention in recent years. However, the generated descriptions of their systems are often less informative and attractive because of lacking training datasets or the limitation of these approaches, which often use templates or statistical methods. In this paper, we explore a method to generate production descriptions by using the GPT-2 model. In addition, we apply text paraphrasing and task-adaptive pretraining techniques to improve the qualify of descriptions generated from the GPT-2 model. Experiment results show that our models outperform the baseline model through automatic evaluation and human evaluation. Especially, our methods achieve a promising result not only on the seen test set but also in the unseen test set.

Keywords: GPT-2, product description, transformer, task-adaptive, language model, pretraining

Procedia PDF Downloads 169
481 Dissimilarity-Based Coloring for Symbolic and Multivariate Data Visualization

Authors: K. Umbleja, M. Ichino, H. Yaguchi

Abstract:

In this paper, we propose a coloring method for multivariate data visualization by using parallel coordinates based on dissimilarity and tree structure information gathered during hierarchical clustering. The proposed method is an extension for proximity-based coloring that suffers from a few undesired side effects if hierarchical tree structure is not balanced tree. We describe the algorithm by assigning colors based on dissimilarity information, show the application of proposed method on three commonly used datasets, and compare the results with proximity-based coloring. We found our proposed method to be especially beneficial for symbolic data visualization where many individual objects have already been aggregated into a single symbolic object.

Keywords: data visualization, dissimilarity-based coloring, proximity-based coloring, symbolic data

Procedia PDF Downloads 137
480 Mining Scientific Literature to Discover Potential Research Data Sources: An Exploratory Study in the Field of Haemato-Oncology

Authors: A. Anastasiou, K. S. Tingay

Abstract:

Background: Discovering suitable datasets is an important part of health research, particularly for projects working with clinical data from patients organized in cohorts (cohort data), but with the proliferation of so many national and international initiatives, it is becoming increasingly difficult for research teams to locate real world datasets that are most relevant to their project objectives. We present a method for identifying healthcare institutes in the European Union (EU) which may hold haemato-oncology (HO) data. A key enabler of this research was the bibInsight platform, a scientometric data management and analysis system developed by the authors at Swansea University. Method: A PubMed search was conducted using HO clinical terms taken from previous work. The resulting XML file was processed using the bibInsight platform, linking affiliations to the Global Research Identifier Database (GRID). GRID is an international, standardized list of institutions, including the city and country in which the institution exists, as well as a category of the main business type, e.g., Academic, Healthcare, Government, Company. Countries were limited to the 28 current EU members, and institute type to 'Healthcare'. An article was considered valid if at least one author was affiliated with an EU-based healthcare institute. Results: The PubMed search produced 21,310 articles, consisting of 9,885 distinct affiliations with correspondence in GRID. Of these articles, 760 were from EU countries, and 390 of these were healthcare institutes. One affiliation was excluded as being a veterinary hospital. Two EU countries did not have any publications in our analysis dataset. The results were analysed by country and by individual healthcare institute. Networks both within the EU and internationally show institutional collaborations, which may suggest a willingness to share data for research purposes. Geographical mapping can ensure that data has broad population coverage. Collaborations with industry or government may exclude healthcare institutes that may have embargos or additional costs associated with data access. Conclusions: Data reuse is becoming increasingly important both for ensuring the validity of results, and economy of available resources. The ability to identify potential, specific data sources from over twenty thousand articles in less than an hour could assist in improving knowledge of, and access to, data sources. As our method has not yet specified if these healthcare institutes are holding data, or merely publishing on that topic, future work will involve text mining of data-specific concordant terms to identify numbers of participants, demographics, study methodologies, and sub-topics of interest.

Keywords: data reuse, data discovery, data linkage, journal articles, text mining

Procedia PDF Downloads 90
479 River's Bed Level Changing Pattern Due to Sedimentation, Case Study: Gash River, Kassala, Sudan

Authors: Faisal Ali, Hasssan Saad Mohammed Hilmi, Mustafa Mohamed, Shamseddin Musa

Abstract:

The Gash rivers an ephemeral river, it usually flows from July to September, it has a braided pattern with high sediment content, of 15200 ppm in suspension, and 360 kg/sec as bed load. The Gash river bed has an average slope of 1.3 m/Km. The objectives of this study were: assessing the Gash River bed level patterns; quantifying the annual variations in Gash bed level; and recommending a suitable method to reduce the sediment accumulation on the Gash River bed. The study covered temporally the period 1905-2013 using datasets included the Gash river flows, and the cross sections. The results showed that there is an increasing trend in the river bed of 5 cm3 per year. This is resulted in changing the behavior of the flood routing and consequently the flood hazard is tremendously increased in Kassala city.

Keywords: bed level, cross section, gash river, sedimentation

Procedia PDF Downloads 502
478 Decision Trees Constructing Based on K-Means Clustering Algorithm

Authors: Loai Abdallah, Malik Yousef

Abstract:

A domain space for the data should reflect the actual similarity between objects. Since objects belonging to the same cluster usually share some common traits even though their geometric distance might be relatively large. In general, the Euclidean distance of data points that represented by large number of features is not capturing the actual relation between those points. In this study, we propose a new method to construct a different space that is based on clustering to form a new distance metric. The new distance space is based on ensemble clustering (EC). The EC distance space is defined by tracking the membership of the points over multiple runs of clustering algorithm metric. Over this distance, we train the decision trees classifier (DT-EC). The results obtained by applying DT-EC on 10 datasets confirm our hypotheses that embedding the EC space as a distance metric would improve the performance.

Keywords: ensemble clustering, decision trees, classification, K nearest neighbors

Procedia PDF Downloads 160
477 A Comparison of YOLO Family for Apple Detection and Counting in Orchards

Authors: Yuanqing Li, Changyi Lei, Zhaopeng Xue, Zhuo Zheng, Yanbo Long

Abstract:

In agricultural production and breeding, implementing automatic picking robot in orchard farming to reduce human labour and error is challenging. The core function of it is automatic identification based on machine vision. This paper focuses on apple detection and counting in orchards and implements several deep learning methods. Extensive datasets are used and a semi-automatic annotation method is proposed. The proposed deep learning models are in state-of-the-art YOLO family. In view of the essence of the models with various backbones, a multi-dimensional comparison in details is made in terms of counting accuracy, mAP and model memory, laying the foundation for realising automatic precision agriculture.

Keywords: agricultural object detection, deep learning, machine vision, YOLO family

Procedia PDF Downloads 164
476 Monitoring Land Productivity Dynamics of Gombe State, Nigeria

Authors: Ishiyaku Abdulkadir, Satish Kumar J

Abstract:

Land Productivity is a measure of the greenness of above-ground biomass in health and potential gain and is not related to agricultural productivity. Monitoring land productivity dynamics is essential to identify, especially when and where the trend is characterized degraded for mitigation measures. This research aims to monitor the land productivity trend of Gombe State between 2001 and 2015. QGIS was used to compute NDVI from AVHRR/MODIS datasets in a cloud-based method. The result appears that land area with improving productivity account for 773sq.km with 4.31%, stable productivity traced to 4,195.6 sq.km with 23.40%, stable but stressed productivity represent 18.7sq.km account for 0.10%, early sign of decline productivity occupied 5203.1sq.km with 29%, declining productivity account for 7019.7sq.km, represent 39.2%, water bodies occupied 718.7sq.km traced to 4% of the state’s area.

Keywords: above-ground biomass, dynamics, land productivity, man-environment relationship

Procedia PDF Downloads 119
475 Robust Variable Selection Based on Schwarz Information Criterion for Linear Regression Models

Authors: Shokrya Saleh A. Alshqaq, Abdullah Ali H. Ahmadini

Abstract:

The Schwarz information criterion (SIC) is a popular tool for selecting the best variables in regression datasets. However, SIC is defined using an unbounded estimator, namely, the least-squares (LS), which is highly sensitive to outlying observations, especially bad leverage points. A method for robust variable selection based on SIC for linear regression models is thus needed. This study investigates the robustness properties of SIC by deriving its influence function and proposes a robust SIC based on the MM-estimation scale. The aim of this study is to produce a criterion that can effectively select accurate models in the presence of vertical outliers and high leverage points. The advantages of the proposed robust SIC is demonstrated through a simulation study and an analysis of a real dataset.

Keywords: influence function, robust variable selection, robust regression, Schwarz information criterion

Procedia PDF Downloads 113
474 Combining the Dynamic Conditional Correlation and Range-GARCH Models to Improve Covariance Forecasts

Authors: Piotr Fiszeder, Marcin Fałdziński, Peter Molnár

Abstract:

The dynamic conditional correlation model of Engle (2002) is one of the most popular multivariate volatility models. However, this model is based solely on closing prices. It has been documented in the literature that the high and low price of the day can be used in an efficient volatility estimation. We, therefore, suggest a model which incorporates high and low prices into the dynamic conditional correlation framework. Empirical evaluation of this model is conducted on three datasets: currencies, stocks, and commodity exchange-traded funds. The utilisation of realized variances and covariances as proxies for true variances and covariances allows us to reach a strong conclusion that our model outperforms not only the standard dynamic conditional correlation model but also a competing range-based dynamic conditional correlation model.

Keywords: volatility, DCC model, high and low prices, range-based models, covariance forecasting

Procedia PDF Downloads 148
473 Distorted Document Images Dataset for Text Detection and Recognition

Authors: Ilia Zharikov, Philipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan

Abstract:

With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models.

Keywords: document analysis, open dataset, optical character recognition, text detection

Procedia PDF Downloads 138
472 FPGA Implementation of Adaptive Clock Recovery for TDMoIP Systems

Authors: Semih Demir, Anil Celebi

Abstract:

Circuit switched networks widely used until the end of the 20th century have been transformed into packages switched networks. Time Division Multiplexing over Internet Protocol (TDMoIP) is a system that enables Time Division Multiplexing (TDM) traffic to be carried over packet switched networks (PSN). In TDMoIP systems, devices that send TDM data to the PSN and receive it from the network must operate with the same clock frequency. In this study, it was aimed to implement clock synchronization process in Field Programmable Gate Array (FPGA) chips using time information attached to the packages received from PSN. The designed hardware is verified using the datasets obtained for the different carrier types and comparing the results with the software model. Field tests are also performed by using the real time TDMoIP system.

Keywords: clock recovery on TDMoIP, FPGA, MATLAB reference model, clock synchronization

Procedia PDF Downloads 243