Search results for: datasets
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 697

Search results for: datasets

487 Mask-Prompt-Rerank: An Unsupervised Method for Text Sentiment Transfer

Authors: Yufen Qin

Abstract:

Text sentiment transfer is an important branch of text style transfer. The goal is to generate text with another sentiment attribute based on a text with a specific sentiment attribute while maintaining the content and semantic information unrelated to sentiment unchanged in the process. There are currently two main challenges in this field: no parallel corpus and text attribute entanglement. In response to the above problems, this paper proposed a novel solution: Mask-Prompt-Rerank. Use the method of masking the sentiment words and then using prompt regeneration to transfer the sentence sentiment. Experiments on two sentiment benchmark datasets and one formality transfer benchmark dataset show that this approach makes the performance of small pre-trained language models comparable to that of the most advanced large models, while consuming two orders of magnitude less computing and memory.

Keywords: language model, natural language processing, prompt, text sentiment transfer

Procedia PDF Downloads 47
486 Exploring the Spatial Characteristics of Mortality Map: A Statistical Area Perspective

Authors: Jung-Hong Hong, Jing-Cen Yang, Cai-Yu Ou

Abstract:

The analysis of geographic inequality heavily relies on the use of location-enabled statistical data and quantitative measures to present the spatial patterns of the selected phenomena and analyze their differences. To protect the privacy of individual instance and link to administrative units, point-based datasets are spatially aggregated to area-based statistical datasets, where only the overall status for the selected levels of spatial units is used for decision making. The partition of the spatial units thus has dominant influence on the outcomes of the analyzed results, well known as the Modifiable Areal Unit Problem (MAUP). A new spatial reference framework, the Taiwan Geographical Statistical Classification (TGSC), was recently introduced in Taiwan based on the spatial partition principles of homogeneous consideration of the number of population and households. Comparing to the outcomes of the traditional township units, TGSC provides additional levels of spatial units with finer granularity for presenting spatial phenomena and enables domain experts to select appropriate dissemination level for publishing statistical data. This paper compares the results of respectively using TGSC and township unit on the mortality data and examines the spatial characteristics of their outcomes. For the mortality data between the period of January 1st, 2008 and December 31st, 2010 of the Taitung County, the all-cause age-standardized death rate (ASDR) ranges from 571 to 1757 per 100,000 persons, whereas the 2nd dissemination area (TGSC) shows greater variation, ranged from 0 to 2222 per 100,000. The finer granularity of spatial units of TGSC clearly provides better outcomes for identifying and evaluating the geographic inequality and can be further analyzed with the statistical measures from other perspectives (e.g., population, area, environment.). The management and analysis of the statistical data referring to the TGSC in this research is strongly supported by the use of Geographic Information System (GIS) technology. An integrated workflow that consists of the tasks of the processing of death certificates, the geocoding of street address, the quality assurance of geocoded results, the automatic calculation of statistic measures, the standardized encoding of measures and the geo-visualization of statistical outcomes is developed. This paper also introduces a set of auxiliary measures from a geographic distribution perspective to further examine the hidden spatial characteristics of mortality data and justify the analyzed results. With the common statistical area framework like TGSC, the preliminary results demonstrate promising potential for developing a web-based statistical service that can effectively access domain statistical data and present the analyzed outcomes in meaningful ways to avoid wrong decision making.

Keywords: mortality map, spatial patterns, statistical area, variation

Procedia PDF Downloads 223
485 From Two-Way to Multi-Way: A Comparative Study for Map-Reduce Join Algorithms

Authors: Marwa Hussien Mohamed, Mohamed Helmy Khafagy

Abstract:

Map-Reduce is a programming model which is widely used to extract valuable information from enormous volumes of data. Map-reduce designed to support heterogeneous datasets. Apache Hadoop map-reduce used extensively to uncover hidden pattern like data mining, SQL, etc. The most important operation for data analysis is joining operation. But, map-reduce framework does not directly support join algorithm. This paper explains and compares two-way and multi-way map-reduce join algorithms for map reduce also we implement MR join Algorithms and show the performance of each phase in MR join algorithms. Our experimental results show that map side join and map merge join in two-way join algorithms has the longest time according to preprocessing step sorting data and reduce side cascade join has the longest time at Multi-Way join algorithms.

Keywords: Hadoop, MapReduce, multi-way join, two-way join, Ubuntu

Procedia PDF Downloads 452
484 A Multi-Agent Urban Traffic Simulator for Generating Autonomous Driving Training Data

Authors: Florin Leon

Abstract:

This paper describes a simulator of traffic scenarios tailored to facilitate autonomous driving model training for urban environments. With the rising prominence of self-driving vehicles, the need for diverse datasets is very important. The proposed simulator provides a flexible framework that allows the generation of custom scenarios needed for the validation and enhancement of trajectory prediction algorithms. Its controlled yet dynamic environment addresses the challenges associated with real-world data acquisition and ensures adaptability to diverse driving scenarios. By providing an adaptable solution for scenario creation and algorithm testing, this tool proves to be a valuable resource for advancing autonomous driving technology that aims to ensure safe and efficient self-driving vehicles.

Keywords: autonomous driving, car simulator, machine learning, model training, urban simulation environment

Procedia PDF Downloads 18
483 Simulation-Based Unmanned Surface Vehicle Design Using PX4 and Robot Operating System With Kubernetes and Cloud-Native Tooling

Authors: Norbert Szulc, Jakub Wilk, Franciszek Górski

Abstract:

This paper presents an approach for simulating and testing robotic systems based on PX4, using a local Kubernetes cluster. The approach leverages modern cloud-native tools and runs on single-board computers. Additionally, this solution enables the creation of datasets for computer vision and the evaluation of control system algorithms in an end-to-end manner. This paper compares this approach to method commonly used Docker based approach. This approach was used to develop simulation environment for an unmanned surface vehicle (USV) for RoboBoat 2023 by running a containerized configuration of the PX4 Open-source Autopilot connected to ROS and the Gazebo simulation environment.

Keywords: cloud computing, Kubernetes, single board computers, simulation, ROS

Procedia PDF Downloads 45
482 Big Data for Local Decision-Making: Indicators Identified at International Conference on Urban Health 2017

Authors: Dana R. Thomson, Catherine Linard, Sabine Vanhuysse, Jessica E. Steele, Michal Shimoni, Jose Siri, Waleska Caiaffa, Megumi Rosenberg, Eleonore Wolff, Tais Grippa, Stefanos Georganos, Helen Elsey

Abstract:

The Sustainable Development Goals (SDGs) and Urban Health Equity Assessment and Response Tool (Urban HEART) identify dozens of key indicators to help local decision-makers prioritize and track inequalities in health outcomes. However, presentations and discussions at the International Conference on Urban Health (ICUH) 2017 suggested that additional indicators are needed to make decisions and policies. A local decision-maker may realize that malaria or road accidents are a top priority. However, s/he needs additional health determinant indicators, for example about standing water or traffic, to address the priority and reduce inequalities. Health determinants reflect the physical and social environments that influence health outcomes often at community- and societal-levels and include such indicators as access to quality health facilities, access to safe parks, traffic density, location of slum areas, air pollution, social exclusion, and social networks. Indicator identification and disaggregation are necessarily constrained by available datasets – typically collected about households and individuals in surveys, censuses, and administrative records. Continued advancements in earth observation, data storage, computing and mobile technologies mean that new sources of health determinants indicators derived from 'big data' are becoming available at fine geographic scale. Big data includes high-resolution satellite imagery and aggregated, anonymized mobile phone data. While big data are themselves not representative of the population (e.g., satellite images depict the physical environment), they can provide information about population density, wealth, mobility, and social environments with tremendous detail and accuracy when combined with population-representative survey, census, administrative and health system data. The aim of this paper is to (1) flag to data scientists important indicators needed by health decision-makers at the city and sub-city scale - ideally free and publicly available, and (2) summarize for local decision-makers new datasets that can be generated from big data, with layperson descriptions of difficulties in generating them. We include SDGs and Urban HEART indicators, as well as indicators mentioned by decision-makers attending ICUH 2017.

Keywords: health determinant, health outcome, mobile phone, remote sensing, satellite imagery, SDG, urban HEART

Procedia PDF Downloads 178
481 MULTI-FLGANs: Multi-Distributed Adversarial Networks for Non-Independent and Identically Distributed Distribution

Authors: Akash Amalan, Rui Wang, Yanqi Qiao, Emmanouil Panaousis, Kaitai Liang

Abstract:

Federated learning is an emerging concept in the domain of distributed machine learning. This concept has enabled General Adversarial Networks (GANs) to benefit from the rich distributed training data while preserving privacy. However, in a non-IID setting, current federated GAN architectures are unstable, struggling to learn the distinct features, and vulnerable to mode collapse. In this paper, we propose an architecture MULTI-FLGAN to solve the problem of low-quality images, mode collapse, and instability for non-IID datasets. Our results show that MULTI-FLGAN is four times as stable and performant (i.e., high inception score) on average over 20 clients compared to baseline FLGAN.

Keywords: federated learning, generative adversarial network, inference attack, non-IID data distribution

Procedia PDF Downloads 116
480 Real Time Multi Person Action Recognition Using Pose Estimates

Authors: Aishrith Rao

Abstract:

Human activity recognition is an important aspect of video analytics, and many approaches have been recommended to enable action recognition. In this approach, the model is used to identify the action of the multiple people in the frame and classify them accordingly. A few approaches use RNNs and 3D CNNs, which are computationally expensive and cannot be trained with the small datasets which are currently available. Multi-person action recognition has been performed in order to understand the positions and action of people present in the video frame. The size of the video frame can be adjusted as a hyper-parameter depending on the hardware resources available. OpenPose has been used to calculate pose estimate using CNN to produce heap-maps, one of which provides skeleton features, which are basically joint features. The features are then extracted, and a classification algorithm can be applied to classify the action.

Keywords: human activity recognition, computer vision, pose estimates, convolutional neural networks

Procedia PDF Downloads 112
479 Generating Product Description with Generative Pre-Trained Transformer 2

Authors: Minh-Thuan Nguyen, Phuong-Thai Nguyen, Van-Vinh Nguyen, Quang-Minh Nguyen

Abstract:

Research on automatically generating descriptions for e-commerce products is gaining increasing attention in recent years. However, the generated descriptions of their systems are often less informative and attractive because of lacking training datasets or the limitation of these approaches, which often use templates or statistical methods. In this paper, we explore a method to generate production descriptions by using the GPT-2 model. In addition, we apply text paraphrasing and task-adaptive pretraining techniques to improve the qualify of descriptions generated from the GPT-2 model. Experiment results show that our models outperform the baseline model through automatic evaluation and human evaluation. Especially, our methods achieve a promising result not only on the seen test set but also in the unseen test set.

Keywords: GPT-2, product description, transformer, task-adaptive, language model, pretraining

Procedia PDF Downloads 169
478 Dissimilarity-Based Coloring for Symbolic and Multivariate Data Visualization

Authors: K. Umbleja, M. Ichino, H. Yaguchi

Abstract:

In this paper, we propose a coloring method for multivariate data visualization by using parallel coordinates based on dissimilarity and tree structure information gathered during hierarchical clustering. The proposed method is an extension for proximity-based coloring that suffers from a few undesired side effects if hierarchical tree structure is not balanced tree. We describe the algorithm by assigning colors based on dissimilarity information, show the application of proposed method on three commonly used datasets, and compare the results with proximity-based coloring. We found our proposed method to be especially beneficial for symbolic data visualization where many individual objects have already been aggregated into a single symbolic object.

Keywords: data visualization, dissimilarity-based coloring, proximity-based coloring, symbolic data

Procedia PDF Downloads 137
477 Mining Scientific Literature to Discover Potential Research Data Sources: An Exploratory Study in the Field of Haemato-Oncology

Authors: A. Anastasiou, K. S. Tingay

Abstract:

Background: Discovering suitable datasets is an important part of health research, particularly for projects working with clinical data from patients organized in cohorts (cohort data), but with the proliferation of so many national and international initiatives, it is becoming increasingly difficult for research teams to locate real world datasets that are most relevant to their project objectives. We present a method for identifying healthcare institutes in the European Union (EU) which may hold haemato-oncology (HO) data. A key enabler of this research was the bibInsight platform, a scientometric data management and analysis system developed by the authors at Swansea University. Method: A PubMed search was conducted using HO clinical terms taken from previous work. The resulting XML file was processed using the bibInsight platform, linking affiliations to the Global Research Identifier Database (GRID). GRID is an international, standardized list of institutions, including the city and country in which the institution exists, as well as a category of the main business type, e.g., Academic, Healthcare, Government, Company. Countries were limited to the 28 current EU members, and institute type to 'Healthcare'. An article was considered valid if at least one author was affiliated with an EU-based healthcare institute. Results: The PubMed search produced 21,310 articles, consisting of 9,885 distinct affiliations with correspondence in GRID. Of these articles, 760 were from EU countries, and 390 of these were healthcare institutes. One affiliation was excluded as being a veterinary hospital. Two EU countries did not have any publications in our analysis dataset. The results were analysed by country and by individual healthcare institute. Networks both within the EU and internationally show institutional collaborations, which may suggest a willingness to share data for research purposes. Geographical mapping can ensure that data has broad population coverage. Collaborations with industry or government may exclude healthcare institutes that may have embargos or additional costs associated with data access. Conclusions: Data reuse is becoming increasingly important both for ensuring the validity of results, and economy of available resources. The ability to identify potential, specific data sources from over twenty thousand articles in less than an hour could assist in improving knowledge of, and access to, data sources. As our method has not yet specified if these healthcare institutes are holding data, or merely publishing on that topic, future work will involve text mining of data-specific concordant terms to identify numbers of participants, demographics, study methodologies, and sub-topics of interest.

Keywords: data reuse, data discovery, data linkage, journal articles, text mining

Procedia PDF Downloads 92
476 River's Bed Level Changing Pattern Due to Sedimentation, Case Study: Gash River, Kassala, Sudan

Authors: Faisal Ali, Hasssan Saad Mohammed Hilmi, Mustafa Mohamed, Shamseddin Musa

Abstract:

The Gash rivers an ephemeral river, it usually flows from July to September, it has a braided pattern with high sediment content, of 15200 ppm in suspension, and 360 kg/sec as bed load. The Gash river bed has an average slope of 1.3 m/Km. The objectives of this study were: assessing the Gash River bed level patterns; quantifying the annual variations in Gash bed level; and recommending a suitable method to reduce the sediment accumulation on the Gash River bed. The study covered temporally the period 1905-2013 using datasets included the Gash river flows, and the cross sections. The results showed that there is an increasing trend in the river bed of 5 cm3 per year. This is resulted in changing the behavior of the flood routing and consequently the flood hazard is tremendously increased in Kassala city.

Keywords: bed level, cross section, gash river, sedimentation

Procedia PDF Downloads 502
475 Decision Trees Constructing Based on K-Means Clustering Algorithm

Authors: Loai Abdallah, Malik Yousef

Abstract:

A domain space for the data should reflect the actual similarity between objects. Since objects belonging to the same cluster usually share some common traits even though their geometric distance might be relatively large. In general, the Euclidean distance of data points that represented by large number of features is not capturing the actual relation between those points. In this study, we propose a new method to construct a different space that is based on clustering to form a new distance metric. The new distance space is based on ensemble clustering (EC). The EC distance space is defined by tracking the membership of the points over multiple runs of clustering algorithm metric. Over this distance, we train the decision trees classifier (DT-EC). The results obtained by applying DT-EC on 10 datasets confirm our hypotheses that embedding the EC space as a distance metric would improve the performance.

Keywords: ensemble clustering, decision trees, classification, K nearest neighbors

Procedia PDF Downloads 160
474 A Comparison of YOLO Family for Apple Detection and Counting in Orchards

Authors: Yuanqing Li, Changyi Lei, Zhaopeng Xue, Zhuo Zheng, Yanbo Long

Abstract:

In agricultural production and breeding, implementing automatic picking robot in orchard farming to reduce human labour and error is challenging. The core function of it is automatic identification based on machine vision. This paper focuses on apple detection and counting in orchards and implements several deep learning methods. Extensive datasets are used and a semi-automatic annotation method is proposed. The proposed deep learning models are in state-of-the-art YOLO family. In view of the essence of the models with various backbones, a multi-dimensional comparison in details is made in terms of counting accuracy, mAP and model memory, laying the foundation for realising automatic precision agriculture.

Keywords: agricultural object detection, deep learning, machine vision, YOLO family

Procedia PDF Downloads 165
473 Monitoring Land Productivity Dynamics of Gombe State, Nigeria

Authors: Ishiyaku Abdulkadir, Satish Kumar J

Abstract:

Land Productivity is a measure of the greenness of above-ground biomass in health and potential gain and is not related to agricultural productivity. Monitoring land productivity dynamics is essential to identify, especially when and where the trend is characterized degraded for mitigation measures. This research aims to monitor the land productivity trend of Gombe State between 2001 and 2015. QGIS was used to compute NDVI from AVHRR/MODIS datasets in a cloud-based method. The result appears that land area with improving productivity account for 773sq.km with 4.31%, stable productivity traced to 4,195.6 sq.km with 23.40%, stable but stressed productivity represent 18.7sq.km account for 0.10%, early sign of decline productivity occupied 5203.1sq.km with 29%, declining productivity account for 7019.7sq.km, represent 39.2%, water bodies occupied 718.7sq.km traced to 4% of the state’s area.

Keywords: above-ground biomass, dynamics, land productivity, man-environment relationship

Procedia PDF Downloads 119
472 Robust Variable Selection Based on Schwarz Information Criterion for Linear Regression Models

Authors: Shokrya Saleh A. Alshqaq, Abdullah Ali H. Ahmadini

Abstract:

The Schwarz information criterion (SIC) is a popular tool for selecting the best variables in regression datasets. However, SIC is defined using an unbounded estimator, namely, the least-squares (LS), which is highly sensitive to outlying observations, especially bad leverage points. A method for robust variable selection based on SIC for linear regression models is thus needed. This study investigates the robustness properties of SIC by deriving its influence function and proposes a robust SIC based on the MM-estimation scale. The aim of this study is to produce a criterion that can effectively select accurate models in the presence of vertical outliers and high leverage points. The advantages of the proposed robust SIC is demonstrated through a simulation study and an analysis of a real dataset.

Keywords: influence function, robust variable selection, robust regression, Schwarz information criterion

Procedia PDF Downloads 113
471 Combining the Dynamic Conditional Correlation and Range-GARCH Models to Improve Covariance Forecasts

Authors: Piotr Fiszeder, Marcin Fałdziński, Peter Molnár

Abstract:

The dynamic conditional correlation model of Engle (2002) is one of the most popular multivariate volatility models. However, this model is based solely on closing prices. It has been documented in the literature that the high and low price of the day can be used in an efficient volatility estimation. We, therefore, suggest a model which incorporates high and low prices into the dynamic conditional correlation framework. Empirical evaluation of this model is conducted on three datasets: currencies, stocks, and commodity exchange-traded funds. The utilisation of realized variances and covariances as proxies for true variances and covariances allows us to reach a strong conclusion that our model outperforms not only the standard dynamic conditional correlation model but also a competing range-based dynamic conditional correlation model.

Keywords: volatility, DCC model, high and low prices, range-based models, covariance forecasting

Procedia PDF Downloads 149
470 Distorted Document Images Dataset for Text Detection and Recognition

Authors: Ilia Zharikov, Philipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan

Abstract:

With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models.

Keywords: document analysis, open dataset, optical character recognition, text detection

Procedia PDF Downloads 138
469 FPGA Implementation of Adaptive Clock Recovery for TDMoIP Systems

Authors: Semih Demir, Anil Celebi

Abstract:

Circuit switched networks widely used until the end of the 20th century have been transformed into packages switched networks. Time Division Multiplexing over Internet Protocol (TDMoIP) is a system that enables Time Division Multiplexing (TDM) traffic to be carried over packet switched networks (PSN). In TDMoIP systems, devices that send TDM data to the PSN and receive it from the network must operate with the same clock frequency. In this study, it was aimed to implement clock synchronization process in Field Programmable Gate Array (FPGA) chips using time information attached to the packages received from PSN. The designed hardware is verified using the datasets obtained for the different carrier types and comparing the results with the software model. Field tests are also performed by using the real time TDMoIP system.

Keywords: clock recovery on TDMoIP, FPGA, MATLAB reference model, clock synchronization

Procedia PDF Downloads 244
468 Person Re-Identification using Siamese Convolutional Neural Network

Authors: Sello Mokwena, Monyepao Thabang

Abstract:

In this study, we propose a comprehensive approach to address the challenges in person re-identification models. By combining a centroid tracking algorithm with a Siamese convolutional neural network model, our method excels in detecting, tracking, and capturing robust person features across non-overlapping camera views. The algorithm efficiently identifies individuals in the camera network, while the neural network extracts fine-grained global features for precise cross-image comparisons. The approach's effectiveness is further accentuated by leveraging the camera network topology for guidance. Our empirical analysis on benchmark datasets highlights its competitive performance, particularly evident when background subtraction techniques are selectively applied, underscoring its potential in advancing person re-identification techniques.

Keywords: camera network, convolutional neural network topology, person tracking, person re-identification, siamese

Procedia PDF Downloads 43
467 Healthcare Data Mining Innovations

Authors: Eugenia Jilinguirian

Abstract:

In the healthcare industry, data mining is essential since it transforms the field by collecting useful data from large datasets. Data mining is the process of applying advanced analytical methods to large patient records and medical histories in order to identify patterns, correlations, and trends. Healthcare professionals can improve diagnosis accuracy, uncover hidden linkages, and predict disease outcomes by carefully examining these statistics. Additionally, data mining supports personalized medicine by personalizing treatment according to the unique attributes of each patient. This proactive strategy helps allocate resources more efficiently, enhances patient care, and streamlines operations. However, to effectively apply data mining, however, and ensure the use of private healthcare information, issues like data privacy and security must be carefully considered. Data mining continues to be vital for searching for more effective, efficient, and individualized healthcare solutions as technology evolves.

Keywords: data mining, healthcare, big data, individualised healthcare, healthcare solutions, database

Procedia PDF Downloads 38
466 Estimating Estimators: An Empirical Comparison of Non-Invasive Analysis Methods

Authors: Yan Torres, Fernanda Simoes, Francisco Petrucci-Fonseca, Freddie-Jeanne Richard

Abstract:

The non-invasive samples are an alternative of collecting genetic samples directly. Non-invasive samples are collected without the manipulation of the animal (e.g., scats, feathers and hairs). Nevertheless, the use of non-invasive samples has some limitations. The main issue is degraded DNA, leading to poorer extraction efficiency and genotyping. Those errors delayed for some years a widespread use of non-invasive genetic information. Possibilities to limit genotyping errors can be done using analysis methods that can assimilate the errors and singularities of non-invasive samples. Genotype matching and population estimation algorithms can be highlighted as important analysis tools that have been adapted to deal with those errors. Although, this recent development of analysis methods there is still a lack of empirical performance comparison of them. A comparison of methods with dataset different in size and structure can be useful for future studies since non-invasive samples are a powerful tool for getting information specially for endangered and rare populations. To compare the analysis methods, four different datasets used were obtained from the Dryad digital repository were used. Three different matching algorithms (Cervus, Colony and Error Tolerant Likelihood Matching - ETLM) are used for matching genotypes and two different ones for population estimation (Capwire and BayesN). The three matching algorithms showed different patterns of results. The ETLM produced less number of unique individuals and recaptures. A similarity in the matched genotypes between Colony and Cervus was observed. That is not a surprise since the similarity between those methods on the likelihood pairwise and clustering algorithms. The matching of ETLM showed almost no similarity with the genotypes that were matched with the other methods. The different cluster algorithm system and error model of ETLM seems to lead to a more criterious selection, although the processing time and interface friendly of ETLM were the worst between the compared methods. The population estimators performed differently regarding the datasets. There was a consensus between the different estimators only for the one dataset. The BayesN showed higher and lower estimations when compared with Capwire. The BayesN does not consider the total number of recaptures like Capwire only the recapture events. So, this makes the estimator sensitive to data heterogeneity. Heterogeneity in the sense means different capture rates between individuals. In those examples, the tolerance for homogeneity seems to be crucial for BayesN work properly. Both methods are user-friendly and have reasonable processing time. An amplified analysis with simulated genotype data can clarify the sensibility of the algorithms. The present comparison of the matching methods indicates that Colony seems to be more appropriated for general use considering a time/interface/robustness balance. The heterogeneity of the recaptures affected strongly the BayesN estimations, leading to over and underestimations population numbers. Capwire is then advisable to general use since it performs better in a wide range of situations.

Keywords: algorithms, genetics, matching, population

Procedia PDF Downloads 114
465 A Machine Learning Approach to Detecting Evasive PDF Malware

Authors: Vareesha Masood, Ammara Gul, Nabeeha Areej, Muhammad Asif Masood, Hamna Imran

Abstract:

The universal use of PDF files has prompted hackers to use them for malicious intent by hiding malicious codes in their victim’s PDF machines. Machine learning has proven to be the most efficient in identifying benign files and detecting files with PDF malware. This paper has proposed an approach using a decision tree classifier with parameters. A modern, inclusive dataset CIC-Evasive-PDFMal2022, produced by Lockheed Martin’s Cyber Security wing is used. It is one of the most reliable datasets to use in this field. We designed a PDF malware detection system that achieved 99.2%. Comparing the suggested model to other cutting-edge models in the same study field, it has a great performance in detecting PDF malware. Accordingly, we provide the fastest, most reliable, and most efficient PDF Malware detection approach in this paper.

Keywords: PDF, PDF malware, decision tree classifier, random forest classifier

Procedia PDF Downloads 55
464 Improvement of Ground Truth Data for Eye Location on Infrared Driver Recordings

Authors: Sorin Valcan, Mihail Gaianu

Abstract:

Labeling is a very costly and time consuming process which aims to generate datasets for training neural networks in several functionalities and projects. For driver monitoring system projects, the need for labeled images has a significant impact on the budget and distribution of effort. This paper presents the modifications done to an algorithm used for the generation of ground truth data for 2D eyes location on infrared images with drivers in order to improve the quality of the data and performance of the trained neural networks. The algorithm restrictions become tougher, which makes it more accurate but also less constant. The resulting dataset becomes smaller and shall not be altered by any kind of manual label adjustment before being used in the neural networks training process. These changes resulted in a much better performance of the trained neural networks.

Keywords: labeling automation, infrared camera, driver monitoring, eye detection, convolutional neural networks

Procedia PDF Downloads 80
463 Designing Emergency Response Network for Rail Hazmat Shipments

Authors: Ali Vaezi, Jyotirmoy Dalal, Manish Verma

Abstract:

The railroad is one of the primary transportation modes for hazardous materials (hazmat) shipments in North America. Installing an emergency response network capable of providing a commensurate response is one of the primary levers to contain (or mitigate) the adverse consequences from rail hazmat incidents. To this end, we propose a two-stage stochastic program to determine the location of and equipment packages to be stockpiled at each response facility. The raw input data collected from publicly available reports were processed, fed into the proposed optimization program, and then tested on a realistic railroad network in Ontario (Canada). From the resulting analyses, we conclude that the decisions based only on empirical datasets would undermine the effectiveness of the resulting network; coverage can be improved by redistributing equipment in the network, purchasing equipment with higher containment capacity, and making use of a disutility multiplier factor.

Keywords: hazmat, rail network, stochastic programming, emergency response

Procedia PDF Downloads 149
462 Understanding and Improving Neural Network Weight Initialization

Authors: Diego Aguirre, Olac Fuentes

Abstract:

In this paper, we present a taxonomy of weight initialization schemes used in deep learning. We survey the most representative techniques in each class and compare them in terms of overhead cost, convergence rate, and applicability. We also introduce a new weight initialization scheme. In this technique, we perform an initial feedforward pass through the network using an initialization mini-batch. Using statistics obtained from this pass, we initialize the weights of the network, so the following properties are met: 1) weight matrices are orthogonal; 2) ReLU layers produce a predetermined number of non-zero activations; 3) the output produced by each internal layer has a unit variance; 4) weights in the last layer are chosen to minimize the error in the initial mini-batch. We evaluate our method on three popular architectures, and a faster converge rates are achieved on the MNIST, CIFAR-10/100, and ImageNet datasets when compared to state-of-the-art initialization techniques.

Keywords: deep learning, image classification, supervised learning, weight initialization

Procedia PDF Downloads 106
461 Sentiment Analysis of Consumers’ Perceptions on Social Media about the Main Mobile Providers in Jamaica

Authors: Sherrene Bogle, Verlia Bogle, Tyrone Anderson

Abstract:

In recent years, organizations have become increasingly interested in the possibility of analyzing social media as a means of gaining meaningful feedback about their products and services. The aspect based sentiment analysis approach is used to predict the sentiment for Twitter datasets for Digicel and Lime, the main mobile companies in Jamaica, using supervised learning classification techniques. The results indicate an average of 82.2 percent accuracy in classifying tweets when comparing three separate classification algorithms against the purported baseline of 70 percent and an average root mean squared error of 0.31. These results indicate that the analysis of sentiment on social media in order to gain customer feedback can be a viable solution for mobile companies looking to improve business performance.

Keywords: machine learning, sentiment analysis, social media, supervised learning

Procedia PDF Downloads 406
460 Opening up Government Datasets for Big Data Analysis to Support Policy Decisions

Authors: K. Hardy, A. Maurushat

Abstract:

Policy makers are increasingly looking to make evidence-based decisions. Evidence-based decisions have historically used rigorous methodologies of empirical studies by research institutes, as well as less reliable immediate survey/polls often with limited sample sizes. As we move into the era of Big Data analytics, policy makers are looking to different methodologies to deliver reliable empirics in real-time. The question is not why did these people do this for the last 10 years, but why are these people doing this now, and if the this is undesirable, and how can we have an impact to promote change immediately. Big data analytics rely heavily on government data that has been released in to the public domain. The open data movement promises greater productivity and more efficient delivery of services; however, Australian government agencies remain reluctant to release their data to the general public. This paper considers the barriers to releasing government data as open data, and how these barriers might be overcome.

Keywords: big data, open data, productivity, data governance

Procedia PDF Downloads 341
459 MarginDistillation: Distillation for Face Recognition Neural Networks with Margin-Based Softmax

Authors: Svitov David, Alyamkin Sergey

Abstract:

The usage of convolutional neural networks (CNNs) in conjunction with the margin-based softmax approach demonstrates the state-of-the-art performance for the face recognition problem. Recently, lightweight neural network models trained with the margin-based softmax have been introduced for the face identification task for edge devices. In this paper, we propose a distillation method for lightweight neural network architectures that outperforms other known methods for the face recognition task on LFW, AgeDB-30 and Megaface datasets. The idea of the proposed method is to use class centers from the teacher network for the student network. Then the student network is trained to get the same angles between the class centers and face embeddings predicted by the teacher network.

Keywords: ArcFace, distillation, face recognition, margin-based softmax

Procedia PDF Downloads 113
458 Fusion of Shape and Texture for Unconstrained Periocular Authentication

Authors: D. R. Ambika, K. R. Radhika, D. Seshachalam

Abstract:

Unconstrained authentication is an important component for personal automated systems and human-computer interfaces. Existing solutions mostly use face as the primary object of analysis. The performance of face-based systems is largely determined by the extent of deformation caused in the facial region and amount of useful information available in occluded face images. Periocular region is a useful portion of face with discriminative ability coupled with resistance to deformation. A reliable portion of periocular area is available for occluded images. The present work demonstrates that joint representation of periocular texture and periocular structure provides an effective expression and poses invariant representation. The proposed methodology provides an effective and compact description of periocular texture and shape. The method is tested over four benchmark datasets exhibiting varied acquisition conditions.

Keywords: periocular authentication, Zernike moments, LBP variance, shape and texture fusion

Procedia PDF Downloads 254