Search results for: imbalance dataset
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 1409

Search results for: imbalance dataset

1199 Data Science-Based Key Factor Analysis and Risk Prediction of Diabetic

Authors: Fei Gao, Rodolfo C. Raga Jr.

Abstract:

This research proposal will ascertain the major risk factors for diabetes and to design a predictive model for risk assessment. The project aims to improve diabetes early detection and management by utilizing data science techniques, which may improve patient outcomes and healthcare efficiency. The phase relation values of each attribute were used to analyze and choose the attributes that might influence the examiner's survival probability using Diabetes Health Indicators Dataset from Kaggle’s data as the research data. We compare and evaluate eight machine learning algorithms. Our investigation begins with comprehensive data preprocessing, including feature engineering and dimensionality reduction, aimed at enhancing data quality. The dataset, comprising health indicators and medical data, serves as a foundation for training and testing these algorithms. A rigorous cross-validation process is applied, and we assess their performance using five key metrics like accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). After analyzing the data characteristics, investigate their impact on the likelihood of diabetes and develop corresponding risk indicators.

Keywords: diabetes, risk factors, predictive model, risk assessment, data science techniques, early detection, data analysis, Kaggle

Procedia PDF Downloads 75
1198 Electrocardiogram-Based Heartbeat Classification Using Convolutional Neural Networks

Authors: Jacqueline Rose T. Alipo-on, Francesca Isabelle F. Escobar, Myles Joshua T. Tan, Hezerul Abdul Karim, Nouar Al Dahoul

Abstract:

Electrocardiogram (ECG) signal analysis and processing are crucial in the diagnosis of cardiovascular diseases, which are considered one of the leading causes of mortality worldwide. However, the traditional rule-based analysis of large volumes of ECG data is time-consuming, labor-intensive, and prone to human errors. With the advancement of the programming paradigm, algorithms such as machine learning have been increasingly used to perform an analysis of ECG signals. In this paper, various deep learning algorithms were adapted to classify five classes of heartbeat types. The dataset used in this work is the synthetic MIT-BIH Arrhythmia dataset produced from generative adversarial networks (GANs). Various deep learning models such as ResNet-50 convolutional neural network (CNN), 1-D CNN, and long short-term memory (LSTM) were evaluated and compared. ResNet-50 was found to outperform other models in terms of recall and F1 score using a five-fold average score of 98.88% and 98.87%, respectively. 1-D CNN, on the other hand, was found to have the highest average precision of 98.93%.

Keywords: heartbeat classification, convolutional neural network, electrocardiogram signals, generative adversarial networks, long short-term memory, ResNet-50

Procedia PDF Downloads 128
1197 Leveraging Natural Language Processing for Legal Artificial Intelligence: A Longformer Approach for Taiwanese Legal Cases

Authors: Hsin Lee, Hsuan Lee

Abstract:

Legal artificial intelligence (LegalAI) has been increasing applications within legal systems, propelled by advancements in natural language processing (NLP). Compared with general documents, legal case documents are typically long text sequences with intrinsic logical structures. Most existing language models have difficulty understanding the long-distance dependencies between different structures. Another unique challenge is that while the Judiciary of Taiwan has released legal judgments from various levels of courts over the years, there remains a significant obstacle in the lack of labeled datasets. This deficiency makes it difficult to train models with strong generalization capabilities, as well as accurately evaluate model performance. To date, models in Taiwan have yet to be specifically trained on judgment data. Given these challenges, this research proposes a Longformer-based pre-trained language model explicitly devised for retrieving similar judgments in Taiwanese legal documents. This model is trained on a self-constructed dataset, which this research has independently labeled to measure judgment similarities, thereby addressing a void left by the lack of an existing labeled dataset for Taiwanese judgments. This research adopts strategies such as early stopping and gradient clipping to prevent overfitting and manage gradient explosion, respectively, thereby enhancing the model's performance. The model in this research is evaluated using both the dataset and the Average Entropy of Offense-charged Clustering (AEOC) metric, which utilizes the notion of similar case scenarios within the same type of legal cases. Our experimental results illustrate our model's significant advancements in handling similarity comparisons within extensive legal judgments. By enabling more efficient retrieval and analysis of legal case documents, our model holds the potential to facilitate legal research, aid legal decision-making, and contribute to the further development of LegalAI in Taiwan.

Keywords: legal artificial intelligence, computation and language, language model, Taiwanese legal cases

Procedia PDF Downloads 72
1196 A Hybrid Feature Selection and Deep Learning Algorithm for Cancer Disease Classification

Authors: Niousha Bagheri Khulenjani, Mohammad Saniee Abadeh

Abstract:

Learning from very big datasets is a significant problem for most present data mining and machine learning algorithms. MicroRNA (miRNA) is one of the important big genomic and non-coding datasets presenting the genome sequences. In this paper, a hybrid method for the classification of the miRNA data is proposed. Due to the variety of cancers and high number of genes, analyzing the miRNA dataset has been a challenging problem for researchers. The number of features corresponding to the number of samples is high and the data suffer from being imbalanced. The feature selection method has been used to select features having more ability to distinguish classes and eliminating obscures features. Afterward, a Convolutional Neural Network (CNN) classifier for classification of cancer types is utilized, which employs a Genetic Algorithm to highlight optimized hyper-parameters of CNN. In order to make the process of classification by CNN faster, Graphics Processing Unit (GPU) is recommended for calculating the mathematic equation in a parallel way. The proposed method is tested on a real-world dataset with 8,129 patients, 29 different types of tumors, and 1,046 miRNA biomarkers, taken from The Cancer Genome Atlas (TCGA) database.

Keywords: cancer classification, feature selection, deep learning, genetic algorithm

Procedia PDF Downloads 111
1195 A Multi-Role Oriented Collaboration Platform for Distributed Disaster Reduction in China

Authors: Linyao Qiu, Zhiqiang Du

Abstract:

As the rapid development of urbanization, economic developments, and steady population growth in China, the widespread devastation, economic damages, and loss of human lives caused by numerous forms of natural disasters are becoming increasingly serious every year. Disaster management requires available and effective cooperation of different roles and organizations in whole process including mitigation, preparedness, response and recovery. Due to the imbalance of regional development in China, the disaster management capabilities of national and provincial disaster reduction centers are uneven. When an undeveloped area suffers from disaster, neither local reduction department could get first-hand information like high-resolution remote sensing images from satellites and aircrafts independently, nor sharing mechanism is provided for the department to access to data resources deployed in other place directly. Most existing disaster management systems operate in a typical passive data-centric mode and work for single department, where resources cannot be fully shared. The impediment blocks local department and group from quick emergency response and decision-making. In this paper, we introduce a collaborative platform for distributed disaster reduction. To address the issues of imbalance of sharing data sources and technology in the process of disaster reduction, we propose a multi-role oriented collaboration business mechanism, which is capable of scheduling and allocating for optimum utilization of multiple resources, to link various roles for collaborative reduction business in different place. The platform fully considers the difference of equipment conditions in different provinces and provide several service modes to satisfy technology need in disaster reduction. An integrated collaboration system based on focusing services mechanism is designed and implemented for resource scheduling, functional integration, data processing, task management, collaborative mapping, and visualization. Actual applications illustrate that the platform can well support data sharing and business collaboration between national and provincial department. It could significantly improve the capability of disaster reduction in China.

Keywords: business collaboration, data sharing, distributed disaster reduction, focusing service

Procedia PDF Downloads 295
1194 Disaggregation the Daily Rainfall Dataset into Sub-Daily Resolution in the Temperate Oceanic Climate Region

Authors: Mohammad Bakhshi, Firas Al Janabi

Abstract:

High resolution rain data are very important to fulfill the input of hydrological models. Among models of high-resolution rainfall data generation, the temporal disaggregation was chosen for this study. The paper attempts to generate three different rainfall resolutions (4-hourly, hourly and 10-minutes) from daily for around 20-year record period. The process was done by DiMoN tool which is based on random cascade model and method of fragment. Differences between observed and simulated rain dataset are evaluated with variety of statistical and empirical methods: Kolmogorov-Smirnov test (K-S), usual statistics, and Exceedance probability. The tool worked well at preserving the daily rainfall values in wet days, however, the generated data are cumulated in a shorter time period and made stronger storms. It is demonstrated that the difference between generated and observed cumulative distribution function curve of 4-hourly datasets is passed the K-S test criteria while in hourly and 10-minutes datasets the P-value should be employed to prove that their differences were reasonable. The results are encouraging considering the overestimation of generated high-resolution rainfall data.

Keywords: DiMoN Tool, disaggregation, exceedance probability, Kolmogorov-Smirnov test, rainfall

Procedia PDF Downloads 201
1193 ECG Based Reliable User Identification Using Deep Learning

Authors: R. N. Begum, Ambalika Sharma, G. K. Singh

Abstract:

Identity theft has serious ramifications beyond data and personal information loss. This necessitates the implementation of robust and efficient user identification systems. Therefore, automatic biometric recognition systems are the need of the hour, and ECG-based systems are unquestionably the best choice due to their appealing inherent characteristics. The CNNs are the recent state-of-the-art techniques for ECG-based user identification systems. However, the results obtained are significantly below standards, and the situation worsens as the number of users and types of heartbeats in the dataset grows. As a result, this study proposes a highly accurate and resilient ECG-based person identification system using CNN's dense learning framework. The proposed research explores explicitly the calibre of dense CNNs in the field of ECG-based human recognition. The study tests four different configurations of dense CNN which are trained on a dataset of recordings collected from eight popular ECG databases. With the highest FAR of 0.04 percent and the highest FRR of 5%, the best performing network achieved an identification accuracy of 99.94 percent. The best network is also tested with various train/test split ratios. The findings show that DenseNets are not only extremely reliable but also highly efficient. Thus, they might also be implemented in real-time ECG-based human recognition systems.

Keywords: Biometrics, Dense Networks, Identification Rate, Train/Test split ratio

Procedia PDF Downloads 160
1192 Quantitative Analysis of Contract Variations Impact on Infrastructure Project Performance

Authors: Soheila Sadeghi

Abstract:

Infrastructure projects often encounter contract variations that can significantly deviate from the original tender estimates, leading to cost overruns, schedule delays, and financial implications. This research aims to quantitatively assess the impact of changes in contract variations on project performance by conducting an in-depth analysis of a comprehensive dataset from the Regional Airport Car Park project. The dataset includes tender budget, contract quantities, rates, claims, and revenue data, providing a unique opportunity to investigate the effects of variations on project outcomes. The study focuses on 21 specific variations identified in the dataset, which represent changes or additions to the project scope. The research methodology involves establishing a baseline for the project's planned cost and scope by examining the tender budget and contract quantities. Each variation is then analyzed in detail, comparing the actual quantities and rates against the tender estimates to determine their impact on project cost and schedule. The claims data is utilized to track the progress of work and identify deviations from the planned schedule. The study employs statistical analysis using R to examine the dataset, including tender budget, contract quantities, rates, claims, and revenue data. Time series analysis is applied to the claims data to track progress and detect variations from the planned schedule. Regression analysis is utilized to investigate the relationship between variations and project performance indicators, such as cost overruns and schedule delays. The research findings highlight the significance of effective variation management in construction projects. The analysis reveals that variations can have a substantial impact on project cost, schedule, and financial outcomes. The study identifies specific variations that had the most significant influence on the Regional Airport Car Park project's performance, such as PV03 (additional fill, road base gravel, spray seal, and asphalt), PV06 (extension to the commercial car park), and PV07 (additional box out and general fill). These variations contributed to increased costs, schedule delays, and changes in the project's revenue profile. The study also examines the effectiveness of project management practices in managing variations and mitigating their impact. The research suggests that proactive risk management, thorough scope definition, and effective communication among project stakeholders can help minimize the negative consequences of variations. The findings emphasize the importance of establishing clear procedures for identifying, assessing, and managing variations throughout the project lifecycle. The outcomes of this research contribute to the body of knowledge in construction project management by demonstrating the value of analyzing tender, contract, claims, and revenue data in variation impact assessment. However, the research acknowledges the limitations imposed by the dataset, particularly the absence of detailed contract and tender documents. This constraint restricts the depth of analysis possible in investigating the root causes and full extent of variations' impact on the project. Future research could build upon this study by incorporating more comprehensive data sources to further explore the dynamics of variations in construction projects.

Keywords: contract variation impact, quantitative analysis, project performance, claims analysis

Procedia PDF Downloads 40
1191 Mental Health Diagnosis through Machine Learning Approaches

Authors: Md Rafiqul Islam, Ashir Ahmed, Anwaar Ulhaq, Abu Raihan M. Kamal, Yuan Miao, Hua Wang

Abstract:

Mental health of people is equally important as of their physical health. Mental health and well-being are influenced not only by individual attributes but also by the social circumstances in which people find themselves and the environment in which they live. Like physical health, there is a number of internal and external factors such as biological, social and occupational factors that could influence the mental health of people. People living in poverty, suffering from chronic health conditions, minority groups, and those who exposed to/or displaced by war or conflict are generally more likely to develop mental health conditions. However, to authors’ best knowledge, there is dearth of knowledge on the impact of workplace (especially the highly stressed IT/Tech workplace) on the mental health of its workers. This study attempts to examine the factors influencing the mental health of tech workers. A publicly available dataset containing more than 65,000 cells and 100 attributes is examined for this purpose. Number of machine learning techniques such as ‘Decision Tree’, ‘K nearest neighbor’ ‘Support Vector Machine’ and ‘Ensemble’, are then applied to the selected dataset to draw the findings. It is anticipated that the analysis reported in this study would contribute in presenting useful insights on the attributes contributing in the mental health of tech workers using relevant machine learning techniques.

Keywords: mental disorder, diagnosis, occupational stress, IT workplace

Procedia PDF Downloads 288
1190 Characterization of the Blood Microbiome in Rheumatoid Arthritis Patients Compared to Healthy Control Subjects Using V4 Region 16S rRNA Sequencing

Authors: D. Hammad, D. P. Tonge

Abstract:

Rheumatoid arthritis (RA) is a disabling and common autoimmune disease during which the body's immune system attacks healthy tissues. This results in complicated and long-lasting actions being carried out by the immune system, which typically only occurs when the immune system encounters a foreign object. In the case of RA, the disease affects millions of people and causes joint inflammation, ultimately leading to the destruction of cartilage and bone. Interestingly, the disease mechanism still remains unclear. It is likely that RA occurs as a result of a complex interplay of genetic and environmental factors including an imbalance in the microorganism population inside our body. The human microbiome or microbiota is an extensive community of microorganisms in and on the bodies of animals, which comprises bacteria, fungi, viruses, and protozoa. Recently, the development of molecular techniques to characterize entire bacterial communities has renewed interest in the involvement of the microbiome in the development and progression of RA. We believe that an imbalance in some of the specific bacterial species in the gut, mouth and other sites may lead to atopobiosis; the translocation of these organisms into the blood, and that this may lead to changes in immune system status. The aim of this study was, therefore, to characterize the microbiome of RA serum samples in comparison to healthy control subjects using 16S rRNA gene amplification and sequencing. Serum samples were obtained from healthy control volunteers and from patients with RA both prior to, and following treatment. The bacterial community present in each sample was identified utilizing V4 region 16S rRNA amplification and sequencing. Bacterial identification, to the lowest taxonomic rank, was performed using a range of bioinformatics tools. Significantly, the proportions of the Lachnospiraceae, Ruminococcaceae, and Halmonadaceae families were significantly increased in the serum of RA patients compared with healthy control serum. Furthermore, the abundance of Bacteroides and Lachnospiraceae nk4a136_group, Lachnospiraceae_UGC-001, RuminococcaceaeUCG-014, Rumnococcus-1, and Shewanella was also raised in the serum of RA patients relative to healthy control serum. These data support the notion of a blood microbiome and reveal RA-associated changes that may have significant implications for biomarker development and may present much-needed opportunities for novel therapeutic development.

Keywords: blood microbiome, gut and oral bacteria, Rheumatoid arthritis, 16S rRNA gene sequencing

Procedia PDF Downloads 132
1189 Brain Tumor Detection and Classification Using Pre-Trained Deep Learning Models

Authors: Aditya Karade, Sharada Falane, Dhananjay Deshmukh, Vijaykumar Mantri

Abstract:

Brain tumors pose a significant challenge in healthcare due to their complex nature and impact on patient outcomes. The application of deep learning (DL) algorithms in medical imaging have shown promise in accurate and efficient brain tumour detection. This paper explores the performance of various pre-trained DL models ResNet50, Xception, InceptionV3, EfficientNetB0, DenseNet121, NASNetMobile, VGG19, VGG16, and MobileNet on a brain tumour dataset sourced from Figshare. The dataset consists of MRI scans categorizing different types of brain tumours, including meningioma, pituitary, glioma, and no tumour. The study involves a comprehensive evaluation of these models’ accuracy and effectiveness in classifying brain tumour images. Data preprocessing, augmentation, and finetuning techniques are employed to optimize model performance. Among the evaluated deep learning models for brain tumour detection, ResNet50 emerges as the top performer with an accuracy of 98.86%. Following closely is Xception, exhibiting a strong accuracy of 97.33%. These models showcase robust capabilities in accurately classifying brain tumour images. On the other end of the spectrum, VGG16 trails with the lowest accuracy at 89.02%.

Keywords: brain tumour, MRI image, detecting and classifying tumour, pre-trained models, transfer learning, image segmentation, data augmentation

Procedia PDF Downloads 74
1188 Analysis of the Lung Microbiome in Cystic Fibrosis Patients Using 16S Sequencing

Authors: Manasvi Pinnaka, Brianna Chrisman

Abstract:

Cystic fibrosis patients often develop lung infections that range anywhere in severity from mild to life-threatening due to the presence of thick and sticky mucus that fills their airways. Since many of these infections are chronic, they not only affect a patient’s ability to breathe but also increase the chances of mortality by respiratory failure. With a publicly available dataset of DNA sequences from bacterial species in the lung microbiome of cystic fibrosis patients, the correlations between different microbial species in the lung and the extent of deterioration of lung function were investigated. 16S sequencing technologies were used to determine the microbiome composition of the samples in the dataset. For the statistical analyses, referencing helped distinguish between taxonomies, and the proportions of certain taxa relative to another were determined. It was found that the Fusobacterium, Actinomyces, and Leptotrichia microbial types all had a positive correlation with the FEV1 score, indicating the potential displacement of these species by pathogens as the disease progresses. However, the dominant pathogens themselves, including Pseudomonas aeruginosa and Staphylococcus aureus, did not have statistically significant negative correlations with the FEV1 score as described by past literature. Examining the lung microbiology of cystic fibrosis patients can help with the prediction of the current condition of lung function, with the potential to guide doctors when designing personalized treatment plans for patients.

Keywords: bacterial infections, cystic fibrosis, lung microbiome, 16S sequencing

Procedia PDF Downloads 99
1187 Evaluation and Compression of Different Language Transformer Models for Semantic Textual Similarity Binary Task Using Minority Language Resources

Authors: Ma. Gracia Corazon Cayanan, Kai Yuen Cheong, Li Sha

Abstract:

Training a language model for a minority language has been a challenging task. The lack of available corpora to train and fine-tune state-of-the-art language models is still a challenge in the area of Natural Language Processing (NLP). Moreover, the need for high computational resources and bulk data limit the attainment of this task. In this paper, we presented the following contributions: (1) we introduce and used a translation pair set of Tagalog and English (TL-EN) in pre-training a language model to a minority language resource; (2) we fine-tuned and evaluated top-ranking and pre-trained semantic textual similarity binary task (STSB) models, to both TL-EN and STS dataset pairs. (3) then, we reduced the size of the model to offset the need for high computational resources. Based on our results, the models that were pre-trained to translation pairs and STS pairs can perform well for STSB task. Also, having it reduced to a smaller dimension has no negative effect on the performance but rather has a notable increase on the similarity scores. Moreover, models that were pre-trained to a similar dataset have a tremendous effect on the model’s performance scores.

Keywords: semantic matching, semantic textual similarity binary task, low resource minority language, fine-tuning, dimension reduction, transformer models

Procedia PDF Downloads 211
1186 Use of Gaussian-Euclidean Hybrid Function Based Artificial Immune System for Breast Cancer Diagnosis

Authors: Cuneyt Yucelbas, Seral Ozsen, Sule Yucelbas, Gulay Tezel

Abstract:

Due to the fact that there exist only a small number of complex systems in artificial immune system (AIS) that work out nonlinear problems, nonlinear AIS approaches, among the well-known solution techniques, need to be developed. Gaussian function is usually used as similarity estimation in classification problems and pattern recognition. In this study, diagnosis of breast cancer, the second type of the most widespread cancer in women, was performed with different distance calculation functions that euclidean, gaussian and gaussian-euclidean hybrid function in the clonal selection model of classical AIS on Wisconsin Breast Cancer Dataset (WBCD), which was taken from the University of California, Irvine Machine-Learning Repository. We used 3-fold cross validation method to train and test the dataset. According to the results, the maximum test classification accuracy was reported as 97.35% by using of gaussian-euclidean hybrid function for fold-3. Also, mean of test classification accuracies for all of functions were obtained as 94.78%, 94.45% and 95.31% with use of euclidean, gaussian and gaussian-euclidean, respectively. With these results, gaussian-euclidean hybrid function seems to be a potential distance calculation method, and it may be considered as an alternative distance calculation method for hard nonlinear classification problems.

Keywords: artificial immune system, breast cancer diagnosis, Euclidean function, Gaussian function

Procedia PDF Downloads 435
1185 JaCoText: A Pretrained Model for Java Code-Text Generation

Authors: Jessica Lopez Espejel, Mahaman Sanoussi Yahaya Alassan, Walid Dahhane, El Hassane Ettifouri

Abstract:

Pretrained transformer-based models have shown high performance in natural language generation tasks. However, a new wave of interest has surged: automatic programming language code generation. This task consists of translating natural language instructions to a source code. Despite the fact that well-known pre-trained models on language generation have achieved good performance in learning programming languages, effort is still needed in automatic code generation. In this paper, we introduce JaCoText, a model based on Transformer neural network. It aims to generate java source code from natural language text. JaCoText leverages the advantages of both natural language and code generation models. More specifically, we study some findings from state of the art and use them to (1) initialize our model from powerful pre-trained models, (2) explore additional pretraining on our java dataset, (3) lead experiments combining the unimodal and bimodal data in training, and (4) scale the input and output length during the fine-tuning of the model. Conducted experiments on CONCODE dataset show that JaCoText achieves new state-of-the-art results.

Keywords: java code generation, natural language processing, sequence-to-sequence models, transformer neural networks

Procedia PDF Downloads 284
1184 The Reproducibility and Repeatability of Modified Likelihood Ratio for Forensics Handwriting Examination

Authors: O. Abiodun Adeyinka, B. Adeyemo Adesesan

Abstract:

The forensic use of handwriting depends on the analysis, comparison, and evaluation decisions made by forensic document examiners. When using biometric technology in forensic applications, it is necessary to compute Likelihood Ratio (LR) for quantifying strength of evidence under two competing hypotheses, namely the prosecution and the defense hypotheses wherein a set of assumptions and methods for a given data set will be made. It is therefore important to know how repeatable and reproducible our estimated LR is. This paper evaluated the accuracy and reproducibility of examiners' decisions. Confidence interval for the estimated LR were presented so as not get an incorrect estimate that will be used to deliver wrong judgment in the court of Law. The estimate of LR is fundamentally a Bayesian concept and we used two LR estimators, namely Logistic Regression (LoR) and Kernel Density Estimator (KDE) for this paper. The repeatability evaluation was carried out by retesting the initial experiment after an interval of six months to observe whether examiners would repeat their decisions for the estimated LR. The experimental results, which are based on handwriting dataset, show that LR has different confidence intervals which therefore implies that LR cannot be estimated with the same certainty everywhere. Though the LoR performed better than the KDE when tested using the same dataset, the two LR estimators investigated showed a consistent region in which LR value can be estimated confidently. These two findings advance our understanding of LR when used in computing the strength of evidence in handwriting using forensics.

Keywords: confidence interval, handwriting, kernel density estimator, KDE, logistic regression LoR, repeatability, reproducibility

Procedia PDF Downloads 124
1183 Content-Aware Image Augmentation for Medical Imaging Applications

Authors: Filip Rusak, Yulia Arzhaeva, Dadong Wang

Abstract:

Machine learning based Computer-Aided Diagnosis (CAD) is gaining much popularity in medical imaging and diagnostic radiology. However, it requires a large amount of high quality and labeled training image datasets. The training images may come from different sources and be acquired from different radiography machines produced by different manufacturers, digital or digitized copies of film radiographs, with various sizes as well as different pixel intensity distributions. In this paper, a content-aware image augmentation method is presented to deal with these variations. The results of the proposed method have been validated graphically by plotting the removed and added seams of pixels on original images. Two different chest X-ray (CXR) datasets are used in the experiments. The CXRs in the datasets defer in size, some are digital CXRs while the others are digitized from analog CXR films. With the proposed content-aware augmentation method, the Seam Carving algorithm is employed to resize CXRs and the corresponding labels in the form of image masks, followed by histogram matching used to normalize the pixel intensities of digital radiography, based on the pixel intensity values of digitized radiographs. We implemented the algorithms, resized the well-known Montgomery dataset, to the size of the most frequently used Japanese Society of Radiological Technology (JSRT) dataset and normalized our digital CXRs for testing. This work resulted in the unified off-the-shelf CXR dataset composed of radiographs included in both, Montgomery and JSRT datasets. The experimental results show that even though the amount of augmentation is large, our algorithm can preserve the important information in lung fields, local structures, and global visual effect adequately. The proposed method can be used to augment training and testing image data sets so that the trained machine learning model can be used to process CXRs from various sources, and it can be potentially used broadly in any medical imaging applications.

Keywords: computer-aided diagnosis, image augmentation, lung segmentation, medical imaging, seam carving

Procedia PDF Downloads 222
1182 A Comparative Asessment of Some Algorithms for Modeling and Forecasting Horizontal Displacement of Ialy Dam, Vietnam

Authors: Kien-Trinh Thi Bui, Cuong Manh Nguyen

Abstract:

In order to simulate and reproduce the operational characteristics of a dam visually, it is necessary to capture the displacement at different measurement points and analyze the observed movement data promptly to forecast the dam safety. The accuracy of forecasts is further improved by applying machine learning methods to data analysis progress. In this study, the horizontal displacement monitoring data of the Ialy hydroelectric dam was applied to machine learning algorithms: Gaussian processes, multi-layer perceptron neural networks, and the M5-rules algorithm for modelling and forecasting of horizontal displacement of the Ialy hydropower dam (Vietnam), respectively, for analysing. The database which used in this research was built by collecting time series of data from 2006 to 2021 and divided into two parts: training dataset and validating dataset. The final results show all three algorithms have high performance for both training and model validation, but the MLPs is the best model. The usability of them are further investigated by comparison with a benchmark models created by multi-linear regression. The result show the performance which obtained from all the GP model, the MLPs model and the M5-Rules model are much better, therefore these three models should be used to analyze and predict the horizontal displacement of the dam.

Keywords: Gaussian processes, horizontal displacement, hydropower dam, Ialy dam, M5-Rules, multi-layer perception neural networks

Procedia PDF Downloads 210
1181 Non-intrusive Hand Control of Drone Using an Inexpensive and Streamlined Convolutional Neural Network Approach

Authors: Evan Lowhorn, Rocio Alba-Flores

Abstract:

The purpose of this work is to develop a method for classifying hand signals and using the output in a drone control algorithm. To achieve this, methods based on Convolutional Neural Networks (CNN) were applied. CNN's are a subset of deep learning, which allows grid-like inputs to be processed and passed through a neural network to be trained for classification. This type of neural network allows for classification via imaging, which is less intrusive than previous methods using biosensors, such as EMG sensors. Classification CNN's operate purely from the pixel values in an image; therefore they can be used without additional exteroceptive sensors. A development bench was constructed using a desktop computer connected to a high-definition webcam mounted on a scissor arm. This allowed the camera to be pointed downwards at the desk to provide a constant solid background for the dataset and a clear detection area for the user. A MATLAB script was created to automate dataset image capture at the development bench and save the images to the desktop. This allowed the user to create their own dataset of 12,000 images within three hours. These images were evenly distributed among seven classes. The defined classes include forward, backward, left, right, idle, and land. The drone has a popular flip function which was also included as an additional class. To simplify control, the corresponding hand signals chosen were the numerical hand signs for one through five for movements, a fist for land, and the universal “ok” sign for the flip command. Transfer learning with PyTorch (Python) was performed using a pre-trained 18-layer residual learning network (ResNet-18) to retrain the network for custom classification. An algorithm was created to interpret the classification and send encoded messages to a Ryze Tello drone over its 2.4 GHz Wi-Fi connection. The drone’s movements were performed in half-meter distance increments at a constant speed. When combined with the drone control algorithm, the classification performed as desired with negligible latency when compared to the delay in the drone’s movement commands.

Keywords: classification, computer vision, convolutional neural networks, drone control

Procedia PDF Downloads 210
1180 Network and Sentiment Analysis of U.S. Congressional Tweets

Authors: Chaitanya Kanakamedala, Hansa Pradhan, Carter Gilbert

Abstract:

Social media platforms, such as Twitter, are excellent datasets for understanding human interactions and sentiments. This report explores social dynamics among US Congressional members through a network analysis applied to a dataset of tweets spanning 2008 to 2017 from the ’US Congressional Tweets Dataset’. In this report, we preform network analysis where connections between users (edges) are established based on a similarity threshold: two tweets are connected if the tweets they post are similar. By utilizing the Natural Language Toolkit (NLTK) and NetworkX, we quantified tweet similarity and constructed a graph comprising various interconnected components. Each component represents a cluster of users with closely aligned content. We then preform sentiment analysis on each cluster to explore the prevalent emotions and opinions within these groups. Our findings reveal that despite the initial expectation of distinct ideological divisions typically aligning with party lines, the analysis exposed a high degree of topical convergence across tweets from different political affiliations. The analysis preformed in this report not only highlights the potential of social media as a tool for political communication but also suggests a complex layer of interaction that transcends traditional partisan boundaries, reflecting a complicated landscape of politics in the digital age.

Keywords: natural language processing, sentiment analysis, centrality analysis, topic modeling

Procedia PDF Downloads 33
1179 Save Balance of Power: Can We?

Authors: Swati Arun

Abstract:

The present paper argues that Balance of Power (BOP) needs to conjugate with certain contingencies like geography. It is evident that sea powers (‘insular’ for better clarity) are not balanced (if at all) in the same way as land powers. Its apparent that artificial insularity that the US has achieved reduces the chances of balancing (constant) and helps it maintain preponderance (variable). But how precise is this approach in assessing the dynamics between China’s rise and reaction of other powers and US. The ‘evolved’ theory can be validated by putting China and US in the equation. Systemic Relation between the nations was explained through the Balance of Power theory much before the systems theory was propounded. The BOP is the crux of functionality of ‘power relation’ dynamics which has played its role in the most astounding ways leading to situations of war and peace. Whimsical; but true that, the BOP has remained a complicated and indefinable concepts since Hans. Morganthau to Kenneth Waltz. A challenge of the BOP, however remains; “ that it has too many meanings”. In the recent times it has become evident that the myriad of expectations generated by BOP has not met the practicality of the current world politics. It is for this reason; the BoP has been replaced by Preponderance Theory (PT) to explain prevailing power situation. PT does provide an empirical reasoning for the success of this theory but fails in a abstract logical reasoning required for making a theory universal. Unipolarity clarifies the current system as one where balance of power has become redundant. It seems to reach beyond the contours of BoP, where a superpower does what it must to remain one. The centrality of this arguments pivots around - an exception, every time BOP fails to operate, preponderance of power emerges. PT does not sit well with the primary logic of a theory because it works on an exception. The evolution of such a pattern and system where BOP fails and preponderance emerges is absent. The puzzle here is- if BOP really has become redundant or it needs polishing. The international power structure changed from multipolar to bipolar to unipolar. BOP was looked at to provide inevitable logic behind such changes and answer the dilemma we see today- why US is unchecked, unbalanced? But why was Britain unchecked in 19th century and why China was unbalanced in 13th century? It is the insularity of the state that makes BOP reproduce “imbalance of power”, going a level up from off-shore balancer. This luxury of a state to maintain imbalance in the region of competition or threat is the causal relation between BOP’s and geography. America has applied imbalancing- meaning disequilibrium (in its favor) to maintain the regional balance so that over time the weaker does not get stronger and pose a competition. It could do that due to the significant parity present between the US and the rest.

Keywords: balance of power, china, preponderance of power, US

Procedia PDF Downloads 278
1178 Classification of Land Cover Usage from Satellite Images Using Deep Learning Algorithms

Authors: Shaik Ayesha Fathima, Shaik Noor Jahan, Duvvada Rajeswara Rao

Abstract:

Earth's environment and its evolution can be seen through satellite images in near real-time. Through satellite imagery, remote sensing data provide crucial information that can be used for a variety of applications, including image fusion, change detection, land cover classification, agriculture, mining, disaster mitigation, and monitoring climate change. The objective of this project is to propose a method for classifying satellite images according to multiple predefined land cover classes. The proposed approach involves collecting data in image format. The data is then pre-processed using data pre-processing techniques. The processed data is fed into the proposed algorithm and the obtained result is analyzed. Some of the algorithms used in satellite imagery classification are U-Net, Random Forest, Deep Labv3, CNN, ANN, Resnet etc. In this project, we are using the DeepLabv3 (Atrous convolution) algorithm for land cover classification. The dataset used is the deep globe land cover classification dataset. DeepLabv3 is a semantic segmentation system that uses atrous convolution to capture multi-scale context by adopting multiple atrous rates in cascade or in parallel to determine the scale of segments.

Keywords: area calculation, atrous convolution, deep globe land cover classification, deepLabv3, land cover classification, resnet 50

Procedia PDF Downloads 139
1177 Time Series Forecasting (TSF) Using Various Deep Learning Models

Authors: Jimeng Shi, Mahek Jain, Giri Narasimhan

Abstract:

Time Series Forecasting (TSF) is used to predict the target variables at a future time point based on the learning from previous time points. To keep the problem tractable, learning methods use data from a fixed-length window in the past as an explicit input. In this paper, we study how the performance of predictive models changes as a function of different look-back window sizes and different amounts of time to predict the future. We also consider the performance of the recent attention-based Transformer models, which have had good success in the image processing and natural language processing domains. In all, we compare four different deep learning methods (RNN, LSTM, GRU, and Transformer) along with a baseline method. The dataset (hourly) we used is the Beijing Air Quality Dataset from the UCI website, which includes a multivariate time series of many factors measured on an hourly basis for a period of 5 years (2010-14). For each model, we also report on the relationship between the performance and the look-back window sizes and the number of predicted time points into the future. Our experiments suggest that Transformer models have the best performance with the lowest Mean Average Errors (MAE = 14.599, 23.273) and Root Mean Square Errors (RSME = 23.573, 38.131) for most of our single-step and multi-steps predictions. The best size for the look-back window to predict 1 hour into the future appears to be one day, while 2 or 4 days perform the best to predict 3 hours into the future.

Keywords: air quality prediction, deep learning algorithms, time series forecasting, look-back window

Procedia PDF Downloads 154
1176 Static Analysis of Security Issues of the Python Packages Ecosystem

Authors: Adam Gorine, Faten Spondon

Abstract:

Python is considered the most popular programming language and offers its own ecosystem for archiving and maintaining open-source software packages. This system is called the python package index (PyPI), the repository of this programming language. Unfortunately, one-third of these software packages have vulnerabilities that allow attackers to execute code automatically when a vulnerable or malicious package is installed. This paper contributes to large-scale empirical studies investigating security issues in the python ecosystem by evaluating package vulnerabilities. These provide a series of implications that can help the security of software ecosystems by improving the process of discovering, fixing, and managing package vulnerabilities. The vulnerable dataset is generated using the NVD, the national vulnerability database, and the Snyk vulnerability dataset. In addition, we evaluated 807 vulnerability reports in the NVD and 3900 publicly known security vulnerabilities in Python Package Manager (pip) from the Snyk database from 2002 to 2022. As a result, many Python vulnerabilities appear in high severity, followed by medium severity. The most problematic areas have been improper input validation and denial of service attacks. A hybrid scanning tool that combines the three scanners bandit, snyk and dlint, which provide a clear report of the code vulnerability, is also described.

Keywords: Python vulnerabilities, bandit, Snyk, Dlint, Python package index, ecosystem, static analysis, malicious attacks

Procedia PDF Downloads 139
1175 Hard Disk Failure Predictions in Supercomputing System Based on CNN-LSTM and Oversampling Technique

Authors: Yingkun Huang, Li Guo, Zekang Lan, Kai Tian

Abstract:

Hard disk drives (HDD) failure of the exascale supercomputing system may lead to service interruption and invalidate previous calculations, and it will cause permanent data loss. Therefore, initiating corrective actions before hard drive failures materialize is critical to the continued operation of jobs. In this paper, a highly accurate analysis model based on CNN-LSTM and oversampling technique was proposed, which can correctly predict the necessity of a disk replacement even ten days in advance. Generally, the learning-based method performs poorly on a training dataset with long-tail distribution, especially fault prediction is a very classic situation as the scarcity of failure data. To overcome the puzzle, a new oversampling was employed to augment the data, and then, an improved CNN-LSTM with the shortcut was built to learn more effective features. The shortcut transmits the results of the previous layer of CNN and is used as the input of the LSTM model after weighted fusion with the output of the next layer. Finally, a detailed, empirical comparison of 6 prediction methods is presented and discussed on a public dataset for evaluation. The experiments indicate that the proposed method predicts disk failure with 0.91 Precision, 0.91 Recall, 0.91 F-measure, and 0.90 MCC for 10 days prediction horizon. Thus, the proposed algorithm is an efficient algorithm for predicting HDD failure in supercomputing.

Keywords: HDD replacement, failure, CNN-LSTM, oversampling, prediction

Procedia PDF Downloads 79
1174 Classification of Potential Biomarkers in Breast Cancer Using Artificial Intelligence Algorithms and Anthropometric Datasets

Authors: Aref Aasi, Sahar Ebrahimi Bajgani, Erfan Aasi

Abstract:

Breast cancer (BC) continues to be the most frequent cancer in females and causes the highest number of cancer-related deaths in women worldwide. Inspired by recent advances in studying the relationship between different patient attributes and features and the disease, in this paper, we have tried to investigate the different classification methods for better diagnosis of BC in the early stages. In this regard, datasets from the University Hospital Centre of Coimbra were chosen, and different machine learning (ML)-based and neural network (NN) classifiers have been studied. For this purpose, we have selected favorable features among the nine provided attributes from the clinical dataset by using a random forest algorithm. This dataset consists of both healthy controls and BC patients, and it was noted that glucose, BMI, resistin, and age have the most importance, respectively. Moreover, we have analyzed these features with various ML-based classifier methods, including Decision Tree (DT), K-Nearest Neighbors (KNN), eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM) along with NN-based Multi-Layer Perceptron (MLP) classifier. The results revealed that among different techniques, the SVM and MLP classifiers have the most accuracy, with amounts of 96% and 92%, respectively. These results divulged that the adopted procedure could be used effectively for the classification of cancer cells, and also it encourages further experimental investigations with more collected data for other types of cancers.

Keywords: breast cancer, diagnosis, machine learning, biomarker classification, neural network

Procedia PDF Downloads 135
1173 Scope of Heavy Oil as a Fuel of the Future

Authors: Kiran P. Chadayamuri, Saransh Bagdi

Abstract:

Increasing imbalance between energy supply and demand has made nations and companies involved in the energy sector to boost up their research and find suitable solutions. With the high rates at which conventional oil and gas resources are depleting, efficient exploration and exploitation of heavy oil could just be the answer. Heavy oil may be defined as crude oil having API gravity value of less than 20⁰. They are highly viscous, have low hydrogen to carbon ratios and are known to produce high carbon residues. They have high contents of asphaltenes, heavy metals, sulphur and nitrogen in them. Due to these properties extraction, transportation and refining of crude oil have its share of challenges. Lack of suitable technology has hindered its production in the past, but now things are going in a more positive direction. The aim of this paper is to study the various advantages of heavy oil, associated limitations and its feasibility as a fuel of the future.

Keywords: energy, heavy oil, fuel, future

Procedia PDF Downloads 285
1172 D3Advert: Data-Driven Decision Making for Ad Personalization through Personality Analysis Using BiLSTM Network

Authors: Sandesh Achar

Abstract:

Personalized advertising holds greater potential for higher conversion rates compared to generic advertisements. However, its widespread application in the retail industry faces challenges due to complex implementation processes. These complexities impede the swift adoption of personalized advertisement on a large scale. Personalized advertisement, being a data-driven approach, necessitates consumer-related data, adding to its complexity. This paper introduces an innovative data-driven decision-making framework, D3Advert, which personalizes advertisements by analyzing personalities using a BiLSTM network. The framework utilizes the Myers–Briggs Type Indicator (MBTI) dataset for development. The employed BiLSTM network, specifically designed and optimized for D3Advert, classifies user personalities into one of the sixteen MBTI categories based on their social media posts. The classification accuracy is 86.42%, with precision, recall, and F1-Score values of 85.11%, 84.14%, and 83.89%, respectively. The D3Advert framework personalizes advertisements based on these personality classifications. Experimental implementation and performance analysis of D3Advert demonstrate a 40% improvement in impressions. D3Advert’s innovative and straightforward approach has the potential to transform personalized advertising and foster widespread personalized advertisement adoption in marketing.

Keywords: personalized advertisement, deep Learning, MBTI dataset, BiLSTM network, NLP.

Procedia PDF Downloads 44
1171 A Transformer-Based Approach for Multi-Human 3D Pose Estimation Using Color and Depth Images

Authors: Qiang Wang, Hongyang Yu

Abstract:

Multi-human 3D pose estimation is a challenging task in computer vision, which aims to recover the 3D joint locations of multiple people from multi-view images. In contrast to traditional methods, which typically only use color (RGB) images as input, our approach utilizes both color and depth (D) information contained in RGB-D images. We also employ a transformer-based model as the backbone of our approach, which is able to capture long-range dependencies and has been shown to perform well on various sequence modeling tasks. Our method is trained and tested on the Carnegie Mellon University (CMU) Panoptic dataset, which contains a diverse set of indoor and outdoor scenes with multiple people in varying poses and clothing. We evaluate the performance of our model on the standard 3D pose estimation metrics of mean per-joint position error (MPJPE). Our results show that the transformer-based approach outperforms traditional methods and achieves competitive results on the CMU Panoptic dataset. We also perform an ablation study to understand the impact of different design choices on the overall performance of the model. In summary, our work demonstrates the effectiveness of using a transformer-based approach with RGB-D images for multi-human 3D pose estimation and has potential applications in real-world scenarios such as human-computer interaction, robotics, and augmented reality.

Keywords: multi-human 3D pose estimation, RGB-D images, transformer, 3D joint locations

Procedia PDF Downloads 80
1170 Automated Digital Mammogram Segmentation Using Dispersed Region Growing and Pectoral Muscle Sliding Window Algorithm

Authors: Ayush Shrivastava, Arpit Chaudhary, Devang Kulshreshtha, Vibhav Prakash Singh, Rajeev Srivastava

Abstract:

Early diagnosis of breast cancer can improve the survival rate by detecting cancer at an early stage. Breast region segmentation is an essential step in the analysis of digital mammograms. Accurate image segmentation leads to better detection of cancer. It aims at separating out Region of Interest (ROI) from rest of the image. The procedure begins with removal of labels, annotations and tags from the mammographic image using morphological opening method. Pectoral Muscle Sliding Window Algorithm (PMSWA) is used for removal of pectoral muscle from mammograms which is necessary as the intensity values of pectoral muscles are similar to that of ROI which makes it difficult to separate out. After removing the pectoral muscle, Dispersed Region Growing Algorithm (DRGA) is used for segmentation of mammogram which disperses seeds in different regions instead of a single bright region. To demonstrate the validity of our segmentation method, 322 mammographic images from Mammographic Image Analysis Society (MIAS) database are used. The dataset contains medio-lateral oblique (MLO) view of mammograms. Experimental results on MIAS dataset show the effectiveness of our proposed method.

Keywords: CAD, dispersed region growing algorithm (DRGA), image segmentation, mammography, pectoral muscle sliding window algorithm (PMSWA)

Procedia PDF Downloads 312