Search results for: dataset analysis
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 27631

Search results for: dataset analysis

27271 GeneNet: Temporal Graph Data Visualization for Gene Nomenclature and Relationships

Authors: Jake Gonzalez, Tommy Dang

Abstract:

This paper proposes a temporal graph approach to visualize and analyze the evolution of gene relationships and nomenclature over time. An interactive web-based tool implements this temporal graph, enabling researchers to traverse a timeline and observe coupled dynamics in network topology and naming conventions. Analysis of a real human genomic dataset reveals the emergence of densely interconnected functional modules over time, representing groups of genes involved in key biological processes. For example, the antimicrobial peptide DEFA1A3 shows increased connections to related alpha-defensins involved in infection response. Tracking degree and betweenness centrality shifts over timeline iterations also quantitatively highlight the reprioritization of certain genes’ topological importance as knowledge advances. Examination of the CNR1 gene encoding the cannabinoid receptor CB1 demonstrates changing synonymous relationships and consolidating naming patterns over time, reflecting its unique functional role discovery. The integrated framework interconnecting these topological and nomenclature dynamics provides richer contextual insights compared to isolated analysis methods. Overall, this temporal graph approach enables a more holistic study of knowledge evolution to elucidate complex biology.

Keywords: temporal graph, gene relationships, nomenclature evolution, interactive visualization, biological insights

Procedia PDF Downloads 38
27270 Investigating the Impacts of Climate Change on Soil Erosion: A Case Study of Kasilian Watershed, Northern Iran

Authors: Mohammad Zare, Mahbubeh Sheikh

Abstract:

Many of the impact of climate change will material through change in soil erosion which were rarely addressed in Iran. This paper presents an investigation of the impacts of climate change soil erosin for the Kasilian basin. LARS-WG5 was used to downscale the IPCM4 and GFCM21 predictions of the A2 scenarios for the projected periods of 1985-2030 and 2080-2099. This analysis was carried out by means of the dataset the International Centre for Theoretical Physics (ICTP) of Trieste. Soil loss modeling using Revised Universal Soil Loss Equation (RUSLE). Results indicate that soil erosion increase or decrease, depending on which climate scenarios are considered. The potential for climate change to increase soil loss rate, soil erosion in future periods was established, whereas considerable decreases in erosion are projected when land use is increased from baseline periods.

Keywords: Kasilian watershed, climatic change, soil erosion, LARS-WG5 Model, RUSLE

Procedia PDF Downloads 484
27269 In-Depth Analysis on Sequence Evolution and Molecular Interaction of Influenza Receptors (Hemagglutinin and Neuraminidase)

Authors: Dong Tran, Thanh Dac Van, Ly Le

Abstract:

Hemagglutinin (HA) and Neuraminidase (NA) play an important role in host immune evasion across influenza virus evolution process. The correlation between HA and NA evolution in respect to epitopic evolution and drug interaction has yet to be investigated. In this study, combining of sequence to structure evolution and statistical analysis on epitopic/binding site specificity, we identified potential therapeutic features of HA and NA that show specific antibody binding site of HA and specific binding distribution within NA active site of current inhibitors. Our approach introduces the use of sequence variation and molecular interaction to provide an effective strategy in establishing experimental based distributed representations of protein-protein/ligand complexes. The most important advantage of our method is that it does not require complete dataset of complexes but rather directly inferring feature interaction from sequence variation and molecular interaction. Using correlated sequence analysis, we additionally identified co-evolved mutations associated with maintaining HA/NA structural and functional variability toward immunity and therapeutic treatment. Our investigation on the HA binding specificity revealed unique conserved stalk domain interacts with unique loop domain of universal antibodies (CR9114, CT149, CR8043, CR8020, F16v3, CR6261, F10). On the other hand, NA inhibitors (Oseltamivir, Zaninamivir, Laninamivir) showed specific conserved residue contribution and similar to that of NA substrate (sialic acid) which can be exploited for drug design. Our study provides an important insight into rational design and identification of novel therapeutics targeting universally recognized feature of influenza HA/NA.

Keywords: influenza virus, hemagglutinin (HA), neuraminidase (NA), sequence evolution

Procedia PDF Downloads 136
27268 Hsa-miR-192-5p, and Hsa-miR-129-5p Prominent Biomarkers in Regulation Glioblastoma Cancer Stem Cells Genes Microenvironment

Authors: Rasha Ahmadi

Abstract:

Glioblastoma is one of the most frequent brain malignancies, having a high mortality rate and limited survival in individuals with this malignancy. Despite different treatments and surgery, recurrence of glioblastoma cancer stem cells may arise as a subsequent tumor. For this reason, it is crucial to research the markers associated with glioblastoma stem cells and specifically their microenvironment. In this study, using bioinformatics analysis, we analyzed and nominated genes in the microenvironment pathways of glioblastoma stem cells. In this study, an appropriate database was selected for analysis by referring to the GEO database. This dataset comprised gene expression patterns in stem cells derived from glioblastoma patients. Gene clusters were divided as high and low expression. Enrichment databases such as Enrichr, STRING, and GEPIA were utilized to analyze the data appropriately. Finally, we extracted the potential genes 2700 high-expression and 1100 low-expression genes are implicated in the metabolic pathways of glioblastoma cancer progression. Cellular senescence, MAPK, TNF, hypoxia, zimosterol biosynthesis, and phosphatidylinositol metabolism pathways were substantially expressed and the metabolic pathways were downregulated. After assessing the association between protein networks, MSMP, SOX2, FGD4 ,and CNTNAP3 genes with high expression and DMKN and SBSN genes with low were selected. All of these genes were observed in the survival curve, with a survival of fewer than 10 percent over around 15 months. hsa-mir-192-5p, hsa-mir-129-5p, hsa-mir-215-5p, hsa-mir-335-5p, and hsa-mir-340-5p played key function in glioblastoma cancer stem cells microenviroments. We introduced critical genes through integrated and regular bioinformatics studies by assessing the amount of gene expression profile data that can play an important role in targeting genes involved in the energy and microenvironment of glioblastoma cancer stem cells. Have. This study indicated that hsa-mir-192-5p, and hsa-mir-129-5p are appropriate candidates for this.

Keywords: Glioblastoma, Cancer Stem Cells, Biomarker Discovery, Gene Expression Profiles, Bioinformatics Analysis, Tumor Microenvironment

Procedia PDF Downloads 114
27267 Establishing a Computational Screening Framework to Identify Environmental Exposures Using Untargeted Gas-Chromatography High-Resolution Mass Spectrometry

Authors: Juni C. Kim, Anna R. Robuck, Douglas I. Walker

Abstract:

The human exposome, which includes chemical exposures over the lifetime and their effects, is now recognized as an important measure for understanding human health; however, the complexity of the data makes the identification of environmental chemicals challenging. The goal of our project was to establish a computational workflow for the improved identification of environmental pollutants containing chlorine or bromine. Using the “pattern. search” function available in the R package NonTarget, we wrote a multifunctional script that searches mass spectral clusters from untargeted gas-chromatography high-resolution mass spectrometry (GC-HRMS) for the presence of spectra consistent with chlorine and bromine-containing organic compounds. The “pattern. search” function was incorporated into a different function that allows the evaluation of clusters containing multiple analyte fragments, has multi-core support, and provides a simplified output identifying listing compounds containing chlorine and/or bromine. The new function was able to process 46,000 spectral clusters in under 8 seconds and identified over 150 potential halogenated spectra. We next applied our function to a deidentified dataset from patients diagnosed with primary biliary cholangitis (PBC), primary sclerosing cholangitis (PSC), and healthy controls. Twenty-two spectra corresponded to potential halogenated compounds in the PSC and PBC dataset, including six significantly different in PBC patients, while four differed in PSC patients. We have developed an improved algorithm for detecting halogenated compounds in GC-HRMS data, providing a strategy for prioritizing exposures in the study of human disease.

Keywords: exposome, metabolome, computational metabolomics, high-resolution mass spectrometry, exposure, pollutants

Procedia PDF Downloads 110
27266 Developing a Machine Learning-based Cost Prediction Model for Construction Projects using Particle Swarm Optimization

Authors: Soheila Sadeghi

Abstract:

Accurate cost prediction is essential for effective project management and decision-making in the construction industry. This study aims to develop a cost prediction model for construction projects using Machine Learning techniques and Particle Swarm Optimization (PSO). The research utilizes a comprehensive dataset containing project cost estimates, actual costs, resource details, and project performance metrics from a road reconstruction project. The methodology involves data preprocessing, feature selection, and the development of an Artificial Neural Network (ANN) model optimized using PSO. The study investigates the impact of various input features, including cost estimates, resource allocation, and project progress, on the accuracy of cost predictions. The performance of the optimized ANN model is evaluated using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared. The results demonstrate the effectiveness of the proposed approach in predicting project costs, outperforming traditional benchmark models. The feature selection process identifies the most influential variables contributing to cost variations, providing valuable insights for project managers. However, this study has several limitations. Firstly, the model's performance may be influenced by the quality and quantity of the dataset used. A larger and more diverse dataset covering different types of construction projects would enhance the model's generalizability. Secondly, the study focuses on a specific optimization technique (PSO) and a single Machine Learning algorithm (ANN). Exploring other optimization methods and comparing the performance of various ML algorithms could provide a more comprehensive understanding of the cost prediction problem. Future research should focus on several key areas. Firstly, expanding the dataset to include a wider range of construction projects, such as residential buildings, commercial complexes, and infrastructure projects, would improve the model's applicability. Secondly, investigating the integration of additional data sources, such as economic indicators, weather data, and supplier information, could enhance the predictive power of the model. Thirdly, exploring the potential of ensemble learning techniques, which combine multiple ML algorithms, may further improve cost prediction accuracy. Additionally, developing user-friendly interfaces and tools to facilitate the adoption of the proposed cost prediction model in real-world construction projects would be a valuable contribution to the industry. The findings of this study have significant implications for construction project management, enabling proactive cost estimation, resource allocation, budget planning, and risk assessment, ultimately leading to improved project performance and cost control. This research contributes to the advancement of cost prediction techniques in the construction industry and highlights the potential of Machine Learning and PSO in addressing this critical challenge. However, further research is needed to address the limitations and explore the identified future research directions to fully realize the potential of ML-based cost prediction models in the construction domain.

Keywords: cost prediction, construction projects, machine learning, artificial neural networks, particle swarm optimization, project management, feature selection, road reconstruction

Procedia PDF Downloads 18
27265 An Adaptive Oversampling Technique for Imbalanced Datasets

Authors: Shaukat Ali Shahee, Usha Ananthakumar

Abstract:

A data set exhibits class imbalance problem when one class has very few examples compared to the other class, and this is also referred to as between class imbalance. The traditional classifiers fail to classify the minority class examples correctly due to its bias towards the majority class. Apart from between-class imbalance, imbalance within classes where classes are composed of a different number of sub-clusters with these sub-clusters containing different number of examples also deteriorates the performance of the classifier. Previously, many methods have been proposed for handling imbalanced dataset problem. These methods can be classified into four categories: data preprocessing, algorithmic based, cost-based methods and ensemble of classifier. Data preprocessing techniques have shown great potential as they attempt to improve data distribution rather than the classifier. Data preprocessing technique handles class imbalance either by increasing the minority class examples or by decreasing the majority class examples. Decreasing the majority class examples lead to loss of information and also when minority class has an absolute rarity, removing the majority class examples is generally not recommended. Existing methods available for handling class imbalance do not address both between-class imbalance and within-class imbalance simultaneously. In this paper, we propose a method that handles between class imbalance and within class imbalance simultaneously for binary classification problem. Removing between class imbalance and within class imbalance simultaneously eliminates the biases of the classifier towards bigger sub-clusters by minimizing the error domination of bigger sub-clusters in total error. The proposed method uses model-based clustering to find the presence of sub-clusters or sub-concepts in the dataset. The number of examples oversampled among the sub-clusters is determined based on the complexity of sub-clusters. The method also takes into consideration the scatter of the data in the feature space and also adaptively copes up with unseen test data using Lowner-John ellipsoid for increasing the accuracy of the classifier. In this study, neural network is being used as this is one such classifier where the total error is minimized and removing the between-class imbalance and within class imbalance simultaneously help the classifier in giving equal weight to all the sub-clusters irrespective of the classes. The proposed method is validated on 9 publicly available data sets and compared with three existing oversampling techniques that rely on the spatial location of minority class examples in the euclidean feature space. The experimental results show the proposed method to be statistically significantly superior to other methods in terms of various accuracy measures. Thus the proposed method can serve as a good alternative to handle various problem domains like credit scoring, customer churn prediction, financial distress, etc., that typically involve imbalanced data sets.

Keywords: classification, imbalanced dataset, Lowner-John ellipsoid, model based clustering, oversampling

Procedia PDF Downloads 393
27264 Addressing the Exorbitant Cost of Labeling Medical Images with Active Learning

Authors: Saba Rahimi, Ozan Oktay, Javier Alvarez-Valle, Sujeeth Bharadwaj

Abstract:

Successful application of deep learning in medical image analysis necessitates unprecedented amounts of labeled training data. Unlike conventional 2D applications, radiological images can be three-dimensional (e.g., CT, MRI), consisting of many instances within each image. The problem is exacerbated when expert annotations are required for effective pixel-wise labeling, which incurs exorbitant labeling effort and cost. Active learning is an established research domain that aims to reduce labeling workload by prioritizing a subset of informative unlabeled examples to annotate. Our contribution is a cost-effective approach for U-Net 3D models that uses Monte Carlo sampling to analyze pixel-wise uncertainty. Experiments on the AAPM 2017 lung CT segmentation challenge dataset show that our proposed framework can achieve promising segmentation results by using only 42% of the training data.

Keywords: image segmentation, active learning, convolutional neural network, 3D U-Net

Procedia PDF Downloads 126
27263 A Novel Heuristic for Analysis of Large Datasets by Selecting Wrapper-Based Features

Authors: Bushra Zafar, Usman Qamar

Abstract:

Large data sample size and dimensions render the effectiveness of conventional data mining methodologies. A data mining technique are important tools for collection of knowledgeable information from variety of databases and provides supervised learning in the form of classification to design models to describe vital data classes while structure of the classifier is based on class attribute. Classification efficiency and accuracy are often influenced to great extent by noisy and undesirable features in real application data sets. The inherent natures of data set greatly masks its quality analysis and leave us with quite few practical approaches to use. To our knowledge first time, we present a new approach for investigation of structure and quality of datasets by providing a targeted analysis of localization of noisy and irrelevant features of data sets. Machine learning is based primarily on feature selection as pre-processing step which offers us to select few features from number of features as a subset by reducing the space according to certain evaluation criterion. The primary objective of this study is to trim down the scope of the given data sample by searching a small set of important features which may results into good classification performance. For this purpose, a heuristic for wrapper-based feature selection using genetic algorithm and for discriminative feature selection an external classifier are used. Selection of feature based on its number of occurrence in the chosen chromosomes. Sample dataset has been used to demonstrate proposed idea effectively. A proposed method has improved average accuracy of different datasets is about 95%. Experimental results illustrate that proposed algorithm increases the accuracy of prediction of different diseases.

Keywords: data mining, generic algorithm, KNN algorithms, wrapper based feature selection

Procedia PDF Downloads 299
27262 [Keynote Speech]: Feature Selection and Predictive Modeling of Housing Data Using Random Forest

Authors: Bharatendra Rai

Abstract:

Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative features that describe various aspects people consider while buying a new house. Boruta algorithm that supports feature selection using a wrapper approach build around random forest is used in this study. This feature selection process leads to 49 confirmed features which are then used for developing predictive random forest models. The study also explores five different data partitioning ratios and their impact on model accuracy are captured using coefficient of determination (r-square) and root mean square error (rsme).

Keywords: housing data, feature selection, random forest, Boruta algorithm, root mean square error

Procedia PDF Downloads 296
27261 Predictive Modelling of Aircraft Component Replacement Using Imbalanced Learning and Ensemble Method

Authors: Dangut Maren David, Skaf Zakwan

Abstract:

Adequate monitoring of vehicle component in other to obtain high uptime is the goal of predictive maintenance, the major challenge faced by businesses in industries is the significant cost associated with a delay in service delivery due to system downtime. Most of those businesses are interested in predicting those problems and proactively prevent them in advance before it occurs, which is the core advantage of Prognostic Health Management (PHM) application. The recent emergence of industry 4.0 or industrial internet of things (IIoT) has led to the need for monitoring systems activities and enhancing system-to-system or component-to- component interactions, this has resulted to a large generation of data known as big data. Analysis of big data represents an increasingly important, however, due to complexity inherently in the dataset such as imbalance classification problems, it becomes extremely difficult to build a model with accurate high precision. Data-driven predictive modeling for condition-based maintenance (CBM) has recently drowned research interest with growing attention to both academics and industries. The large data generated from industrial process inherently comes with a different degree of complexity which posed a challenge for analytics. Thus, imbalance classification problem exists perversely in industrial datasets which can affect the performance of learning algorithms yielding to poor classifier accuracy in model development. Misclassification of faults can result in unplanned breakdown leading economic loss. In this paper, an advanced approach for handling imbalance classification problem is proposed and then a prognostic model for predicting aircraft component replacement is developed to predict component replacement in advanced by exploring aircraft historical data, the approached is based on hybrid ensemble-based method which improves the prediction of the minority class during learning, we also investigate the impact of our approach on multiclass imbalance problem. We validate the feasibility and effectiveness in terms of the performance of our approach using real-world aircraft operation and maintenance datasets, which spans over 7 years. Our approach shows better performance compared to other similar approaches. We also validate our approach strength for handling multiclass imbalanced dataset, our results also show good performance compared to other based classifiers.

Keywords: prognostics, data-driven, imbalance classification, deep learning

Procedia PDF Downloads 151
27260 High Fidelity Interactive Video Segmentation Using Tensor Decomposition, Boundary Loss, Convolutional Tessellations, and Context-Aware Skip Connections

Authors: Anthony D. Rhodes, Manan Goel

Abstract:

We provide a high fidelity deep learning algorithm (HyperSeg) for interactive video segmentation tasks using a dense convolutional network with context-aware skip connections and compressed, 'hypercolumn' image features combined with a convolutional tessellation procedure. In order to maintain high output fidelity, our model crucially processes and renders all image features in high resolution, without utilizing downsampling or pooling procedures. We maintain this consistent, high grade fidelity efficiently in our model chiefly through two means: (1) we use a statistically-principled, tensor decomposition procedure to modulate the number of hypercolumn features and (2) we render these features in their native resolution using a convolutional tessellation technique. For improved pixel-level segmentation results, we introduce a boundary loss function; for improved temporal coherence in video data, we include temporal image information in our model. Through experiments, we demonstrate the improved accuracy of our model against baseline models for interactive segmentation tasks using high resolution video data. We also introduce a benchmark video segmentation dataset, the VFX Segmentation Dataset, which contains over 27,046 high resolution video frames, including green screen and various composited scenes with corresponding, hand-crafted, pixel-level segmentations. Our work presents a improves state of the art segmentation fidelity with high resolution data and can be used across a broad range of application domains, including VFX pipelines and medical imaging disciplines.

Keywords: computer vision, object segmentation, interactive segmentation, model compression

Procedia PDF Downloads 100
27259 Multivariate Analysis on Water Quality Attributes Using Master-Slave Neural Network Model

Authors: A. Clementking, C. Jothi Venkateswaran

Abstract:

Mathematical and computational functionalities such as descriptive mining, optimization, and predictions are espoused to resolve natural resource planning. The water quality prediction and its attributes influence determinations are adopted optimization techniques. The water properties are tainted while merging water resource one with another. This work aimed to predict influencing water resource distribution connectivity in accordance to water quality and sediment using an innovative proposed master-slave neural network back-propagation model. The experiment results are arrived through collecting water quality attributes, computation of water quality index, design and development of neural network model to determine water quality and sediment, master–slave back propagation neural network back-propagation model to determine variations on water quality and sediment attributes between the water resources and the recommendation for connectivity. The homogeneous and parallel biochemical reactions are influences water quality and sediment while distributing water from one location to another. Therefore, an innovative master-slave neural network model [M (9:9:2)::S(9:9:2)] designed and developed to predict the attribute variations. The result of training dataset given as an input to master model and its maximum weights are assigned as an input to the slave model to predict the water quality. The developed master-slave model is predicted physicochemical attributes weight variations for 85 % to 90% of water quality as a target values.The sediment level variations also predicated from 0.01 to 0.05% of each water quality percentage. The model produced the significant variations on physiochemical attribute weights. According to the predicated experimental weight variation on training data set, effective recommendations are made to connect different resources.

Keywords: master-slave back propagation neural network model(MSBPNNM), water quality analysis, multivariate analysis, environmental mining

Procedia PDF Downloads 449
27258 Feature Evaluation Based on Random Subspace and Multiple-K Ensemble

Authors: Jaehong Yu, Seoung Bum Kim

Abstract:

Clustering analysis can facilitate the extraction of intrinsic patterns in a dataset and reveal its natural groupings without requiring class information. For effective clustering analysis in high dimensional datasets, unsupervised dimensionality reduction is an important task. Unsupervised dimensionality reduction can generally be achieved by feature extraction or feature selection. In many situations, feature selection methods are more appropriate than feature extraction methods because of their clear interpretation with respect to the original features. The unsupervised feature selection can be categorized as feature subset selection and feature ranking method, and we focused on unsupervised feature ranking methods which evaluate the features based on their importance scores. Recently, several unsupervised feature ranking methods were developed based on ensemble approaches to achieve their higher accuracy and stability. However, most of the ensemble-based feature ranking methods require the true number of clusters. Furthermore, these algorithms evaluate the feature importance depending on the ensemble clustering solution, and they produce undesirable evaluation results if the clustering solutions are inaccurate. To address these limitations, we proposed an ensemble-based feature ranking method with random subspace and multiple-k ensemble (FRRM). The proposed FRRM algorithm evaluates the importance of each feature with the random subspace ensemble, and all evaluation results are combined with the ensemble importance scores. Moreover, FRRM does not require the determination of the true number of clusters in advance through the use of the multiple-k ensemble idea. Experiments on various benchmark datasets were conducted to examine the properties of the proposed FRRM algorithm and to compare its performance with that of existing feature ranking methods. The experimental results demonstrated that the proposed FRRM outperformed the competitors.

Keywords: clustering analysis, multiple-k ensemble, random subspace-based feature evaluation, unsupervised feature ranking

Procedia PDF Downloads 310
27257 Hyper Tuned RBF SVM: Approach for the Prediction of the Breast Cancer

Authors: Surita Maini, Sanjay Dhanka

Abstract:

Machine learning (ML) involves developing algorithms and statistical models that enable computers to learn and make predictions or decisions based on data without being explicitly programmed. Because of its unlimited abilities ML is gaining popularity in medical sectors; Medical Imaging, Electronic Health Records, Genomic Data Analysis, Wearable Devices, Disease Outbreak Prediction, Disease Diagnosis, etc. In the last few decades, many researchers have tried to diagnose Breast Cancer (BC) using ML, because early detection of any disease can save millions of lives. Working in this direction, the authors have proposed a hybrid ML technique RBF SVM, to predict the BC in earlier the stage. The proposed method is implemented on the Breast Cancer UCI ML dataset with 569 instances and 32 attributes. The authors recorded performance metrics of the proposed model i.e., Accuracy 98.24%, Sensitivity 98.67%, Specificity 97.43%, F1 Score 98.67%, Precision 98.67%, and run time 0.044769 seconds. The proposed method is validated by K-Fold cross-validation.

Keywords: breast cancer, support vector classifier, machine learning, hyper parameter tunning

Procedia PDF Downloads 50
27256 Predicting Football Player Performance: Integrating Data Visualization and Machine Learning

Authors: Saahith M. S., Sivakami R.

Abstract:

In the realm of football analytics, particularly focusing on predicting football player performance, the ability to forecast player success accurately is of paramount importance for teams, managers, and fans. This study introduces an elaborate examination of predicting football player performance through the integration of data visualization methods and machine learning algorithms. The research entails the compilation of an extensive dataset comprising player attributes, conducting data preprocessing, feature selection, model selection, and model training to construct predictive models. The analysis within this study will involve delving into feature significance using methodologies like Select Best and Recursive Feature Elimination (RFE) to pinpoint pertinent attributes for predicting player performance. Various machine learning algorithms, including Random Forest, Decision Tree, Linear Regression, Support Vector Regression (SVR), and Artificial Neural Networks (ANN), will be explored to develop predictive models. The evaluation of each model's performance utilizing metrics such as Mean Squared Error (MSE) and R-squared will be executed to gauge their efficacy in predicting player performance. Furthermore, this investigation will encompass a top player analysis to recognize the top-performing players based on the anticipated overall performance scores. Nationality analysis will entail scrutinizing the player distribution based on nationality and investigating potential correlations between nationality and player performance. Positional analysis will concentrate on examining the player distribution across various positions and assessing the average performance of players in each position. Age analysis will evaluate the influence of age on player performance and identify any discernible trends or patterns associated with player age groups. The primary objective is to predict a football player's overall performance accurately based on their individual attributes, leveraging data-driven insights to enrich the comprehension of player success on the field. By amalgamating data visualization and machine learning methodologies, the aim is to furnish valuable tools for teams, managers, and fans to effectively analyze and forecast player performance. This research contributes to the progression of sports analytics by showcasing the potential of machine learning in predicting football player performance and offering actionable insights for diverse stakeholders in the football industry.

Keywords: football analytics, player performance prediction, data visualization, machine learning algorithms, random forest, decision tree, linear regression, support vector regression, artificial neural networks, model evaluation, top player analysis, nationality analysis, positional analysis

Procedia PDF Downloads 18
27255 The Importance of including All Data in a Linear Model for the Analysis of RNAseq Data

Authors: Roxane A. Legaie, Kjiana E. Schwab, Caroline E. Gargett

Abstract:

Studies looking at the changes in gene expression from RNAseq data often make use of linear models. It is also common practice to focus on a subset of data for a comparison of interest, leaving aside the samples not involved in this particular comparison. This work shows the importance of including all observations in the modeling process to better estimate variance parameters, even when the samples included are not directly used in the comparison under test. The human endometrium is a dynamic tissue, which undergoes cycles of growth and regression with each menstrual cycle. The mesenchymal stem cells (MSCs) present in the endometrium are likely responsible for this remarkable regenerative capacity. However recent studies suggest that MSCs also plays a role in the pathogenesis of endometriosis, one of the most common medical conditions affecting the lower abdomen in women in which the endometrial tissue grows outside the womb. In this study we compared gene expression profiles between MSCs and non-stem cell counterparts (‘non-MSC’) obtained from women with (‘E’) or without (‘noE’) endometriosis from RNAseq. Raw read counts were used for differential expression analysis using a linear model with the limma-voom R package, including either all samples in the study or only the samples belonging to the subset of interest (e.g. for the comparison ‘E vs noE in MSC cells’, including only MSC samples from E and noE patients but not the non-MSC ones). Using the full dataset we identified about 100 differentially expressed (DE) genes between E and noE samples in MSC samples (adj.p-val < 0.05 and |logFC|>1) while only 9 DE genes were identified when using only the subset of data (MSC samples only). Important genes known to be involved in endometriosis such as KLF9 and RND3 were missed in the latter case. When looking at the MSC vs non-MSC cells comparison, the linear model including all samples identified 260 genes for noE samples (including the stem cell marker SUSD2) while the subset analysis did not identify any DE genes. When looking at E samples, 12 genes were identified with the first approach and only 1 with the subset approach. Although the stem cell marker RGS5 was found in both cases, the subset test missed important genes involved in stem cell differentiation such as NOTCH3 and other potentially related genes to be used for further investigation and pathway analysis.

Keywords: differential expression, endometriosis, linear model, RNAseq

Procedia PDF Downloads 409
27254 Automatic Lexicon Generation for Domain Specific Dataset for Mining Public Opinion on China Pakistan Economic Corridor

Authors: Tayyaba Azim, Bibi Amina

Abstract:

The increase in the popularity of opinion mining with the rapid growth in the availability of social networks has attracted a lot of opportunities for research in the various domains of Sentiment Analysis and Natural Language Processing (NLP) using Artificial Intelligence approaches. The latest trend allows the public to actively use the internet for analyzing an individual’s opinion and explore the effectiveness of published facts. The main theme of this research is to account the public opinion on the most crucial and extensively discussed development projects, China Pakistan Economic Corridor (CPEC), considered as a game changer due to its promise of bringing economic prosperity to the region. So far, to the best of our knowledge, the theme of CPEC has not been analyzed for sentiment determination through the ML approach. This research aims to demonstrate the use of ML approaches to spontaneously analyze the public sentiment on Twitter tweets particularly about CPEC. Support Vector Machine SVM is used for classification task classifying tweets into positive, negative and neutral classes. Word2vec and TF-IDF features are used with the SVM model, a comparison of the trained model on manually labelled tweets and automatically generated lexicon is performed. The contributions of this work are: Development of a sentiment analysis system for public tweets on CPEC subject, construction of an automatic generation of the lexicon of public tweets on CPEC, different themes are identified among tweets and sentiments are assigned to each theme. It is worth noting that the applications of web mining that empower e-democracy by improving political transparency and public participation in decision making via social media have not been explored and practised in Pakistan region on CPEC yet.

Keywords: machine learning, natural language processing, sentiment analysis, support vector machine, Word2vec

Procedia PDF Downloads 128
27253 A Reliable Multi-Type Vehicle Classification System

Authors: Ghada S. Moussa

Abstract:

Vehicle classification is an important task in traffic surveillance and intelligent transportation systems. Classification of vehicle images is facing several problems such as: high intra-class vehicle variations, occlusion, shadow, illumination. These problems and others must be considered to develop a reliable vehicle classification system. In this study, a reliable multi-type vehicle classification system based on Bag-of-Words (BoW) paradigm is developed. Our proposed system used and compared four well-known classifiers; Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), k-Nearest Neighbour (KNN), and Decision Tree to classify vehicles into four categories: motorcycles, small, medium and large. Experiments on a large dataset show that our approach is efficient and reliable in classifying vehicles with accuracy of 95.7%. The SVM outperforms other classification algorithms in terms of both accuracy and robustness alongside considerable reduction in execution time. The innovativeness of developed system is it can serve as a framework for many vehicle classification systems.

Keywords: vehicle classification, bag-of-words technique, SVM classifier, LDA classifier, KNN classifier, decision tree classifier, SIFT algorithm

Procedia PDF Downloads 332
27252 Landslide Vulnerability Assessment in Context with Indian Himalayan

Authors: Neha Gupta

Abstract:

Landslide vulnerability is considered as the crucial parameter for the assessment of landslide risk. The term vulnerability defined as the damage or degree of elements at risk of different dimensions, i.e., physical, social, economic, and environmental dimensions. Himalaya region is very prone to multi-hazard such as floods, forest fires, earthquakes, and landslides. With the increases in fatalities rates, loss of infrastructure, and economy due to landslide in the Himalaya region, leads to the assessment of vulnerability. In this study, a methodology to measure the combination of vulnerability dimension, i.e., social vulnerability, physical vulnerability, and environmental vulnerability in one framework. A combined result of these vulnerabilities has rarely been carried out. But no such approach was applied in the Indian Scenario. The methodology was applied in an area of east Sikkim Himalaya, India. The physical vulnerability comprises of building footprint layer extracted from remote sensing data and Google Earth imaginary. The social vulnerability was assessed by using population density based on land use. The land use map was derived from a high-resolution satellite image, and for environment vulnerability assessment NDVI, forest, agriculture land, distance from the river were assessed from remote sensing and DEM. The classes of social vulnerability, physical vulnerability, and environment vulnerability were normalized at the scale of 0 (no loss) to 1 (loss) to get the homogenous dataset. Then the Multi-Criteria Analysis (MCA) was used to assign individual weights to each dimension and then integrate it into one frame. The final vulnerability was further classified into four classes from very low to very high.

Keywords: landslide, multi-criteria analysis, MCA, physical vulnerability, social vulnerability

Procedia PDF Downloads 282
27251 Machine Learning for Disease Prediction Using Symptoms and X-Ray Images

Authors: Ravija Gunawardana, Banuka Athuraliya

Abstract:

Machine learning has emerged as a powerful tool for disease diagnosis and prediction. The use of machine learning algorithms has the potential to improve the accuracy of disease prediction, thereby enabling medical professionals to provide more effective and personalized treatments. This study focuses on developing a machine-learning model for disease prediction using symptoms and X-ray images. The importance of this study lies in its potential to assist medical professionals in accurately diagnosing diseases, thereby improving patient outcomes. Respiratory diseases are a significant cause of morbidity and mortality worldwide, and chest X-rays are commonly used in the diagnosis of these diseases. However, accurately interpreting X-ray images requires significant expertise and can be time-consuming, making it difficult to diagnose respiratory diseases in a timely manner. By incorporating machine learning algorithms, we can significantly enhance disease prediction accuracy, ultimately leading to better patient care. The study utilized the Mask R-CNN algorithm, which is a state-of-the-art method for object detection and segmentation in images, to process chest X-ray images. The model was trained and tested on a large dataset of patient information, which included both symptom data and X-ray images. The performance of the model was evaluated using a range of metrics, including accuracy, precision, recall, and F1-score. The results showed that the model achieved an accuracy rate of over 90%, indicating that it was able to accurately detect and segment regions of interest in the X-ray images. In addition to X-ray images, the study also incorporated symptoms as input data for disease prediction. The study used three different classifiers, namely Random Forest, K-Nearest Neighbor and Support Vector Machine, to predict diseases based on symptoms. These classifiers were trained and tested using the same dataset of patient information as the X-ray model. The results showed promising accuracy rates for predicting diseases using symptoms, with the ensemble learning techniques significantly improving the accuracy of disease prediction. The study's findings indicate that the use of machine learning algorithms can significantly enhance disease prediction accuracy, ultimately leading to better patient care. The model developed in this study has the potential to assist medical professionals in diagnosing respiratory diseases more accurately and efficiently. However, it is important to note that the accuracy of the model can be affected by several factors, including the quality of the X-ray images, the size of the dataset used for training, and the complexity of the disease being diagnosed. In conclusion, the study demonstrated the potential of machine learning algorithms for disease prediction using symptoms and X-ray images. The use of these algorithms can improve the accuracy of disease diagnosis, ultimately leading to better patient care. Further research is needed to validate the model's accuracy and effectiveness in a clinical setting and to expand its application to other diseases.

Keywords: K-nearest neighbor, mask R-CNN, random forest, support vector machine

Procedia PDF Downloads 114
27250 Key Factors Influencing Individual Knowledge Capability in KIFs

Authors: Salman Iqbal

Abstract:

Knowledge management (KM) literature has mainly focused on the antecedents of KM. The purpose of this study is to investigate the effect of specific human resource management (HRM) practices on employee knowledge sharing and its outcome as individual knowledge capability. Based on previous literature, a model is proposed for the study and hypotheses are formulated. The cross-sectional dataset comes from a sample of 19 knowledge intensive firms (KIFs). This study has run an item parceling technique followed by Confirmatory Factor Analysis (CFA) on the latent constructs of the research model. Employees’ collaboration and their interpersonal trust can help to improve their knowledge sharing behaviour and knowledge capability within organisations. This study suggests that in future, by using a larger sample, better statistical insight is possible. The findings of this study are beneficial for scholars, policy makers and practitioners. The empirical results of this study are entirely based on employees’ perceptions and make a significant research contribution, given there is a dearth of empirical research focusing on the subcontinent.

Keywords: employees’ collaboration, individual knowledge capability, knowledge sharing, monetary rewards, structural equation modelling

Procedia PDF Downloads 251
27249 An Experimental Study on Some Conventional and Hybrid Models of Fuzzy Clustering

Authors: Jeugert Kujtila, Kristi Hoxhalli, Ramazan Dalipi, Erjon Cota, Ardit Murati, Erind Bedalli

Abstract:

Clustering is a versatile instrument in the analysis of collections of data providing insights of the underlying structures of the dataset and enhancing the modeling capabilities. The fuzzy approach to the clustering problem increases the flexibility involving the concept of partial memberships (some value in the continuous interval [0, 1]) of the instances in the clusters. Several fuzzy clustering algorithms have been devised like FCM, Gustafson-Kessel, Gath-Geva, kernel-based FCM, PCM etc. Each of these algorithms has its own advantages and drawbacks, so none of these algorithms would be able to perform superiorly in all datasets. In this paper we will experimentally compare FCM, GK, GG algorithm and a hybrid two-stage fuzzy clustering model combining the FCM and Gath-Geva algorithms. Firstly we will theoretically dis-cuss the advantages and drawbacks for each of these algorithms and we will describe the hybrid clustering model exploiting the advantages and diminishing the drawbacks of each algorithm. Secondly we will experimentally compare the accuracy of the hybrid model by applying it on several benchmark and synthetic datasets.

Keywords: fuzzy clustering, fuzzy c-means algorithm (FCM), Gustafson-Kessel algorithm, hybrid clustering model

Procedia PDF Downloads 484
27248 The Effect of Finding and Development Costs and Gas Price on Basins in the Barnett Shale

Authors: Michael Kenomore, Mohamed Hassan, Amjad Shah, Hom Dhakal

Abstract:

Shale gas reservoirs have been of greater importance compared to shale oil reservoirs since 2009 and with the current nature of the oil market, understanding the technical and economic performance of shale gas reservoirs is of importance. Using the Barnett shale as a case study, an economic model was developed to quantify the effect of finding and development costs and gas prices on the basins in the Barnett shale using net present value as an evaluation parameter. A rate of return of 20% and a payback period of 60 months or less was used as the investment hurdle in the model. The Barnett was split into four basins (Strawn Basin, Ouachita Folded Belt, Forth-worth Syncline and Bend-arch Basin) with analysis conducted on each of the basin to provide a holistic outlook. The dataset consisted of only horizontal wells that started production from 2008 to at most 2015 with 1835 wells coming from the strawn basin, 137 wells from the Ouachita folded belt, 55 wells from the bend-arch basin and 724 wells from the forth-worth syncline. The data was analyzed initially on Microsoft Excel to determine the estimated ultimate recoverable (EUR). The range of EUR from each basin were loaded in the Palisade Risk software and a log normal distribution typical of Barnett shale wells was fitted to the dataset. Monte Carlo simulation was then carried out over a 1000 iterations to obtain a cumulative distribution plot showing the probabilistic distribution of EUR for each basin. From the cumulative distribution plot, the P10, P50 and P90 EUR values for each basin were used in the economic model. Gas production from an individual well with a EUR similar to the calculated EUR was chosen and rescaled to fit the calculated EUR values for each basin at the respective percentiles i.e. P10, P50 and P90. The rescaled production was entered into the economic model to determine the effect of the finding and development cost and gas price on the net present value (10% discount rate/year) as well as also determine the scenario that satisfied the proposed investment hurdle. The finding and development costs used in this paper (assumed to consist only of the drilling and completion costs) were £1 million, £2 million and £4 million while the gas price was varied from $2/MCF-$13/MCF based on Henry Hub spot prices from 2008-2015. One of the major findings in this study was that wells in the bend-arch basin were least economic, higher gas prices are needed in basins containing non-core counties and 90% of the Barnet shale wells were not economic at all finding and development costs irrespective of the gas price in all the basins. This study helps to determine the percentage of wells that are economic at different range of costs and gas prices, determine the basins that are most economic and the wells that satisfy the investment hurdle.

Keywords: shale gas, Barnett shale, unconventional gas, estimated ultimate recoverable

Procedia PDF Downloads 279
27247 Machine Learning Driven Analysis of Kepler Objects of Interest to Identify Exoplanets

Authors: Akshat Kumar, Vidushi

Abstract:

This paper identifies 27 KOIs, 26 of which are currently classified as candidates and one as false positives that have a high probability of being confirmed. For this purpose, 11 machine learning algorithms were implemented on the cumulative kepler dataset sourced from the NASA exoplanet archive; it was observed that the best-performing model was HistGradientBoosting and XGBoost with a test accuracy of 93.5%, and the lowest-performing model was Gaussian NB with a test accuracy of 54%, to test model performance F1, cross-validation score and RUC curve was calculated. Based on the learned models, the significant characteristics for confirm exoplanets were identified, putting emphasis on the object’s transit and stellar properties; these characteristics were namely koi_count, koi_prad, koi_period, koi_dor, koi_ror, and koi_smass, which were later considered to filter out the potential KOIs. The paper also calculates the Earth similarity index based on the planetary radius and equilibrium temperature for each KOI identified to aid in their classification.

Keywords: Kepler objects of interest, exoplanets, space exploration, machine learning, earth similarity index, transit photometry

Procedia PDF Downloads 38
27246 Emotion-Convolutional Neural Network for Perceiving Stress from Audio Signals: A Brain Chemistry Approach

Authors: Anup Anand Deshmukh, Catherine Soladie, Renaud Seguier

Abstract:

Emotion plays a key role in many applications like healthcare, to gather patients’ emotional behavior. Unlike typical ASR (Automated Speech Recognition) problems which focus on 'what was said', it is equally important to understand 'how it was said.' There are certain emotions which are given more importance due to their effectiveness in understanding human feelings. In this paper, we propose an approach that models human stress from audio signals. The research challenge in speech emotion detection is finding the appropriate set of acoustic features corresponding to an emotion. Another difficulty lies in defining the very meaning of emotion and being able to categorize it in a precise manner. Supervised Machine Learning models, including state of the art Deep Learning classification methods, rely on the availability of clean and labelled data. One of the problems in affective computation is the limited amount of annotated data. The existing labelled emotions datasets are highly subjective to the perception of the annotator. We address the first issue of feature selection by exploiting the use of traditional MFCC (Mel-Frequency Cepstral Coefficients) features in Convolutional Neural Network. Our proposed Emo-CNN (Emotion-CNN) architecture treats speech representations in a manner similar to how CNN’s treat images in a vision problem. Our experiments show that Emo-CNN consistently and significantly outperforms the popular existing methods over multiple datasets. It achieves 90.2% categorical accuracy on the Emo-DB dataset. We claim that Emo-CNN is robust to speaker variations and environmental distortions. The proposed approach achieves 85.5% speaker-dependant categorical accuracy for SAVEE (Surrey Audio-Visual Expressed Emotion) dataset, beating the existing CNN based approach by 10.2%. To tackle the second problem of subjectivity in stress labels, we use Lovheim’s cube, which is a 3-dimensional projection of emotions. Monoamine neurotransmitters are a type of chemical messengers in the brain that transmits signals on perceiving emotions. The cube aims at explaining the relationship between these neurotransmitters and the positions of emotions in 3D space. The learnt emotion representations from the Emo-CNN are mapped to the cube using three component PCA (Principal Component Analysis) which is then used to model human stress. This proposed approach not only circumvents the need for labelled stress data but also complies with the psychological theory of emotions given by Lovheim’s cube. We believe that this work is the first step towards creating a connection between Artificial Intelligence and the chemistry of human emotions.

Keywords: deep learning, brain chemistry, emotion perception, Lovheim's cube

Procedia PDF Downloads 127
27245 Comparison of Parametric and Bayesian Survival Regression Models in Simulated and HIV Patient Antiretroviral Therapy Data: Case Study of Alamata Hospital, North Ethiopia

Authors: Zeytu G. Asfaw, Serkalem K. Abrha, Demisew G. Degefu

Abstract:

Background: HIV/AIDS remains a major public health problem in Ethiopia and heavily affecting people of productive and reproductive age. We aimed to compare the performance of Parametric Survival Analysis and Bayesian Survival Analysis using simulations and in a real dataset application focused on determining predictors of HIV patient survival. Methods: A Parametric Survival Models - Exponential, Weibull, Log-normal, Log-logistic, Gompertz and Generalized gamma distributions were considered. Simulation study was carried out with two different algorithms that were informative and noninformative priors. A retrospective cohort study was implemented for HIV infected patients under Highly Active Antiretroviral Therapy in Alamata General Hospital, North Ethiopia. Results: A total of 320 HIV patients were included in the study where 52.19% females and 47.81% males. According to Kaplan-Meier survival estimates for the two sex groups, females has shown better survival time in comparison with their male counterparts. The median survival time of HIV patients was 79 months. During the follow-up period 89 (27.81%) deaths and 231 (72.19%) censored individuals registered. The average baseline cluster of differentiation 4 (CD4) cells count for HIV/AIDS patients were 126.01 but after a three-year antiretroviral therapy follow-up the average cluster of differentiation 4 (CD4) cells counts were 305.74, which was quite encouraging. Age, functional status, tuberculosis screen, past opportunistic infection, baseline cluster of differentiation 4 (CD4) cells, World Health Organization clinical stage, sex, marital status, employment status, occupation type, baseline weight were found statistically significant factors for longer survival of HIV patients. The standard error of all covariate in Bayesian log-normal survival model is less than the classical one. Hence, Bayesian survival analysis showed better performance than classical parametric survival analysis, when subjective data analysis was performed by considering expert opinions and historical knowledge about the parameters. Conclusions: Thus, HIV/AIDS patient mortality rate could be reduced through timely antiretroviral therapy with special care on the potential factors. Moreover, Bayesian log-normal survival model was preferable than the classical log-normal survival model for determining predictors of HIV patients survival.

Keywords: antiretroviral therapy (ART), Bayesian analysis, HIV, log-normal, parametric survival models

Procedia PDF Downloads 165
27244 Integration of Resistivity and Seismic Refraction Using Combine Inversion for Ancient River Findings at Sungai Batu, Lembah Bujang, Malaysia

Authors: Rais Yusoh, Rosli Saad, Mokhtar Saidin, Fauzi Andika, Sabiu Bala Muhammad

Abstract:

Resistivity and seismic refraction profiling have become a common method in pre-investigations for visualizing subsurface structure. The integration of the methods could reduce an interpretation ambiguity. Both methods have their individual software packages for data inversion, but potential to combine certain geophysical methods are restricted; however, the research algorithms that have this functionality was existed and are evaluated personally. The interpretation of subsurface were improve by combining inversion data from both methods by influence each other models using closure coupling; thus, by implementing both methods to support each other which could improve the subsurface interpretation. These methods were applied on a field dataset from a pre-investigation for archeology in finding the ancient river. There were no major changes in the inverted model by combining data inversion for this archetype which probably due to complex geology. The combine data analysis provides an additional technique for interpretation such as an alluvium, which can have strong influence on the ancient river findings.

Keywords: ancient river, combine inversion, resistivity, seismic refraction

Procedia PDF Downloads 307
27243 Classification of Political Affiliations by Reduced Number of Features

Authors: Vesile Evrim, Aliyu Awwal

Abstract:

By the evolvement in technology, the way of expressing opinions switched the direction to the digital world. The domain of politics as one of the hottest topics of opinion mining research merged together with the behavior analysis for affiliation determination in text which constitutes the subject of this paper. This study aims to classify the text in news/blogs either as Republican or Democrat with the minimum number of features. As an initial set, 68 features which 64 are constituted by Linguistic Inquiry and Word Count (LIWC) features are tested against 14 benchmark classification algorithms. In the later experiments, the dimensions of the feature vector reduced based on the 7 feature selection algorithms. The results show that Decision Tree, Rule Induction and M5 Rule classifiers when used with SVM and IGR feature selection algorithms performed the best up to 82.5% accuracy on a given dataset. Further tests on a single feature and the linguistic based feature sets showed the similar results. The feature “function” as an aggregate feature of the linguistic category, is obtained as the most differentiating feature among the 68 features with 81% accuracy by itself in classifying articles either as Republican or Democrat.

Keywords: feature selection, LIWC, machine learning, politics

Procedia PDF Downloads 361
27242 The Inequality Effects of Natural Disasters: Evidence from Thailand

Authors: Annop Jaewisorn

Abstract:

This study explores the relationship between natural disasters and inequalities -both income and expenditure inequality- at a micro-level of Thailand as the first study of this nature for this country. The analysis uses a unique panel and remote-sensing dataset constructed for the purpose of this research. It contains provincial inequality measures and other economic and social indicators based on the Thailand Household Survey during the period between 1992 and 2019. Meanwhile, the data on natural disasters, which are remote-sensing data, are received from several official geophysical or meteorological databases. Employing a panel fixed effects, the results show that natural disasters significantly reduce household income and expenditure inequality as measured by the Gini index, implying that rich people in Thailand bear a higher cost of natural disasters when compared to poor people. The effect on income inequality is mainly driven by droughts, while the effect on expenditure inequality is mainly driven by flood events. The results are robust across heterogeneity of the samples, lagged effects, outliers, and an alternative inequality measure.

Keywords: inequality, natural disasters, remote-sensing data, Thailand

Procedia PDF Downloads 102