Search results for: dataset analysis
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 27487

Search results for: dataset analysis

27307 Clothes Identification Using Inception ResNet V2 and MobileNet V2

Authors: Subodh Chandra Shakya, Badal Shrestha, Suni Thapa, Ashutosh Chauhan, Saugat Adhikari

Abstract:

To tackle our problem of clothes identification, we used different architectures of Convolutional Neural Networks. Among different architectures, the outcome from Inception ResNet V2 and MobileNet V2 seemed promising. On comparison of the metrices, we observed that the Inception ResNet V2 slightly outperforms MobileNet V2 for this purpose. So this paper of ours proposes the cloth identifier using Inception ResNet V2 and also contains the comparison between the outcome of ResNet V2 and MobileNet V2. The document here contains the results and findings of the research that we performed on the DeepFashion Dataset. To improve the dataset, we used different image preprocessing techniques like image shearing, image rotation, and denoising. The whole experiment was conducted with the intention of testing the efficiency of convolutional neural networks on cloth identification so that we could develop a reliable system that is good enough in identifying the clothes worn by the users. The whole system can be integrated with some kind of recommendation system.

Keywords: inception ResNet, convolutional neural net, deep learning, confusion matrix, data augmentation, data preprocessing

Procedia PDF Downloads 157
27306 Continual Learning Using Data Generation for Hyperspectral Remote Sensing Scene Classification

Authors: Samiah Alammari, Nassim Ammour

Abstract:

When providing a massive number of tasks successively to a deep learning process, a good performance of the model requires preserving the previous tasks data to retrain the model for each upcoming classification. Otherwise, the model performs poorly due to the catastrophic forgetting phenomenon. To overcome this shortcoming, we developed a successful continual learning deep model for remote sensing hyperspectral image regions classification. The proposed neural network architecture encapsulates two trainable subnetworks. The first module adapts its weights by minimizing the discrimination error between the land-cover classes during the new task learning, and the second module tries to learn how to replicate the data of the previous tasks by discovering the latent data structure of the new task dataset. We conduct experiments on HSI dataset Indian Pines. The results confirm the capability of the proposed method.

Keywords: continual learning, data reconstruction, remote sensing, hyperspectral image segmentation

Procedia PDF Downloads 219
27305 Machine Learning-Enabled Classification of Climbing Using Small Data

Authors: Nicholas Milburn, Yu Liang, Dalei Wu

Abstract:

Athlete performance scoring within the climbing do-main presents interesting challenges as the sport does not have an objective way to assign skill. Assessing skill levels within any sport is valuable as it can be used to mark progress while training, and it can help an athlete choose appropriate climbs to attempt. Machine learning-based methods are popular for complex problems like this. The dataset available was composed of dynamic force data recorded during climbing; however, this dataset came with challenges such as data scarcity, imbalance, and it was temporally heterogeneous. Investigated solutions to these challenges include data augmentation, temporal normalization, conversion of time series to the spectral domain, and cross validation strategies. The investigated solutions to the classification problem included light weight machine classifiers KNN and SVM as well as the deep learning with CNN. The best performing model had an 80% accuracy. In conclusion, there seems to be enough information within climbing force data to accurately categorize climbers by skill.

Keywords: classification, climbing, data imbalance, data scarcity, machine learning, time sequence

Procedia PDF Downloads 120
27304 Finding Related Scientific Documents Using Formal Concept Analysis

Authors: Nadeem Akhtar, Hira Javed

Abstract:

An important aspect of research is literature survey. Availability of a large amount of literature across different domains triggers the need for optimized systems which provide relevant literature to researchers. We propose a search system based on keywords for text documents. This experimental approach provides a hierarchical structure to the document corpus. The documents are labelled with keywords using KEA (Keyword Extraction Algorithm) and are automatically organized in a lattice structure using Formal Concept Analysis (FCA). This groups the semantically related documents together. The hierarchical structure, based on keywords gives out only those documents which precisely contain them. This approach open doors for multi-domain research. The documents across multiple domains which are indexed by similar keywords are grouped together. A hierarchical relationship between keywords is obtained. To signify the effectiveness of the approach, we have carried out the experiment and evaluation on Semeval-2010 Dataset. Results depict that the presented method is considerably successful in indexing of scientific papers.

Keywords: formal concept analysis, keyword extraction algorithm, scientific documents, lattice

Procedia PDF Downloads 305
27303 The Comparison of Joint Simulation and Estimation Methods for the Geometallurgical Modeling

Authors: Farzaneh Khorram

Abstract:

This paper endeavors to construct a block model to assess grinding energy consumption (CCE) and pinpoint blocks with the highest potential for energy usage during the grinding process within a specified region. Leveraging geostatistical techniques, particularly joint estimation, or simulation, based on geometallurgical data from various mineral processing stages, our objective is to forecast CCE across the study area. The dataset encompasses variables obtained from 2754 drill samples and a block model comprising 4680 blocks. The initial analysis encompassed exploratory data examination, variography, multivariate analysis, and the delineation of geological and structural units. Subsequent analysis involved the assessment of contacts between these units and the estimation of CCE via cokriging, considering its correlation with SPI. The selection of blocks exhibiting maximum CCE holds paramount importance for cost estimation, production planning, and risk mitigation. The study conducted exploratory data analysis on lithology, rock type, and failure variables, revealing seamless boundaries between geometallurgical units. Simulation methods, such as Plurigaussian and Turning band, demonstrated more realistic outcomes compared to cokriging, owing to the inherent characteristics of geometallurgical data and the limitations of kriging methods.

Keywords: geometallurgy, multivariate analysis, plurigaussian, turning band method, cokriging

Procedia PDF Downloads 25
27302 Analysis of Citation Rate and Data Reuse for Openly Accessible Biodiversity Datasets on Global Biodiversity Information Facility

Authors: Nushrat Khan, Mike Thelwall, Kayvan Kousha

Abstract:

Making research data openly accessible has been mandated by most funders over the last 5 years as it promotes reproducibility in science and reduces duplication of effort to collect the same data. There are evidence that articles that publicly share research data have higher citation rates in biological and social sciences. However, how and whether shared data is being reused is not always intuitive as such information is not easily accessible from the majority of research data repositories. This study aims to understand the practice of data citation and how data is being reused over the years focusing on biodiversity since research data is frequently reused in this field. Metadata of 38,878 datasets including citation counts were collected through the Global Biodiversity Information Facility (GBIF) API for this purpose. GBIF was used as a data source since it provides citation count for datasets, not a commonly available feature for most repositories. Analysis of dataset types, citation counts, creation and update time of datasets suggests that citation rate varies for different types of datasets, where occurrence datasets that have more granular information have higher citation rates than checklist and metadata-only datasets. Another finding is that biodiversity datasets on GBIF are frequently updated, which is unique to this field. Majority of the datasets from the earliest year of 2007 were updated after 11 years, with no dataset that was not updated since creation. For each year between 2007 and 2017, we compared the correlations between update time and citation rate of four different types of datasets. While recent datasets do not show any correlations, 3 to 4 years old datasets show weak correlation where datasets that were updated more recently received high citations. The results are suggestive that it takes several years to cumulate citations for research datasets. However, this investigation found that when searched on Google Scholar or Scopus databases for the same datasets, the number of citations is often not the same as GBIF. Hence future aim is to further explore the citation count system adopted by GBIF to evaluate its reliability and whether it can be applicable to other fields of studies as well.

Keywords: data citation, data reuse, research data sharing, webometrics

Procedia PDF Downloads 151
27301 Analysis of Brownfield Soil Contamination Using Local Government Planning Data

Authors: Emma E. Hellawell, Susan J. Hughes

Abstract:

BBrownfield sites are currently being redeveloped for residential use. Information on soil contamination on these former industrial sites is collected as part of the planning process by the local government. This research project analyses this untapped resource of environmental data, using site investigation data submitted to a local Borough Council, in Surrey, UK. Over 150 site investigation reports were collected and interrogated to extract relevant information. This study involved three phases. Phase 1 was the development of a database for soil contamination information from local government reports. This database contained information on the source, history, and quality of the data together with the chemical information on the soil that was sampled. Phase 2 involved obtaining site investigation reports for development within the study area and extracting the required information for the database. Phase 3 was the data analysis and interpretation of key contaminants to evaluate typical levels of contaminants, their distribution within the study area, and relating these results to current guideline levels of risk for future site users. Preliminary results for a pilot study using a sample of the dataset have been obtained. This pilot study showed there is some inconsistency in the quality of the reports and measured data, and careful interpretation of the data is required. Analysis of the information has found high levels of lead in shallow soil samples, with mean and median levels exceeding the current guidance for residential use. The data also showed elevated (but below guidance) levels of potentially carcinogenic polyaromatic hydrocarbons. Of particular concern from the data was the high detection rate for asbestos fibers. These were found at low concentrations in 25% of the soil samples tested (however, the sample set was small). Contamination levels of the remaining chemicals tested were all below the guidance level for residential site use. These preliminary pilot study results will be expanded, and results for the whole local government area will be presented at the conference. The pilot study has demonstrated the potential for this extensive dataset to provide greater information on local contamination levels. This can help inform regulators and developers and lead to more targeted site investigations, improving risk assessments, and brownfield development.

Keywords: Brownfield development, contaminated land, local government planning data, site investigation

Procedia PDF Downloads 112
27300 Assessing the Effectiveness of Machine Learning Algorithms for Cyber Threat Intelligence Discovery from the Darknet

Authors: Azene Zenebe

Abstract:

Deep learning is a subset of machine learning which incorporates techniques for the construction of artificial neural networks and found to be useful for modeling complex problems with large dataset. Deep learning requires a very high power computational and longer time for training. By aggregating computing power, high performance computer (HPC) has emerged as an approach to resolving advanced problems and performing data-driven research activities. Cyber threat intelligence (CIT) is actionable information or insight an organization or individual uses to understand the threats that have, will, or are currently targeting the organization. Results of review of literature will be presented along with results of experimental study that compares the performance of tree-based and function-base machine learning including deep learning algorithms using secondary dataset collected from darknet.

Keywords: deep-learning, cyber security, cyber threat modeling, tree-based machine learning, function-based machine learning, data science

Procedia PDF Downloads 123
27299 Automation of Embodied Energy Calculations for Buildings through Building Information Modelling

Authors: Ahmad Odeh

Abstract:

Researchers are currently more concerned about the calculations of energy at the operational stage, mainly due to its larger environmental impact, but the fact remains, embodied energies represent a substantial contributor unaccounted for in the overall energy computation method. The calculation of materials’ embodied energy during the construction stage is complicated. This is due to the various factors involved. The equipment used, fuel needed, and electricity required for each type of materials varies with location and thus the embodied energy will differ for each project. Moreover, the method used in manufacturing, transporting and putting in place will have significant influence on the materials’ embodied energy. This anomaly has made it difficult to calculate or even bench mark the usage of such energies. This paper presents a model aimed at calculating embodied energies based on such variabilities. It presents a systematic approach that uses an efficient method of calculation to provide a new insight for the selection of construction materials. The model is developed in a BIM environment. The quantification of materials’ energy is determined over the three main stages of their lifecycle: manufacturing, transporting and placing. The model uses three major databases each of which contains set of the construction materials that are most commonly used in building projects. The first dataset holds information about the energy required to manufacture any type of materials, the second includes information about the energy required for transporting the materials while the third stores information about the energy required by machinery to place the materials in their intended locations. Through geospatial data analysis, the model automatically calculates the distances between the suppliers and construction sites and then uses dataset information for energy computations. The computational sum of all the energies is automatically calculated and then the model provides designers with a list of usable equipment along with the associated embodied energies.

Keywords: BIM, lifecycle energy assessment, building automation, energy conservation

Procedia PDF Downloads 173
27298 A Deep Learning Model with Greedy Layer-Wise Pretraining Approach for Optimal Syngas Production by Dry Reforming of Methane

Authors: Maryam Zarabian, Hector Guzman, Pedro Pereira-Almao, Abraham Fapojuwo

Abstract:

Dry reforming of methane (DRM) has sparked significant industrial and scientific interest not only as a viable alternative for addressing the environmental concerns of two main contributors of the greenhouse effect, i.e., carbon dioxide (CO₂) and methane (CH₄), but also produces syngas, i.e., a mixture of hydrogen (H₂) and carbon monoxide (CO) utilized by a wide range of downstream processes as a feedstock for other chemical productions. In this study, we develop an AI-enable syngas production model to tackle the problem of achieving an equivalent H₂/CO ratio [1:1] with respect to the most efficient conversion. Firstly, the unsupervised density-based spatial clustering of applications with noise (DBSAN) algorithm removes outlier data points from the original experimental dataset. Then, random forest (RF) and deep neural network (DNN) models employ the error-free dataset to predict the DRM results. DNN models inherently would not be able to obtain accurate predictions without a huge dataset. To cope with this limitation, we employ reusing pre-trained layers’ approaches such as transfer learning and greedy layer-wise pretraining. Compared to the other deep models (i.e., pure deep model and transferred deep model), the greedy layer-wise pre-trained deep model provides the most accurate prediction as well as similar accuracy to the RF model with R² values 1.00, 0.999, 0.999, 0.999, 0.999, and 0.999 for the total outlet flow, H₂/CO ratio, H₂ yield, CO yield, CH₄ conversion, and CO₂ conversion outputs, respectively.

Keywords: artificial intelligence, dry reforming of methane, artificial neural network, deep learning, machine learning, transfer learning, greedy layer-wise pretraining

Procedia PDF Downloads 62
27297 An Improvement of ComiR Algorithm for MicroRNA Target Prediction by Exploiting Coding Region Sequences of mRNAs

Authors: Giorgio Bertolazzi, Panayiotis Benos, Michele Tumminello, Claudia Coronnello

Abstract:

MicroRNAs are small non-coding RNAs that post-transcriptionally regulate the expression levels of messenger RNAs. MicroRNA regulation activity depends on the recognition of binding sites located on mRNA molecules. ComiR (Combinatorial miRNA targeting) is a user friendly web tool realized to predict the targets of a set of microRNAs, starting from their expression profile. ComiR incorporates miRNA expression in a thermodynamic binding model, and it associates each gene with the probability of being a target of a set of miRNAs. ComiR algorithms were trained with the information regarding binding sites in the 3’UTR region, by using a reliable dataset containing the targets of endogenously expressed microRNA in D. melanogaster S2 cells. This dataset was obtained by comparing the results from two different experimental approaches, i.e., inhibition, and immunoprecipitation of the AGO1 protein; this protein is a component of the microRNA induced silencing complex. In this work, we tested whether including coding region binding sites in the ComiR algorithm improves the performance of the tool in predicting microRNA targets. We focused the analysis on the D. melanogaster species and updated the ComiR underlying database with the currently available releases of mRNA and microRNA sequences. As a result, we find that the ComiR algorithm trained with the information related to the coding regions is more efficient in predicting the microRNA targets, with respect to the algorithm trained with 3’utr information. On the other hand, we show that 3’utr based predictions can be seen as complementary to the coding region based predictions, which suggests that both predictions, from 3'UTR and coding regions, should be considered in a comprehensive analysis. Furthermore, we observed that the lists of targets obtained by analyzing data from one experimental approach only, that is, inhibition or immunoprecipitation of AGO1, are not reliable enough to test the performance of our microRNA target prediction algorithm. Further analysis will be conducted to investigate the effectiveness of the tool with data from other species, provided that validated datasets, as obtained from the comparison of RISC proteins inhibition and immunoprecipitation experiments, will be available for the same samples. Finally, we propose to upgrade the existing ComiR web-tool by including the coding region based trained model, available together with the 3’UTR based one.

Keywords: AGO1, coding region, Drosophila melanogaster, microRNA target prediction

Procedia PDF Downloads 413
27296 Fully Automated Methods for the Detection and Segmentation of Mitochondria in Microscopy Images

Authors: Blessing Ojeme, Frederick Quinn, Russell Karls, Shannon Quinn

Abstract:

The detection and segmentation of mitochondria from fluorescence microscopy are crucial for understanding the complex structure of the nervous system. However, the constant fission and fusion of mitochondria and image distortion in the background make the task of detection and segmentation challenging. In the literature, a number of open-source software tools and artificial intelligence (AI) methods have been described for analyzing mitochondrial images, achieving remarkable classification and quantitation results. However, the availability of combined expertise in the medical field and AI required to utilize these tools poses a challenge to its full adoption and use in clinical settings. Motivated by the advantages of automated methods in terms of good performance, minimum detection time, ease of implementation, and cross-platform compatibility, this study proposes a fully automated framework for the detection and segmentation of mitochondria using both image shape information and descriptive statistics. Using the low-cost, open-source python and openCV library, the algorithms are implemented in three stages: pre-processing, image binarization, and coarse-to-fine segmentation. The proposed model is validated using the mitochondrial fluorescence dataset. Ground truth labels generated using a Lab kit were also used to evaluate the performance of our detection and segmentation model. The study produces good detection and segmentation results and reports the challenges encountered during the image analysis of mitochondrial morphology from the fluorescence mitochondrial dataset. A discussion on the methods and future perspectives of fully automated frameworks conclude the paper.

Keywords: 2D, binarization, CLAHE, detection, fluorescence microscopy, mitochondria, segmentation

Procedia PDF Downloads 336
27295 Post Pandemic Mobility Analysis through Indexing and Sharding in MongoDB: Performance Optimization and Insights

Authors: Karan Vishavjit, Aakash Lakra, Shafaq Khan

Abstract:

The COVID-19 pandemic has pushed healthcare professionals to use big data analytics as a vital tool for tracking and evaluating the effects of contagious viruses. To effectively analyze huge datasets, efficient NoSQL databases are needed. The analysis of post-COVID-19 health and well-being outcomes and the evaluation of the effectiveness of government efforts during the pandemic is made possible by this research’s integration of several datasets, which cuts down on query processing time and creates predictive visual artifacts. We recommend applying sharding and indexing technologies to improve query effectiveness and scalability as the dataset expands. Effective data retrieval and analysis are made possible by spreading the datasets into a sharded database and doing indexing on individual shards. Analysis of connections between governmental activities, poverty levels, and post-pandemic well being is the key goal. We want to evaluate the effectiveness of governmental initiatives to improve health and lower poverty levels. We will do this by utilising advanced data analysis and visualisations. The findings provide relevant data that supports the advancement of UN sustainable objectives, future pandemic preparation, and evidence-based decision-making. This study shows how Big Data and NoSQL databases may be used to address problems with global health.

Keywords: big data, COVID-19, health, indexing, NoSQL, sharding, scalability, well being

Procedia PDF Downloads 43
27294 A Multivariate Exploratory Data Analysis of a Crisis Text Messaging Service in Order to Analyse the Impact of the COVID-19 Pandemic on Mental Health in Ireland

Authors: Hamda Ajmal, Karen Young, Ruth Melia, John Bogue, Mary O'Sullivan, Jim Duggan, Hannah Wood

Abstract:

The Covid-19 pandemic led to a range of public health mitigation strategies in order to suppress the SARS-CoV-2 virus. The drastic changes in everyday life due to lockdowns had the potential for a significant negative impact on public mental health, and a key public health goal is to now assess the evidence from available Irish datasets to provide useful insights on this issue. Text-50808 is an online text-based mental health support service, established in Ireland in 2020, and can provide a measure of revealed distress and mental health concerns across the population. The aim of this study is to explore statistical associations between public mental health in Ireland and the Covid-19 pandemic. Uniquely, this study combines two measures of emotional wellbeing in Ireland: (1) weekly text volume at Text-50808, and (2) emotional wellbeing indicators reported by respondents of the Amárach public opinion survey, carried out on behalf of the Department of Health, Ireland. For this analysis, a multivariate graphical exploratory data analysis (EDA) was performed on the Text-50808 dataset dated from 15th June 2020 to 30th June 2021. This was followed by time-series analysis of key mental health indicators including: (1) the percentage of daily/weekly texts at Text-50808 that mention Covid-19 related issues; (2) the weekly percentage of people experiencing anxiety, boredom, enjoyment, happiness, worry, fear and stress in Amárach survey; and Covid-19 related factors: (3) daily new Covid-19 case numbers; (4) daily stringency index capturing the effect of government non-pharmaceutical interventions (NPIs) in Ireland. The cross-correlation function was applied to measure the relationship between the different time series. EDA of the Text-50808 dataset reveals significant peaks in the volume of texts on days prior to level 3 lockdown and level 5 lockdown in October 2020, and full level 5 lockdown in December 2020. A significantly high positive correlation was observed between the percentage of texts at Text-50808 that reported Covid-19 related issues and the percentage of respondents experiencing anxiety, worry and boredom (at a lag of 1 week) in Amárach survey data. There is a significant negative correlation between percentage of texts with Covid-19 related issues and percentage of respondents experiencing happiness in Amárach survey. Daily percentage of texts at Text-50808 that reported Covid-19 related issues to have a weak positive correlation with daily new Covid-19 cases in Ireland at a lag of 10 days and with daily stringency index of NPIs in Ireland at a lag of 2 days. The sudden peaks in text volume at Text-50808 immediately prior to new restrictions in Ireland indicate an association between a rise in mental health concerns following the announcement of new restrictions. There is also a high correlation between emotional wellbeing variables in the Amárach dataset and the number of weekly texts at Text-50808, and this confirms that Text-50808 reflects overall public sentiment. This analysis confirms the benefits of the texting service as a community surveillance tool for mental health in the population. This initial EDA will be extended to use multivariate modeling to predict the effect of additional Covid-19 related factors on public mental health in Ireland.

Keywords: COVID-19 pandemic, data analysis, digital health, mental health, public health, digital health

Procedia PDF Downloads 113
27293 Mammographic Multi-View Cancer Identification Using Siamese Neural Networks

Authors: Alisher Ibragimov, Sofya Senotrusova, Aleksandra Beliaeva, Egor Ushakov, Yuri Markin

Abstract:

Mammography plays a critical role in screening for breast cancer in women, and artificial intelligence has enabled the automatic detection of diseases in medical images. Many of the current techniques used for mammogram analysis focus on a single view (mediolateral or craniocaudal view), while in clinical practice, radiologists consider multiple views of mammograms from both breasts to make a correct decision. Consequently, computer-aided diagnosis (CAD) systems could benefit from incorporating information gathered from multiple views. In this study, the introduce a method based on a Siamese neural network (SNN) model that simultaneously analyzes mammographic images from tri-view: bilateral and ipsilateral. In this way, when a decision is made on a single image of one breast, attention is also paid to two other images – a view of the same breast in a different projection and an image of the other breast as well. Consequently, the algorithm closely mimics the radiologist's practice of paying attention to the entire examination of a patient rather than to a single image. Additionally, to the best of our knowledge, this research represents the first experiments conducted using the recently released Vietnamese dataset of digital mammography (VinDr-Mammo). On an independent test set of images from this dataset, the best model achieved an AUC of 0.87 per image. Therefore, this suggests that there is a valuable automated second opinion in the interpretation of mammograms and breast cancer diagnosis, which in the future may help to alleviate the burden on radiologists and serve as an additional layer of verification.

Keywords: breast cancer, computer-aided diagnosis, deep learning, multi-view mammogram, siamese neural network

Procedia PDF Downloads 111
27292 Representativity Based Wasserstein Active Regression

Authors: Benjamin Bobbia, Matthias Picard

Abstract:

In recent years active learning methodologies based on the representativity of the data seems more promising to limit overfitting. The presented query methodology for regression using the Wasserstein distance measuring the representativity of our labelled dataset compared to the global distribution. In this work a crucial use of GroupSort Neural Networks is made therewith to draw a double advantage. The Wasserstein distance can be exactly expressed in terms of such neural networks. Moreover, one can provide explicit bounds for their size and depth together with rates of convergence. However, heterogeneity of the dataset is also considered by weighting the Wasserstein distance with the error of approximation at the previous step of active learning. Such an approach leads to a reduction of overfitting and high prediction performance after few steps of query. After having detailed the methodology and algorithm, an empirical study is presented in order to investigate the range of our hyperparameters. The performances of this method are compared, in terms of numbers of query needed, with other classical and recent query methods on several UCI datasets.

Keywords: active learning, Lipschitz regularization, neural networks, optimal transport, regression

Procedia PDF Downloads 61
27291 A Machine Learning-Based Model to Screen Antituberculosis Compound Targeted against LprG Lipoprotein of Mycobacterium tuberculosis

Authors: Syed Asif Hassan, Syed Atif Hassan

Abstract:

Multidrug-resistant Tuberculosis (MDR-TB) is an infection caused by the resistant strains of Mycobacterium tuberculosis that do not respond either to isoniazid or rifampicin, which are the most important anti-TB drugs. The increase in the occurrence of a drug-resistance strain of MTB calls for an intensive search of novel target-based therapeutics. In this context LprG (Rv1411c) a lipoprotein from MTB plays a pivotal role in the immune evasion of Mtb leading to survival and propagation of the bacterium within the host cell. Therefore, a machine learning method will be developed for generating a computational model that could predict for a potential anti LprG activity of the novel antituberculosis compound. The present study will utilize dataset from PubChem database maintained by National Center for Biotechnology Information (NCBI). The dataset involves compounds screened against MTB were categorized as active and inactive based upon PubChem activity score. PowerMV, a molecular descriptor generator, and visualization tool will be used to generate the 2D molecular descriptors for the actives and inactive compounds present in the dataset. The 2D molecular descriptors generated from PowerMV will be used as features. We feed these features into three different classifiers, namely, random forest, a deep neural network, and a recurring neural network, to build separate predictive models and choosing the best performing model based on the accuracy of predicting novel antituberculosis compound with an anti LprG activity. Additionally, the efficacy of predicted active compounds will be screened using SMARTS filter to choose molecule with drug-like features.

Keywords: antituberculosis drug, classifier, machine learning, molecular descriptors, prediction

Procedia PDF Downloads 359
27290 Estimating Estimators: An Empirical Comparison of Non-Invasive Analysis Methods

Authors: Yan Torres, Fernanda Simoes, Francisco Petrucci-Fonseca, Freddie-Jeanne Richard

Abstract:

The non-invasive samples are an alternative of collecting genetic samples directly. Non-invasive samples are collected without the manipulation of the animal (e.g., scats, feathers and hairs). Nevertheless, the use of non-invasive samples has some limitations. The main issue is degraded DNA, leading to poorer extraction efficiency and genotyping. Those errors delayed for some years a widespread use of non-invasive genetic information. Possibilities to limit genotyping errors can be done using analysis methods that can assimilate the errors and singularities of non-invasive samples. Genotype matching and population estimation algorithms can be highlighted as important analysis tools that have been adapted to deal with those errors. Although, this recent development of analysis methods there is still a lack of empirical performance comparison of them. A comparison of methods with dataset different in size and structure can be useful for future studies since non-invasive samples are a powerful tool for getting information specially for endangered and rare populations. To compare the analysis methods, four different datasets used were obtained from the Dryad digital repository were used. Three different matching algorithms (Cervus, Colony and Error Tolerant Likelihood Matching - ETLM) are used for matching genotypes and two different ones for population estimation (Capwire and BayesN). The three matching algorithms showed different patterns of results. The ETLM produced less number of unique individuals and recaptures. A similarity in the matched genotypes between Colony and Cervus was observed. That is not a surprise since the similarity between those methods on the likelihood pairwise and clustering algorithms. The matching of ETLM showed almost no similarity with the genotypes that were matched with the other methods. The different cluster algorithm system and error model of ETLM seems to lead to a more criterious selection, although the processing time and interface friendly of ETLM were the worst between the compared methods. The population estimators performed differently regarding the datasets. There was a consensus between the different estimators only for the one dataset. The BayesN showed higher and lower estimations when compared with Capwire. The BayesN does not consider the total number of recaptures like Capwire only the recapture events. So, this makes the estimator sensitive to data heterogeneity. Heterogeneity in the sense means different capture rates between individuals. In those examples, the tolerance for homogeneity seems to be crucial for BayesN work properly. Both methods are user-friendly and have reasonable processing time. An amplified analysis with simulated genotype data can clarify the sensibility of the algorithms. The present comparison of the matching methods indicates that Colony seems to be more appropriated for general use considering a time/interface/robustness balance. The heterogeneity of the recaptures affected strongly the BayesN estimations, leading to over and underestimations population numbers. Capwire is then advisable to general use since it performs better in a wide range of situations.

Keywords: algorithms, genetics, matching, population

Procedia PDF Downloads 118
27289 Standard Essential Patents for Artificial Intelligence Hardware and the Implications For Intellectual Property Rights

Authors: Wendy de Gomez

Abstract:

Standardization is a critical element in the ability of a society to reduce uncertainty, subjectivity, misrepresentation, and interpretation while simultaneously contributing to innovation. Technological standardization is critical to codify specific operationalization through legal instruments that provide rules of development, expectation, and use. In the current emerging technology landscape Artificial Intelligence (AI) hardware as a general use technology has seen incredible growth as evidenced from AI technology patents between 2012 and 2018 in the United States Patent Trademark Office (USPTO) AI dataset. However, as outlined in the 2023 United States Government National Standards Strategy for Critical and Emerging Technology the codification through standardization of emerging technologies such as AI has not kept pace with its actual technological proliferation. This gap has the potential to cause significant divergent possibilities for the downstream outcomes of AI in both the short and long term. This original empirical research provides an overview of the standardization efforts around AI in different geographies and provides a background to standardization law. It quantifies the longitudinal trend of Artificial Intelligence hardware patents through the USPTO AI dataset. It seeks evidence of existing Standard Essential Patents from these AI hardware patents through a text analysis of the Statement of patent history and the Field of the invention of these patents in Patent Vector and examines their determination as a Standard Essential Patent and their inclusion in existing AI technology standards across the four main AI standards bodies- European Telecommunications Standards Institute (ETSI); International Telecommunication Union (ITU)/ Telecommunication Standardization Sector (-T); Institute of Electrical and Electronics Engineers (IEEE); and the International Organization for Standardization (ISO). Once the analysis is complete the paper will discuss both the theoretical and operational implications of F/Rand Licensing Agreements for the owners of these Standard Essential Patents in the United States Court and Administrative system. It will conclude with an evaluation of how Standard Setting Organizations (SSOs) can work with SEP owners more effectively through various forms of Intellectual Property mechanisms such as patent pools.

Keywords: patents, artifical intelligence, standards, F/Rand agreements

Procedia PDF Downloads 46
27288 Multi-Level Attentional Network for Aspect-Based Sentiment Analysis

Authors: Xinyuan Liu, Xiaojun Jing, Yuan He, Junsheng Mu

Abstract:

Aspect-based Sentiment Analysis (ABSA) has attracted much attention due to its capacity to determine the sentiment polarity of the certain aspect in a sentence. In previous works, great significance of the interaction between aspect and sentence has been exhibited in ABSA. In consequence, a Multi-Level Attentional Networks (MLAN) is proposed. MLAN consists of four parts: Embedding Layer, Encoding Layer, Multi-Level Attentional (MLA) Layers and Final Prediction Layer. Among these parts, MLA Layers including Aspect Level Attentional (ALA) Layer and Interactive Attentional (ILA) Layer is the innovation of MLAN, whose function is to focus on the important information and obtain multiple levels’ attentional weighted representation of aspect and sentence. In the experiments, MLAN is compared with classical TD-LSTM, MemNet, RAM, ATAE-LSTM, IAN, AOA, LCR-Rot and AEN-GloVe on SemEval 2014 Dataset. The experimental results show that MLAN outperforms those state-of-the-art models greatly. And in case study, the works of ALA Layer and ILA Layer have been proven to be effective and interpretable.

Keywords: deep learning, aspect-based sentiment analysis, attention, natural language processing

Procedia PDF Downloads 117
27287 Application of MALDI-MS to Differentiate SARS-CoV-2 and Non-SARS-CoV-2 Symptomatic Infections in the Early and Late Phases of the Pandemic

Authors: Dmitriy Babenko, Sergey Yegorov, Ilya Korshukov, Aidana Sultanbekova, Valentina Barkhanskaya, Tatiana Bashirova, Yerzhan Zhunusov, Yevgeniya Li, Viktoriya Parakhina, Svetlana Kolesnichenko, Yeldar Baiken, Aruzhan Pralieva, Zhibek Zhumadilova, Matthew S. Miller, Gonzalo H. Hortelano, Anar Turmuhambetova, Antonella E. Chesca, Irina Kadyrova

Abstract:

Introduction: The rapidly evolving COVID-19 pandemic, along with the re-emergence of pathogens causing acute respiratory infections (ARI), has necessitated the development of novel diagnostic tools to differentiate various causes of ARI. MALDI-MS, due to its wide usage and affordability, has been proposed as a potential instrument for diagnosing SARS-CoV-2 versus non-SARS-CoV-2 ARI. The aim of this study was to investigate the potential of MALDI-MS in conjunction with a machine learning model to accurately distinguish between symptomatic infections caused by SARS-CoV-2 and non-SARS-CoV-2 during both the early and later phases of the pandemic. Furthermore, this study aimed to analyze mass spectrometry (MS) data obtained from nasal swabs of healthy individuals. Methods: We gathered mass spectra from 252 samples, comprising 108 SARS-CoV-2-positive samples obtained in 2020 (Covid 2020), 7 SARS-CoV- 2-positive samples obtained in 2023 (Covid 2023), 71 samples from symptomatic individuals without SARS-CoV-2 (Control non-Covid ARVI), and 66 samples from healthy individuals (Control healthy). All the samples were subjected to RT-PCR testing. For data analysis, we employed the caret R package to train and test seven machine-learning algorithms: C5.0, KNN, NB, RF, SVM-L, SVM-R, and XGBoost. We conducted a training process using a five-fold (outer) nested repeated (five times) ten-fold (inner) cross-validation with a randomized stratified splitting approach. Results: In this study, we utilized the Covid 2020 dataset as a case group and the non-Covid ARVI dataset as a control group to train and test various machine learning (ML) models. Among these models, XGBoost and SVM-R demonstrated the highest performance, with accuracy values of 0.97 [0.93, 0.97] and 0.95 [0.95; 0.97], specificity values of 0.86 [0.71; 0.93] and 0.86 [0.79; 0.87], and sensitivity values of 0.984 [0.984; 1.000] and 1.000 [0.968; 1.000], respectively. When examining the Covid 2023 dataset, the Naive Bayes model achieved the highest classification accuracy of 43%, while XGBoost and SVM-R achieved accuracies of 14%. For the healthy control dataset, the accuracy of the models ranged from 0.27 [0.24; 0.32] for k-nearest neighbors to 0.44 [0.41; 0.45] for the Support Vector Machine with a radial basis function kernel. Conclusion: Therefore, ML models trained on MALDI MS of nasopharyngeal swabs obtained from patients with Covid during the initial phase of the pandemic, as well as symptomatic non-Covid individuals, showed excellent classification performance, which aligns with the results of previous studies. However, when applied to swabs from healthy individuals and a limited sample of patients with Covid in the late phase of the pandemic, ML models exhibited lower classification accuracy.

Keywords: SARS-CoV-2, MALDI-TOF MS, ML models, nasopharyngeal swabs, classification

Procedia PDF Downloads 76
27286 Analysis of Matching Pursuit Features of EEG Signal for Mental Tasks Classification

Authors: Zin Mar Lwin

Abstract:

Brain Computer Interface (BCI) Systems have developed for people who suffer from severe motor disabilities and challenging to communicate with their environment. BCI allows them for communication by a non-muscular way. For communication between human and computer, BCI uses a type of signal called Electroencephalogram (EEG) signal which is recorded from the human„s brain by means of an electrode. The electroencephalogram (EEG) signal is an important information source for knowing brain processes for the non-invasive BCI. Translating human‟s thought, it needs to classify acquired EEG signal accurately. This paper proposed a typical EEG signal classification system which experiments the Dataset from “Purdue University.” Independent Component Analysis (ICA) method via EEGLab Tools for removing artifacts which are caused by eye blinks. For features extraction, the Time and Frequency features of non-stationary EEG signals are extracted by Matching Pursuit (MP) algorithm. The classification of one of five mental tasks is performed by Multi_Class Support Vector Machine (SVM). For SVMs, the comparisons have been carried out for both 1-against-1 and 1-against-all methods.

Keywords: BCI, EEG, ICA, SVM

Procedia PDF Downloads 252
27285 Chinese Sentence Level Lip Recognition

Authors: Peng Wang, Tigang Jiang

Abstract:

The computer based lip reading method of different languages cannot be universal. At present, for the research of Chinese lip reading, whether the work on data sets or recognition algorithms, is far from mature. In this paper, we study the Chinese lipreading method based on machine learning, and propose a Chinese Sentence-level lip-reading network (CNLipNet) model which consists of spatio-temporal convolutional neural network(CNN), recurrent neural network(RNN) and Connectionist Temporal Classification (CTC) loss function. This model can map variable-length sequence of video frames to Chinese Pinyin sequence and is trained end-to-end. More over, We create CNLRS, a Chinese Lipreading Dataset, which contains 5948 samples and can be shared through github. The evaluation of CNLipNet on this dataset yielded a 41% word correct rate and a 70.6% character correct rate. This evaluation result is far superior to the professional human lip readers, indicating that CNLipNet performs well in lipreading.

Keywords: lipreading, machine learning, spatio-temporal, convolutional neural network, recurrent neural network

Procedia PDF Downloads 101
27284 Comparative Analysis of Reinforcement Learning Algorithms for Autonomous Driving

Authors: Migena Mana, Ahmed Khalid Syed, Abdul Malik, Nikhil Cherian

Abstract:

In recent years, advancements in deep learning enabled researchers to tackle the problem of self-driving cars. Car companies use huge datasets to train their deep learning models to make autonomous cars a reality. However, this approach has certain drawbacks in that the state space of possible actions for a car is so huge that there cannot be a dataset for every possible road scenario. To overcome this problem, the concept of reinforcement learning (RL) is being investigated in this research. Since the problem of autonomous driving can be modeled in a simulation, it lends itself naturally to the domain of reinforcement learning. The advantage of this approach is that we can model different and complex road scenarios in a simulation without having to deploy in the real world. The autonomous agent can learn to drive by finding the optimal policy. This learned model can then be easily deployed in a real-world setting. In this project, we focus on three RL algorithms: Q-learning, Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO). To model the environment, we have used TORCS (The Open Racing Car Simulator), which provides us with a strong foundation to test our model. The inputs to the algorithms are the sensor data provided by the simulator such as velocity, distance from side pavement, etc. The outcome of this research project is a comparative analysis of these algorithms. Based on the comparison, the PPO algorithm gives the best results. When using PPO algorithm, the reward is greater, and the acceleration, steering angle and braking are more stable compared to the other algorithms, which means that the agent learns to drive in a better and more efficient way in this case. Additionally, we have come up with a dataset taken from the training of the agent with DDPG and PPO algorithms. It contains all the steps of the agent during one full training in the form: (all input values, acceleration, steering angle, break, loss, reward). This study can serve as a base for further complex road scenarios. Furthermore, it can be enlarged in the field of computer vision, using the images to find the best policy.

Keywords: autonomous driving, DDPG (deep deterministic policy gradient), PPO (proximal policy optimization), reinforcement learning

Procedia PDF Downloads 120
27283 Statistical Wavelet Features, PCA, and SVM-Based Approach for EEG Signals Classification

Authors: R. K. Chaurasiya, N. D. Londhe, S. Ghosh

Abstract:

The study of the electrical signals produced by neural activities of human brain is called Electroencephalography. In this paper, we propose an automatic and efficient EEG signal classification approach. The proposed approach is used to classify the EEG signal into two classes: epileptic seizure or not. In the proposed approach, we start with extracting the features by applying Discrete Wavelet Transform (DWT) in order to decompose the EEG signals into sub-bands. These features, extracted from details and approximation coefficients of DWT sub-bands, are used as input to Principal Component Analysis (PCA). The classification is based on reducing the feature dimension using PCA and deriving the support-vectors using Support Vector Machine (SVM). The experimental are performed on real and standard dataset. A very high level of classification accuracy is obtained in the result of classification.

Keywords: discrete wavelet transform, electroencephalogram, pattern recognition, principal component analysis, support vector machine

Procedia PDF Downloads 608
27282 Dual-Channel Reliable Breast Ultrasound Image Classification Based on Explainable Attribution and Uncertainty Quantification

Authors: Haonan Hu, Shuge Lei, Dasheng Sun, Huabin Zhang, Kehong Yuan, Jian Dai, Jijun Tang

Abstract:

This paper focuses on the classification task of breast ultrasound images and conducts research on the reliability measurement of classification results. A dual-channel evaluation framework was developed based on the proposed inference reliability and predictive reliability scores. For the inference reliability evaluation, human-aligned and doctor-agreed inference rationals based on the improved feature attribution algorithm SP-RISA are gracefully applied. Uncertainty quantification is used to evaluate the predictive reliability via the test time enhancement. The effectiveness of this reliability evaluation framework has been verified on the breast ultrasound clinical dataset YBUS, and its robustness is verified on the public dataset BUSI. The expected calibration errors on both datasets are significantly lower than traditional evaluation methods, which proves the effectiveness of the proposed reliability measurement.

Keywords: medical imaging, ultrasound imaging, XAI, uncertainty measurement, trustworthy AI

Procedia PDF Downloads 61
27281 Keypoint Detection Method Based on Multi-Scale Feature Fusion of Attention Mechanism

Authors: Xiaoxiao Li, Shuangcheng Jia, Qian Li

Abstract:

Keypoint detection has always been a challenge in the field of image recognition. This paper proposes a novelty keypoint detection method which is called Multi-Scale Feature Fusion Convolutional Network with Attention (MFFCNA). We verified that the multi-scale features with the attention mechanism module have better feature expression capability. The feature fusion between different scales makes the information that the network model can express more abundant, and the network is easier to converge. On our self-made street sign corner dataset, we validate the MFFCNA model with an accuracy of 97.8% and a recall of 81%, which are 5 and 8 percentage points higher than the HRNet network, respectively. On the COCO dataset, the AP is 71.9%, and the AR is 75.3%, which are 3 points and 2 points higher than HRNet, respectively. Extensive experiments show that our method has a remarkable improvement in the keypoint recognition tasks, and the recognition effect is better than the existing methods. Moreover, our method can be applied not only to keypoint detection but also to image classification and semantic segmentation with good generality.

Keywords: keypoint detection, feature fusion, attention, semantic segmentation

Procedia PDF Downloads 96
27280 Event Extraction, Analysis, and Event Linking

Authors: Anam Alam, Rahim Jamaluddin Kanji

Abstract:

With the rapid growth of event in everywhere, event extraction has now become an important matter to retrieve the information from the unstructured data. One of the challenging problems is to extract the event from it. An event is an observable occurrence of interaction among entities. The paper investigates the effectiveness of event extraction capabilities of three software tools that are Wandora, Nitro and SPSS. We performed standard text mining techniques of these tools on the data sets of (i) Afghan War Diaries (AWD collection), (ii) MUC4 and (iii) WebKB. Information retrieval measures such as precision and recall which are computed under extensive set of experiments for Event Extraction. The experimental study analyzes the difference between events extracted by the software and human. This approach helps to construct an algorithm that will be applied for different machine learning methods.

Keywords: event extraction, Wandora, nitro, SPSS, event analysis, extraction method, AFG, Afghan War Diaries, MUC4, 4 universities, dataset, algorithm, precision, recall, evaluation

Procedia PDF Downloads 557
27279 Domain-Specific Deep Neural Network Model for Classification of Abnormalities on Chest Radiographs

Authors: Nkechinyere Joy Olawuyi, Babajide Samuel Afolabi, Bola Ibitoye

Abstract:

This study collected a preprocessed dataset of chest radiographs and formulated a deep neural network model for detecting abnormalities. It also evaluated the performance of the formulated model and implemented a prototype of the formulated model. This was with the view to developing a deep neural network model to automatically classify abnormalities in chest radiographs. In order to achieve the overall purpose of this research, a large set of chest x-ray images were sourced for and collected from the CheXpert dataset, which is an online repository of annotated chest radiographs compiled by the Machine Learning Research Group, Stanford University. The chest radiographs were preprocessed into a format that can be fed into a deep neural network. The preprocessing techniques used were standardization and normalization. The classification problem was formulated as a multi-label binary classification model, which used convolutional neural network architecture to make a decision on whether an abnormality was present or not in the chest radiographs. The classification model was evaluated using specificity, sensitivity, and Area Under Curve (AUC) score as the parameter. A prototype of the classification model was implemented using Keras Open source deep learning framework in Python Programming Language. The AUC ROC curve of the model was able to classify Atelestasis, Support devices, Pleural effusion, Pneumonia, A normal CXR (no finding), Pneumothorax, and Consolidation. However, Lung opacity and Cardiomegaly had a probability of less than 0.5 and thus were classified as absent. Precision, recall, and F1 score values were 0.78; this implies that the number of False Positive and False Negative is the same, revealing some measure of label imbalance in the dataset. The study concluded that the developed model is sufficient to classify abnormalities present in chest radiographs into present or absent.

Keywords: transfer learning, convolutional neural network, radiograph, classification, multi-label

Procedia PDF Downloads 87
27278 Author Profiling: Prediction of Learners’ Gender on a MOOC Platform Based on Learners’ Comments

Authors: Tahani Aljohani, Jialin Yu, Alexandra. I. Cristea

Abstract:

The more an educational system knows about a learner, the more personalised interaction it can provide, which leads to better learning. However, asking a learner directly is potentially disruptive, and often ignored by learners. Especially in the booming realm of MOOC Massive Online Learning platforms, only a very low percentage of users disclose demographic information about themselves. Thus, in this paper, we aim to predict learners’ demographic characteristics, by proposing an approach using linguistically motivated Deep Learning Architectures for Learner Profiling, particularly targeting gender prediction on a FutureLearn MOOC platform. Additionally, we tackle here the difficult problem of predicting the gender of learners based on their comments only – which are often available across MOOCs. The most common current approaches to text classification use the Long Short-Term Memory (LSTM) model, considering sentences as sequences. However, human language also has structures. In this research, rather than considering sentences as plain sequences, we hypothesise that higher semantic - and syntactic level sentence processing based on linguistics will render a richer representation. We thus evaluate, the traditional LSTM versus other bleeding edge models, which take into account syntactic structure, such as tree-structured LSTM, Stack-augmented Parser-Interpreter Neural Network (SPINN) and the Structure-Aware Tag Augmented model (SATA). Additionally, we explore using different word-level encoding functions. We have implemented these methods on Our MOOC dataset, which is the most performant one comparing with a public dataset on sentiment analysis that is further used as a cross-examining for the models' results.

Keywords: deep learning, data mining, gender predication, MOOCs

Procedia PDF Downloads 116