Search results for: inverse models of data envelopment analysis
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 43807

Search results for: inverse models of data envelopment analysis

43447 A Structuring and Classification Method for Assigning Application Areas to Suitable Digital Factory Models

Authors: R. Hellmuth

Abstract:

The method of factory planning has changed a lot, especially when it is about planning the factory building itself. Factory planning has the task of designing products, plants, processes, organization, areas, and the building of a factory. Regular restructuring is becoming more important in order to maintain the competitiveness of a factory. Restrictions in new areas, shorter life cycles of product and production technology as well as a VUCA world (Volatility, Uncertainty, Complexity and Ambiguity) lead to more frequent restructuring measures within a factory. A digital factory model is the planning basis for rebuilding measures and becomes an indispensable tool. Furthermore, digital building models are increasingly being used in factories to support facility management and manufacturing processes. The main research question of this paper is, therefore: What kind of digital factory model is suitable for the different areas of application during the operation of a factory? First, different types of digital factory models are investigated, and their properties and usabilities for use cases are analysed. Within the scope of investigation are point cloud models, building information models, photogrammetry models, and these enriched with sensor data are examined. It is investigated which digital models allow a simple integration of sensor data and where the differences are. Subsequently, possible application areas of digital factory models are determined by means of a survey and the respective digital factory models are assigned to the application areas. Finally, an application case from maintenance is selected and implemented with the help of the appropriate digital factory model. It is shown how a completely digitalized maintenance process can be supported by a digital factory model by providing information. Among other purposes, the digital factory model is used for indoor navigation, information provision, and display of sensor data. In summary, the paper shows a structuring of digital factory models that concentrates on the geometric representation of a factory building and its technical facilities. A practical application case is shown and implemented. Thus, the systematic selection of digital factory models with the corresponding application cases is evaluated.

Keywords: building information modeling, digital factory model, factory planning, maintenance

Procedia PDF Downloads 99
43446 Discrimination in Insurance Pricing: A Textual-Analysis Perspective

Authors: Ruijuan Bi

Abstract:

Discrimination in insurance pricing is a topic of increasing concern, particularly in the context of the rapid development of big data and artificial intelligence. There is a need to explore the various forms of discrimination, such as direct and indirect discrimination, proxy discrimination, algorithmic discrimination, and unfair discrimination, and understand their implications in insurance pricing models. This paper aims to analyze and interpret the definitions of discrimination in insurance pricing and explore measures to reduce discrimination. It utilizes a textual analysis methodology, which involves gathering qualitative data from relevant literature on definitions of discrimination. The research methodology focuses on exploring the various forms of discrimination and their implications in insurance pricing models. Through textual analysis, this paper identifies the specific characteristics and implications of each form of discrimination in the general insurance industry. This research contributes to the theoretical understanding of discrimination in insurance pricing. By analyzing and interpreting relevant literature, this paper provides insights into the definitions of discrimination and the laws and regulations surrounding it. This theoretical foundation can inform future empirical research on discrimination in insurance pricing using relevant theories of probability theory.

Keywords: algorithmic discrimination, direct and indirect discrimination, proxy discrimination, unfair discrimination, insurance pricing

Procedia PDF Downloads 57
43445 Engagement Analysis Using DAiSEE Dataset

Authors: Naman Solanki, Souraj Mondal

Abstract:

With the world moving towards online communication, the video datastore has exploded in the past few years. Consequently, it has become crucial to analyse participant’s engagement levels in online communication videos. Engagement prediction of people in videos can be useful in many domains, like education, client meetings, dating, etc. Video-level or frame-level prediction of engagement for a user involves the development of robust models that can capture facial micro-emotions efficiently. For the development of an engagement prediction model, it is necessary to have a widely-accepted standard dataset for engagement analysis. DAiSEE is one of the datasets which consist of in-the-wild data and has a gold standard annotation for engagement prediction. Earlier research done using the DAiSEE dataset involved training and testing standard models like CNN-based models, but the results were not satisfactory according to industry standards. In this paper, a multi-level classification approach has been introduced to create a more robust model for engagement analysis using the DAiSEE dataset. This approach has recorded testing accuracies of 0.638, 0.7728, 0.8195, and 0.866 for predicting boredom level, engagement level, confusion level, and frustration level, respectively.

Keywords: computer vision, engagement prediction, deep learning, multi-level classification

Procedia PDF Downloads 102
43444 Simulation of Optimal Runoff Hydrograph Using Ensemble of Radar Rainfall and Blending of Runoffs Model

Authors: Myungjin Lee, Daegun Han, Jongsung Kim, Soojun Kim, Hung Soo Kim

Abstract:

Recently, the localized heavy rainfall and typhoons are frequently occurred due to the climate change and the damage is becoming bigger. Therefore, we may need a more accurate prediction of the rainfall and runoff. However, the gauge rainfall has the limited accuracy in space. Radar rainfall is better than gauge rainfall for the explanation of the spatial variability of rainfall but it is mostly underestimated with the uncertainty involved. Therefore, the ensemble of radar rainfall was simulated using error structure to overcome the uncertainty and gauge rainfall. The simulated ensemble was used as the input data of the rainfall-runoff models for obtaining the ensemble of runoff hydrographs. The previous studies discussed about the accuracy of the rainfall-runoff model. Even if the same input data such as rainfall is used for the runoff analysis using the models in the same basin, the models can have different results because of the uncertainty involved in the models. Therefore, we used two models of the SSARR model which is the lumped model, and the Vflo model which is a distributed model and tried to simulate the optimum runoff considering the uncertainty of each rainfall-runoff model. The study basin is located in Han river basin and we obtained one integrated runoff hydrograph which is an optimum runoff hydrograph using the blending methods such as Multi-Model Super Ensemble (MMSE), Simple Model Average (SMA), Mean Square Error (MSE). From this study, we could confirm the accuracy of rainfall and rainfall-runoff model using ensemble scenario and various rainfall-runoff model and we can use this result to study flood control measure due to climate change. Acknowledgements: This work is supported by the Korea Agency for Infrastructure Technology Advancement(KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant 18AWMP-B083066-05).

Keywords: radar rainfall ensemble, rainfall-runoff models, blending method, optimum runoff hydrograph

Procedia PDF Downloads 263
43443 A Demonstration of How to Employ and Interpret Binary IRT Models Using the New IRT Procedure in SAS 9.4

Authors: Ryan A. Black, Stacey A. McCaffrey

Abstract:

Over the past few decades, great strides have been made towards improving the science in the measurement of psychological constructs. Item Response Theory (IRT) has been the foundation upon which statistical models have been derived to increase both precision and accuracy in psychological measurement. These models are now being used widely to develop and refine tests intended to measure an individual's level of academic achievement, aptitude, and intelligence. Recently, the field of clinical psychology has adopted IRT models to measure psychopathological phenomena such as depression, anxiety, and addiction. Because advances in IRT measurement models are being made so rapidly across various fields, it has become quite challenging for psychologists and other behavioral scientists to keep abreast of the most recent developments, much less learn how to employ and decide which models are the most appropriate to use in their line of work. In the same vein, IRT measurement models vary greatly in complexity in several interrelated ways including but not limited to the number of item-specific parameters estimated in a given model, the function which links the expected response and the predictor, response option formats, as well as dimensionality. As a result, inferior methods (a.k.a. Classical Test Theory methods) continue to be employed in efforts to measure psychological constructs, despite evidence showing that IRT methods yield more precise and accurate measurement. To increase the use of IRT methods, this study endeavors to provide a comprehensive overview of binary IRT models; that is, measurement models employed on test data consisting of binary response options (e.g., correct/incorrect, true/false, agree/disagree). Specifically, this study will cover the most basic binary IRT model, known as the 1-parameter logistic (1-PL) model dating back to over 50 years ago, up until the most recent complex, 4-parameter logistic (4-PL) model. Binary IRT models will be defined mathematically and the interpretation of each parameter will be provided. Next, all four binary IRT models will be employed on two sets of data: 1. Simulated data of N=500,000 subjects who responded to four dichotomous items and 2. A pilot analysis of real-world data collected from a sample of approximately 770 subjects who responded to four self-report dichotomous items pertaining to emotional consequences to alcohol use. Real-world data were based on responses collected on items administered to subjects as part of a scale-development study (NIDA Grant No. R44 DA023322). IRT analyses conducted on both the simulated data and analyses of real-world pilot will provide a clear demonstration of how to construct, evaluate, and compare binary IRT measurement models. All analyses will be performed using the new IRT procedure in SAS 9.4. SAS code to generate simulated data and analyses will be available upon request to allow for replication of results.

Keywords: instrument development, item response theory, latent trait theory, psychometrics

Procedia PDF Downloads 338
43442 Observational Study Reveals Inverse Relationship: Rising PM₂.₅ Concentrations Linked to Decreasing Muon Flux

Authors: Yashas Mattur, Jensen Coonradt

Abstract:

Muon flux, the rate of muons reaching Earth from the atmosphere, is impacted by various factors such as air pressure, temperature, and humidity. However, the influence of concentrations of PM₂.₅ (particulate matter with diameters 2.5 mm or smaller) on muon detection rates remains unexplored. During the summer of 2023, smoke from Canadian wildfires (containing significant amounts of particulate matter) blew over regions in the Northern US, introducing huge fluctuations in PM₂.₅ concentrations, thus inspiring our experiment to investigate the correlation of PM₂.₅ concentrations and muon rates. To investigate this correlation, muon collision rates were measured and analyzed alongside PM₂.₅ concentration data over the periods of both light and heavy smoke. Other confounding variables, including temperature, humidity, and atmospheric pressure, were also considered. The results reveal a statistically significant inverse correlation between muon flux and PM₂.₅ concentrations, indicating that particulate matter has an impact on the rate of muons reaching the earth’s surface.

Keywords: Muon Flux, atmospheric effects on muons, PM₂.₅, airborne particulate matter

Procedia PDF Downloads 59
43441 An As-Is Analysis and Approach for Updating Building Information Models and Laser Scans

Authors: Rene Hellmuth

Abstract:

Factory planning has the task of designing products, plants, processes, organization, areas, and the construction of a factory. The requirements for factory planning and the building of a factory have changed in recent years. Regular restructuring of the factory building is becoming more important in order to maintain the competitiveness of a factory. Restrictions in new areas, shorter life cycles of product and production technology as well as a VUCA world (Volatility, Uncertainty, Complexity & Ambiguity) lead to more frequent restructuring measures within a factory. A building information model (BIM) is the planning basis for rebuilding measures and becomes an indispensable data repository to be able to react quickly to changes. Use as a planning basis for restructuring measures in factories only succeeds if the BIM model has adequate data quality. Under this aspect and the industrial requirement, three data quality factors are particularly important for this paper regarding the BIM model: up-to-dateness, completeness, and correctness. The research question is: how can a BIM model be kept up to date with required data quality and which visualization techniques can be applied in a short period of time on the construction site during conversion measures? An as-is analysis is made of how BIM models and digital factory models (including laser scans) are currently being kept up to date. Industrial companies are interviewed, and expert interviews are conducted. Subsequently, the results are evaluated, and a procedure conceived how cost-effective and timesaving updating processes can be carried out. The availability of low-cost hardware and the simplicity of the process are of importance to enable service personnel from facility mnagement to keep digital factory models (BIM models and laser scans) up to date. The approach includes the detection of changes to the building, the recording of the changing area, and the insertion into the overall digital twin. Finally, an overview of the possibilities for visualizations suitable for construction sites is compiled. An augmented reality application is created based on an updated BIM model of a factory and installed on a tablet. Conversion scenarios with costs and time expenditure are displayed. A user interface is designed in such a way that all relevant conversion information is available at a glance for the respective conversion scenario. A total of three essential research results are achieved: As-is analysis of current update processes for BIM models and laser scans, development of a time-saving and cost-effective update process and the conception and implementation of an augmented reality solution for BIM models suitable for construction sites.

Keywords: building information modeling, digital factory model, factory planning, restructuring

Procedia PDF Downloads 100
43440 The Effect of Artificial Intelligence on the Production of Agricultural Lands and Labor

Authors: Ibrahim Makram Ibrahim Salib

Abstract:

Agriculture plays an essential role in providing food for the world's population. It also offers numerous benefits to countries, including non-food products, transportation, and environmental balance. Precision agriculture, which employs advanced tools to monitor variability and manage inputs, can help achieve these benefits. The increasing demand for food security puts pressure on decision-makers to ensure sufficient food production worldwide. To support sustainable agriculture, unmanned aerial vehicles (UAVs) can be utilized to manage farms and increase yields. This paper aims to provide an understanding of UAV usage and its applications in agriculture. The objective is to review the various applications of UAVs in agriculture. Based on a comprehensive review of existing research, it was found that different sensors provide varying analyses for agriculture applications. Therefore, the purpose of the project must be determined before using UAV technology for better data quality and analysis. In conclusion, identifying a suitable sensor and UAV is crucial to gather accurate data and precise analysis when using UAVs in agriculture.

Keywords: agriculture land, agriculture land loss, Kabul city, urban land expansion, urbanization agriculture yield growth, agriculture yield prediction, explorative data analysis, predictive models, regression models drone, precision agriculture, farmer income

Procedia PDF Downloads 55
43439 A Comparative Study of Regional Climate Models and Global Coupled Models over Uttarakhand

Authors: Sudip Kumar Kundu, Charu Singh

Abstract:

As a great physiographic divide, the Himalayas affecting a large system of water and air circulation which helps to determine the climatic condition in the Indian subcontinent to the south and mid-Asian highlands to the north. It creates obstacles by defending chill continental air from north side into India in winter and also defends rain-bearing southwesterly monsoon to give up maximum precipitation in that area in monsoon season. Nowadays extreme weather conditions such as heavy precipitation, cloudburst, flash flood, landslide and extreme avalanches are the regular happening incidents in the region of North Western Himalayan (NWH). The present study has been planned to investigate the suitable model(s) to find out the rainfall pattern over that region. For this investigation, selected models from Coordinated Regional Climate Downscaling Experiment (CORDEX) and Coupled Model Intercomparison Project Phase 5 (CMIP5) has been utilized in a consistent framework for the period of 1976 to 2000 (historical). The ability of these driving models from CORDEX domain and CMIP5 has been examined according to their capability of the spatial distribution as well as time series plot of rainfall over NWH in the rainy season and compared with the ground-based Indian Meteorological Department (IMD) gridded rainfall data set. It is noted from the analysis that the models like MIROC5 and MPI-ESM-LR from the both CORDEX and CMIP5 provide the best spatial distribution of rainfall over NWH region. But the driving models from CORDEX underestimates the daily rainfall amount as compared to CMIP5 driving models as it is unable to capture daily rainfall data properly when it has been plotted for time series (TS) individually for the state of Uttarakhand (UK) and Himachal Pradesh (HP). So finally it can be said that the driving models from CMIP5 are better than CORDEX domain models to investigate the rainfall pattern over NWH region.

Keywords: global warming, rainfall, CMIP5, CORDEX, NWH

Procedia PDF Downloads 156
43438 Cirrhosis Mortality Prediction as Classification using Frequent Subgraph Mining

Authors: Abdolghani Ebrahimi, Diego Klabjan, Chenxi Ge, Daniela Ladner, Parker Stride

Abstract:

In this work, we use machine learning and novel data analysis techniques to predict the one-year mortality of cirrhotic patients. Data from 2,322 patients with liver cirrhosis are collected at a single medical center. Different machine learning models are applied to predict one-year mortality. A comprehensive feature space including demographic information, comorbidity, clinical procedure and laboratory tests is being analyzed. A temporal pattern mining technic called Frequent Subgraph Mining (FSM) is being used. Model for End-stage liver disease (MELD) prediction of mortality is used as a comparator. All of our models statistically significantly outperform the MELD-score model and show an average 10% improvement of the area under the curve (AUC). The FSM technic itself does not improve the model significantly, but FSM, together with a machine learning technique called an ensemble, further improves the model performance. With the abundance of data available in healthcare through electronic health records (EHR), existing predictive models can be refined to identify and treat patients at risk for higher mortality. However, due to the sparsity of the temporal information needed by FSM, the FSM model does not yield significant improvements. To the best of our knowledge, this is the first work to apply modern machine learning algorithms and data analysis methods on predicting one-year mortality of cirrhotic patients and builds a model that predicts one-year mortality significantly more accurate than the MELD score. We have also tested the potential of FSM and provided a new perspective of the importance of clinical features.

Keywords: machine learning, liver cirrhosis, subgraph mining, supervised learning

Procedia PDF Downloads 126
43437 Characterizing the Spatially Distributed Differences in the Operational Performance of Solar Power Plants Considering Input Volatility: Evidence from China

Authors: Bai-Chen Xie, Xian-Peng Chen

Abstract:

China has become the world's largest energy producer and consumer, and its development of renewable energy is of great significance to global energy governance and the fight against climate change. The rapid growth of solar power in China could help achieve its ambitious carbon peak and carbon neutrality targets early. However, the non-technical costs of solar power in China are much higher than at international levels, meaning that inefficiencies are rooted in poor management and improper policy design and that efficiency distortions have become a serious challenge to the sustainable development of the renewable energy industry. Unlike fossil energy generation technologies, the output of solar power is closely related to the volatile solar resource, and the spatial unevenness of solar resource distribution leads to potential efficiency spatial distribution differences. It is necessary to develop an efficiency evaluation method that considers the volatility of solar resources and explores the mechanism of the influence of natural geography and social environment on the spatially varying characteristics of efficiency distribution to uncover the root causes of managing inefficiencies. The study sets solar resources as stochastic inputs, introduces a chance-constrained data envelopment analysis model combined with the directional distance function, and measures the solar resource utilization efficiency of 222 solar power plants in representative photovoltaic bases in northwestern China. By the meta-frontier analysis, we measured the characteristics of different power plant clusters and compared the differences among groups, discussed the mechanism of environmental factors influencing inefficiencies, and performed statistical tests through the system generalized method of moments. Rational localization of power plants is a systematic project that requires careful consideration of the full utilization of solar resources, low transmission costs, and power consumption guarantee. Suitable temperature, precipitation, and wind speed can improve the working performance of photovoltaic modules, reasonable terrain inclination can reduce land cost, and the proximity to cities strongly guarantees the consumption of electricity. The density of electricity demand and high-tech industries is more important than resource abundance because they trigger the clustering of power plants to result in a good demonstration and competitive effect. To ensure renewable energy consumption, increased support for rural grids and encouraging direct trading between generators and neighboring users will provide solutions. The study will provide proposals for improving the full life-cycle operational activities of solar power plants in China to reduce high non-technical costs and improve competitiveness against fossil energy sources.

Keywords: solar power plants, environmental factors, data envelopment analysis, efficiency evaluation

Procedia PDF Downloads 76
43436 Frailty Models for Modeling Heterogeneity: Simulation Study and Application to Quebec Pension Plan

Authors: Souad Romdhane, Lotfi Belkacem

Abstract:

When referring to actuarial analysis of lifetime, only models accounting for observable risk factors have been developed. Within this context, Cox proportional hazards model (CPH model) is commonly used to assess the effects of observable covariates as gender, age, smoking habits, on the hazard rates. These covariates may fail to fully account for the true lifetime interval. This may be due to the existence of another random variable (frailty) that is still being ignored. The aim of this paper is to examine the shared frailty issue in the Cox proportional hazard model by including two different parametric forms of frailty into the hazard function. Four estimated methods are used to fit them. The performance of the parameter estimates is assessed and compared between the classical Cox model and these frailty models through a real-life data set from the Quebec Pension Plan and then using a more general simulation study. This performance is investigated in terms of the bias of point estimates and their empirical standard errors in both fixed and random effect parts. Both the simulation and the real dataset studies showed differences between classical Cox model and shared frailty model.

Keywords: life insurance-pension plan, survival analysis, risk factors, cox proportional hazards model, multivariate failure-time data, shared frailty, simulations study

Procedia PDF Downloads 349
43435 Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis

Authors: Sangita Pokhrel, Nalinda Somasiri, Rebecca Jeyavadhanam, Swathi Ganesan

Abstract:

Tourism is a booming industry with huge future potential for global wealth and employment. There are countless data generated over social media sites every day, creating numerous opportunities to bring more insights to decision-makers. The integration of Big Data Technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centers or hotels, etc., and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences was first classified through Vader and Roberta model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization and TFIDF Vectorization, and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm, after pre-processing and cleaning the dataset, we received an accuracy of 96.12% for the positive and negative sentiment analysis.

Keywords: counter vectorization, convolutional neural network, crawler, data technology, long short-term memory, web scraping, sentiment analysis

Procedia PDF Downloads 76
43434 Evaluation of QSRR Models by Sum of Ranking Differences Approach: A Case Study of Prediction of Chromatographic Behavior of Pesticides

Authors: Lidija R. Jevrić, Sanja O. Podunavac-Kuzmanović, Strahinja Z. Kovačević

Abstract:

The present study deals with the selection of the most suitable quantitative structure-retention relationship (QSRR) models which should be used in prediction of the retention behavior of basic, neutral, acidic and phenolic pesticides which belong to different classes: fungicides, herbicides, metabolites, insecticides and plant growth regulators. Sum of ranking differences (SRD) approach can give a different point of view on selection of the most consistent QSRR model. SRD approach can be applied not only for ranking of the QSRR models, but also for detection of similarity or dissimilarity among them. Applying the SRD analysis, the most similar models can be found easily. In this study, selection of the best model was carried out on the basis of the reference ranking (“golden standard”) which was defined as the row average values of logarithm of retention time (logtr) defined by high performance liquid chromatography (HPLC). Also, SRD analysis based on experimental logtr values as reference ranking revealed similar grouping of the established QSRR models already obtained by hierarchical cluster analysis (HCA).

Keywords: chemometrics, chromatography, pesticides, sum of ranking differences

Procedia PDF Downloads 364
43433 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Mpho Mokoatle, Darlington Mapiye, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on $k$-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms.

Keywords: AWD-LSTM, bootstrapping, k-mers, next generation sequencing

Procedia PDF Downloads 155
43432 Phenotype Prediction of DNA Sequence Data: A Machine and Statistical Learning Approach

Authors: Darlington Mapiye, Mpho Mokoatle, James Mashiyane, Stephanie Muller, Gciniwe Dlamini

Abstract:

Great advances in high-throughput sequencing technologies have resulted in availability of huge amounts of sequencing data in public and private repositories, enabling a holistic understanding of complex biological phenomena. Sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid growth in sequence data poses a great challenge which calls for novel data processing and analytic methods, as well as huge computing resources. In this work, a machine and statistical learning approach for DNA sequence classification based on k-mer representation of sequence data is proposed. The approach is tested using whole genome sequences of Mycobacterium tuberculosis (MTB) isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate, and (iv) demonstrate computing challenges associated with the analysis of whole genome sequence data in producing interpretable and explainable insights. The classification models were trained on 104 whole genome sequences of MTB isoloates. Cluster analysis showed that k-mers maybe used to discriminate phenotypes and the discrimination becomes more concise as the size of k-mers increase. The best performing classification model had a k-mer size of 10 (longest k-mer) an accuracy, recall, precision, specificity, and Matthews Correlation coeffient of 72.0 %, 80.5 %, 80.5 %, 63.6 %, and 0.4 respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biological explainable results, which brings to the fore the interplay that exists amongst accuracy, computing resources and explainability of classification results. However, the analysis provides a new way to elucidate genetic information from genomic data, and identify phenotype relationships which are important especially in explaining complex biological mechanisms

Keywords: AWD-LSTM, bootstrapping, k-mers, next generation sequencing

Procedia PDF Downloads 141
43431 Observed Changes in Constructed Precipitation at High Resolution in Southern Vietnam

Authors: Nguyen Tien Thanh, Günter Meon

Abstract:

Precipitation plays a key role in water cycle, defining the local climatic conditions and in ecosystem. It is also an important input parameter for water resources management and hydrologic models. With spatial continuous data, a certainty of discharge predictions or other environmental factors is unquestionably better than without. This is, however, not always willingly available to acquire for a small basin, especially for coastal region in Vietnam due to a low network of meteorological stations (30 stations) on long coast of 3260 km2. Furthermore, available gridded precipitation datasets are not fine enough when applying to hydrologic models. Under conditions of global warming, an application of spatial interpolation methods is a crucial for the climate change impact studies to obtain the spatial continuous data. In recent research projects, although some methods can perform better than others do, no methods draw the best results for all cases. The objective of this paper therefore, is to investigate different spatial interpolation methods for daily precipitation over a small basin (approximately 400 km2) located in coastal region, Southern Vietnam and find out the most efficient interpolation method on this catchment. The five different interpolation methods consisting of cressman, ordinary kriging, regression kriging, dual kriging and inverse distance weighting have been applied to identify the best method for the area of study on the spatio-temporal scale (daily, 10 km x 10 km). A 30-year precipitation database was created and merged into available gridded datasets. Finally, observed changes in constructed precipitation were performed. The results demonstrate that the method of ordinary kriging interpolation is an effective approach to analyze the daily precipitation. The mixed trends of increasing and decreasing monthly, seasonal and annual precipitation have documented at significant levels.

Keywords: interpolation, precipitation, trend, vietnam

Procedia PDF Downloads 269
43430 Benchmarking Machine Learning Approaches for Forecasting Hotel Revenue

Authors: Rachel Y. Zhang, Christopher K. Anderson

Abstract:

A critical aspect of revenue management is a firm’s ability to predict demand as a function of price. Historically hotels have used simple time series models (regression and/or pick-up based models) owing to the complexities of trying to build casual models of demands. Machine learning approaches are slowly attracting attention owing to their flexibility in modeling relationships. This study provides an overview of approaches to forecasting hospitality demand – focusing on the opportunities created by machine learning approaches, including K-Nearest-Neighbors, Support vector machine, Regression Tree, and Artificial Neural Network algorithms. The out-of-sample performances of above approaches to forecasting hotel demand are illustrated by using a proprietary sample of the market level (24 properties) transactional data for Las Vegas NV. Causal predictive models can be built and evaluated owing to the availability of market level (versus firm level) data. This research also compares and contrast model accuracy of firm-level models (i.e. predictive models for hotel A only using hotel A’s data) to models using market level data (prices, review scores, location, chain scale, etc… for all hotels within the market). The prospected models will be valuable for hotel revenue prediction given the basic characters of a hotel property or can be applied in performance evaluation for an existed hotel. The findings will unveil the features that play key roles in a hotel’s revenue performance, which would have considerable potential usefulness in both revenue prediction and evaluation.

Keywords: hotel revenue, k-nearest-neighbors, machine learning, neural network, prediction model, regression tree, support vector machine

Procedia PDF Downloads 119
43429 Seafloor and Sea Surface Modelling in the East Coast Region of North America

Authors: Magdalena Idzikowska, Katarzyna Pająk, Kamil Kowalczyk

Abstract:

Seafloor topography is a fundamental issue in geological, geophysical, and oceanographic studies. Single-beam or multibeam sonars attached to the hulls of ships are used to emit a hydroacoustic signal from transducers and reproduce the topography of the seabed. This solution provides relevant accuracy and spatial resolution. Bathymetric data from ships surveys provides National Centers for Environmental Information – National Oceanic and Atmospheric Administration. Unfortunately, most of the seabed is still unidentified, as there are still many gaps to be explored between ship survey tracks. Moreover, such measurements are very expensive and time-consuming. The solution is raster bathymetric models shared by The General Bathymetric Chart of the Oceans. The offered products are a compilation of different sets of data - raw or processed. Indirect data for the development of bathymetric models are also measurements of gravity anomalies. Some forms of seafloor relief (e.g. seamounts) increase the force of the Earth's pull, leading to changes in the sea surface. Based on satellite altimetry data, Sea Surface Height and marine gravity anomalies can be estimated, and based on the anomalies, it’s possible to infer the structure of the seabed. The main goal of the work is to create regional bathymetric models and models of the sea surface in the area of the east coast of North America – a region of seamounts and undulating seafloor. The research includes an analysis of the methods and techniques used, an evaluation of the interpolation algorithms used, model thickening, and the creation of grid models. Obtained data are raster bathymetric models in NetCDF format, survey data from multibeam soundings in MB-System format, and satellite altimetry data from Copernicus Marine Environment Monitoring Service. The methodology includes data extraction, processing, mapping, and spatial analysis. Visualization of the obtained results was carried out with Geographic Information System tools. The result is an extension of the state of the knowledge of the quality and usefulness of the data used for seabed and sea surface modeling and knowledge of the accuracy of the generated models. Sea level is averaged over time and space (excluding waves, tides, etc.). Its changes, along with knowledge of the topography of the ocean floor - inform us indirectly about the volume of the entire water ocean. The true shape of the ocean surface is further varied by such phenomena as tides, differences in atmospheric pressure, wind systems, thermal expansion of water, or phases of ocean circulation. Depending on the location of the point, the higher the depth, the lower the trend of sea level change. Studies show that combining data sets, from different sources, with different accuracies can affect the quality of sea surface and seafloor topography models.

Keywords: seafloor, sea surface height, bathymetry, satellite altimetry

Procedia PDF Downloads 67
43428 Analysis of Operating Speed on Four-Lane Divided Highways under Mixed Traffic Conditions

Authors: Chaitanya Varma, Arpan Mehar

Abstract:

The present study demonstrates the procedure to analyse speed data collected on various four-lane divided sections in India. Field data for the study was collected at different straight and curved sections on rural highways with the help of radar speed gun and video camera. The data collected at the sections were analysed and parameters pertain to speed distributions were estimated. The different statistical distribution was analysed on vehicle type speed data and for mixed traffic speed data. It was found that vehicle type speed data was either follows the normal distribution or Log-normal distribution, whereas the mixed traffic speed data follows more than one type of statistical distribution. The most common fit observed on mixed traffic speed data were Beta distribution and Weibull distribution. The separate operating speed model based on traffic and roadway geometric parameters were proposed in the present study. The operating speed model with traffic parameters and curve geometry parameters were established. Two different operating speed models were proposed with variables 1/R and Ln(R) and were found to be realistic with a different range of curve radius. The models developed in the present study are simple and realistic and can be used for forecasting operating speed on four-lane highways.

Keywords: highway, mixed traffic flow, modeling, operating speed

Procedia PDF Downloads 454
43427 Stability Analysis of Modelling the Effect of Vaccination and Novel Quarantine-Adjusted Incidence on the Spread of Newcastle Disease

Authors: Nurudeen O. Lasisi, Sirajo Abdulrahman, Abdulkareem A. Ibrahim

Abstract:

Newcastle disease is an infection of domestic poultry and other bird species with the virulent Newcastle disease virus (NDV). In this paper, we study the dynamics of the modeling of the Newcastle disease virus (NDV) using a novel quarantine-adjusted incidence. The comparison of Vaccination, linear incident rate and novel quarantine-adjusted incident rate in the models are discussed. The dynamics of the models yield disease-free and endemic equilibrium states.The effective reproduction numbers of the models are computed in order to measure the relative impact of an individual bird or combined intervention for effective disease control. We showed the local and global stability of endemic equilibrium states of the models and we found that the stability of endemic equilibrium states of models are globally asymptotically stable if the effective reproduction numbers of the models equations are greater than a unit.

Keywords: effective reproduction number, Endemic state, Mathematical model, Newcastle disease virus, novel quarantine-adjusted incidence, stability analysis

Procedia PDF Downloads 102
43426 In-Context Meta Learning for Automatic Designing Pretext Tasks for Self-Supervised Image Analysis

Authors: Toktam Khatibi

Abstract:

Self-supervised learning (SSL) includes machine learning models that are trained on one aspect and/or one part of the input to learn other aspects and/or part of it. SSL models are divided into two different categories, including pre-text task-based models and contrastive learning ones. Pre-text tasks are some auxiliary tasks learning pseudo-labels, and the trained models are further fine-tuned for downstream tasks. However, one important disadvantage of SSL using pre-text task solving is defining an appropriate pre-text task for each image dataset with a variety of image modalities. Therefore, it is required to design an appropriate pretext task automatically for each dataset and each downstream task. To the best of our knowledge, the automatic designing of pretext tasks for image analysis has not been considered yet. In this paper, we present a framework based on In-context learning that describes each task based on its input and output data using a pre-trained image transformer. Our proposed method combines the input image and its learned description for optimizing the pre-text task design and its hyper-parameters using Meta-learning models. The representations learned from the pre-text tasks are fine-tuned for solving the downstream tasks. We demonstrate that our proposed framework outperforms the compared ones on unseen tasks and image modalities in addition to its superior performance for previously known tasks and datasets.

Keywords: in-context learning (ICL), meta learning, self-supervised learning (SSL), vision-language domain, transformers

Procedia PDF Downloads 67
43425 A Comparative Analysis of Machine Learning Techniques for PM10 Forecasting in Vilnius

Authors: Mina Adel Shokry Fahim, Jūratė Sužiedelytė Visockienė

Abstract:

With the growing concern over air pollution (AP), it is clear that this has gained more prominence than ever before. The level of consciousness has increased and a sense of knowledge now has to be forwarded as a duty by those enlightened enough to disseminate it to others. This realisation often comes after an understanding of how poor air quality indices (AQI) damage human health. The study focuses on assessing air pollution prediction models specifically for Lithuania, addressing a substantial need for empirical research within the region. Concentrating on Vilnius, it specifically examines particulate matter concentrations 10 micrometers or less in diameter (PM10). Utilizing Gaussian Process Regression (GPR) and Regression Tree Ensemble, and Regression Tree methodologies, predictive forecasting models are validated and tested using hourly data from January 2020 to December 2022. The study explores the classification of AP data into anthropogenic and natural sources, the impact of AP on human health, and its connection to cardiovascular diseases. The study revealed varying levels of accuracy among the models, with GPR achieving the highest accuracy, indicated by an RMSE of 4.14 in validation and 3.89 in testing.

Keywords: air pollution, anthropogenic and natural sources, machine learning, Gaussian process regression, tree ensemble, forecasting models, particulate matter

Procedia PDF Downloads 41
43424 An Exploratory Sequential Design: A Mixed Methods Model for the Statistics Learning Assessment with a Bayesian Network Representation

Authors: Zhidong Zhang

Abstract:

This study established a mixed method model in assessing statistics learning with Bayesian network models. There are three variants in exploratory sequential designs. There are three linked steps in one of the designs: qualitative data collection and analysis, quantitative measure, instrument, intervention, and quantitative data collection analysis. The study used a scoring model of analysis of variance (ANOVA) as a content domain. The research study is to examine students’ learning in both semantic and performance aspects at fine grain level. The ANOVA score model, y = α+ βx1 + γx1+ ε, as a cognitive task to collect data during the student learning process. When the learning processes were decomposed into multiple steps in both semantic and performance aspects, a hierarchical Bayesian network was established. This is a theory-driven process. The hierarchical structure was gained based on qualitative cognitive analysis. The data from students’ ANOVA score model learning was used to give evidence to the hierarchical Bayesian network model from the evidential variables. Finally, the assessment results of students’ ANOVA score model learning were reported. Briefly, this was a mixed method research design applied to statistics learning assessment. The mixed methods designs expanded more possibilities for researchers to establish advanced quantitative models initially with a theory-driven qualitative mode.

Keywords: exploratory sequential design, ANOVA score model, Bayesian network model, mixed methods research design, cognitive analysis

Procedia PDF Downloads 155
43423 Multimedia Data Fusion for Event Detection in Twitter by Using Dempster-Shafer Evidence Theory

Authors: Samar M. Alqhtani, Suhuai Luo, Brian Regan

Abstract:

Data fusion technology can be the best way to extract useful information from multiple sources of data. It has been widely applied in various applications. This paper presents a data fusion approach in multimedia data for event detection in twitter by using Dempster-Shafer evidence theory. The methodology applies a mining algorithm to detect the event. There are two types of data in the fusion. The first is features extracted from text by using the bag-ofwords method which is calculated using the term frequency-inverse document frequency (TF-IDF). The second is the visual features extracted by applying scale-invariant feature transform (SIFT). The Dempster - Shafer theory of evidence is applied in order to fuse the information from these two sources. Our experiments have indicated that comparing to the approaches using individual data source, the proposed data fusion approach can increase the prediction accuracy for event detection. The experimental result showed that the proposed method achieved a high accuracy of 0.97, comparing with 0.93 with texts only, and 0.86 with images only.

Keywords: data fusion, Dempster-Shafer theory, data mining, event detection

Procedia PDF Downloads 397
43422 A Comparative Analysis of E-Government Quality Models

Authors: Abdoullah Fath-Allah, Laila Cheikhi, Rafa E. Al-Qutaish, Ali Idri

Abstract:

Many quality models have been used to measure e-government portals quality. However, the absence of an international consensus for e-government portals quality models results in many differences in terms of quality attributes and measures. The aim of this paper is to compare and analyze the existing e-government quality models proposed in literature (those that are based on ISO standards and those that are not) in order to propose guidelines to build a good and useful e-government portals quality model. Our findings show that, there is no e-government portal quality model based on the new international standard ISO 25010. Besides that, the quality models are not based on a best practice model to allow agencies to both; measure e-government portals quality and identify missing best practices for those portals.

Keywords: e-government, portal, best practices, quality model, ISO, standard, ISO 25010, ISO 9126

Procedia PDF Downloads 542
43421 Copula-Based Estimation of Direct and Indirect Effects in Path Analysis Model

Authors: Alam Ali, Ashok Kumar Pathak

Abstract:

Path analysis is a statistical technique used to evaluate the strength of the direct and indirect effects of variables. One or more structural regression equations are used to estimate a series of parameters in order to find the better fit of data. Sometimes, exogenous variables do not show a significant strength of their direct and indirect effect when the assumption of classical regression (ordinary least squares (OLS)) are violated by the nature of the data. The main motive of this article is to investigate the efficacy of the copula-based regression approach over the classical regression approach and calculate the direct and indirect effects of variables when data violates the OLS assumption and variables are linked through an elliptical copula. We perform this study using a well-organized numerical scheme. Finally, a real data application is also presented to demonstrate the performance of the superiority of the copula approach.

Keywords: path analysis, copula-based regression models, direct and indirect effects, k-fold cross validation technique

Procedia PDF Downloads 60
43420 Statistical Assessment of Models for Determination of Soil–Water Characteristic Curves of Sand Soils

Authors: S. J. Matlan, M. Mukhlisin, M. R. Taha

Abstract:

Characterization of the engineering behavior of unsaturated soil is dependent on the soil-water characteristic curve (SWCC), a graphical representation of the relationship between water content or degree of saturation and soil suction. A reasonable description of the SWCC is thus important for the accurate prediction of unsaturated soil parameters. The measurement procedures for determining the SWCC, however, are difficult, expensive, and time-consuming. During the past few decades, researchers have laid a major focus on developing empirical equations for predicting the SWCC, with a large number of empirical models suggested. One of the most crucial questions is how precisely existing equations can represent the SWCC. As different models have different ranges of capability, it is essential to evaluate the precision of the SWCC models used for each particular soil type for better SWCC estimation. It is expected that better estimation of SWCC would be achieved via a thorough statistical analysis of its distribution within a particular soil class. With this in view, a statistical analysis was conducted in order to evaluate the reliability of the SWCC prediction models against laboratory measurement. Optimization techniques were used to obtain the best-fit of the model parameters in four forms of SWCC equation, using laboratory data for relatively coarse-textured (i.e., sandy) soil. The four most prominent SWCCs were evaluated and computed for each sample. The result shows that the Brooks and Corey model is the most consistent in describing the SWCC for sand soil type. The Brooks and Corey model prediction also exhibit compatibility with samples ranging from low to high soil water content in which subjected to the samples that evaluated in this study.

Keywords: soil-water characteristic curve (SWCC), statistical analysis, unsaturated soil, geotechnical engineering

Procedia PDF Downloads 326
43419 Estimation of Geotechnical Parameters by Comparing Monitoring Data with Numerical Results: Case Study of Arash–Esfandiar-Niayesh Under-Passing Tunnel, Africa Tunnel, Tehran, Iran

Authors: Aliakbar Golshani, Seyyed Mehdi Poorhashemi, Mahsa Gharizadeh

Abstract:

The under passing tunnels are strongly influenced by the soils around. There are some complexities in the specification of real soil behavior, owing to the fact that lots of uncertainties exist in soil properties, and additionally, inappropriate soil constitutive models. Such mentioned factors may cause incompatible settlements in numerical analysis with the obtained values in actual construction. This paper aims to report a case study on a specific tunnel constructed by NATM. The tunnel has a depth of 11.4 m, height of 12.2 m, and width of 14.4 m with 2.5 lanes. The numerical modeling was based on a 2D finite element program. The soil material behavior was modeled by hardening soil model. According to the field observations, the numerical estimated settlement at the ground surface was approximately four times more than the measured one, after the entire installation of the initial lining, indicating that some unknown factors affect the values. Consequently, the geotechnical parameters are accurately revised by a numerical back-analysis using laboratory and field test data and based on the obtained monitoring data. The obtained result confirms that typically, the soil parameters are conservatively low-estimated. And additionally, the constitutive models cannot be applied properly for all soil conditions.

Keywords: NATM tunnel, initial lining, laboratory test data, numerical back-analysis

Procedia PDF Downloads 353
43418 Multi-Source Data Fusion for Urban Comprehensive Management

Authors: Bolin Hua

Abstract:

In city governance, various data are involved, including city component data, demographic data, housing data and all kinds of business data. These data reflects different aspects of people, events and activities. Data generated from various systems are different in form and data source are different because they may come from different sectors. In order to reflect one or several facets of an event or rule, data from multiple sources need fusion together. Data from different sources using different ways of collection raised several issues which need to be resolved. Problem of data fusion include data update and synchronization, data exchange and sharing, file parsing and entry, duplicate data and its comparison, resource catalogue construction. Governments adopt statistical analysis, time series analysis, extrapolation, monitoring analysis, value mining, scenario prediction in order to achieve pattern discovery, law verification, root cause analysis and public opinion monitoring. The result of Multi-source data fusion is to form a uniform central database, which includes people data, location data, object data, and institution data, business data and space data. We need to use meta data to be referred to and read when application needs to access, manipulate and display the data. A uniform meta data management ensures effectiveness and consistency of data in the process of data exchange, data modeling, data cleansing, data loading, data storing, data analysis, data search and data delivery.

Keywords: multi-source data fusion, urban comprehensive management, information fusion, government data

Procedia PDF Downloads 376