Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 1269

Search results for: predictive analytics

9 Integration of Building Information Modeling Framework for 4D Constructability Review and Clash Detection Management of a Sewage Treatment Plant

Authors: Malla Vijayeta, Y. Vijaya Kumar, N. Ramakrishna Raju, K. Satyanarayana

Abstract:

Global AEC (architecture, engineering, and construction) industry has been coined as one of the most resistive domains in embracing technology. Although this digital era has been inundated with software tools like CAD, STADD, CANDY, Microsoft Project, Primavera etc. the key stakeholders have been working in siloes and processes remain fragmented. Unlike the yesteryears’ simpler project delivery methods, the current projects are of fast-track, complex, risky, multidisciplinary, stakeholder’s influential, statutorily regulative etc. pose extensive bottlenecks in preventing timely completion of projects. At this juncture, a paradigm shift surfaced in construction industry, and Building Information Modeling, aka BIM, has been a panacea to bolster the multidisciplinary teams’ cooperative and collaborative work leading to productive, sustainable and leaner project outcome. Building information modeling has been integrative, stakeholder engaging and centralized approach in providing a common platform of communication. A common misconception that BIM can be used for building/high rise projects in Indian Construction Industry, while this paper discusses of the implementation of BIM processes/methodologies in water and waste water industry. It elucidates about BIM 4D planning and constructability reviews of a Sewage Treatment Plant in India. Conventional construction planning and logistics management involves a blend of experience coupled with imagination. Even though the excerpts or judgments or lessons learnt gained from veterans might be predictive and helpful, but the uncertainty factor persists. This paper shall delve about the case study of real time implementation of BIM 4D planning protocols for one of the Sewage Treatment Plant of Dravyavati River Rejuvenation Project in India and develops a Time Liner to identify logistics planning and clash detection. With this BIM processes, we shall find that there will be significant reduction of duplication of tasks and reworks. Also another benefit achieved will be better visualization and workarounds during conception stage and enables for early involvement of the stakeholders in the Project Life cycle of Sewage Treatment Plant construction. Moreover, we have also taken an opinion poll of the benefits accrued utilizing BIM processes versus traditional paper based communication like 2D and 3D CAD tools. Thus this paper concludes with BIM framework for Sewage Treatment Plant construction which will achieve optimal construction co-ordination advantages like 4D construction sequencing, interference checking, clash detection checking and resolutions by primary engagement of all key stakeholders thereby identifying potential risks and subsequent creation of risk response strategies. However, certain hiccups like hesitancy in adoption of BIM technology by naïve users and availability of proficient BIM trainers in India poses a phenomenal impediment. Hence the nurture of BIM processes from conception, construction and till commissioning, operation and maintenance along with deconstruction of a project’s life cycle is highly essential for Indian Construction Industry in this digital era.

Keywords: integrated BIM workflow, 4D planning with BIM, building information modeling, clash detection and visualization, constructability reviews, project life cycle

Procedia PDF Downloads 113

8 Internet of Things, Edge and Cloud Computing in Rock Mechanical Investigation for Underground Surveys

Authors: Esmael Makarian, Ayub Elyasi, Fatemeh Saberi, Olusegun Stanley Tomomewo

Abstract:

Rock mechanical investigation is one of the most crucial activities in underground operations, especially in surveys related to hydrocarbon exploration and production, geothermal reservoirs, energy storage, mining, and geotechnics. There is a wide range of traditional methods for driving, collecting, and analyzing rock mechanics data. However, these approaches may not be suitable or work perfectly in some situations, such as fractured zones. Cutting-edge technologies have been provided to solve and optimize the mentioned issues. Internet of Things (IoT), Edge, and Cloud Computing technologies (ECt & CCt, respectively) are among the most widely used and new artificial intelligence methods employed for geomechanical studies. IoT devices act as sensors and cameras for real-time monitoring and mechanical-geological data collection of rocks, such as temperature, movement, pressure, or stress levels. Structural integrity, especially for cap rocks within hydrocarbon systems, and rock mass behavior assessment, to further activities such as enhanced oil recovery (EOR) and underground gas storage (UGS), or to improve safety risk management (SRM) and potential hazards identification (P.H.I), are other benefits from IoT technologies. EC techniques can process, aggregate, and analyze data immediately collected by IoT on a real-time scale, providing detailed insights into the behavior of rocks in various situations (e.g., stress, temperature, and pressure), establishing patterns quickly, and detecting trends. Therefore, this state-of-the-art and useful technology can adopt autonomous systems in rock mechanical surveys, such as drilling and production (in hydrocarbon wells) or excavation (in mining and geotechnics industries). Besides, ECt allows all rock-related operations to be controlled remotely and enables operators to apply changes or make adjustments. It must be mentioned that this feature is very important in environmental goals. More often than not, rock mechanical studies consist of different data, such as laboratory tests, field operations, and indirect information like seismic or well-logging data. CCt provides a useful platform for storing and managing a great deal of volume and different information, which can be very useful in fractured zones. Additionally, CCt supplies powerful tools for predicting, modeling, and simulating rock mechanical information, especially in fractured zones within vast areas. Also, it is a suitable source for sharing extensive information on rock mechanics, such as the direction and size of fractures in a large oil field or mine. The comprehensive review findings demonstrate that digital transformation through integrated IoT, Edge, and Cloud solutions is revolutionizing traditional rock mechanical investigation. These advanced technologies have empowered real-time monitoring, predictive analysis, and data-driven decision-making, culminating in noteworthy enhancements in safety, efficiency, and sustainability. Therefore, by employing IoT, CCt, and ECt, underground operations have experienced a significant boost, allowing for timely and informed actions using real-time data insights. The successful implementation of IoT, CCt, and ECt has led to optimized and safer operations, optimized processes, and environmentally conscious approaches in underground geological endeavors.

Keywords: rock mechanical studies, internet of things, edge computing, cloud computing, underground surveys, geological operations

Procedia PDF Downloads 51

7 Evaluation of Random Forest and Support Vector Machine Classification Performance for the Prediction of Early Multiple Sclerosis from Resting State FMRI Connectivity Data

Authors: V. Saccà, A. Sarica, F. Novellino, S. Barone, T. Tallarico, E. Filippelli, A. Granata, P. Valentino, A. Quattrone

Abstract:

The work aim was to evaluate how well Random Forest (RF) and Support Vector Machine (SVM) algorithms could support the early diagnosis of Multiple Sclerosis (MS) from resting-state functional connectivity data. In particular, we wanted to explore the ability in distinguishing between controls and patients of mean signals extracted from ICA components corresponding to 15 well-known networks. Eighteen patients with early-MS (mean-age 37.42±8.11, 9 females) were recruited according to McDonald and Polman, and matched for demographic variables with 19 healthy controls (mean-age 37.55±14.76, 10 females). MRI was acquired by a 3T scanner with 8-channel head coil: (a)whole-brain T1-weighted; (b)conventional T2-weighted; (c)resting-state functional MRI (rsFMRI), 200 volumes. Estimated total lesion load (ml) and number of lesions were calculated using LST-toolbox from the corrected T1 and FLAIR. All rsFMRIs were pre-processed using tools from the FMRIB's Software Library as follows: (1) discarding of the first 5 volumes to remove T1 equilibrium effects, (2) skull-stripping of images, (3) motion and slice-time correction, (4) denoising with high-pass temporal filter (128s), (5) spatial smoothing with a Gaussian kernel of FWHM 8mm. No statistical significant differences (t-test, p < 0.05) were found between the two groups in the mean Euclidian distance and the mean Euler angle. WM and CSF signal together with 6 motion parameters were regressed out from the time series. We applied an independent component analysis (ICA) with the GIFT-toolbox using the Infomax approach with number of components=21. Fifteen mean components were visually identified by two experts. The resulting z-score maps were thresholded and binarized to extract the mean signal of the 15 networks for each subject. Statistical and machine learning analysis were then conducted on this dataset composed of 37 rows (subjects) and 15 features (mean signal in the network) with R language. The dataset was randomly splitted into training (75%) and test sets and two different classifiers were trained: RF and RBF-SVM. We used the intrinsic feature selection of RF, based on the Gini index, and recursive feature elimination (rfe) for the SVM, to obtain a rank of the most predictive variables. Thus, we built two new classifiers only on the most important features and we evaluated the accuracies (with and without feature selection) on test-set. The classifiers, trained on all the features, showed very poor accuracies on training (RF:58.62%, SVM:65.52%) and test sets (RF:62.5%, SVM:50%). Interestingly, when feature selection by RF and rfe-SVM were performed, the most important variable was the sensori-motor network I in both cases. Indeed, with only this network, RF and SVM classifiers reached an accuracy of 87.5% on test-set. More interestingly, the only misclassified patient resulted to have the lowest value of lesion volume. We showed that, with two different classification algorithms and feature selection approaches, the best discriminant network between controls and early MS, was the sensori-motor I. Similar importance values were obtained for the sensori-motor II, cerebellum and working memory networks. These findings, in according to the early manifestation of motor/sensorial deficits in MS, could represent an encouraging step toward the translation to the clinical diagnosis and prognosis.

Keywords: feature selection, machine learning, multiple sclerosis, random forest, support vector machine

Procedia PDF Downloads 233

6 ChatGPT 4.0 Demonstrates Strong Performance in Standardised Medical Licensing Examinations: Insights and Implications for Medical Educators

Authors: K. O'Malley

Abstract:

Background: The emergence and rapid evolution of large language models (LLMs) (i.e., models of generative artificial intelligence, or AI) has been unprecedented. ChatGPT is one of the most widely used LLM platforms. Using natural language processing technology, it generates customized responses to user prompts, enabling it to mimic human conversation. Responses are generated using predictive modeling of vast internet text and data swathes and are further refined and reinforced through user feedback. The popularity of LLMs is increasing, with a growing number of students utilizing these platforms for study and revision purposes. Notwithstanding its many novel applications, LLM technology is inherently susceptible to bias and error. This poses a significant challenge in the educational setting, where academic integrity may be undermined. This study aims to evaluate the performance of the latest iteration of ChatGPT (ChatGPT4.0) in standardized state medical licensing examinations. Methods: A considered search strategy was used to interrogate the PubMed electronic database. The keywords ‘ChatGPT’ AND ‘medical education’ OR ‘medical school’ OR ‘medical licensing exam’ were used to identify relevant literature. The search included all peer-reviewed literature published in the past five years. The search was limited to publications in the English language only. Eligibility was ascertained based on the study title and abstract and confirmed by consulting the full-text document. Data was extracted into a Microsoft Excel document for analysis. Results: The search yielded 345 publications that were screened. 225 original articles were identified, of which 11 met the pre-determined criteria for inclusion in a narrative synthesis. These studies included performance assessments in national medical licensing examinations from the United States, United Kingdom, Saudi Arabia, Poland, Taiwan, Japan and Germany. ChatGPT 4.0 achieved scores ranging from 67.1 to 88.6 percent. The mean score across all studies was 82.49 percent (SD= 5.95). In all studies, ChatGPT exceeded the threshold for a passing grade in the corresponding exam. Conclusion: The capabilities of ChatGPT in standardized academic assessment in medicine are robust. While this technology can potentially revolutionize higher education, it also presents several challenges with which educators have not had to contend before. The overall strong performance of ChatGPT, as outlined above, may lend itself to unfair use (such as the plagiarism of deliverable coursework) and pose unforeseen ethical challenges (arising from algorithmic bias). Conversely, it highlights potential pitfalls if users assume LLM-generated content to be entirely accurate. In the aforementioned studies, ChatGPT exhibits a margin of error between 11.4 and 32.9 percent, which resonates strongly with concerns regarding the quality and veracity of LLM-generated content. It is imperative to highlight these limitations, particularly to students in the early stages of their education who are less likely to possess the requisite insight or knowledge to recognize errors, inaccuracies or false information. Educators must inform themselves of these emerging challenges to effectively address them and mitigate potential disruption in academic fora.

Keywords: artificial intelligence, ChatGPT, generative ai, large language models, licensing exam, medical education, medicine, university

Procedia PDF Downloads 19

5 A Copula-Based Approach for the Assessment of Severity of Illness and Probability of Mortality: An Exploratory Study Applied to Intensive Care Patients

Authors: Ainura Tursunalieva, Irene Hudson

Abstract:

Continuous improvement of both the quality and safety of health care is an important goal in Australia and internationally. The intensive care unit (ICU) receives patients with a wide variety of and severity of illnesses. Accurately identifying patients at risk of developing complications or dying is crucial to increasing healthcare efficiency. Thus, it is essential for clinicians and researchers to have a robust framework capable of evaluating the risk profile of a patient. ICU scoring systems provide such a framework. The Acute Physiology and Chronic Health Evaluation III and the Simplified Acute Physiology Score II are ICU scoring systems frequently used for assessing the severity of acute illness. These scoring systems collect multiple risk factors for each patient including physiological measurements then render the assessment outcomes of individual risk factors into a single numerical value. A higher score is related to a more severe patient condition. Furthermore, the Mortality Probability Model II uses logistic regression based on independent risk factors to predict a patient’s probability of mortality. An important overlooked limitation of SAPS II and MPM II is that they do not, to date, include interaction terms between a patient’s vital signs. This is a prominent oversight as it is likely there is an interplay among vital signs. The co-existence of certain conditions may pose a greater health risk than when these conditions exist independently. One barrier to including such interaction terms in predictive models is the dimensionality issue as it becomes difficult to use variable selection. We propose an innovative scoring system which takes into account a dependence structure among patient’s vital signs, such as systolic and diastolic blood pressures, heart rate, pulse interval, and peripheral oxygen saturation. Copulas will capture the dependence among normally distributed and skewed variables as some of the vital sign distributions are skewed. The estimated dependence parameter will then be incorporated into the traditional scoring systems to adjust the points allocated for the individual vital sign measurements. The same dependence parameter will also be used to create an alternative copula-based model for predicting a patient’s probability of mortality. The new copula-based approach will accommodate not only a patient’s trajectories of vital signs but also the joint dependence probabilities among the vital signs. We hypothesise that this approach will produce more stable assessments and lead to more time efficient and accurate predictions. We will use two data sets: (1) 250 ICU patients admitted once to the Chui Regional Hospital (Kyrgyzstan) and (2) 37 ICU patients’ agitation-sedation profiles collected by the Hunter Medical Research Institute (Australia). Both the traditional scoring approach and our copula-based approach will be evaluated using the Brier score to indicate overall model performance, the concordance (or c) statistic to indicate the discriminative ability (or area under the receiver operating characteristic (ROC) curve), and goodness-of-fit statistics for calibration. We will also report discrimination and calibration values and establish visualization of the copulas and high dimensional regions of risk interrelating two or three vital signs in so-called higher dimensional ROCs.

Keywords: copula, intensive unit scoring system, ROC curves, vital sign dependence

Procedia PDF Downloads 144

4 Pulmonary Disease Identification Using Machine Learning and Deep Learning Techniques

Authors: Chandu Rathnayake, Isuri Anuradha

Abstract:

Early detection and accurate diagnosis of lung diseases play a crucial role in improving patient prognosis. However, conventional diagnostic methods heavily rely on subjective symptom assessments and medical imaging, often causing delays in diagnosis and treatment. To overcome this challenge, we propose a novel lung disease prediction system that integrates patient symptoms and X-ray images to provide a comprehensive and reliable diagnosis.In this project, develop a mobile application specifically designed for detecting lung diseases. Our application leverages both patient symptoms and X-ray images to facilitate diagnosis. By combining these two sources of information, our application delivers a more accurate and comprehensive assessment of the patient's condition, minimizing the risk of misdiagnosis. Our primary aim is to create a user-friendly and accessible tool, particularly important given the current circumstances where many patients face limitations in visiting healthcare facilities. To achieve this, we employ several state-of-the-art algorithms. Firstly, the Decision Tree algorithm is utilized for efficient symptom-based classification. It analyzes patient symptoms and creates a tree-like model to predict the presence of specific lung diseases. Secondly, we employ the Random Forest algorithm, which enhances predictive power by aggregating multiple decision trees. This ensemble technique improves the accuracy and robustness of the diagnosis. Furthermore, we incorporate a deep learning model using Convolutional Neural Network (CNN) with the RestNet50 pre-trained model. CNNs are well-suited for image analysis and feature extraction. By training CNN on a large dataset of X-ray images, it learns to identify patterns and features indicative of lung diseases. The RestNet50 architecture, known for its excellent performance in image recognition tasks, enhances the efficiency and accuracy of our deep learning model. By combining the outputs of the decision tree-based algorithms and the deep learning model, our mobile application generates a comprehensive lung disease prediction. The application provides users with an intuitive interface to input their symptoms and upload X-ray images for analysis. The prediction generated by the system offers valuable insights into the likelihood of various lung diseases, enabling individuals to take appropriate actions and seek timely medical attention. Our proposed mobile application has significant potential to address the rising prevalence of lung diseases, particularly among young individuals with smoking addictions. By providing a quick and user-friendly approach to assessing lung health, our application empowers individuals to monitor their well-being conveniently. This solution also offers immense value in the context of limited access to healthcare facilities, enabling timely detection and intervention. In conclusion, our research presents a comprehensive lung disease prediction system that combines patient symptoms and X-ray images using advanced algorithms. By developing a mobile application, we provide an accessible tool for individuals to assess their lung health conveniently. This solution has the potential to make a significant impact on the early detection and management of lung diseases, benefiting both patients and healthcare providers.

Keywords: CNN, random forest, decision tree, machine learning, deep learning

Procedia PDF Downloads 71

3 Identification of a Panel of Epigenetic Biomarkers for Early Detection of Hepatocellular Carcinoma in Blood of Individuals with Liver Cirrhosis

Authors: Katarzyna Lubecka, Kirsty Flower, Megan Beetch, Lucinda Kurzava, Hannah Buvala, Samer Gawrieh, Suthat Liangpunsakul, Tracy Gonzalez, George McCabe, Naga Chalasani, James M. Flanagan, Barbara Stefanska

Abstract:

Hepatocellular carcinoma (HCC), the most prevalent type of primary liver cancer, is the second leading cause of cancer death worldwide. Late onset of clinical symptoms in HCC results in late diagnosis and poor disease outcome. Approximately 85% of individuals with HCC have underlying liver cirrhosis. However, not all cirrhotic patients develop cancer. Reliable early detection biomarkers that can distinguish cirrhotic patients who will develop cancer from those who will not are urgently needed and could increase the cure rate from 5% to 80%. We used Illumina-450K microarray to test whether blood DNA, an easily accessible source of DNA, bear site-specific changes in DNA methylation in response to HCC before diagnosis with conventional tools (pre-diagnostic). Top 11 differentially methylated sites were selected for validation by pyrosequencing. The diagnostic potential of the 11 pyrosequenced probes was tested in blood samples from a prospective cohort of cirrhotic patients. We identified 971 differentially methylated CpG sites in pre-diagnostic HCC cases as compared with healthy controls (P < 0.05, paired Wilcoxon test, ICC ≥ 0.5). Nearly 76% of differentially methylated CpG sites showed lower levels of methylation in cases vs. controls (P = 2.973E-11, Wilcoxon test). Classification of the CpG sites according to their location relative to CpG islands and transcription start site revealed that those hypomethylated loci are located in regulatory regions important for gene transcription such as CpG island shores, promoters, and 5’UTR at higher frequency than hypermethylated sites. Among 735 CpG sites hypomethylated in cases vs. controls, 482 sites were assigned to gene coding regions whereas 236 hypermethylated sites corresponded to 160 genes. Bioinformatics analysis using GO, KEGG and DAVID knowledgebase indicate that differentially methylated CpG sites are located in genes associated with functions that are essential for gene transcription, cell adhesion, cell migration, and regulation of signal transduction pathways. Taking into account the magnitude of the difference, statistical significance, location, and consistency across the majority of matched pairs case-control, we selected 11 CpG loci corresponding to 10 genes for further validation by pyrosequencing. We established that methylation of CpG sites within 5 out of those 10 genes distinguish cirrhotic patients who subsequently developed HCC from those who stayed cancer free (cirrhotic controls), demonstrating potential as biomarkers of early detection in populations at risk. The best predictive value was detected for CpGs located within BARD1 (AUC=0.70, asymptotic significance ˂0.01). Using an additive logistic regression model, we further showed that 9 CpG loci within those 5 genes, that were covered in pyrosequenced probes, constitute a panel with high diagnostic accuracy (AUC=0.887; 95% CI:0.80-0.98). The panel was able to distinguish pre-diagnostic cases from cirrhotic controls free of cancer with 88% sensitivity at 70% specificity. Using blood as a minimally invasive material and pyrosequencing as a straightforward quantitative method, the established biomarker panel has high potential to be developed into a routine clinical test after validation in larger cohorts. This study was supported by Showalter Trust, American Cancer Society (IRG#14-190-56), and Purdue Center for Cancer Research (P30 CA023168) granted to BS.

Keywords: biomarker, DNA methylation, early detection, hepatocellular carcinoma

Procedia PDF Downloads 297

2 Establishment of a Classifier Model for Early Prediction of Acute Delirium in Adult Intensive Care Unit Using Machine Learning

Authors: Pei Yi Lin

Abstract:

Objective: The objective of this study is to use machine learning methods to build an early prediction classifier model for acute delirium to improve the quality of medical care for intensive care patients. Background: Delirium is a common acute and sudden disturbance of consciousness in critically ill patients. After the occurrence, it is easy to prolong the length of hospital stay and increase medical costs and mortality. In 2021, the incidence of delirium in the intensive care unit of internal medicine was as high as 59.78%, which indirectly prolonged the average length of hospital stay by 8.28 days, and the mortality rate is about 2.22% in the past three years. Therefore, it is expected to build a delirium prediction classifier through big data analysis and machine learning methods to detect delirium early. Method: This study is a retrospective study, using the artificial intelligence big data database to extract the characteristic factors related to delirium in intensive care unit patients and let the machine learn. The study included patients aged over 20 years old who were admitted to the intensive care unit between May 1, 2022, and December 31, 2022, excluding GCS assessment <4 points, admission to ICU for less than 24 hours, and CAM-ICU evaluation. The CAMICU delirium assessment results every 8 hours within 30 days of hospitalization are regarded as an event, and the cumulative data from ICU admission to the prediction time point are extracted to predict the possibility of delirium occurring in the next 8 hours, and collect a total of 63,754 research case data, extract 12 feature selections to train the model, including age, sex, average ICU stay hours, visual and auditory abnormalities, RASS assessment score, APACHE-II Score score, number of invasive catheters indwelling, restraint and sedative and hypnotic drugs. Through feature data cleaning, processing and KNN interpolation method supplementation, a total of 54595 research case events were extracted to provide machine learning model analysis, using the research events from May 01 to November 30, 2022, as the model training data, 80% of which is the training set for model training, and 20% for the internal verification of the verification set, and then from December 01 to December 2022 The CU research event on the 31st is an external verification set data, and finally the model inference and performance evaluation are performed, and then the model has trained again by adjusting the model parameters. Results: In this study, XG Boost, Random Forest, Logistic Regression, and Decision Tree were used to analyze and compare four machine learning models. The average accuracy rate of internal verification was highest in Random Forest (AUC=0.86), and the average accuracy rate of external verification was in Random Forest and XG Boost was the highest, AUC was 0.86, and the average accuracy of cross-validation was the highest in Random Forest (ACC=0.77). Conclusion: Clinically, medical staff usually conduct CAM-ICU assessments at the bedside of critically ill patients in clinical practice, but there is a lack of machine learning classification methods to assist ICU patients in real-time assessment, resulting in the inability to provide more objective and continuous monitoring data to assist Clinical staff can more accurately identify and predict the occurrence of delirium in patients. It is hoped that the development and construction of predictive models through machine learning can predict delirium early and immediately, make clinical decisions at the best time, and cooperate with PADIS delirium care measures to provide individualized non-drug interventional care measures to maintain patient safety, and then Improve the quality of care.

Keywords: critically ill patients, machine learning methods, delirium prediction, classifier model

Procedia PDF Downloads 63

1 Towards Dynamic Estimation of Residential Building Energy Consumption in Germany: Leveraging Machine Learning and Public Data from England and Wales

Authors: Philipp Sommer, Amgad Agoub

Abstract:

The construction sector significantly impacts global CO₂ emissions, particularly through the energy usage of residential buildings. To address this, various governments, including Germany's, are focusing on reducing emissions via sustainable refurbishment initiatives. This study examines the application of machine learning (ML) to estimate energy demands dynamically in residential buildings and enhance the potential for large-scale sustainable refurbishment. A major challenge in Germany is the lack of extensive publicly labeled datasets for energy performance, as energy performance certificates, which provide critical data on building-specific energy requirements and consumption, are not available for all buildings or require on-site inspections. Conversely, England and other countries in the European Union (EU) have rich public datasets, providing a viable alternative for analysis. This research adapts insights from these English datasets to the German context by developing a comprehensive data schema and calibration dataset capable of predicting building energy demand effectively. The study proposes a minimal feature set, determined through feature importance analysis, to optimize the ML model. Findings indicate that ML significantly improves the scalability and accuracy of energy demand forecasts, supporting more effective emissions reduction strategies in the construction industry. Integrating energy performance certificates into municipal heat planning in Germany highlights the transformative impact of data-driven approaches on environmental sustainability. The goal is to identify and utilize key features from open data sources that significantly influence energy demand, creating an efficient forecasting model. Using Extreme Gradient Boosting (XGB) and data from energy performance certificates, effective features such as building type, year of construction, living space, insulation level, and building materials were incorporated. These were supplemented by data derived from descriptions of roofs, walls, windows, and floors, integrated into three datasets. The emphasis was on features accessible via remote sensing, which, along with other correlated characteristics, greatly improved the model's accuracy. The model was further validated using SHapley Additive exPlanations (SHAP) values and aggregated feature importance, which quantified the effects of individual features on the predictions. The refined model using remote sensing data showed a coefficient of determination (R²) of 0.64 and a mean absolute error (MAE) of 4.12, indicating predictions based on efficiency class 1-100 (G-A) may deviate by 4.12 points. This R² increased to 0.84 with the inclusion of more samples, with wall type emerging as the most predictive feature. After optimizing and incorporating related features like estimated primary energy consumption, the R² score for the training and test set reached 0.94, demonstrating good generalization. The study concludes that ML models significantly improve prediction accuracy over traditional methods, illustrating the potential of ML in enhancing energy efficiency analysis and planning. This supports better decision-making for energy optimization and highlights the benefits of developing and refining data schemas using open data to bolster sustainability in the building sector. The study underscores the importance of supporting open data initiatives to collect similar features and support the creation of comparable models in Germany, enhancing the outlook for environmental sustainability.

Keywords: machine learning, remote sensing, residential building, energy performance certificates, data-driven, heat planning

Procedia PDF Downloads 47