• Title/Summary/Keyword: missForest

Search Result 8, Processing Time 0.026 seconds

Comparision of Missing Imputaion Methods In fine dust data (미세먼지 자료에서의 결측치 대체 방법 비교)

  • Kim, YeonJin;Park, HeonJin
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.105-114
    • /
    • 2019
  • Missing value replacement is one of the big issues in data analysis. If you ignore the occurrence of the missing value and proceed with the analysis, a bias can occur and give incorrect results for the estimate. In this paper, we need to find and apply an appropriate alternative to missing data from weather data. Through this, we attempted to clarify and compare the simulations for various situations using existing methods such as MICE and MissForest based on R and time series-based models. When comparing these results with each variable, it was determined that the kalman filter of the auto arima model using the ImputeTS package and the MissForest model gave good results in the weather data.

  • PDF

A comparison of imputation methods using machine learning models

  • Heajung Suh;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.3
    • /
    • pp.331-341
    • /
    • 2023
  • Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

Performance analysis and comparison of various machine learning algorithms for early stroke prediction

  • Vinay Padimi;Venkata Sravan Telu;Devarani Devi Ningombam
    • ETRI Journal
    • /
    • v.45 no.6
    • /
    • pp.1007-1021
    • /
    • 2023
  • Stroke is the leading cause of permanent disability in adults, and it can cause permanent brain damage. According to the World Health Organization, 795 000 Americans experience a new or recurrent stroke each year. Early detection of medical disorders, for example, strokes, can minimize the disabling effects. Thus, in this paper, we consider various risk factors that contribute to the occurrence of stoke and machine learning algorithms, for example, the decision tree, random forest, and naive Bayes algorithms, on patient characteristics survey data to achieve high prediction accuracy. We also consider the semisupervised self-training technique to predict the risk of stroke. We then consider the near-miss undersampling technique, which can select only instances in larger classes with the smaller class instances. Experimental results demonstrate that the proposed method obtains an accuracy of approximately 98.83% at low cost, which is significantly higher and more reliable compared with the compared techniques.

Data-centric XAI-driven Data Imputation of Molecular Structure and QSAR Model for Toxicity Prediction of 3D Printing Chemicals (3D 프린팅 소재 화학물질의 독성 예측을 위한 Data-centric XAI 기반 분자 구조 Data Imputation과 QSAR 모델 개발)

  • ChanHyeok Jeong;SangYoun Kim;SungKu Heo;Shahzeb Tariq;MinHyeok Shin;ChangKyoo Yoo
    • Korean Chemical Engineering Research
    • /
    • v.61 no.4
    • /
    • pp.523-541
    • /
    • 2023
  • As accessibility to 3D printers increases, there is a growing frequency of exposure to chemicals associated with 3D printing. However, research on the toxicity and harmfulness of chemicals generated by 3D printing is insufficient, and the performance of toxicity prediction using in silico techniques is limited due to missing molecular structure data. In this study, quantitative structure-activity relationship (QSAR) model based on data-centric AI approach was developed to predict the toxicity of new 3D printing materials by imputing missing values in molecular descriptors. First, MissForest algorithm was utilized to impute missing values in molecular descriptors of hazardous 3D printing materials. Then, based on four different machine learning models (decision tree, random forest, XGBoost, SVM), a machine learning (ML)-based QSAR model was developed to predict the bioconcentration factor (Log BCF), octanol-air partition coefficient (Log Koa), and partition coefficient (Log P). Furthermore, the reliability of the data-centric QSAR model was validated through the Tree-SHAP (SHapley Additive exPlanations) method, which is one of explainable artificial intelligence (XAI) techniques. The proposed imputation method based on the MissForest enlarged approximately 2.5 times more molecular structure data compared to the existing data. Based on the imputed dataset of molecular descriptor, the developed data-centric QSAR model achieved approximately 73%, 76% and 92% of prediction performance for Log BCF, Log Koa, and Log P, respectively. Lastly, Tree-SHAP analysis demonstrated that the data-centric-based QSAR model achieved high prediction performance for toxicity information by identifying key molecular descriptors highly correlated with toxicity indices. Therefore, the proposed QSAR model based on the data-centric XAI approach can be extended to predict the toxicity of potential pollutants in emerging printing chemicals, chemical process, semiconductor or display process.

Machine Learning-based landslide susceptibility mapping - Inje area, South Korea

  • Chanul Choi;Le Xuan Hien;Seongcheon Kwon;Giha Lee
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.248-248
    • /
    • 2023
  • In recent years, the number of landslides in Korea has been increasing due to extreme weather events such as localized heavy rainfall and typhoons. Landslides often occur with debris flows, land subsidence, and earthquakes. They cause significant damage to life and property. 64% of Korea's land area is made up of mountains, the government wanted to predict landslides to reduce damage. In response, the Korea Forest Service has established a 'Landslide Information System' to predict the likelihood of landslides. This system selects a total of 13 landslide factors based on past landslide events. Using the LR technique (Logistic Regression) to predict the possibility of a landslide occurrence and the accuracy is known to be 0.75. However, most of the data used for learning in the current system is on landslides that occurred from 2005 to 2011, and it does not reflect recent typhoons or heavy rain. Therefore, in this study, we will apply a total of six machine learning techniques (KNN, LR, SVM, XGB, RF, GNB) to predict the occurrence of landslides based on the data of Inje, Gangwon-do, which was recently produced by the National Institute of Forest. To predict the occurrence of landslides, it is necessary to process converting landslide events and factors data into a suitable form for machine learning techniques through ArcGIS and Python. In addition, there is a large difference in the number of data between areas where landslides occurred or not. Therefore, the prediction was performed after correcting the unbalanced data using Tomek Links and Near Miss techniques. Moreover, to control unbalanced data, a model that reflects soil properties will use to remove absolute safe areas.

  • PDF

Systematic Review of Reciprocal Changes after Spinal Reconstruction Surgery : Do Not Miss the Forest for the Trees

  • Kim, Chang-Wook;Hyun, Seung-Jae;Kim, Ki-Jeong
    • Journal of Korean Neurosurgical Society
    • /
    • v.64 no.6
    • /
    • pp.843-852
    • /
    • 2021
  • The purpose of this review was to synthesize the research on global spinal alignment and reciprocal changes following cervical or thoracolumbar reconstruction surgery. We carried out a search of PubMed, EMBASE, and Cochrane Library for studies through May 2020, and ultimately included 11 articles. The optimal goal of a truly balanced spine is to maintain the head over the femoral heads. When spinal imbalance occurs, the human body reacts through various compensatory mechanisms to maintain the head over the pelvis and to retain a horizontal gaze. Historically, deformity correction has focused on correcting scoliosis and preventing scoliotic curve progression. Following substantial correction of a spinal deformity, reciprocal changes take place in the flexible segments proximal and distal to the area of correction. Restoration of lumbar lordosis following surgery to correct a thoracolumbar deformity induces reciprocal changes in T1 slope, cervical lordosis, pelvic shift, and lower extremity parameters. Patients with cervical kyphosis exhibit different patterns of reciprocal changes depending on whether they have head-balanced or trunk-balanced kyphosis. These reciprocal changes should be considered to in order to prevent secondary spine disorders. We emphasize the importance of evaluating the global spinal alignment to assess postoperative changes.

Automatic Detection of Dead Trees Based on Lightweight YOLOv4 and UAV Imagery

  • Yuanhang Jin;Maolin Xu;Jiayuan Zheng
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.614-630
    • /
    • 2023
  • Dead trees significantly impact forest production and the ecological environment and pose constraints to the sustainable development of forests. A lightweight YOLOv4 dead tree detection algorithm based on unmanned aerial vehicle images is proposed to address current limitations in dead tree detection that rely mainly on inefficient, unsafe and easy-to-miss manual inspections. An improved logarithmic transformation method was developed in data pre-processing to display tree features in the shadows. For the model structure, the original CSPDarkNet-53 backbone feature extraction network was replaced by MobileNetV3. Some of the standard convolutional blocks in the original extraction network were replaced by depthwise separable convolution blocks. The new ReLU6 activation function replaced the original LeakyReLU activation function to make the network more robust for low-precision computations. The K-means++ clustering method was also integrated to generate anchor boxes that are more suitable for the dataset. The experimental results show that the improved algorithm achieved an accuracy of 97.33%, higher than other methods. The detection speed of the proposed approach is higher than that of YOLOv4, improving the efficiency and accuracy of the detection process.

Performance Evaluation of Snow Detection Using Himawari-8 AHI Data (Himawari-8 AHI 적설 탐지의 성능 평가)

  • Jin, Donghyun;Lee, Kyeong-sang;Seo, Minji;Choi, Sungwon;Seong, Noh-hun;Lee, Eunkyung;Han, Hyeon-gyeong;Han, Kyung-soo
    • Korean Journal of Remote Sensing
    • /
    • v.34 no.6_1
    • /
    • pp.1025-1032
    • /
    • 2018
  • Snow Cover is a form of precipitation that is defined by snow on the surface and is the single largest component of the cryosphere that plays an important role in maintaining the energy balance between the earth's surface and the atmosphere. It affects the regulation of the Earth's surface temperature. However, since snow cover is mainly distributed in area where human access is difficult, snow cover detection using satellites is actively performed, and snow cover detection in forest area is an important process as well as distinguishing between cloud and snow. In this study, we applied the Normalized Difference Snow Index (NDSI) and the Normalized Difference Vegetation Index (NDVI) to the geostationary satellites for the snow detection of forest area in existing polar orbit satellites. On the rest of the forest area, the snow cover detection using $R_{1.61{\mu}m}$ anomaly technique and NDSI was performed. As a result of the indirect validation using the snow cover data and the Visible Infrared Imaging Radiometer (VIIRS) snow cover data, the probability of detection (POD) was 99.95 % and the False Alarm Ratio (FAR) was 16.63 %. We also performed qualitative validation using the Himawari-8 Advanced Himawari Imager (AHI) RGB image. The result showed that the areas detected by the VIIRS Snow Cover miss pixel are mixed with the area detected by the research false pixel.