• Title/Summary/Keyword: random sets

Search Result 276, Processing Time 0.033 seconds

Influence Measures for the Likelihood Ratio Test on Independence of Two Random Vectors

  • Jung, Kang-Mo
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2001.10a
    • /
    • pp.13-16
    • /
    • 2001
  • We compare methods for detecting influential observations that have a large influence on the likelihood ratio test statistics that the two sets of variables are uncorrelated with one another. For this purpose we derive results of the deletion diagnostic, the influence function, the standardized influence matrix and the local influence. An illustrative example is given.

  • PDF

An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest (랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.36 no.2
    • /
    • pp.57-77
    • /
    • 2019
  • Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100~1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

Rice yield prediction in South Korea by using random forest (Random Forest를 이용한 남한지역 쌀 수량 예측 연구)

  • Kim, Junhwan;Lee, Juseok;Sang, Wangyu;Shin, Pyeong;Cho, Hyeounsuk;Seo, Myungchul
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.21 no.2
    • /
    • pp.75-84
    • /
    • 2019
  • In this study, the random forest approach was used to predict the national mean rice yield of South Korea by using mean climatic factors at a national scale. A random forest model that used monthly climate variable and year as an important predictor in predicting crop yield. Annual yield change would be affected by technical improvement for crop management as well as climate. Year as prediction factor represent technical improvement. Thus, it is likely that the variables of importance identified for the random forest model could result in a large error in prediction of rice yield in practice. It was also found that elimination of the trend of yield data resulted in reasonable accuracy in prediction of yield using the random forest model. For example, yield prediction using the training set (data obtained from 1991 to 2005) had a relatively high degree of agreement statistics. Although the degree of agreement statistics for yield prediction for the test set (2006-2015) was not as good as those for the training set, the value of relative root mean square error (RRMSE) was less than 5%. In the variable importance plot, significant difference was noted in the importance of climate factors between the training and test sets. This difference could be attributed to the shifting of the transplanting date, which might have affected the growing season. This suggested that acceptable yield prediction could be achieved using random forest, when the data set included consistent planting or transplanting dates in the predicted area.

Comparison of Machine Learning-Based Radioisotope Identifiers for Plastic Scintillation Detector

  • Jeon, Byoungil;Kim, Jongyul;Yu, Yonggyun;Moon, Myungkook
    • Journal of Radiation Protection and Research
    • /
    • v.46 no.4
    • /
    • pp.204-212
    • /
    • 2021
  • Background: Identification of radioisotopes for plastic scintillation detectors is challenging because their spectra have poor energy resolutions and lack photo peaks. To overcome this weakness, many researchers have conducted radioisotope identification studies using machine learning algorithms; however, the effect of data normalization on radioisotope identification has not been addressed yet. Furthermore, studies on machine learning-based radioisotope identifiers for plastic scintillation detectors are limited. Materials and Methods: In this study, machine learning-based radioisotope identifiers were implemented, and their performances according to data normalization methods were compared. Eight classes of radioisotopes consisting of combinations of 22Na, 60Co, and 137Cs, and the background, were defined. The training set was generated by the random sampling technique based on probabilistic density functions acquired by experiments and simulations, and test set was acquired by experiments. Support vector machine (SVM), artificial neural network (ANN), and convolutional neural network (CNN) were implemented as radioisotope identifiers with six data normalization methods, and trained using the generated training set. Results and Discussion: The implemented identifiers were evaluated by test sets acquired by experiments with and without gain shifts to confirm the robustness of the identifiers against the gain shift effect. Among the three machine learning-based radioisotope identifiers, prediction accuracy followed the order SVM > ANN > CNN, while the training time followed the order SVM > ANN > CNN. Conclusion: The prediction accuracy for the combined test sets was highest with the SVM. The CNN exhibited a minimum variation in prediction accuracy for each class, even though it had the lowest prediction accuracy for the combined test sets among three identifiers. The SVM exhibited the highest prediction accuracy for the combined test sets, and its training time was the shortest among three identifiers.

Numerical investigations on stability evaluation of a jointed rock slope during excavation using an optimized DDARF method

  • Li, Yong;Zhou, Hao;Dong, Zhenxing;Zhu, Weishen;Li, Shucai;Wang, Shugang
    • Geomechanics and Engineering
    • /
    • v.14 no.3
    • /
    • pp.271-281
    • /
    • 2018
  • A jointed rock slope stability evaluation was simulated by a discontinuous deformation analysis numerical method to investigate the process and safety factors for different crack distributions and different overloading situations. An optimized method using Discontinuous Deformation Analysis for Rock Failure (DDARF) is presented to perform numerical investigations on the jointed rock slope stability evaluation of the Dagangshan hydropower station. During the pre-processing of establishing the numerical model, an integrated software system including AutoCAD, Screen Capture, and Excel is adopted to facilitate the implementation of the numerical model with random joint network. These optimizations during the pre-processing stage of DDARF can remarkably improve the simulation efficiency, making it possible for complex model calculation. In the numerical investigations on the jointed rock slope stability evaluations using the optimized DDARF, three calculation schemes have been taken into account in the numerical model: (I) no joint; (II) two sets of regular parallel joints; and (III) multiple sets of random joints. This model is capable of replicating the entire processes including crack initiation, propagation, formation of shear zones, and local failures, and thus is able to provide constructive suggestions to supporting schemes for the slope. Meanwhile, the overloading numerical simulations under the same three schemes have also been performed. Overloading safety factors of the three schemes are 5.68, 2.42 and 1.39, respectively, which are obtained by analyzing the displacement evolutions of key monitoring points during overloading.

Performance of PN Code Based Time Hopping Sequences in M-ary Ultra Wide Band Multiple Access Systems Using Equicorrelated Signal Sets (동일 상관 신호군을 이용하는 M-ary UWB 다원 접속 시스템에서 PN 부호 기반 시간 도약 시퀀스의 성능)

  • 양석철;신요안
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.28 no.10A
    • /
    • pp.816-829
    • /
    • 2003
  • In this paper, we evaluate the performance of PN (Pseudo Noise) code based time hopping sequences for M-ary UWB (Ultra Wide Band) multiple access systems using the equicorrelated signal sets. In particular, we consider two different types of M-ary UWB systems in UWB indoor wireless multipath channels: The first type of the systems (System #1) has identical symbol transmission rate regardless of the number of symbols M since the length of signal pulse train is fixed while M increases, and the second type of the systems (System #2) has the same bit transmission rate regardless of M since the length of signal pulse train is extended according to the increase of M. We compare the proposed systems with those using the ideal random time hopping sequence in terms of the symbol error rate performance. Simulation results show that the PN code based time hopping sequence achieves quite good performance which is favorably comparable to that of the ideal random sequence. Moreover, as M increases, we observe that System #2 shows better robustness against multiple access interference than System # 1.

A longitudinal study for child aggression with Korea Welfare Panel Study data (한국복지패널 자료를 이용한 아동기 공격성에 대한 경시적 자료 분석)

  • Choi, Nayeon;Huh, Jib
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.6
    • /
    • pp.1439-1447
    • /
    • 2014
  • Most of literatures on Korean child aggression are based on using the cross-sectional data sets. Although there is a related study with a longitudinal data set, it is assumed that the data sets measured repeatedly in the longitudinal data are mutually independent. A longitudinal data analysis for Korean child aggression is then necessary. This study is to analyze the effect of child development outcomes including academic achievement, self-esteem, depression anxiety, delinquency, victimization by peers, abuse by parents and internet using time on child aggression with Korea Welfare Panel Study data observed three times between 2006 and 2012. Since Korea Welfare Panel Study data have missing values, the missing at random is assumed. The linear mixed effect model and the restricted maximum likelihood estimation are considered.

A Study on Predictive Modeling of I-131 Radioactivity Based on Machine Learning (머신러닝 기반 고용량 I-131의 용량 예측 모델에 관한 연구)

  • Yeon-Wook You;Chung-Wun Lee;Jung-Soo Kim
    • Journal of radiological science and technology
    • /
    • v.46 no.2
    • /
    • pp.131-139
    • /
    • 2023
  • High-dose I-131 used for the treatment of thyroid cancer causes localized exposure among radiology technologists handling it. There is a delay between the calibration date and when the dose of I-131 is administered to a patient. Therefore, it is necessary to directly measure the radioactivity of the administered dose using a dose calibrator. In this study, we attempted to apply machine learning modeling to measured external dose rates from shielded I-131 in order to predict their radioactivity. External dose rates were measured at 1 m, 0.3 m, and 0.1 m distances from a shielded container with the I-131, with a total of 868 sets of measurements taken. For the modeling process, we utilized the hold-out method to partition the data with a 7:3 ratio (609 for the training set:259 for the test set). For the machine learning algorithms, we chose linear regression, decision tree, random forest and XGBoost. To evaluate the models, we calculated root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE) to evaluate accuracy and R2 to evaluate explanatory power. Evaluation results are as follows. Linear regression (RMSE 268.15, MSE 71901.87, MAE 231.68, R2 0.92), decision tree (RMSE 108.89, MSE 11856.92, MAE 19.24, R2 0.99), random forest (RMSE 8.89, MSE 79.10, MAE 6.55, R2 0.99), XGBoost (RMSE 10.21, MSE 104.22, MAE 7.68, R2 0.99). The random forest model achieved the highest predictive ability. Improving the model's performance in the future is expected to contribute to lowering exposure among radiology technologists.

Development of benthic macroinvertebrate species distribution models using the Bayesian optimization (베이지안 최적화를 통한 저서성 대형무척추동물 종분포모델 개발)

  • Go, ByeongGeon;Shin, Jihoon;Cha, Yoonkyung
    • Journal of Korean Society of Water and Wastewater
    • /
    • v.35 no.4
    • /
    • pp.259-275
    • /
    • 2021
  • This study explored the usefulness and implications of the Bayesian hyperparameter optimization in developing species distribution models (SDMs). A variety of machine learning (ML) algorithms, namely, support vector machine (SVM), random forest (RF), boosted regression tree (BRT), XGBoost (XGB), and Multilayer perceptron (MLP) were used for predicting the occurrence of four benthic macroinvertebrate species. The Bayesian optimization method successfully tuned model hyperparameters, with all ML models resulting an area under the curve (AUC) > 0.7. Also, hyperparameter search ranges that generally clustered around the optimal values suggest the efficiency of the Bayesian optimization in finding optimal sets of hyperparameters. Tree based ensemble algorithms (BRT, RF, and XGB) tended to show higher performances than SVM and MLP. Important hyperparameters and optimal values differed by species and ML model, indicating the necessity of hyperparameter tuning for improving individual model performances. The optimization results demonstrate that for all macroinvertebrate species SVM and RF required fewer numbers of trials until obtaining optimal hyperparameter sets, leading to reduced computational cost compared to other ML algorithms. The results of this study suggest that the Bayesian optimization is an efficient method for hyperparameter optimization of machine learning algorithms.

Selecting the Optimal Hidden Layer of Extreme Learning Machine Using Multiple Kernel Learning

  • Zhao, Wentao;Li, Pan;Liu, Qiang;Liu, Dan;Liu, Xinwang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.12
    • /
    • pp.5765-5781
    • /
    • 2018
  • Extreme learning machine (ELM) is emerging as a powerful machine learning method in a variety of application scenarios due to its promising advantages of high accuracy, fast learning speed and easy of implementation. However, how to select the optimal hidden layer of ELM is still an open question in the ELM community. Basically, the number of hidden layer nodes is a sensitive hyperparameter that significantly affects the performance of ELM. To address this challenging problem, we propose to adopt multiple kernel learning (MKL) to design a multi-hidden-layer-kernel ELM (MHLK-ELM). Specifically, we first integrate kernel functions with random feature mapping of ELM to design a hidden-layer-kernel ELM (HLK-ELM), which serves as the base of MHLK-ELM. Then, we utilize the MKL method to propose two versions of MHLK-ELMs, called sparse and non-sparse MHLK-ELMs. Both two types of MHLK-ELMs can effectively find out the optimal linear combination of multiple HLK-ELMs for different classification and regression problems. Experimental results on seven data sets, among which three data sets are relevant to classification and four ones are relevant to regression, demonstrate that the proposed MHLK-ELM achieves superior performance compared with conventional ELM and basic HLK-ELM.