• Title/Summary/Keyword: 10-fold Validation

Search Result 240, Processing Time 0.027 seconds

Cloning of Korean Morphological Analyzers using Pre-analyzed Eojeol Dictionary and Syllable-based Probabilistic Model (기분석 어절 사전과 음절 단위의 확률 모델을 이용한 한국어 형태소 분석기 복제)

  • Shim, Kwangseob
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.3
    • /
    • pp.119-126
    • /
    • 2016
  • In this study, we verified the feasibility of a Korean morphological analyzer that uses a pre-analyzed Eojeol dictionary and syllable-based probabilistic model. For the verification, MACH and KLT2000, Korean morphological analyzers, were cloned with a pre-analyzed eojeol dictionary and syllable-based probabilistic model. The analysis results were compared between the cloned morphological analyzer, MACH, and KLT2000. The 10 million Eojeol Sejong corpus was segmented into 10 sets for cross-validation. The 10-fold cross-validated precision and recall for cloned MACH and KLT2000 were 97.16%, 98.31% and 96.80%, 99.03%, respectively. Analysis speed of a cloned MACH was 308,000 Eojeols per second, and the speed of a cloned KLT2000 was 436,000 Eojeols per second. The experimental results indicated that a Korean morphological analyzer that uses a pre-analyzed eojeol dictionary and syllable-based probabilistic model could be used in practical applications.

Fully Automatic Segmentation of Acute Ischemic Lesions on Diffusion-Weighted Imaging Using Convolutional Neural Networks: Comparison with Conventional Algorithms

  • Ilsang Woo;Areum Lee;Seung Chai Jung;Hyunna Lee;Namkug Kim;Se Jin Cho;Donghyun Kim;Jungbin Lee;Leonard Sunwoo;Dong-Wha Kang
    • Korean Journal of Radiology
    • /
    • v.20 no.8
    • /
    • pp.1275-1284
    • /
    • 2019
  • Objective: To develop algorithms using convolutional neural networks (CNNs) for automatic segmentation of acute ischemic lesions on diffusion-weighted imaging (DWI) and compare them with conventional algorithms, including a thresholding-based segmentation. Materials and Methods: Between September 2005 and August 2015, 429 patients presenting with acute cerebral ischemia (training:validation:test set = 246:89:94) were retrospectively enrolled in this study, which was performed under Institutional Review Board approval. Ground truth segmentations for acute ischemic lesions on DWI were manually drawn under the consensus of two expert radiologists. CNN algorithms were developed using two-dimensional U-Net with squeeze-and-excitation blocks (U-Net) and a DenseNet with squeeze-and-excitation blocks (DenseNet) with squeeze-and-excitation operations for automatic segmentation of acute ischemic lesions on DWI. The CNN algorithms were compared with conventional algorithms based on DWI and the apparent diffusion coefficient (ADC) signal intensity. The performances of the algorithms were assessed using the Dice index with 5-fold cross-validation. The Dice indices were analyzed according to infarct volumes (< 10 mL, ≥ 10 mL), number of infarcts (≤ 5, 6-10, ≥ 11), and b-value of 1000 (b1000) signal intensities (< 50, 50-100, > 100), time intervals to DWI, and DWI protocols. Results: The CNN algorithms were significantly superior to conventional algorithms (p < 0.001). Dice indices for the CNN algorithms were 0.85 for U-Net and DenseNet and 0.86 for an ensemble of U-Net and DenseNet, while the indices were 0.58 for ADC-b1000 and b1000-ADC and 0.52 for the commercial ADC algorithm. The Dice indices for small and large lesions, respectively, were 0.81 and 0.88 with U-Net, 0.80 and 0.88 with DenseNet, and 0.82 and 0.89 with the ensemble of U-Net and DenseNet. The CNN algorithms showed significant differences in Dice indices according to infarct volumes (p < 0.001). Conclusion: The CNN algorithm for automatic segmentation of acute ischemic lesions on DWI achieved Dice indices greater than or equal to 0.85 and showed superior performance to conventional algorithms.

Image Mood Classification Using Deep CNN and Its Application to Automatic Video Generation (심층 CNN을 활용한 영상 분위기 분류 및 이를 활용한 동영상 자동 생성)

  • Cho, Dong-Hee;Nam, Yong-Wook;Lee, Hyun-Chang;Kim, Yong-Hyuk
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.9
    • /
    • pp.23-29
    • /
    • 2019
  • In this paper, the mood of images was classified into eight categories through a deep convolutional neural network and video was automatically generated using proper background music. Based on the collected image data, the classification model is learned using a multilayer perceptron (MLP). Using the MLP, a video is generated by using multi-class classification to predict image mood to be used for video generation, and by matching pre-classified music. As a result of 10-fold cross-validation and result of experiments on actual images, each 72.4% of accuracy and 64% of confusion matrix accuracy was achieved. In the case of misclassification, by classifying video into a similar mood, it was confirmed that the music from the video had no great mismatch with images.

Development of a deep neural network model to estimate solar radiation using temperature and precipitation (온도와 강수를 이용하여 일별 일사량을 추정하기 위한 심층 신경망 모델 개발)

  • Kang, DaeGyoon;Hyun, Shinwoo;Kim, Kwang Soo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.21 no.2
    • /
    • pp.85-96
    • /
    • 2019
  • Solar radiation is an important variable for estimation of energy balance and water cycle in natural and agricultural ecosystems. A deep neural network (DNN) model has been developed in order to estimate the daily global solar radiation. Temperature and precipitation, which would have wider availability from weather stations than other variables such as sunshine duration, were used as inputs to the DNN model. Five-fold cross-validation was applied to train and test the DNN models. Meteorological data at 15 weather stations were collected for a long term period, e.g., > 30 years in Korea. The DNN model obtained from the cross-validation had relatively small value of RMSE ($3.75MJ\;m^{-2}\;d^{-1}$) for estimates of the daily solar radiation at the weather station in Suwon. The DNN model explained about 68% of variation in observed solar radiation at the Suwon weather station. It was found that the measurements of solar radiation in 1985 and 1998 were considerably low for a small period of time compared with sunshine duration. This suggested that assessment of the quality for the observation data for solar radiation would be needed in further studies. When data for those years were excluded from the data analysis, the DNN model had slightly greater degree of agreement statistics. For example, the values of $R^2$ and RMSE were 0.72 and $3.55MJ\;m^{-2}\;d^{-1}$, respectively. Our results indicate that a DNN would be useful for the development a solar radiation estimation model using temperature and precipitation, which are usually available for downscaled scenario data for future climate conditions. Thus, such a DNN model would be useful for the impact assessment of climate change on crop production where solar radiation is used as a required input variable to a crop model.

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

  • An, Sehwan;Kim, Youngmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.263-286
    • /
    • 2022
  • This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.

A Study for Estimation of High Resolution Temperature Using Satellite Imagery and Machine Learning Models during Heat Waves (위성영상과 머신러닝 모델을 이용한 폭염기간 고해상도 기온 추정 연구)

  • Lee, Dalgeun;Lee, Mi Hee;Kim, Boeun;Yu, Jeonghum;Oh, Yeongju;Park, Jinyi
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_4
    • /
    • pp.1179-1194
    • /
    • 2020
  • This study investigates the feasibility of three algorithms, K-Nearest Neighbors (K-NN), Random Forest (RF) and Neural Network (NN), for estimating the air temperature of an unobserved area where the weather station is not installed. The satellite image were obtained from Landsat-8 and MODIS Aqua/Terra acquired in 2019, and the meteorological ground weather data were from AWS/ASOS data of Korea Meteorological Administration and Korea Forest Service. In addition, in order to improve the estimation accuracy, a digital surface model, solar radiation, aspect and slope were used. The accuracy assessment of machine learning methods was performed by calculating the statistics of R2 (determination coefficient) and Root Mean Square Error (RMSE) through 10-fold cross-validation and the estimated values were compared for each target area. As a result, the neural network algorithm showed the most stable result among the three algorithms with R2 = 0.805 and RMSE = 0.508. The neural network algorithm was applied to each data set on Landsat imagery scene. It was possible to generate an mean air temperature map from June to September 2019 and confirmed that detailed air temperature information could be estimated. The result is expected to be utilized for national disaster safety management such as heat wave response policies and heat island mitigation research.

Motor Imagery EEG Classification Method using EMD and FFT (EMD와 FFT를 이용한 동작 상상 EEG 분류 기법)

  • Lee, David;Lee, Hee-Jae;Lee, Sang-Goog
    • Journal of KIISE
    • /
    • v.41 no.12
    • /
    • pp.1050-1057
    • /
    • 2014
  • Electroencephalogram (EEG)-based brain-computer interfaces (BCI) can be used for a number of purposes in a variety of industries, such as to replace body parts like hands and feet or to improve user convenience. In this paper, we propose a method to decompose and extract motor imagery EEG signal using Empirical Mode Decomposition (EMD) and Fast Fourier Transforms (FFT). The EEG signal classification consists of the following three steps. First, during signal decomposition, the EMD is used to generate Intrinsic Mode Functions (IMFs) from the EEG signal. Then during feature extraction, the power spectral density (PSD) is used to identify the frequency band of the IMFs generated. The FFT is used to extract the features for motor imagery from an IMF that includes mu rhythm. Finally, during classification, the Support Vector Machine (SVM) is used to classify the features of the motor imagery EEG signal. 10-fold cross-validation was then used to estimate the generalization capability of the given classifier., and the results show that the proposed method has an accuracy of 84.50% which is higher than that of other methods.

A Study on Predicting Cryptocurrency Distribution Prices Using Machine Learning Techniques (머신러닝 기법을 활용한 암호화폐 유통 가격 예측 연구)

  • KIM, Han-Min;KIM, Hoik
    • Journal of Distribution Science
    • /
    • v.17 no.11
    • /
    • pp.93-101
    • /
    • 2019
  • Purpose: Blockchain technology suggests ways to solve the problems in the existing industry. Among them, Cryptocurrency system, which is an element of Blockchain technology, is a very important factor for operating Blockchain. While Blockchain cryptocurrency has attracted attention, studies on cryptocurrency prices have been mainly conducted, however previous studies mainly conducted on Bitcoin prices. On the other hand, in the context of the creation and trading of various cryptocurrencies based on the Blockchain system, little research has been done on cryptocurrencies other than Bitcoin. Hence, this study attempts to find variables related to the prices of Dash, Litecoin, and Monero cryptocurrencies using machine learning techniques. We also attempt to find differences in the variables related to the prices for each cryptocurrencies and to examine machine learning techniques that can provide better performance. Research design, data, and methodology: This study performed Dash, Litecoin, and Monero price prediction analysis of cryptocurrency using Blockchain information and machine learning techniques. We employed number of transactions in Blockchain, amount of generated cryptocurrency, transaction fees, number of activity accounts in Blockchain, Block creation difficulty, block size, umber of created blocks as independent variables. This study tried to ensure the reliability of the analysis results through 10-fold cross validation. Blockchain information was hierarchically added for price prediction, and the analysis result was measured as RMSE and MAPE. Results: The analysis shows that the prices of Dash, Litecoin and Monero cryptocurrency are related to Blockchain information. Also, we found that different Blockchain information improves the analysis results for each cryptocurrency. In addition, this study found that the neural network machine learning technique provides better analysis results than support-vector machine in predicting cryptocurrency prices. Conclusion: This study concludes that the information of Blockchain should be considered for the prediction of the price of Dash, Litecoin, and Monero cryptocurrency. It also suggests that Blockchain information related to the price of cryptocurrency differs depending on the type of cryptocurrency. We suggest that future research on various types of cryptocurrencies is needed. The findings of this study can provide a theoretical basis for future cryptocurrency research in distribution management.

Project Failure Main Factors Analysis using Text Mining in Audit Evaluation (감리결과에 텍스트마이닝 기법을 적용한 프로젝트 실패 주요요인 분석)

  • Jang, Kyoungae;Jang, Seong Yong;Kim, Woo-Je
    • Journal of KIISE
    • /
    • v.42 no.4
    • /
    • pp.468-474
    • /
    • 2015
  • Corporations should make efforts to recognize the importance of projects, identify their failure factors, prevent risks in advance, and raise the success rates, because the corporations need to make quick responses to rapid external changes. There are some previous studies on success and failure factors of projects, however, most of them have limitations in terms of objectivity and quantitative analysis based on data gathering through surveys, statistical sampling and analysis. This study analyzes the failure factors of projects based on data mining to find problems with projects in an audit report, which is an objective project evaluation report. To do this, we identified the texts in the paragraph of suggestions about improvement. We made use of the superior classification algorithms in this study, which were NaiveBayes, SMO and J48. They were evaluated in terms of data of Recall and Precision after performing 10-fold-cross validation. In the identified texts, the failure factors of projects were analyzed so that they could be utilized in project implementation.

Named Entity Recognition for Patent Documents Based on Conditional Random Fields (조건부 랜덤 필드를 이용한 특허 문서의 개체명 인식)

  • Lee, Tae Seok;Shin, Su Mi;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.9
    • /
    • pp.419-424
    • /
    • 2016
  • Named entity recognition is required to improve the retrieval accuracy of patent documents or similar patents in the claims and patent descriptions. In this paper, we proposed an automatic named entity recognition for patents by using a conditional random field that is one of the best methods in machine learning research. Named entity recognition system has been constructed from the training set of tagged corpus with 660,000 words and 70,000 words are used as a test set for evaluation. The experiment shows that the accuracy is 93.6% and the Kappa coefficient is 0.67 between manual tagging and automatic tagging system. This figure is better than the Kappa coefficient 0.6 for manually tagged results and it shows that automatic named entity tagging system can be used as a practical tagging for patent documents in replacement of a manual tagging.