• Title/Summary/Keyword: Validation data set

Search Result 379, Processing Time 0.029 seconds

Development and Testing of a Machine Learning Model Using 18F-Fluorodeoxyglucose PET/CT-Derived Metabolic Parameters to Classify Human Papillomavirus Status in Oropharyngeal Squamous Carcinoma

  • Changsoo Woo;Kwan Hyeong Jo;Beomseok Sohn;Kisung Park;Hojin Cho;Won Jun Kang;Jinna Kim;Seung-Koo Lee
    • Korean Journal of Radiology
    • /
    • v.24 no.1
    • /
    • pp.51-61
    • /
    • 2023
  • Objective: To develop and test a machine learning model for classifying human papillomavirus (HPV) status of patients with oropharyngeal squamous cell carcinoma (OPSCC) using 18F-fluorodeoxyglucose (18F-FDG) PET-derived parameters in derived parameters and an appropriate combination of machine learning methods in patients with OPSCC. Materials and Methods: This retrospective study enrolled 126 patients (118 male; mean age, 60 years) with newly diagnosed, pathologically confirmed OPSCC, that underwent 18F-FDG PET-computed tomography (CT) between January 2012 and February 2020. Patients were randomly assigned to training and internal validation sets in a 7:3 ratio. An external test set of 19 patients (16 male; mean age, 65.3 years) was recruited sequentially from two other tertiary hospitals. Model 1 used only PET parameters, Model 2 used only clinical features, and Model 3 used both PET and clinical parameters. Multiple feature transforms, feature selection, oversampling, and training models are all investigated. The external test set was used to test the three models that performed best in the internal validation set. The values for area under the receiver operating characteristic curve (AUC) were compared between models. Results: In the external test set, ExtraTrees-based Model 3, which uses two PET-derived parameters and three clinical features, with a combination of MinMaxScaler, mutual information selection, and adaptive synthetic sampling approach, showed the best performance (AUC = 0.78; 95% confidence interval, 0.46-1). Model 3 outperformed Model 1 using PET parameters alone (AUC = 0.48, p = 0.047) and Model 2 using clinical parameters alone (AUC = 0.52, p = 0.142) in predicting HPV status. Conclusion: Using oversampling and mutual information selection, an ExtraTree-based HPV status classifier was developed by combining metabolic parameters derived from 18F-FDG PET/CT and clinical parameters in OPSCC, which exhibited higher performance than the models using either PET or clinical parameters alone.

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

  • An, Sehwan;Kim, Youngmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.263-286
    • /
    • 2022
  • This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.

Development, Demonstration and Validation of the Deep Space Orbit Determination Software Using Lunar Prospector Tracking Data

  • Lee, Eunji;Kim, Youngkwang;Kim, Minsik;Park, Sang-Young
    • Journal of Astronomy and Space Sciences
    • /
    • v.34 no.3
    • /
    • pp.213-223
    • /
    • 2017
  • The deep space orbit determination software (DSODS) is a part of a flight dynamic subsystem (FDS) for the Korean Pathfinder Lunar Orbiter (KPLO), a lunar exploration mission expected to launch after 2018. The DSODS consists of several sub modules, of which the orbit determination (OD) module employs a weighted least squares algorithm for estimating the parameters related to the motion and the tracking system of the spacecraft, and subroutines for performance improvement and detailed analysis of the orbit solution. In this research, DSODS is demonstrated and validated at lunar orbit at an altitude of 100 km using actual Lunar Prospector tracking data. A set of a priori states are generated, and the robustness of DSODS to the a priori error is confirmed by the NASA planetary data system (PDS) orbit solutions. Furthermore, the accuracy of the orbit solutions is determined by solution comparison and overlap analysis as about tens of meters. Through these analyses, the ability of the DSODS to provide proper orbit solutions for the KPLO are proved.

Korean Groal Potential Habitat Suitability Model at Soraksan National Park Using Fuzzy Set and Multi-Criteria Evaluation (설악산국립공원내 산양(Nemorhaedus Caudatus Raddeanus)의 잠재 서식지 적합성 모형; 다기준평가기법(MCE)과 퍼지집합(Fuzzy Set)의 도입을 통하여)

  • Choi Tae-Young;Park Chong-Hwa
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.32 no.4
    • /
    • pp.28-38
    • /
    • 2004
  • Korean goral (Nemorhaedus caudatus raddeanus) is one of the endangered species in Korea, and the rugged terrain of the Soraksan National Park (373㎢) is a critical habitat for the species. But the goral population is threatened by habitat fragmentation caused by roads and hiking trails. The objective of this study was to develop a potential habitat suitability model for Korean goral in the park, and the model was based on the concepts of fuzzy set theory and multi-criteria evaluation. The process of the suitability modeling could be divided into three steps. First, data for the modeling was collected by using field work and a literature survey. Collected data included 204 points of GPS data obtained through a goral trace survey and through the number of daily visitors to each hiking trail during the peak season of the park. Second, fuzzy set theory was employed for building a GIS data base related to environmental factors affecting the suitability of the goral habitat. Finally, a multiple-criteria evaluation was performed as the final step towards a goral habitat suitability model. The results of the study were as follows. First, characteristics of suitable habitats were the proximity to rock cliffs, scattered pine (Pinus densiflora) patches, ridges, the elevation of 700∼800m, and the aspect of south and southeast. Second, the habitat suitability model had a high classification accuracy of 93.9% for the analysis site, and 95.7% for the validation site at a cut off value of 0.5. Finally, 11.7% of habitatwith more than 0.5 of habitat suitability index was affected by roads and hiking trails in the park.

A Study on Big Data Based Investment Strategy Using Internet Search Trends (인터넷 검색추세를 활용한 빅데이터 기반의 주식투자전략에 대한 연구)

  • Kim, Minsoo;Koo, Pyunghoi
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.38 no.4
    • /
    • pp.53-63
    • /
    • 2013
  • Together with soaring interest on Big Data, now there are vigorous reports that unearth various social values lying underneath those data from a number of application areas. Among those reports many are using such data as Internet search histories from Google site, social relationships from Facebook, and transactional or locational traces collected from various ubiquitous devices. Many of those researches, however, are conducted based on the data sets that are accumulated over the North American and European areas, which means that direct interpretation and application of social values exhibited by those researches to the other areas like Korea can be a disturbing task. This research has started from a validation study against Korean environment of the former paper which says an investment strategy that exploits up and down of Google search volume on a carefully selected set of terms shows high market performance. A huge difference between North American and Korean environment can be eye witnessed via the distinction in profit rates that are exhibited by the corresponding set of search terms. Two sets of search terms actually presented low correlation in their profit rates over two financial markets. Even in an experiment which compares the profit rates with two different investment periods with the same set of search terms showed no such meaningful result that outperforms the market average. With all these results, we cautiously conclude that establishing an investment strategy that exploits Internet search volume over a specified word set needs more conscious approach.

A deisgn of VHDL compiler front-end for the VHDL-to-C mapping (VHDL-to-C 사상을 위한 VHDL 컴파일러 전반부의 설계)

  • 공진흥;고형일
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.22 no.12
    • /
    • pp.2834-2851
    • /
    • 1997
  • In this paper, a design and implementation of VHDL compiler front-end, aims at supporting the full-set of VHDL '87 & '93 LRM and carring out the preprocessing of VHDL-to-C, is described. The VHDL compiler front-end includes 1)the symbol tree of analyzed data to represent the hierarchy, the scope and visibility, the overloading and homograph, the concurrent multiple stacks in VHDL, 2)the data structure and supportig routies to deal with the objects, the type and subtype, the attribute and operation in VHDL, 3)the analysis of the concurrent/sequential statements, the behavior/structural descriptions, of semantic token and the propagation of symbol & type to improve the registration and retrieval procedure of analyzed data. In the experiments with Validation Suite, the VHDL compiler front-end could support the full-set specification of VHDL LRM '87 & '93; and in the experiments to asses the performance of symantic token for the VHDL hierachy/visibility/concurrency/semantic checking, the improvement of about 20-30% could be achieved.

  • PDF

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

  • Ahn, SungMahn;Chung, Yeojin;Lee, Jaejoon;Yang, Jiheon
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.71-88
    • /
    • 2017
  • Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.

A Metrics Set for Measuring Software Module Severity (소프트웨어 모듈 심각도 측정을 위한 메트릭 집합)

  • Hong, Euy-Seok
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.1
    • /
    • pp.197-206
    • /
    • 2015
  • Defect severity that is a measure of the impact caused by the defect plays an important role in software quality activities because not all software defects are equal. Earlier studies have concentrated on defining defect severity levels, but there have almost never been trials of measuring module severity. In this paper, first, we define a defect severity metric in the form of an exponential function using the characteristics that defect severity values increase much faster than severity levels. Then we define a new metrics set for software module severity using the number of defects in a module and their defect severity metric values. In order to show the applicability of the proposed metrics, we performed an analytical validation using Weyuker's properties and experimental validation using NASA open data sets. The results show that ms is very useful for measuring the module severity and msd can be used to compare different systems in terms of module severity.

Nondestructive Prediction of Fatty Acid Composition in Sesame Seeds by Near Infrared Reflectance Spectroscopy

  • Kim, Kwan-Su;Park, Si-Hyung;Choung, Myoung-Gun;Kim, Sun-Lim
    • KOREAN JOURNAL OF CROP SCIENCE
    • /
    • v.51 no.spc1
    • /
    • pp.304-309
    • /
    • 2006
  • Near infrared reflectance spectroscopy (NIRS) was used to develop a rapid and nondestructive method for the determination of fatty acid composition in sesame (Sesamum indicum L.) seed oil. A total of ninety-three samples of intact seeds were scanned in the reflectance mode of a scanning monochromator, and reference values for fatty acid composition were measured by gas-liquid chromatography. Calibration equations were developed using modified partial least square regression with internal cross validation (n=63). The equations obtained had low standard errors of cross-validation and moderate $R^2$ (coefficient of determination in calibration). Prediction of an external validation set (n=30) showed significant correlation between reference values and NIRS estimated values based on the SEP (standard error of prediction), $r^2$ (coefficient of determination in prediction) and the ratio of standard deviation (SD) of reference data to SEP. The models developed in this study had relatively higher values (more than 2.0) of SD/SEP(C) for oleic and linoleic acid, having good correlation between reference and NIRS estimate. The results indicated that NIRS, a nondestructive screening method could be used to rapidly determine fatty acid composition in sesame seeds in the breeding programs for high quality sesame oil.

A Comparative Study on Arrhenius-Type Constitutive Models with Regression Methods

  • Lee, Kyunghoon;Murugesan, Mohanraj;Lee, Seung-Min;Kang, Beom-Soo
    • Transactions of Materials Processing
    • /
    • v.26 no.1
    • /
    • pp.18-27
    • /
    • 2017
  • A comparative study was performed on strain-compensated Arrhenius-type constitutive models established with two regression methods: polynomial regression and regression Kriging. For measurements at high temperatures, experimental data of 70Cr3Mo steel were adopted from previous research. An Arrhenius-type constitutive model necessitates strain compensation for material constants to account for strain effect. To associate the material constants with strain, we first evaluated them at a set of discrete strains, then capitalized on surrogate modeling to represent the material constants as a function of strain. As a result, disparate flow stress models were formed via the two different regression methods. The constructed constitutive models were examined systematically against measured flow stresses by validation methods. The predicted material constants were found to be quite accurate compared to the actual material constants. However, notable mismatches between measured and predicted flow stresses were revealed by the proposed validation techniques, which carry out validation with not the entire, but a single tensile test case.