• Title/Summary/Keyword: cross validation

Search Result 1,001, Processing Time 0.03 seconds

Improvements for Atmospheric Motion Vectors Algorithm Using First Guess by Optical Flow Method (옵티컬 플로우 방법으로 계산된 초기 바람 추정치에 따른 대기운동벡터 알고리즘 개선 연구)

  • Oh, Yurim;Park, Hyungmin;Kim, Jae Hwan;Kim, Somyoung
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_1
    • /
    • pp.763-774
    • /
    • 2020
  • Wind data forecasted from the numerical weather prediction (NWP) model is generally used as the first-guess of the target tracking process to obtain the atmospheric motion vectors(AMVs) because it increases tracking accuracy and reduce computational time. However, there is a contradiction that the NWP model used as the first-guess is used again as the reference in the AMVs verification process. To overcome this problem, model-independent first guesses are required. In this study, we propose the AMVs derivation from Lucas and Kanade optical flow method and then using it as the first guess. To retrieve AMVs, Himawari-8/AHI geostationary satellite level-1B data were used at 00, 06, 12, and 18 UTC from August 19 to September 5, 2015. To evaluate the impact of applying the optical flow method on the AMV derivation, cross-validation has been conducted in three ways as follows. (1) Without the first-guess, (2) NWP (KMA/UM) forecasted wind as the first-guess, and (3) Optical flow method based wind as the first-guess. As the results of verification using ECMWF ERA-Interim reanalysis data, the highest precision (RMSVD: 5.296-5.804 ms-1) was obtained using optical flow based winds as the first-guess. In addition, the computation speed for AMVs derivation was the slowest without the first-guess test, but the other two had similar performance. Thus, applying the optical flow method in the target tracking process of AMVs algorithm, this study showed that the optical flow method is very effective as a first guess for model-independent AMVs derivation.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Estimation of Near Surface Air Temperature Using MODIS Land Surface Temperature Data and Geostatistics (MODIS 지표면 온도 자료와 지구통계기법을 이용한 지상 기온 추정)

  • Shin, HyuSeok;Chang, Eunmi;Hong, Sungwook
    • Spatial Information Research
    • /
    • v.22 no.1
    • /
    • pp.55-63
    • /
    • 2014
  • Near surface air temperature data which are one of the essential factors in hydrology, meteorology and climatology, have drawn a substantial amount of attention from various academic domains and societies. Meteorological observations, however, have high spatio-temporal constraints with the limits in the number and distribution over the earth surface. To overcome such limits, many studies have sought to estimate the near surface air temperature from satellite image data at a regional or continental scale with simple regression methods. Alternatively, we applied various Kriging methods such as ordinary Kriging, universal Kriging, Cokriging, Regression Kriging in search of an optimal estimation method based on near surface air temperature data observed from automatic weather stations (AWS) in South Korea throughout 2010 (365 days) and MODIS land surface temperature (LST) data (MOD11A1, 365 images). Due to high spatial heterogeneity, auxiliary data have been also analyzed such as land cover, DEM (digital elevation model) to consider factors that can affect near surface air temperature. Prior to the main estimation, we calculated root mean square error (RMSE) of temperature differences from the 365-days LST and AWS data by season and landcover. The results show that the coefficient of variation (CV) of RMSE by season is 0.86, but the equivalent value of CV by landcover is 0.00746. Seasonal differences between LST and AWS data were greater than that those by landcover. Seasonal RMSE was the lowest in winter (3.72). The results from a linear regression analysis for examining the relationship among AWS, LST, and auxiliary data show that the coefficient of determination was the highest in winter (0.818) but the lowest in summer (0.078), thereby indicating a significant level of seasonal variation. Based on these results, we utilized a variety of Kriging techniques to estimate the surface temperature. The results of cross-validation in each Kriging model show that the measure of model accuracy was 1.71, 1.71, 1.848, and 1.630 for universal Kriging, ordinary Kriging, cokriging, and regression Kriging, respectively. The estimates from regression Kriging thus proved to be the most accurate among the Kriging methods compared.

Monitoring Ground-level SO2 Concentrations Based on a Stacking Ensemble Approach Using Satellite Data and Numerical Models (위성 자료와 수치모델 자료를 활용한 스태킹 앙상블 기반 SO2 지상농도 추정)

  • Choi, Hyunyoung;Kang, Yoojin;Im, Jungho;Shin, Minso;Park, Seohui;Kim, Sang-Min
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_3
    • /
    • pp.1053-1066
    • /
    • 2020
  • Sulfur dioxide (SO2) is primarily released through industrial, residential, and transportation activities, and creates secondary air pollutants through chemical reactions in the atmosphere. Long-term exposure to SO2 can result in a negative effect on the human body causing respiratory or cardiovascular disease, which makes the effective and continuous monitoring of SO2 crucial. In South Korea, SO2 monitoring at ground stations has been performed, but this does not provide spatially continuous information of SO2 concentrations. Thus, this research estimated spatially continuous ground-level SO2 concentrations at 1 km resolution over South Korea through the synergistic use of satellite data and numerical models. A stacking ensemble approach, fusing multiple machine learning algorithms at two levels (i.e., base and meta), was adopted for ground-level SO2 estimation using data from January 2015 to April 2019. Random forest and extreme gradient boosting were used as based models and multiple linear regression was adopted for the meta-model. The cross-validation results showed that the meta-model produced the improved performance by 25% compared to the base models, resulting in the correlation coefficient of 0.48 and root-mean-square-error of 0.0032 ppm. In addition, the temporal transferability of the approach was evaluated for one-year data which were not used in the model development. The spatial distribution of ground-level SO2 concentrations based on the proposed model agreed with the general seasonality of SO2 and the temporal patterns of emission sources.

Development of Prediction Equation of Diffusing Capacity of Lung for Koreans

  • Hwang, Yong Il;Park, Yong Bum;Yoon, Hyoung Kyu;Lim, Seong Yong;Kim, Tae-Hyung;Park, Joo Hun;Lee, Won-Yeon;Park, Seong Ju;Lee, Sei Won;Kim, Woo Jin;Kim, Ki Uk;Shin, Kyeong Cheol;Kim, Do Jin;Kim, Hui Jung;Kim, Tae-Eun;Yoo, Kwang Ha;Shim, Jae Jeong
    • Tuberculosis and Respiratory Diseases
    • /
    • v.81 no.1
    • /
    • pp.42-48
    • /
    • 2018
  • Background: The diffusing capacity of the lung is influenced by multiple factors such as age, sex, height, weight, ethnicity and smoking status. Although a prediction equation for the diffusing capacity of Korea was proposed in the mid-1980s, this equation is not used currently. The aim of this study was to develop a new prediction equation for the diffusing capacity for Koreans. Methods: Using the data of the Korean National Health and Nutrition Examination Survey, a total of 140 nonsmokers with normal chest X-rays were enrolled in this study. Results: Using linear regression analysis, a new predicting equation for diffusing capacity was developed. For men, the following new equations were developed: carbon monoxide diffusing capacity (DLco)=-10.4433-0.1434${\times}$age (year)+0.2482${\times}$heights (cm); DLco/alveolar volume (VA)=6.01507-0.02374${\times}$age (year)-0.00233${\times}$heights (cm). For women the prediction equations were described as followed: DLco=-12.8895-0.0532${\times}$age (year)+0.2145${\times}$heights (cm) and DLco/VA=7.69516-0.02219${\times}$age (year)-0.01377${\times}$heights (cm). All equations were internally validated by k-fold cross validation method. Conclusion: In this study, we developed new prediction equations for the diffusing capacity of the lungs of Koreans. A further study is needed to validate the new predicting equation for diffusing capacity.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

Knowledge graph-based knowledge map for efficient expression and inference of associated knowledge (연관지식의 효율적인 표현 및 추론이 가능한 지식그래프 기반 지식지도)

  • Yoo, Keedong
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.4
    • /
    • pp.49-71
    • /
    • 2021
  • Users who intend to utilize knowledge to actively solve given problems proceed their jobs with cross- and sequential exploration of associated knowledge related each other in terms of certain criteria, such as content relevance. A knowledge map is the diagram or taxonomy overviewing status of currently managed knowledge in a knowledge-base, and supports users' knowledge exploration based on certain relationships between knowledge. A knowledge map, therefore, must be expressed in a networked form by linking related knowledge based on certain types of relationships, and should be implemented by deploying proper technologies or tools specialized in defining and inferring them. To meet this end, this study suggests a methodology for developing the knowledge graph-based knowledge map using the Graph DB known to exhibit proper functionality in expressing and inferring relationships between entities and their relationships stored in a knowledge-base. Procedures of the proposed methodology are modeling graph data, creating nodes, properties, relationships, and composing knowledge networks by combining identified links between knowledge. Among various Graph DBs, the Neo4j is used in this study for its high credibility and applicability through wide and various application cases. To examine the validity of the proposed methodology, a knowledge graph-based knowledge map is implemented deploying the Graph DB, and a performance comparison test is performed, by applying previous research's data to check whether this study's knowledge map can yield the same level of performance as the previous one did. Previous research's case is concerned with building a process-based knowledge map using the ontology technology, which identifies links between related knowledge based on the sequences of tasks producing or being activated by knowledge. In other words, since a task not only is activated by knowledge as an input but also produces knowledge as an output, input and output knowledge are linked as a flow by the task. Also since a business process is composed of affiliated tasks to fulfill the purpose of the process, the knowledge networks within a business process can be concluded by the sequences of the tasks composing the process. Therefore, using the Neo4j, considered process, task, and knowledge as well as the relationships among them are defined as nodes and relationships so that knowledge links can be identified based on the sequences of tasks. The resultant knowledge network by aggregating identified knowledge links is the knowledge map equipping functionality as a knowledge graph, and therefore its performance needs to be tested whether it meets the level of previous research's validation results. The performance test examines two aspects, the correctness of knowledge links and the possibility of inferring new types of knowledge: the former is examined using 7 questions, and the latter is checked by extracting two new-typed knowledge. As a result, the knowledge map constructed through the proposed methodology has showed the same level of performance as the previous one, and processed knowledge definition as well as knowledge relationship inference in a more efficient manner. Furthermore, comparing to the previous research's ontology-based approach, this study's Graph DB-based approach has also showed more beneficial functionality in intensively managing only the knowledge of interest, dynamically defining knowledge and relationships by reflecting various meanings from situations to purposes, agilely inferring knowledge and relationships through Cypher-based query, and easily creating a new relationship by aggregating existing ones, etc. This study's artifacts can be applied to implement the user-friendly function of knowledge exploration reflecting user's cognitive process toward associated knowledge, and can further underpin the development of an intelligent knowledge-base expanding autonomously through the discovery of new knowledge and their relationships by inference. This study, moreover than these, has an instant effect on implementing the networked knowledge map essential to satisfying contemporary users eagerly excavating the way to find proper knowledge to use.

Development and Validation of the 'Food Safety and Health' Workbook for High School (고등학교 「식품안전과 건강」 워크북 개발 및 타당도 검증)

  • Park, Mi Jeong;Jung, Lan-Hee;Yu, Nan Sook;Choi, Seong-Youn
    • Journal of Korean Home Economics Education Association
    • /
    • v.34 no.1
    • /
    • pp.59-80
    • /
    • 2022
  • The purpose of this study was to develop a workbook that can support the class and evaluation of the subject, 「Food safety and health」 and to verify its validity. The development direction of the workbook was set by analyzing the 「Food safety and health」 curriculum, dietary education materials, and previous studies related to the workbook, and the overall structure was designed by deriving the activity ideas for each area. Based on this, the draft was developed, and the draft went through several rounds of cross-review by the authors and the examination and revision by the Ministry of Food and Drug Safety, before the final edited version was developed. The workbook was finalized with corrections and enhancements based on the advice of 9 experts and 44 home economics teachers. The workbook consists of 4 areas: the 'food selection' area, with 10 learning topics and 36 lessons, the 'food poisoning and food management' area, with 10 learning topics and 36 lessons, the 'cooking' area, with 11 learning topics and 43 lessons, and the 'healthy eating' area, with 11 learning topics and 55 lessons, resulting in a total of 42 learning topics, 170 lessons. The workbook was designed to evenly cultivate practical problem-solving competency, self-reliance capacity, creative thinking capacity, and community capacity. In-depth inquiry-learning is conducted on the content, and the context is structured so that self-diagnosis can be made through evaluation. According to the validity test of the workbook, it was evaluated to be very appropriate for encouraging student-participatory classes and evaluations, and to create a class atmosphere that promotes inquiry by strengthening experiments and practices. In the current situation where the high school credit system is implemented and individual students' learning options are emphasized, the results of this study is expected to help expand the scope of home economics-based elective courses and contribute to realizing student-led classrooms with a focus on inquiry.

Preliminary Inspection Prediction Model to select the on-Site Inspected Foreign Food Facility using Multiple Correspondence Analysis (차원축소를 활용한 해외제조업체 대상 사전점검 예측 모형에 관한 연구)

  • Hae Jin Park;Jae Suk Choi;Sang Goo Cho
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.1
    • /
    • pp.121-142
    • /
    • 2023
  • As the number and weight of imported food are steadily increasing, safety management of imported food to prevent food safety accidents is becoming more important. The Ministry of Food and Drug Safety conducts on-site inspections of foreign food facilities before customs clearance as well as import inspection at the customs clearance stage. However, a data-based safety management plan for imported food is needed due to time, cost, and limited resources. In this study, we tried to increase the efficiency of the on-site inspection by preparing a machine learning prediction model that pre-selects the companies that are expected to fail before the on-site inspection. Basic information of 303,272 foreign food facilities and processing businesses collected in the Integrated Food Safety Information Network and 1,689 cases of on-site inspection information data collected from 2019 to April 2022 were collected. After preprocessing the data of foreign food facilities, only the data subject to on-site inspection were extracted using the foreign food facility_code. As a result, it consisted of a total of 1,689 data and 103 variables. For 103 variables, variables that were '0' were removed based on the Theil-U index, and after reducing by applying Multiple Correspondence Analysis, 49 characteristic variables were finally derived. We build eight different models and perform hyperparameter tuning through 5-fold cross validation. Then, the performance of the generated models are evaluated. The research purpose of selecting companies subject to on-site inspection is to maximize the recall, which is the probability of judging nonconforming companies as nonconforming. As a result of applying various algorithms of machine learning, the Random Forest model with the highest Recall_macro, AUROC, Average PR, F1-score, and Balanced Accuracy was evaluated as the best model. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the selection reason for nonconforming facilities of individual instances, and discuss applicability to the on-site inspection facility selection system. Based on the results of this study, it is expected that it will contribute to the efficient operation of limited resources such as manpower and budget by establishing an imported food management system through a data-based scientific risk management model.

A Comparative Study on Factors Affecting Satisfaction by Travel Purpose for Urban Demand Response Transport Service: Focusing on Sejong Shucle (도심형 수요응답 교통서비스의 통행목적별 만족도 영향요인 비교연구: 세종특별자치시 셔클(Shucle)을 중심으로)

  • Wonchul Kim;Woo Jin Han;Juntae Park
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.23 no.2
    • /
    • pp.132-141
    • /
    • 2024
  • In this study, the differences in user satisfaction and the variables influencing the satisfaction with demand response transport (DRT) by travel purpose were compared. The purpose of DRT travel was divided into commuting/school and shopping/leisure travel. A survey conducted on 'Shucle' users in Sejong City was used for the analysis and the least absolute shrinkage and selection operator (LASSO) regression analysis was applied to minimize the overfitting problems of the multilinear model. The results of the analysis confirmed the possibility that the introduction of the DRT service could eliminate the blind spot in the existing public transportation, reduce the use of private cars, encourage low-carbon and public transportation revitalization policies, and provide optimal transportation services to people who exhibit intermittent travel behaviors (e.g., elderly people, housewives, etc.). In addition, factors such as the waiting time after calling a DRT, travel time after boarding the DRT, convenience of using the DRT app, punctuality of expected departure/arrival time, and location of pickup and drop-off points were the common factors that positively influenced the satisfaction of users of the DRT services during their commuting/school and shopping/leisure travel. Meanwhile, the method of transfer to other transport modes was found to affect satisfaction only in the case of commuting/school travel, but not in the case of shopping/leisure travel. To activate the DRT service, it is necessary to consider the five influencing factors analyzed above. In addition, the differentiating factors between commuting/school and shopping/leisure travel were also identified. In the case of commuting/school travel, people value time and consider it to be important, so it is necessary to promote the convenience of transfer to other transport modes to reduce the total travel time. Regarding shopping/leisure travel, it is necessary to consider ways to create a facility that allows users to easily and conveniently designate the location of the pickup and drop-off point.