• Title/Summary/Keyword: cross-validation

Search Result 998, Processing Time 0.03 seconds

Exploration of the Multiple Structure of Relational Self and Construct Validation among Korean Adults (한국남녀의 관계적 자아의 특성: 다원적 구성요인 탐색 및 타당성 분석)

  • Ji Kyung Kim;Myoung So Kim
    • Korean Journal of Culture and Social Issue
    • /
    • v.9 no.2
    • /
    • pp.41-59
    • /
    • 2003
  • The present study was conducted to (1) explore the perceptions of Korean men and women about what is an important relationship for them and how do each gender group construe relational self, and (2) develop the scale to assess the factors of relational self and verify construct validity of the scale. 40 college students and 60 adults participated in survey and FGI (Focused Group Interview) respectively, and content analysis of their responses yielded 2 categories with 39 characteristics of relational self. The one category was named 'instrumentality' which was important to men and the other was named 'expressivity' which was important to women. The list of 39 items was administered to a nationwide sample of 1503 Korean adults to assess their construal of relational self through the 6-point Likert scale. Principal axis factor analysis showed that the two categories were unidimensional with high reliability. As a result of factor analysis on each category, a total of 9 factors were extracted. Specifically, the instrumentality consisted of factors such as utilitarianism, independence, initiativeness, self-assurance, and competence. And the factors of expressivity were empathy, passiveness, dependency, consideration. The tests of mean difference revealed that men had higher scores in most of the instrumental factors, while women had higher scores in most of the expressive factors. But there was no sex difference in the interdependent self-construal scale(Cross, 2000) which has been frequently used for measuring relational self. This is related to the Korean's collective cultural characteristics, and it was concluded that the relationship with others is very important to both Korean men and women, but the meaning and expectation of the relationship as well as the method for its preservation are different to each sex group. In addition, the correlation analyses indicated that the feminity score was positively correlated with the expressiveness while the masculinity score was positively correlated with instrumentality. This result implicated the differences of relational self among Korean people were related to the socialization process of each sex, i.e., sex role identity. Finally, limitations of this study and the directions for future research were discussed.

  • PDF

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

Accuracy evaluation of microwave water surface current meter for measurement angles in middle flow condition (전자파표면유속계의 측정 각도에 따른 평수기 유속 측정 정확도 분석)

  • Son, Geunsoo;Kim, Dongsu;Kim, Kyungdong;Kim, Jongmin
    • Journal of Korea Water Resources Association
    • /
    • v.53 no.1
    • /
    • pp.15-27
    • /
    • 2020
  • Streamflow discharge as a fundamental riverine quantity plays a crucial role in water resources management, thereby requiring accurate in-situ measurement. Recent advances in instrumentations for the streamflow discharge measurement has complemented or substituted classical devices and methods. Among various potential methods, surface current meter using microwave has increasingly begun to be applied not only for flood but also normal flow discharge measurement, remotely and safely enabling practitioners to measure flow velocity postulating indirect contact. With minimized field preparedness, this method facilitated and eased flood discharge measurement in the difficult in-situ conditions such as extreme flood in active ways emitting 24.125 GHz microwave without relying on natural lights. In South Korea, a rectangular shaped instrument named with Microwave Water Surface Current Meter (MWSCM) has been developed and commercially released around 2010, in which domestic agencies charging on streamflow observation shed lights on this approach regarding it as a potential substitute. Considering this brand-new device highlighted for efficient flow measurement, however, there has been few noticeable efforts in systematic and comprehensive evaluation of its performance in various measurement and riverine conditions that lead to lack in imminent and widely spreading usages in practices. This study attempted to evaluate the MWSCM in terms of instrumen's monitoring configuration particularly regarding tilt and yaw angle. In the middle of pointing the measurement spot in a given cross-section, the observation campaign inevitably poses accuracy issues related with different tilt and yaw angles of the instrument, which can be a conventionally major source of errors for this type of instrument. Focusing on the perspective of instrument configuration, the instrument was tested in a controlled outdoor river channel located in KICT River Experiment Center with a fixed flow condition of around 1 m/s flow speed with steady flow supply, 6 m of channel width, and less than 1 m of shallow flow depth, where the detailed velocity measurements with SonTek micro-ADV was used for validation. As results, less than 15 degree in tilting angle generated much higher deviation, and higher yawing angle proportionally increased coefficient of variance. Yaw angles affected accuracy in terms of measurement area.

A Multimodal Profile Ensemble Approach to Development of Recommender Systems Using Big Data (빅데이터 기반 추천시스템 구현을 위한 다중 프로파일 앙상블 기법)

  • Kim, Minjeong;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.93-110
    • /
    • 2015
  • The recommender system is a system which recommends products to the customers who are likely to be interested in. Based on automated information filtering technology, various recommender systems have been developed. Collaborative filtering (CF), one of the most successful recommendation algorithms, has been applied in a number of different domains such as recommending Web pages, books, movies, music and products. But, it has been known that CF has a critical shortcoming. CF finds neighbors whose preferences are like those of the target customer and recommends products those customers have most liked. Thus, CF works properly only when there's a sufficient number of ratings on common product from customers. When there's a shortage of customer ratings, CF makes the formation of a neighborhood inaccurate, thereby resulting in poor recommendations. To improve the performance of CF based recommender systems, most of the related studies have been focused on the development of novel algorithms under the assumption of using a single profile, which is created from user's rating information for items, purchase transactions, or Web access logs. With the advent of big data, companies got to collect more data and to use a variety of information with big size. So, many companies recognize it very importantly to utilize big data because it makes companies to improve their competitiveness and to create new value. In particular, on the rise is the issue of utilizing personal big data in the recommender system. It is why personal big data facilitate more accurate identification of the preferences or behaviors of users. The proposed recommendation methodology is as follows: First, multimodal user profiles are created from personal big data in order to grasp the preferences and behavior of users from various viewpoints. We derive five user profiles based on the personal information such as rating, site preference, demographic, Internet usage, and topic in text. Next, the similarity between users is calculated based on the profiles and then neighbors of users are found from the results. One of three ensemble approaches is applied to calculate the similarity. Each ensemble approach uses the similarity of combined profile, the average similarity of each profile, and the weighted average similarity of each profile, respectively. Finally, the products that people among the neighborhood prefer most to are recommended to the target users. For the experiments, we used the demographic data and a very large volume of Web log transaction for 5,000 panel users of a company that is specialized to analyzing ranks of Web sites. R and SAS E-miner was used to implement the proposed recommender system and to conduct the topic analysis using the keyword search, respectively. To evaluate the recommendation performance, we used 60% of data for training and 40% of data for test. The 5-fold cross validation was also conducted to enhance the reliability of our experiments. A widely used combination metric called F1 metric that gives equal weight to both recall and precision was employed for our evaluation. As the results of evaluation, the proposed methodology achieved the significant improvement over the single profile based CF algorithm. In particular, the ensemble approach using weighted average similarity shows the highest performance. That is, the rate of improvement in F1 is 16.9 percent for the ensemble approach using weighted average similarity and 8.1 percent for the ensemble approach using average similarity of each profile. From these results, we conclude that the multimodal profile ensemble approach is a viable solution to the problems encountered when there's a shortage of customer ratings. This study has significance in suggesting what kind of information could we use to create profile in the environment of big data and how could we combine and utilize them effectively. However, our methodology should be further studied to consider for its real-world application. We need to compare the differences in recommendation accuracy by applying the proposed method to different recommendation algorithms and then to identify which combination of them would show the best performance.

A Management Plan According to the Estimation of Nutria (Myocastorcoypus) Distribution Density and Potential Suitable Habitat (뉴트리아(Myocastor coypus) 분포밀도 및 잠재적 서식가능지역 예측에 따른 관리방향)

  • Kim, Areum;Kim, Young-Chae;Lee, Do-Hun
    • Journal of Environmental Impact Assessment
    • /
    • v.27 no.2
    • /
    • pp.203-214
    • /
    • 2018
  • The purpose of this study is to estimate the concentrated distribution area of nutria (Myocastor coypus) and potential suitable habitat and to provide useful data for the effective management direction setting. Based on the nationwide distribution data of nutria, the cross-validation value was applied to analyze the distribution density. As a result, the concentrated distribution areas thatrequired preferential elimination is found in 14 administrative areas including Busan Metropolitan City, Daegu Metropolitan City, 11 cities and counties in Gyeongsangnam-do and 1 county in Gyeongsangbuk-do. In the potential suitable habitat estimation using a MaxEnt (Maximum Entropy) model, the possibility of emergency was found in the Nakdong River middle and lower stream area and the Seomjin riverlower stream area and Gahwacheon River area. As for the contribution by variables of a model, it showed DEM, precipitation of driest month, min temperature of coldest month and distance from river had contribution from the highest order. In terms of the relation with the probability of appearance, the probability of emergence was higher than the threshold value in areas with less than 34m of altitude, with $-5.7^{\circ}C{\sim}-0.6^{\circ}C$ of min temperature of the coldest month, with 15-30mm of precipitation of the driest month and with less than 1,373m away from the river. Variables that Altitude, existence of water and wintertemperature affected settlement and expansion of nutria, considering the research results and the physiological and ecological characteristics of nutria. Therefore, it is necessary to reflect them as important variables in the future habitable area detection and expansion estimation modeling. It must be essential to distinguish the concentrated distribution area and the management area of invasive alien species such as nutria and to establish and apply a suitable management strategy to the management site for the permanent control. The results in this study can be used as useful data for a strategic management such as rapid management on the preferential management area and preemptive and preventive management on the possible spreading area.

Impacts assessment of Climate changes in North Korea based on RCP climate change scenarios II. Impacts assessment of hydrologic cycle changes in Yalu River (RCP 기후변화시나리오를 이용한 미래 북한지역의 수문순환 변화 영향 평가 II. 압록강유역의 미래 수문순환 변화 영향 평가)

  • Jeung, Se Jin;Kang, Dong Ho;Kim, Byung Sik
    • Journal of Wetlands Research
    • /
    • v.21 no.spc
    • /
    • pp.39-50
    • /
    • 2019
  • This study aims to assess the influence of climate change on the hydrological cycle at a basin level in North Korea. The selected model for this study is MRI-CGCM 3, the one used for the Coupled Model Intercomparison Project Phase 5 (CMIP5). Moreover, this study adopted the Spatial Disaggregation-Quantile Delta Mapping (SDQDM), which is one of the stochastic downscaling techniques, to conduct the bias correction for climate change scenarios. The comparison between the preapplication and postapplication of the SDQDM supported the study's review on the technique's validity. In addition, as this study determined the influence of climate change on the hydrological cycle, it also observed the runoff in North Korea. In predicting such influence, parameters of a runoff model used for the analysis should be optimized. However, North Korea is classified as an ungauged region for its political characteristics, and it was difficult to collect the country's runoff observation data. Hence, the study selected 16 basins with secured high-quality runoff data, and the M-RAT model's optimized parameters were calculated. The study also analyzed the correlation among variables for basin characteristics to consider multicollinearity. Then, based on a phased regression analysis, the study developed an equation to calculate parameters for ungauged basin areas. To verify the equation, the study assumed the Osipcheon River, Namdaecheon Stream, Yongdang Reservoir, and Yonggang Stream as ungauged basin areas and conducted cross-validation. As a result, for all the four basin areas, high efficiency was confirmed with the efficiency coefficients of 0.8 or higher. The study used climate change scenarios and parameters of the estimated runoff model to assess the changes in hydrological cycle processes at a basin level from climate change in the Amnokgang River of North Korea. The results showed that climate change would lead to an increase in precipitation, and the corresponding rise in temperature is predicted to cause elevating evapotranspiration. However, it was found that the storage capacity in the basin decreased. The result of the analysis on flow duration indicated a decrease in flow on the 95th day; an increase in the drought flow during the periods of Future 1 and Future 2; and an increase in both flows for the period of Future 3.

Improvements for Atmospheric Motion Vectors Algorithm Using First Guess by Optical Flow Method (옵티컬 플로우 방법으로 계산된 초기 바람 추정치에 따른 대기운동벡터 알고리즘 개선 연구)

  • Oh, Yurim;Park, Hyungmin;Kim, Jae Hwan;Kim, Somyoung
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_1
    • /
    • pp.763-774
    • /
    • 2020
  • Wind data forecasted from the numerical weather prediction (NWP) model is generally used as the first-guess of the target tracking process to obtain the atmospheric motion vectors(AMVs) because it increases tracking accuracy and reduce computational time. However, there is a contradiction that the NWP model used as the first-guess is used again as the reference in the AMVs verification process. To overcome this problem, model-independent first guesses are required. In this study, we propose the AMVs derivation from Lucas and Kanade optical flow method and then using it as the first guess. To retrieve AMVs, Himawari-8/AHI geostationary satellite level-1B data were used at 00, 06, 12, and 18 UTC from August 19 to September 5, 2015. To evaluate the impact of applying the optical flow method on the AMV derivation, cross-validation has been conducted in three ways as follows. (1) Without the first-guess, (2) NWP (KMA/UM) forecasted wind as the first-guess, and (3) Optical flow method based wind as the first-guess. As the results of verification using ECMWF ERA-Interim reanalysis data, the highest precision (RMSVD: 5.296-5.804 ms-1) was obtained using optical flow based winds as the first-guess. In addition, the computation speed for AMVs derivation was the slowest without the first-guess test, but the other two had similar performance. Thus, applying the optical flow method in the target tracking process of AMVs algorithm, this study showed that the optical flow method is very effective as a first guess for model-independent AMVs derivation.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Estimation of Near Surface Air Temperature Using MODIS Land Surface Temperature Data and Geostatistics (MODIS 지표면 온도 자료와 지구통계기법을 이용한 지상 기온 추정)

  • Shin, HyuSeok;Chang, Eunmi;Hong, Sungwook
    • Spatial Information Research
    • /
    • v.22 no.1
    • /
    • pp.55-63
    • /
    • 2014
  • Near surface air temperature data which are one of the essential factors in hydrology, meteorology and climatology, have drawn a substantial amount of attention from various academic domains and societies. Meteorological observations, however, have high spatio-temporal constraints with the limits in the number and distribution over the earth surface. To overcome such limits, many studies have sought to estimate the near surface air temperature from satellite image data at a regional or continental scale with simple regression methods. Alternatively, we applied various Kriging methods such as ordinary Kriging, universal Kriging, Cokriging, Regression Kriging in search of an optimal estimation method based on near surface air temperature data observed from automatic weather stations (AWS) in South Korea throughout 2010 (365 days) and MODIS land surface temperature (LST) data (MOD11A1, 365 images). Due to high spatial heterogeneity, auxiliary data have been also analyzed such as land cover, DEM (digital elevation model) to consider factors that can affect near surface air temperature. Prior to the main estimation, we calculated root mean square error (RMSE) of temperature differences from the 365-days LST and AWS data by season and landcover. The results show that the coefficient of variation (CV) of RMSE by season is 0.86, but the equivalent value of CV by landcover is 0.00746. Seasonal differences between LST and AWS data were greater than that those by landcover. Seasonal RMSE was the lowest in winter (3.72). The results from a linear regression analysis for examining the relationship among AWS, LST, and auxiliary data show that the coefficient of determination was the highest in winter (0.818) but the lowest in summer (0.078), thereby indicating a significant level of seasonal variation. Based on these results, we utilized a variety of Kriging techniques to estimate the surface temperature. The results of cross-validation in each Kriging model show that the measure of model accuracy was 1.71, 1.71, 1.848, and 1.630 for universal Kriging, ordinary Kriging, cokriging, and regression Kriging, respectively. The estimates from regression Kriging thus proved to be the most accurate among the Kriging methods compared.