Search | Korea Science

Entropy-based Clustering Validation Technique for Categorical Data Sets (범주형 데이터 집합에 대한 엔트로피 기반 군집 유효화 기술)

Park Namhyun;Ahn Chang Wook;Ramakrishna R.S.
- Proceedings of the Korea Information Processing Society Conference
- /
- 2004.11a
- /
- pp.477-480
- /
- 2004
본 논문에서는 고차원의 특성을 가진 범주형 데이터 집합의 군집 유효화 기술에 대하여 알아본다. 먼저, 범주형 데이터 집합에 대하여 한 군집의 센트로이드를 정의함에 따라 일반적인 군집화 방법에서 사용되는 쌍 유사성 측정을 가능하게 한다. 다음으로, 범주형 데이터 집합에 대한 증분 군집 알고리즘을 통하여 도출된 결과에 대해 최적 군집 수의 결정하기 위하여 엔트로피 기반 군집 유효화 지수를 사용한다. 이를 통하여 일반적인 군집 알고리즘에서 최적 결과를 얻기 위해 필요한 문턱값 결정 문제를 손쉽게 해결한다. 마지막으로, 위의 개념들을 여러 데이터 집합에 대해 실험한다.
PDF

A polychotomous regression model with tensor product splines and direct sums (연속형의 텐서곱과 범주형의 직합을 사용한 다항 로지스틱 회귀모형)

Sim, Songyong;Kang, Heemo
- Journal of the Korean Data and Information Science Society
- /
- v.25 no.1
- /
- pp.19-26
- /
- 2014
In this paper, we propose a polychotomous regression model when independent variables include both categorical and numerical variables. For categorical independent variables, we use direct sums, and tensor product splines are used for continuous independent variables. We use BIC for varible selections criterior. We implemented the algorithm and apply the algorithm to real data. The use of direct sums and tensor products outperformed the usual multinomial logistic regression model.
https://doi.org/10.7465/jkdi.2014.25.1.19 인용 PDF KSCI

Improving Classification Performance for Data with Numeric and Categorical Attributes Using Feature Wrapping (특징 래핑을 통한 숫자형 특징과 범주형 특징이 혼합된 데이터의 클래스 분류 성능 향상 기법)

Lee, Jae-Sung;Kim, Dae-Won
- Journal of KIISE:Software and Applications
- /
- v.36 no.12
- /
- pp.1024-1027
- /
- 2009
In this letter, we evaluate the classification performance of mixed numeric and categorical data for comparing the efficiency of feature filtering and feature wrapping. Because the mixed data is composed of numeric and categorical features, the feature selection method was applied to data set after discretizing the numeric features in the given data set. In this study, we choose the feature subset for improving the classification performance of the data set after preprocessing. The experimental result of comparing the classification performance show that the feature wrapping method is more reliable than feature filtering method in the aspect of classification accuracy.
PDF KSCI

Validation Comparison of Credit Rating Models for Categorized Financial Data (범주형 재무자료에 대한 신용평가모형 검증 비교)

Hong, Chong-Sun;Lee, Chang-Hyuk;Kim, Ji-Hun
- Communications for Statistical Applications and Methods
- /
- v.15 no.4
- /
- pp.615-631
- /
- 2008
Current credit evaluation models based on only financial data except non-financial data are used continuous data and produce credit scores for the ranking. In this work, some problems of the credit evaluation models based on transformed continuous financial data are discussed and we propose improved credit evaluation models based on categorized financial data. After analyzing and comparing goodness-of-fit tests of two models, the availability of the credit evaluation models for categorized financial data is explained.
https://doi.org/10.5351/CKSS.2008.15.4.615 인용 PDF KSCI

Integration of Categorical Data using Multivariate Kriging for Spatial Interpolation of Ground Survey Data (현장 조사 자료의 공간 보간을 위한 다변량 크리깅을 이용한 범주형 자료의 통합)

Park, No-Wook
- Spatial Information Research
- /
- v.19 no.4
- /
- pp.81-89
- /
- 2011
This paper presents a multivariate kriging algorithm that integrates categorical data as secondary data for spatial interpolation of sparsely sampled ground survey data. Instead of using constant mean values in each attribute of categorical data, disaggregated local mean values at target grid points are first estimated by area-to-point kriging and then are used as local mean values in simple kriging with local means. This algorithm is illustrated through a case study of spatial interpolation of a geochemical copper element with geological map data. Cross validation results indicates that the presented algorithm leads to significant respective improvement of 15% and 25% in prediction capability, compared with univariate ordinary kriging and conventional simple kriging with constant mean values. It is expected that the multivariate kriging algorithm applied in this study would be effectively applied for spatial interpolation with categorical data.
PDF KSCI

Skyline Query Algorithm in the Categoric Data (범주형 데이터에 대한 스카이라인 질의 알고리즘)

Lee, Woo-Key;Choi, Jung-Ho;Song, Jong-Su
- Journal of KIISE:Computing Practices and Letters
- /
- v.16 no.7
- /
- pp.819-823
- /
- 2010
The skyline query is one of the effective methods to deal with the large amounts and multi-dimensional data set. By utilizing the concept of 'dominate' the skyline query can pinpoint the target data so that the dominated ones, about 95% of them, can efficiently be excluded as an unnecessary data. Most of the skyline query algorithms, however, have been developed in terms of the numerical data set. This paper pioneers an entirely new domain, the categorical data, on which the corresponding ranking measures for the skyline queries are suggested. In the experiment, the ACM Computing Classification System has been exploited to which our methods are significantly represented with respect to performance thresholds such as the processing time and precision ratio, etc.
PDF KSCI

A Big Data Analysis by Between-Cluster Information using k-Modes Clustering Algorithm (k-Modes 분할 알고리즘에 의한 군집의 상관정보 기반 빅데이터 분석)

Park, In-Kyoo
- Journal of Digital Convergence
- /
- v.13 no.11
- /
- pp.157-164
- /
- 2015
This paper describes subspace clustering of categorical data for convergence and integration. Because categorical data are not designed for dealing only with numerical data, The conventional evaluation measures are more likely to have the limitations due to the absence of ordering and high dimensional data and scarcity of frequency. Hence, conditional entropy measure is proposed to evaluate close approximation of cohesion among attributes within each cluster. We propose a new objective function that is used to reflect the optimistic clustering so that the within-cluster dispersion is minimized and the between-cluster separation is enhanced. We performed experiments on five real-world datasets, comparing the performance of our algorithms with four algorithms, using three evaluation metrics: accuracy, f-measure and adjusted Rand index. According to the experiments, the proposed algorithm outperforms the algorithms that were considered int the evaluation, regarding the considered metrics.
https://doi.org/10.14400/JDC.2015.13.11.157 인용 PDF KSCI

Developing of Exact Tests for Order-Restrictions in Categorical Data (범주형 자료에서 순서화된 대립가설 검정을 위한 정확검정의 개발)

Nam, Jusun;Kang, Seung-Ho
- The Korean Journal of Applied Statistics
- /
- v.26 no.4
- /
- pp.595-610
- /
- 2013
Testing of order-restricted alternative hypothesis in $2{\times}k$ contingency tables can be applied to various fields of medicine, sociology, and business administration. Most testing methods have been developed based on a large sample theory. In the case of a small sample size or unbalanced sample size, the Type I error rate of the testing method (based on a large sample theory) is very different from the target point of 5%. In this paper, the exact testing method is introduced in regards to the testing of an order-restricted alternative hypothesis in categorical data (particularly if a small sample size or extreme unbalanced data). Power and exact p-value are calculated, respectively.
https://doi.org/10.5351/KJAS.2013.26.4.595 인용 PDF KSCI

Categorical time series clustering: Case study of Korean pro-baseball data (범주형 시계열 자료의 군집화: 프로야구 자료의 사례 연구)

Pak, Ro Jin
- Journal of the Korean Data and Information Science Society
- /
- v.27 no.3
- /
- pp.621-627
- /
- 2016
A certain professional baseball team tends to be very weak against another particular team. For example, S team, the strongest team in Korea, is relatively weak to H team. In this paper, we carried out clustering the Korean baseball teams based on the records against the team S to investigate whether the pattern of the record of the team H is different from those of the other teams. The technique we have employed is 'time series clustering', or more specifically 'categorical time series clustering'. Three methods have been considered in this paper: (i) distance based method, (ii) genetic sequencing method and (iii) periodogram method. Each method has its own advantages and disadvantages to handle categorical time series, so that it is recommended to draw conclusion by considering the results from the above three methods altogether in a comprehensive manner.
https://doi.org/10.7465/jkdi.2016.27.3.621 인용 PDF KSCI

lustering of Categorical Data using Rough Entropy (러프 엔트로피를 이용한 범주형 데이터의 클러스터링)

Park, Inkyoo
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.13 no.5
- /
- pp.183-188
- /
- 2013
A variety of cluster analysis techniques prerequisite to cluster objects having similar characteristics in data mining. But the clustering of those algorithms have lots of difficulties in dealing with categorical data within the databases. The imprecise handling of uncertainty within categorical data in the clustering process stems from the only algebraic logic of rough set, resulting in the degradation of stability and effectiveness. This paper proposes a information-theoretic rough entropy(RE) by taking into account the dependency of attributes and proposes a technique called min-mean-mean roughness(MMMR) for selecting clustering attribute. We analyze and compare the performance of the proposed technique with K-means, fuzzy techniques and other standard deviation roughness methods based on ZOO dataset. The results verify the better performance of the proposed approach.
https://doi.org/10.7236/JIIBC.2013.13.5.183 인용 PDF KSCI

Search Result 547, Processing Time 0.033 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)