Selection of Optimal Variables for Clustering of Seoul using Genetic Algorithm

Kim, Hyung Jin;Jung, Jae Hoon;Lee, Jung Bin;Kim, Sang Min;Heo, Joon;

doi:10.7319/kogsis.2014.22.4.175

Journal of Korean Society for Geospatial Information Science (대한공간정보학회지)

Volume 22 Issue 4
/
Pages.175-181
/
2014
/
1598-2955(pISSN)
/
2287-6693(eISSN)

Korea Spatial Information Society (대한공간정보학회)

DOI QR Code

Selection of Optimal Variables for Clustering of Seoul using Genetic Algorithm

유전자 알고리즘을 이용한 서울시 군집화 최적 변수 선정

Kim, Hyung Jin (Department of Civil and Environmental Engineering, Yonsei University) ;
Jung, Jae Hoon (Department of Civil and Environmental Engineering, Yonsei University) ;
Lee, Jung Bin (Department of Civil and Environmental Engineering, Yonsei University) ;
Kim, Sang Min (Department of Civil and Environmental Engineering, Yonsei University) ;
Heo, Joon (Department of Civil and Environmental Engineering, Yonsei University)

김형진 (연세대학교 토목환경공학과) ;
정재훈 (연세대학교 토목환경공학과) ;
이정빈 (연세대학교 토목환경공학과) ;
김상민 (연세대학교 토목환경공학과) ;
허준 (연세대학교 토목환경공학과)

Received : 2014.12.19
Accepted : 2014.12.19
Published : 2014.12.31

https://doi.org/10.7319/kogsis.2014.22.4.175 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Korean government proposed a new initiative 'government 3.0' with which the administration will open its dataset to the public before requests. City of Seoul is the front runner in disclosure of government data. If we know what kind of attributes are governing factors for any given segmentation, these outcomes can be applied to real world problems of marketing and business strategy, and administrative decision makings. However, with respect to city of Seoul, selection of optimal variables from the open dataset up to several thousands of attributes would require a humongous amount of computation time because it might require a combinatorial optimization while maximizing dissimilarity measures between clusters. In this study, we acquired 718 attribute dataset from Statistics Korea and conducted an analysis to select the most suitable variables, which differentiate Gangnam from other districts, using the Genetic algorithm and Dunn's index. Also, we utilized the Microsoft Azure cloud computing system to speed up the process time. As the result, the optimal 28 variables were finally selected, and the validation result showed that those 28 variables effectively group the Gangnam from other districts using the Ward's minimum variance and K-means algorithm.

정부 3.0이라는 새로운 정부운영 계획과 함께 다양한 공공정보를 민간이 활용할 수 있게 되었으며, 특히 서울은 이러한 행정정보 공개 및 활용을 선도하고 있다. 공개된 행정정보를 통해 각 지역을 특징짓는 행정요소를 발견할 경우, 각종 행정정책을 위한 의사결정 수단에 반영할 수 있을 뿐만 아니라 특정 지역의 고객 특성을 파악하여 특화된 서비스나 상품을 판매하는 마케팅 수단으로도 사용할 수 있을 것으로 사료된다. 하지만, 방대한 양의 행정자료로부터 각 군집의 특성을 명확히 구분할 수 있는 최적의 조합을 찾는 과정은 조합최적화 문제로서 상당한 연산량을 요구한다. 본 연구에서는 서울시에서 제공하는 다차원 행정자료로부터 서울시를 대표하는 문화 산업의 중심인 서초구, 강남구, 송파구 등의 강남 3구를 다른 지역과 효과적으로 구분하는 행정요인를 찾고자 하였다. 방대한 양의 행정정보로부터 두 군집간의 차이점을 극대화하는 요인을 선별하기 위한 최적화 방법으로 유전자 알고리즘을 이용하였으며, 군집간 차이를 계산하는 척도로는 Dunn 지수를 이용하였다. 또한 유전자 알고리즘의 연산속도의 향상을 위해 Microsoft Azure에서 제공하는 cloud computing을 이용한 분산처리를 수행하였다. 자료로는 통계청으로 부터 취득한 총 718개의 행정자료를 이용하였으며, 그 중 28개가 최적 변수로 선정되었다. 검증을 위해 선정된 28개의 변수를 입력값으로 Ward의 최소분산법 및 K-means 알고리즘을 통한 군집화를 수행한 결과 두 경우 모두 강남 3구가 다른 지역으로부터 효과적으로 분류됨을 확인하였다.

Keywords

References

Bezdek, J. C. and Nikhil R. P., 1995, Cluster validation with generalized dunn's indices, Proc. of the 2nd New Zealand Conference, pp. 190-193.
Hartigan, J. A. and Wong, M. A., 1979, Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society, Vol. 28, No. 1, pp. 100-108.
Hinneburg, A. and Kein, D. A., 1998, An efficient approach to clustering in large multimedia databases with noise, Proc. of the 4th International Conference on Knowledge Discovery and Data Mmining, pp. 58-65.
Kwak, S. Y., Nam, H. W. and Jun, C. M., 2012, An optimal model for indoor pedestrian evacuation considering the entire distribution of building pedestrians, Korea Society for Geospatial Information System, Vol. 20, No. 2, pp. 23-29. https://doi.org/10.7319/kogsis.2012.20.2.023
Kim, S. W. and Ahn, H. C., 2010, Development of an intelligent trading system using support vector machines and genetic algorithms, Korea Intelligent Information System Society, Vol. 16, No. 1, pp. 71-92.
Kim, U. G., Ahn, W. S., Lee, C. Y. and Um, M. J., 2012, The optimal analysis of data preprocessing method for clustering the region of precipitation, Journal of Korean Society of Hazard Mitigation, Vol. 12, No. 5, pp. 233-240. https://doi.org/10.9798/KOSHAM.2012.12.5.233
Microsoft, 2014, Microsoft azure, http://azure.microsoft.com/ko-kr/
Milligan, G. W. and Cooper, M. C., 1985, Anexamination of procedures for determining the number of clusters in a data set, Psychometrika, Vol. 50, pp. 159-179. https://doi.org/10.1007/BF02294245
Rademacher, L., 2005, Combinatorial optimization, http://www-math.mit.edu/ -goemans/18433-FALL05.html.
Ray, A. and Srivastava, D. C., 2008, Non-linear least squares ellipse fitting using the genetic algorithm with applications to strain analysis, Journal of Structural Geology, Vol. 30, pp. 1593-1602. https://doi.org/10.1016/j.jsg.2008.09.003
Statistical Research Institute, 2008, Segmentation of rural areas based on the attributes of agricultural and fishing villages, Technical report, p. 40.

Cited by

A Combinatorial Optimization for Influential Factor Analysis: a Case Study of Political Preference in Korea vol.35, pp.5, 2017, https://doi.org/10.7848/ksgpc.2017.35.5.415

Journal of Korean Society for Geospatial Information Science (대한공간정보학회지)

Selection of Optimal Variables for Clustering of Seoul using Genetic Algorithm

유전자 알고리즘을 이용한 서울시 군집화 최적 변수 선정

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)