Clustering Meta Information of K-Pop Girl Groups Using Term Frequency-inverse Document Frequency Vectorization

단어-역문서 빈도 벡터화를 통한 한국 걸그룹의 음반 메타 정보 군집화

  • 현준서 (전북대학교 소프트웨어공학과) ;
  • 조재혁 (전북대학교 소프트웨어공학과)
  • Received : 2023.04.04
  • Accepted : 2023.05.16
  • Published : 2023.06.30

Abstract

In the 2020s, the K-Pop market has been dominated by girl groups over boy groups and the fourth generation over the third generation. This paper presents methods and results on lyric clustering to investigate whether the generation of girl groups has started to change. We collected meta-information data for 1469 songs of 47 groups released from 2013 to 2022 and classified them into lyric information and non-lyric meta-information and quantified them respectively. The lyrics information was preprocessed by applying word-translation frequency vectorization based on previous studies and then selecting only the top vector values. Non-lyric meta-information was preprocessed and applied with One-Hot Encoding to reduce the bias of using only lyric information and show better clustering results. The clustering performance on the preprocessed data is 129%, 45% higher for Spherical K-Means' Silhouette Score and Calinski-Harabasz Score, respectively, compared to Hierarchical Clustering. This paper is expected to contribute to the study of Korean popular song development and girl group lyrics analysis and clustering.

2020 년대 K-Pop 시장은 보이그룹보다 걸그룹이, 3 세대보다 4 세대가 전반에서 주목받았다. 해당 논문은 걸그룹의 세대가 바뀌기 시작했는지 알아보고자 가사 군집화에 대한 방법과 결과를 제시한다. 2013 년부터 2022 년까지 발표된 47 개 그룹의 1469 곡에 대한 메타정보를 수집하여 가사 정보와 가사 외 메타정보로 분류하여 각각 수치화했다. 가사 정보는 선행연구를 기반으로 단어역문서 빈도 벡터화를 적용한 뒤 상위 벡터 값만 선정하는 전처리를 하였다. 가사 외 메타정보는 가사 정보만 사용했을 때의 편향성을 줄이고 더 좋은 군집화 결과를 보여주기 위해 One-Hot Encoding 으로 전처리하여 적용했다. 전처리된 데이터에 대한 군집화 성능은 Spherical K-Means 의 Silhouette Coefficient, Calinski-Harabasz Score 가 Hierarchical Clustering 에 비해 각각 129%, 45% 더 높았다. 본 연구는 한국 대중가요 발전사와 걸그룹 가사 분석 및 군집화 연구에 기여할 수 있을 것으로 기대된다.

Keywords

Acknowledgement

본 연구는 정보통신기획평가원과 과학기술정보통신부의 지원을 받아 수행하였습니다. (열린 혁신 플랫폼 디지털 오픈랩, 과제번호 2021-0-00546)

References

  1. S.W. Choi, S. J. Limb (2019, Dec). The Third-Generation K-Pop Idols' Strategies: Focused on 'EXO', 'TWICE' and 'BTS'. Journal of Industrial Innovation. 35(4), pp.57~93. [Online]. Available: https://doi.org/ 10.22793/indinn.2019.35.4.003
  2. Dis1co (2020, Jun). Theory of Idol Generation: Theory of 2020 Idol Pop Generation. 「Idology」. [Online]. Available: https://idology.kr/13070
  3. S. M. Sim (2023, Jan). "New Girl Groups Debut One after Another in This Year"... Entertainment Companies Stocks are Bull Market by Growing Anticipation. 「HanKyung Korea Market」. [Online]. Available: https://www.hankyung.com/finance/article/2023020167281
  4. S. E. Lee (2022, Sep). 'Top Pick' of Entertainment Companies' Stocks is also Girl Groups ...HYBE is just 'shaking' with BTS. 「Invest Chosun」. [Online]. Available: http://www.investchosun.com/site/data/html_dir/2022/09/05/2022090580662.html
  5. H. M. Hong (2022. May). [HI★Focus] 4th Generation Boy Groups, Review for Popularity. 「Hankook Ilbo」. [Online]. Available: https://hankookilbo.com/News/Read/A2022052610230002851
  6. Kornkanya Siriket, Vera Sa-ing, Subhron Khonthapagdee (2021, Mar). Mood classification from Song Lyric using Machine Learning. [Online]. Available: http://doi.org/10.1109/iEECON51072.2021.9440333
  7. J. H. Lee (2016, Oct). Popular Music Similarity Evaluation using Emotion and Structure Analysis on Lyrics. KIISE Transactions on Computing Practices. [Online]. Available: http://doi.org/10.5626/KTCP.2016.22.10.479
  8. Sabbah, Thabit, Selamat, Ali, Selamat, Md Hafiz, Al-Anzi, Fawaz S., Viedma, Enrique Herrera, Krejcar, Ondrej, Fujita, Hamido (2017, Apr). Modified frequency-based term weighting schemes for text classification. [Online]. Available: https://doi.org/10.1016/j.asoc.2017.04.069
  9. E.S. You, G. H. Choi, S. H. Kim (2016, Feb.). Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels. [Online] Available: https://doi.org/10.9708/jksci.2015.20.2.121
  10. Anna Huang. "Similarity measures for text document clustering," In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008)., Christchurch, New Zealand, 2008. pp. 9-56.
  11. Inderjit S Dhillon, Dharmendra S Modha (2000, Oct). Concept decompositions for large sparse text data using clustering. [Online]. Available: https://doi.org/10.1023/A:1007612920971
  12. M.J. Lim, J.H. Kim, J.H. Shin (2019, Nov). Method of Related Document Recommendation with Similarity and Weight of Keyword. [Online]. Available: https://doi.org/0.9717/kmms.2019.22.11.1313
  13. Mohammad Alodadi, Vandana P. Janeja (2015, Oct.). Similarity in Patient Support Forums. [Online]. Available: https://doi.org/10.1109/ICHI.2015.99
  14. S.Y. Bang, M. Y. Lee (2021, Mar.). A Study on Fashion Attribute Analysis Using Spherical K-means Clustering. [Online]. Available: https://doi.org/ 10.15843/kpapr.35.1.2021.3.137
  15. Kurt Hornik, Ingo Feinerer, Martin Kober, Christian Buchta (2012, Sep). Spherical k-Means Clustering. [Online]. Available: https://doi.org/10.18637/jss.v050.i10
  16. Peter J. Rousseeuw (1986, Nov.). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. [Online]. Available: https://doi.org/10.1016/0377-0427(87)90125-7
  17. Gagolewski, Marek, Bartoszuk, Maciej, Cena, Anna (2021, Oct). Are cluster validity measures (in) valid?. [Online]. Available: https:// 10.1016/j.ins.2021.10.004
  18. Y.S.Shim, J.W.Chung, I.C.Choi (2006, Mar.). "A Performance Comparison of Cluster Validity Indices based on K-means Algorithm.", Asia Pacific Journal of Information Systems, Vol. 16, No. 1, pp. 127-144, Mar. 2006.