DOI QR코드

DOI QR Code

Topic Modeling Insomnia Social Media Corpus using BERTopic and Building Automatic Deep Learning Classification Model

BERTopic을 활용한 불면증 소셜 데이터 토픽 모델링 및 불면증 경향 문헌 딥러닝 자동분류 모델 구축

  • 고영수 (연세대학교 문헌정보학과) ;
  • 이수빈 (연세대학교 문헌정보학과) ;
  • 차민정 (연세대학교 소셜오믹스 연구센터) ;
  • 김성덕 (연세대학교 문헌정보학과) ;
  • 이주희 (연세대학교 문헌정보학과) ;
  • 한지영 (연세대학교 문헌정보학과) ;
  • 송민 (연세대학교 문헌정보학과)
  • Received : 2022.05.13
  • Accepted : 2022.06.08
  • Published : 2022.06.30

Abstract

Insomnia is a chronic disease in modern society, with the number of new patients increasing by more than 20% in the last 5 years. Insomnia is a serious disease that requires diagnosis and treatment because the individual and social problems that occur when there is a lack of sleep are serious and the triggers of insomnia are complex. This study collected 5,699 data from 'insomnia', a community on 'Reddit', a social media that freely expresses opinions. Based on the International Classification of Sleep Disorders ICSD-3 standard and the guidelines with the help of experts, the insomnia corpus was constructed by tagging them as insomnia tendency documents and non-insomnia tendency documents. Five deep learning language models (BERT, RoBERTa, ALBERT, ELECTRA, XLNet) were trained using the constructed insomnia corpus as training data. As a result of performance evaluation, RoBERTa showed the highest performance with an accuracy of 81.33%. In order to in-depth analysis of insomnia social data, topic modeling was performed using the newly emerged BERTopic method by supplementing the weaknesses of LDA, which is widely used in the past. As a result of the analysis, 8 subject groups ('Negative emotions', 'Advice and help and gratitude', 'Insomnia-related diseases', 'Sleeping pills', 'Exercise and eating habits', 'Physical characteristics', 'Activity characteristics', 'Environmental characteristics') could be confirmed. Users expressed negative emotions and sought help and advice from the Reddit insomnia community. In addition, they mentioned diseases related to insomnia, shared discourse on the use of sleeping pills, and expressed interest in exercise and eating habits. As insomnia-related characteristics, we found physical characteristics such as breathing, pregnancy, and heart, active characteristics such as zombies, hypnic jerk, and groggy, and environmental characteristics such as sunlight, blankets, temperature, and naps.

불면증은 최근 5년 새 환자가 20% 이상 증가하고 있는 현대 사회의 만성적인 질병이다. 수면이 부족할 경우 나타나는 개인 및 사회적 문제가 심각하고 불면증의 유발 요인이 복합적으로 작용하고 있어서 진단 및 치료가 중요한 질환이다. 본 연구는 자유롭게 의견을 표출하는 소셜 미디어 'Reddit'의 불면증 커뮤니티인 'insomnia'를 대상으로 5,699개의 데이터를 수집하였고 이를 국제수면장애분류 ICSD-3 기준과 정신의학과 전문의의 자문을 받은 가이드라인을 바탕으로 불면증 경향 문헌과 비경향 문헌으로 태깅하여 불면증 말뭉치를 구축하였다. 구축된 불면증 말뭉치를 학습데이터로 하여 5개의 딥러닝 언어모델(BERT, RoBERTa, ALBERT, ELECTRA, XLNet)을 훈련시켰고 성능 평가 결과 RoBERTa가 정확도, 정밀도, 재현율, F1점수에서 가장 높은 성능을 보였다. 불면증 소셜 데이터를 심층적으로 분석하기 위해 기존에 많이 사용되었던 LDA의 약점을 보완하며 새롭게 등장한 BERTopic 방법을 사용하여 토픽 모델링을 진행하였다. 계층적 클러스터링 분석 결과 8개의 주제군('부정적 감정', '조언 및 도움과 감사', '불면증 관련 질병', '수면제', '운동 및 식습관', '신체적 특징', '활동적 특징', '환경적 특징')을 확인할 수 있었다. 이용자들은 불면증 커뮤니티에서 부정 감정을 표현하고 도움과 조언을 구하는 모습을 보였다. 또한, 불면증과 관련된 질병들을 언급하고 수면제 사용에 대한 담론을 나누며 운동 및 식습관에 관한 관심을 표현하고 있었다. 발견된 불면증 관련 특징으로는 호흡, 임신, 심장 등의 신체적 특징과 좀비, 수면 경련, 그로기상태 등의 활동적 특징, 햇빛, 담요, 온도, 낮잠 등의 환경적 특징이 확인되었다.

Keywords

Acknowledgement

본 연구는 정부의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(NRF-2018S1A3A2075114).

References

  1. Ahn Kyung-jin (2014). Sleep disorder threatens national health. Medical observer. Available: http://www.monews.co.kr/news/articleView.html?idxno=76359
  2. Asan Medical Center (2014). disease encyclopedia insomnia. Asan Medical Center, Available: https://www.amc.seoul.kr/asan/healthinfo/disease/diseaseDetail.do?contentId=31586
  3. Ko, Young-Soo, Lee, Ju-Hee, & Song, Min (2021). Examining suicide tendency social media texts by deep learning and topic modeling techniques. Journal of the Korean Biblia Society for library and Information Science, 32(3), 247-264. https://doi.org/10.14699/kbiblia.2021.32.3.247
  4. Lee, Soobin, Kim, Seongdeok, Lee, Juhee, Ko, Youngsoo, & Song, Min (2021). Building and analyzing panic disorder social media corpus for automatic deep learning classification model. Journal of the Korean Society for Information Management, 38(2), 153-172. https://doi.org/10.3743/KOSIM.2021.38.2.153
  5. National Health Insurance Service (2020). 2020 National Health Insurance Statistical Yearbook.
  6. Yoon, In-Young (2013). Introduction to sleep disorders. Hanyang Medical Reviews, 33, 197-202. https://doi.org/10.7599/hmr.2013.33.4.197
  7. Abuzayed, A. & Al-Khalifa, H. (2021). BERT for arabic topic modeling: an experimental study on BERTopic technique. Procedia Computer Science, 189, 191-194. http://doi.org/10.1016/j.procs.2021.05.096
  8. Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv preprint, arXiv:2008.09470. https://doi.org/10.48550/arXiv.2008.09470
  9. Buysse D. J. (2013). Insomnia. The Journal of the American Medical Association, 309(7), 706-716. https://doi.org/10.1001/jama.2013.193
  10. Cheng, Q., Li, T. M., Kwok, C. L., Zhu, T., & Yip, P. S. (2017). Assessing suicide risk and emotional distress in Chinese social media: a text mining and machine learning study. Journal of Medical Internet Research, 19(7), e243. https://doi.org/10.2196/jmir.7276
  11. Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint, arXiv:2003.10555. http://doi.org/10.48550/arXiv.2003.10555
  12. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805. https://doi.org/10.48550/arXiv.1810.04805
  13. Grootendorst, M. (2020). Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics. Zenodo. https://doi.org/10.5281/zenodo.4381785
  14. Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint, arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794
  15. Guo, C., Lin, S., Huang, Z., & Yao, Y. (2021). Mental health question and answering system based on bert model and knowledge graph technology. Proceedings of the 2nd International Symposium on Artificial Intelligence for Medicine Sciences, 472-476. https://doi.org/10.1145/3500931.3501011
  16. He, Q., Veldkamp, B. P., Glas, C. A., & de Vries, T. (2017). Automated assessment of patients' self-narratives for posttraumatic stress disorder screening using natural language processing and text mining. Assessment, 24(2), 157-172. https://doi.org/10.1177/1073191115602551
  17. Hendry, D., Darari, F., Nurfadillah, R., Khanna, G., Sun, M., Condylis, P. C., & Taufik, N. (2021). Topic modeling for customer service chats. In 2021 International Conference on Advanced Computer Science and Information Systems, 1-6. https://doi.org/10.1109/ICACSIS53237.2021.9631322
  18. Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
  19. Jamison-Powell, S., Linehan, C., Daley, L., Garbett, A., & Lawson, S. (2012). "I can't get no sleep" discussing insomnia on twitter. Proceedings of the Sigchi Conference on Human Factors in Computing Systems, 1501-1510. https://doi.org/10.1145/2207676.2208612
  20. Kingma, D. P. & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR. 2015. arXiv preprint, arXiv:1412.6980, 9. https://doi.org/10.48550/arXiv.1412.6980
  21. Koh, J. X. & Liew, T. M. (2020). How loneliness is talked about in social media during COVID-19 pandemic: Text mining of 4,492 Twitter feeds. Journal of Psychiatric Research, 145, 317-324. https://doi.org/10.1016/j.jpsychires.2020.11.015
  22. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint, arXiv:1909.11942. https://doi.org/10.48550/arXiv.1909.11942
  23. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint, arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
  24. Martinez-Castano, R., Htait, A., Azzopardi, L., & Moshfeghi, Y. (2021). BERT-Based transformers for early detection of mental health illnesses. In International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 189-200. https://doi.org/10.1007/978-3-030-85251-1_15
  25. Nikhil Chandran, A., Sreekumar, K., & Subha, D. P. (2021). EEG-based automated detection of schizophrenia using long short-term memory (LSTM) network. In Advances in Machine Learning and Computational Intelligence, 26, Springer, Singapore, 229-236. https://doi.org/10.1007/978-981-15-5243-4_19
  26. Sateia M. J. (2014). International classification of sleep disorders-third editiond. Chest, 146(5), 1387-1394. https://doi.org/10.1378/chest.14-0970
  27. Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too!. arXiv preprint, arXiv:2004.14914. https://doi.org/10.48550/arXiv.2004.14914
  28. van der Nagel, E. & Frith, J. (2015). Anonymity, pseudonymity, and the agency of online identity: Examining the social practices of r/Gonewild. First Monday, 20(3). https://doi.org/10.5210/fm.v20i3.5615
  29. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, T., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.03762
  30. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 32. https://doi.org/10.48550/arXiv.1906.08237