DOI QR코드

DOI QR Code

Latent class model for mixed variables with applications to text data

혼합모드 잠재범주모형을 통한 텍스트 자료의 분석

  • Shin, Hyun Soo (Department of Statistics, Sungkyunkwan University) ;
  • Seo, Byungtae (Department of Statistics, Sungkyunkwan University)
  • 신현수 (성균관대학교 통계학과) ;
  • 서병태 (성균관대학교 통계학과)
  • Received : 2019.07.01
  • Accepted : 2019.10.20
  • Published : 2019.12.31

Abstract

Latent class models (LCM) are useful tools to draw hidden information from categorical data. This model can also be interpreted as a mixture model with multinomial component distributions. In some cases, however, an available dataset may contain both categorical and count or continuous data. For such cases, we can extend the LCM to a mixture model with both multinomial and other component distributions such as normal and Poisson distributions. In this paper, we consider a LCM for the data containing categorical and count data to analyze the Drug Review dataset which contains categorical responses and text review. From this data analysis, we show that we can obtain more specific hidden inforamtion than those from the LCM only with categorical responses.

일종의 혼합다항분포 모형이라고 볼 수 있는 잠재범주모형은 범주형 자료에서 직접 관측되지 않은 중요한 정보를 얻어낼 수 있는 유용한 도구이다. 하지만 자료에 범주형 변수 뿐 아니라 연속형 변수 혹은 빈도형 변수가 함께 포함되어 있을 경우 이 모형을 직접적으로 사용할 수 없다. 본 논문에서는 특히 범주형 변수와 빈도형 변수가 함께 포함되어 있는 경우에 잠재범주모형인 혼합모드 잠재범주모형을 사용하여 텍스트 후기와 범주형 응답문항이 모두 포함된 의약품 사용 후기자료를 분석하였다. 이 분석을 통해 범주형 응답만을 사용한 보통의 잠재범주 모형에 비해 텍스트 자료를 함께 사용한 혼합모드 잠재범주모형을 사용했을때 잠재범주에 대한 보다 자세한 정보를 얻을 수 있는 것을 확인하였다.

Keywords

References

  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds), Second International Symposium on Information theory, 267-281. Budapest, Akademai Kiado, Hungary.
  2. Bockenholt, U. (1993). A latent class regression approach for the analysis of recurrent choices, British Journal of Mathematical and Statistical Psychology, 46, 95-118. https://doi.org/10.1111/j.2044-8317.1993.tb01004.x
  3. Bozdogan, H. (1987). Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions, Psychometrika, 52, 345-370. https://doi.org/10.1007/BF02294361
  4. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society Series B, 39, 1-38.
  5. Everitt, B. S. (1988). A finite mixture model for the clustering of mixed-mode data, Statistics and Probability Letters, 6, 305-309. https://doi.org/10.1016/0167-7152(88)90004-1
  6. Everitt, B. S. (1993). Cluster Analysis, Edward Arnold, London.
  7. Feinerer, I., Hornik, K., and Meyer, D. (2008). Text mining infrastructure in R, Journal of Statistical Software, 25, 1-54.
  8. Felix, G., Surya, K., Hagen, M., and Sebastian, Z. (2018). Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health. ACM, New York, NY, USA, 121-125.
  9. Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models, Biometrika, 61, 215-231. https://doi.org/10.1093/biomet/61.2.215
  10. Haberman, S. J. (1979). Analysis of Qualitative Data, Vol 2. New Developments, Academic Press, New York.
  11. Hagenaars, J. A. (1990). Categorical Longitudinal Data: Log-linear Analysis of Panel, Trend and Cohort Data, Sage, Newbury Park.
  12. Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis & The interpretation and mathmatical foundation of latent structure analysis. S.A. Stouffer et al. (Ed.), Measurement and Prediction, 362-472. Princeton, Princeton University Press, NJ.
  13. Lazarsfeld, P. F. and Henry, N. W. (1968). Latent Structure Analysis, Houghton Mill, Boston.
  14. Sammel, M. D., Ryan, L. M., and Legler, J. M. (1997). Latent variable models for mixed discrete and continuous outcomes, Journal of the Royal Statistical Society, Series B, 59, 667-678. https://doi.org/10.1111/1467-9868.00090
  15. Schwarz, G. (1978). Estimating the dimension of a model, Annals of Statistics, 6, 461-464. https://doi.org/10.1214/aos/1176344136
  16. Sung, M., Chang, Y. E., and Seo, B. (2016). The roles of study habits and emotional-behavioral problems in predicting school adjustment classification among 3rd graders, Korean Journal of Childcare and Education, 12, 79-102. https://doi.org/10.14698/jkcce.2016.12.06.079
  17. Wedel, M., DeSarbo, W. S., Bult, J. R., and Ramaswamy, V. (1993). A latent class Poisson regression model for heterogeneous count data with an application to direct mail, Journal of Applied Econometrics, 8, 397-411. https://doi.org/10.1002/jae.3950080407
  18. Vermunt, J. K. (1997). Log-linear models for Event Histories, Sage Publications, Thousand Oakes.