DOI QR코드

DOI QR Code

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization

부분 단어 토큰화 기법을 이용한 뉴스 기사 정치적 편향성 자동 분류 및 어휘 분석

  • Received : 2020.07.22
  • Accepted : 2020.08.25
  • Published : 2021.01.31

Abstract

In the political field of news articles, there are polarized and biased characteristics such as conservative and liberal, which is called political bias. We constructed keyword-based dataset to classify bias of news articles. Most embedding researches represent a sentence with sequence of morphemes. In our work, we expect that the number of unknown tokens will be reduced if the sentences are constituted by subwords that are segmented by the language model. We propose a document embedding model with subword tokenization and apply this model to SVM and feedforward neural network structure to classify the political bias. As a result of comparing the performance of the document embedding model with morphological analysis, the document embedding model with subwords showed the highest accuracy at 78.22%. It was confirmed that the number of unknown tokens was reduced by subword tokenization. Using the best performance embedding model in our bias classification task, we extract the keywords based on politicians. The bias of keywords was verified by the average similarity with the vector of politicians from each political tendency.

뉴스 기사의 정치 분야는 보수, 진보와 같이 양극화된 편향적 특성이 존재하며 이를 정치적 편향성이라고 한다. 뉴스 기사로부터 편향성 문제를 분류하기 위해 키워드 기반의 학습 데이터를 구축하였다. 대부분의 임베딩 연구에서는 미등록어로 인한 문제를 완화시키기 위해 형태소 단위로 문장을 구성한다. 본 논문에서는 문장을 언어 모델에 의해 세부적으로 분할하는 부분 단어로 문장을 구성할 경우 미등록어 수가 감소할 것이라 예상하였다. 부분 단어 토큰화 기법을 이용한 문서 임베딩 모델을 제안하며 이를 SVM과 전방향 뉴럴 네트워크 구조에 적용하여 정치적 편향성 분류 실험을 진행하였다. 형태소 토큰화 기법을 이용한 문서 임베딩 모델과 비교 실험한 결과, 부분 단어 토큰화 기법을 이용한 문서 임베딩 모델이 78.22%로 가장 높은 정확도를 보였으며 부분 단어 토큰화를 통해 미등록어 수가 감소되는 것을 확인하였다. 분류 실험에서 가장 성능이 좋은 임베딩 모델을 이용하여 정치적 인물을 기반한 어휘를 추출하였으며 각 성향의 정치적 인물 벡터와의 평균 유사도를 통해 어휘의 편향성을 검증하였다.

Keywords

References

  1. S. Greene and P. Resnik, "More than words: Syntactic packaging and implicit sentiment," in Proceedings of the Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, pp.503-511, 2009.
  2. M. Recasens, C. Danescu-Niculescu-Mizil, and D. Jurafsky, "Linguistic models for analyzing and detecting biased language," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Vol.1, pp.1650-1659, 2013.
  3. C. Hube and B. Fetahu, "Detecting biased statements in wikipedia," in Companion Proceedings of the Web Conference 2018, Lyon, pp.1779-1786, 2018.
  4. L. Fan, M. White, E. Sharma, R. Su, P. Choubey, R. Huang, and L. Wang, "In plain sight: media bias through the lens of factual reporting," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, pp.6343-6349, 2019.
  5. N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, "A survey on bias and fairness in machine learning," in arXiv preprint arXiv:1908:09635, 2019.
  6. T. Bolukbasi, K. Chang, J. Zou, V. Saligrama, and A. Kalai, "Man is to computer programmer as woman is to homemaker? debiasing word embeddings," in Proceedings of the Advances in Neural Information Processing Systems, Red Hook, pp.4349-4357, 2016.
  7. S. Bordia and S. Bowman, "Identifying and reducing gender bias in word-level language models," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, pp.7-15, 2019.
  8. J. Font and M. Costa-Jussa, "Equalizing gender biases in neural machine translation with word embeddings techniques," in Proceedings of the 1st ACL Workshop on Gender Bias in Natural Language Processing, Florence, pp.147-154, 2019.
  9. E. Sheng, K. Chang, P. Natarajan, and N. Peng, "The woman worked as a babysitter: On biases in language generation," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, pp.3407-3412, 2019.
  10. T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed representations of words and phrases and their compositionality," in Proceedings of the Advances in Neural Information Processing Systems, Nevada, pp.3111-3119, 2013.
  11. T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space," in Proceedings of the 1st International Conference on Learning Representations, 2013.
  12. J. Pennington, R. Socher and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, pp.1532-1543, 2014.
  13. J. Botha and P. Blunsom, "Compositional morphology for word representations and language modelling," in Proceedings of the International Conference on Machine Learning, Beijing, Vol.32, pp.1899-1907, 2014.
  14. R. Cotterell and H. Schutze, "Morphological word embeddings," in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, Denver, pp.1287-1292, 2015.
  15. J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, "Towards universal paraphrastic sentence embeddings," in Proceedings of the 4th International Conference on Learning Representations, 2016.
  16. T. Kudo, "Subword regularization: Improving neural network translation models with multiple subword candidates," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Vol.1, pp.66-75, 2018.
  17. M. Domingo, M. Garcia-Marinez, A. Helle, F. Casacuberta, and M. Herranz, "How much does tokenization affect neural machine translation?," in arXiv preprint arXiv:1812.08621, 2018.
  18. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," in Transactions of the Association for Computational Linguistics, Vol.5, pp.135-146, 2017. https://doi.org/10.1162/tacl_a_00051
  19. D. Cho, H. Lee, and S. Kang, "Sentiment analysis for informal text by using sentencepiece tokenizer and subword embedding," in Proceedings of the Korea Computer Congress 2020, Online, pp.395-397, 2020.