[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTSDE.2021.10.1.1

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization

Cho, Dan Bi (국민대학교 컴퓨터공학과)
Lee, Hyun Young (국민대학교 컴퓨터공학과)
Jung, Won Sup (경남대학교 자유전공학부)
Kang, Seung Shik (국민대학교 소프트웨어학부)

Publication Information

KIPS Transactions on Software and Data Engineering / v.10, no.1, 2021 , pp. 1-8 More about this Journal

Abstract

In the political field of news articles, there are polarized and biased characteristics such as conservative and liberal, which is called political bias. We constructed keyword-based dataset to classify bias of news articles. Most embedding researches represent a sentence with sequence of morphemes. In our work, we expect that the number of unknown tokens will be reduced if the sentences are constituted by subwords that are segmented by the language model. We propose a document embedding model with subword tokenization and apply this model to SVM and feedforward neural network structure to classify the political bias. As a result of comparing the performance of the document embedding model with morphological analysis, the document embedding model with subwords showed the highest accuracy at 78.22%. It was confirmed that the number of unknown tokens was reduced by subword tokenization. Using the best performance embedding model in our bias classification task, we extract the keywords based on politicians. The bias of keywords was verified by the average similarity with the vector of politicians from each political tendency.

Keywords

Political Bias; AI Bias; Lexical Bias; Document Embedding; Subword Tokenizer;

Citations & Related Records

Reference

1	S. Greene and P. Resnik, "More than words: Syntactic packaging and implicit sentiment," in Proceedings of the Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, pp.503-511, 2009.
2	M. Recasens, C. Danescu-Niculescu-Mizil, and D. Jurafsky, "Linguistic models for analyzing and detecting biased language," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Vol.1, pp.1650-1659, 2013.
3	C. Hube and B. Fetahu, "Detecting biased statements in wikipedia," in Companion Proceedings of the Web Conference 2018, Lyon, pp.1779-1786, 2018.
4	L. Fan, M. White, E. Sharma, R. Su, P. Choubey, R. Huang, and L. Wang, "In plain sight: media bias through the lens of factual reporting," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, pp.6343-6349, 2019.
5	N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, "A survey on bias and fairness in machine learning," in arXiv preprint arXiv:1908:09635, 2019.
6	T. Bolukbasi, K. Chang, J. Zou, V. Saligrama, and A. Kalai, "Man is to computer programmer as woman is to homemaker? debiasing word embeddings," in Proceedings of the Advances in Neural Information Processing Systems, Red Hook, pp.4349-4357, 2016.
7	S. Bordia and S. Bowman, "Identifying and reducing gender bias in word-level language models," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, pp.7-15, 2019.
8	J. Font and M. Costa-Jussa, "Equalizing gender biases in neural machine translation with word embeddings techniques," in Proceedings of the 1st ACL Workshop on Gender Bias in Natural Language Processing, Florence, pp.147-154, 2019.
9	E. Sheng, K. Chang, P. Natarajan, and N. Peng, "The woman worked as a babysitter: On biases in language generation," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, pp.3407-3412, 2019.
10	T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed representations of words and phrases and their compositionality," in Proceedings of the Advances in Neural Information Processing Systems, Nevada, pp.3111-3119, 2013.
11	T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space," in Proceedings of the 1st International Conference on Learning Representations, 2013.
12	J. Pennington, R. Socher and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, pp.1532-1543, 2014.
13	J. Botha and P. Blunsom, "Compositional morphology for word representations and language modelling," in Proceedings of the International Conference on Machine Learning, Beijing, Vol.32, pp.1899-1907, 2014.
14	R. Cotterell and H. Schutze, "Morphological word embeddings," in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, Denver, pp.1287-1292, 2015.
15	J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, "Towards universal paraphrastic sentence embeddings," in Proceedings of the 4th International Conference on Learning Representations, 2016.
16	T. Kudo, "Subword regularization: Improving neural network translation models with multiple subword candidates," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Vol.1, pp.66-75, 2018.
17	D. Cho, H. Lee, and S. Kang, "Sentiment analysis for informal text by using sentencepiece tokenizer and subword embedding," in Proceedings of the Korea Computer Congress 2020, Online, pp.395-397, 2020.
18	M. Domingo, M. Garcia-Marinez, A. Helle, F. Casacuberta, and M. Herranz, "How much does tokenization affect neural machine translation?," in arXiv preprint arXiv:1812.08621, 2018.
19	P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," in Transactions of the Association for Computational Linguistics, Vol.5, pp.135-146, 2017. DOI

KSCI

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization 부분 단어 토큰화 기법을 이용한 뉴스 기사 정치적 편향성 자동 분류 및 어휘 분석

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization