Browse > Article
http://dx.doi.org/10.9708/jksci.2020.25.05.187

Self-Supervised Document Representation Method  

Yun, Yeoil (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (School of Management Information Systems, Kookmin University)
Abstract
Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.
Keywords
Deep Learning; Document Embedding; Pre-Trained Language Model; Self-Supervised Learning; Text Mining;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 T. Mikolov, C. Kai, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, Jan, 2013.
2 J. Pennington, R. Socher, and C. D. Manning, "Glove: Global Vectors for Word Representation," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543, 2014.
3 P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, Jul, 2016.
4 T. Mikolov, I. Sutskever, C. Kai, G. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," Advances in Neural Information Processing Systems, Vol. 26, pp. 3111-3119, Dec, 2013.
5 M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, "Deep Contextualized Word Representations," arXiv:1802.05365, Feb, 2018.
6 J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, Oct, 2018.
7 Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, "XLNet : Generalized Autoregressive Pretraining for Language Understanding," Advances in Neural Information Processing Systems, Vol. 32, pp. 1-11, Dec, 2019.
8 Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "RoBERTa: A Robustly Optimized BERT Pretraining Approach," arXiv:1907.11692, Jul, 2019.
9 Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations," arXiv:1909.11942, Sep, 2019.
10 V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter," arXiv:1910.01108, Oct, 2019.
11 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is All You Need," Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 1-11, 2017.
12 K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, "What Does BERT Looking At? An Analysis of BERT's Attention," arXiv:1906.04341, Jun, 2019.
13 Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, "Transformers-XL: Attentive Language Models Beyond a Fixed-Length Context," arXiv:1901.02860, Jan, 2019.
14 M. S. Ahmed, L. Khan, and N. Oza, "Pseudo-Label Generation for Multi-Label Text Classfication," Proceedings of the 2011 Conference on Intelligent Data Understanding, pp. 60-74, 2011.
15 C. Sun, X .Qiu, Y. Xu, and X. Huang, "How to Fine-Tune BERT for Text Classification?," Proceedings of the 18th China National Conference on Chinese Computational Linguistics, pp. 194-206, 2019.
16 A. Adhikari, A. Ram, R. Tang, and J. Lin, "DocBERT: BERT for Document Classification," arXiv:1904.08398, Apr, 2019.
17 N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv:1908.10084, Aug, 2019.
18 R. Zhang, Z. Wei, Y. Shi, and Y. Chen, "BERT-AL: BERT for Arbitrarily Long Document Understandding," Proceedings of the International Conference on Learning Representations 2020, pp. 1-10, 2020.
19 D. Lee, "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks," Proceedings of the International Conference on Machine Learning 2013 Workshop, pp. 1-6, 2013.
20 J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, J. Zhao, and B. Xu, "Self-Taught Convolutional Neural Networks for Short Text Clustering," Neural Networks, Vol. 88, pp. 22-31, Apr, 2017.   DOI
21 R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, and N. Dehak, "Hierarchical Transformers for Long Document Classification," arXiv:1910.10781, Oct, 2019.
22 Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, "Improved Variational Autoencoders for Text Modeling using Detailed Convolutions," Proceedings of the 34th International Conference on Machine Learning, pp. 3881-3890, 2017.
23 D. Yeo, G. Lee, and J. Lee, "Pipe Leak Detection System using Wireless Acoustic Sensor Module and Deep Auto-Encoder," Journal of The Korea Society of Computer and Information, Vol. 25, No. 2, pp. 59-66, Feb, 2020.
24 A. V. M. Barone, "Towards Cross-lingual Distributed Repre sentations without Parallel Text Trained with Adversarial Autoencoders," arXiv:1608.02996, Aug, 2016.
25 L. Jiwei, L. Minh-Thang, and J. Dan, "A Hierarchical Neural Autoencoder for Paragraph and Documents," arXiv:1506.01057, Jun, 2015.
26 T. Baumel, R. Cohen, and M. Elhadad, "Sentence Embedding Evaluation using Pyramid Annotation," Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 145-149, 2016.
27 Y. Chen and M. J. Zaki, "KATE: K-Competitive Autoencoder for Text," Proceedings of the 23rd International Conference on Knowledge Discovery and Data Mining, pp. 85-94, 2017.
28 A. Bakarov, "A Survey of Word Embeddings Evaluation Methods," arXiv:1801.09536, Jan, 2018.
29 Y. Tsvetkov, M. Faruqui, and C. Dyer, "Correlation-based Intrinsic Evaluation of Word Vector Representations," arXiv:1606.06710, Jun, 2016.
30 J. Zhang and T. Baldwin, "Evaluating the Utility of Document Embedding Vector Difference for Relation Learning," arXiv:1907.08184, Jul, 2019.
31 J. H. Lau and T. Baldwin, "An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation," arXiv:1607.05368, Jul, 2016.
32 F. F. Liza and M. Grzes, "An Improved Crowdsourcing based Evaluation Technique for Word Embeddings Methods," Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 55-61, 2016.
33 M. Batchkarov, T. Kober, J. Reffin, J. Weeds, and D. Weir, "A Critique of Word Similarity as a Method for Evaluating Distributional Semantic Models," Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 7-12, 2016.
34 G. Wang, S. Shin, and W. Lee, "A Text Sentiment Classification Method Based on LSTM-CNN," Journal of The Korea Society of Computer and Information, Vol. 24, No. 12, pp. 1-7, Dec, 2019.   DOI
35 M. Faruqui, Y. Tsvetkov, P. Rastogi, and C. Dyer, "Problems with Evaluation of Word Embeddings using Word Similarity Task," arXiv:1605.02276, May, 2016.