Browse > Article
http://dx.doi.org/10.6109/jkiice.2021.25.11.1505

Sentence Filtering Dataset Construction Method about Web Corpus  

Nam, Chung-Hyeon (Department of Computer Engineering, Korea University of Technology and Education)
Jang, Kyung-Sik (Department of Computer Engineering, Korea University of Technology and Education)
Abstract
Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%.
Keywords
Natural language processing; Deep learning; Sentence filtering; Corpus construction;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, San Francisco, pp. 4117-4186, 2019.
2 P. W. Park, "Text-CNN Based Intent Classification Method for Automatic Input of Intent Sentences in Chatbot," Journal of Korean Institute of Information Technology, vol. 18, no. 1, pp. 19-25, Jan. 2020.   DOI
3 N. Utiu and V. S. Lonescu, "Learning Web Content Extraction with DOM Features," in Proceedings of the 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing(ICCP), Doha, pp. 1724-1734, 2014.
4 Z. Yang, D. Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," in Proceedings of the 33rd Conference on Neural Information Processing System (NeurIPS), Vancouver, pp. 5754-5764, 2019.
5 Wikipedia, Wikipedia Dump Data [Online]. Available: https://www.wikipedia.org/.
6 J. M. Kim and J. H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of Korean Institute of Intelligent Systems, vol. 27, no. 6, pp. 560-565, Dec. 2017.   DOI
7 H. J. Jeon and C. Koh, "Text Extraction Algorithm using the HTML Logical Structure Analysis," The KDCS Transactions, vol. 16, no. 3, pp. 445-455, Jun. 2015.
8 Huggingface. Transformers [Internet]. Available: https://www.github.com/huggingface/.
9 B. D. Nguyen-Hoang, B. T. Pham-Hong, Y. Jin, and P. T. V. Le, "Genre-Oriented Web Content Extraction with Deep Convolutional Neural Networks and Statistical Methods," in Proceedings of 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong, pp. 476-485, 2018.