Browse > Article
http://dx.doi.org/10.5909/JBE.2019.24.1.77

Related Documents Classification System by Similarity between Documents  

Jeong, Jisoo (Department of Software Convergence, Sejong University)
Jee, Minkyu (Department of Software Convergence, Sejong University)
Go, Myunghyun (Department of Digital Contents, Sejong University)
Kim, Hakdong (Department of Digital Contents, Sejong University)
Lim, Heonyeong (Department of Digital Contents, Sejong University)
Lee, Yurim (Department of Artificial Intelligence and Linguistic Engineering, Sejong University)
Kim, Wonil (Department of Software, Sejong University)
Publication Information
Journal of Broadcast Engineering / v.24, no.1, 2019 , pp. 77-86 More about this Journal
Abstract
This paper proposes using machine-learning technology to analyze and classify historical collected documents based on them. Data is collected based on keywords associated with a specific domain and the non-conceptuals such as special characters are removed. Then, tag each word of the document collected using a Korean-language morpheme analyzer with its nouns, verbs, and sentences. Embedded documents using Doc2Vec model that converts documents into vectors. Measure the similarity between documents through the embedded model and learn the document classifier using the machine running algorithm. The highest performance support vector machine measured 0.83 of F1-score as a result of comparing the classification model learned.
Keywords
Document analysis; Related document; Doc2Vec; Machine learning; Document classification;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Jun-Ho Roh, Han-joon Kim, Jae-Young Chang. "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction." The Jounal of Society for e-Business Studies, 18.2 pp.95-110(6) 2013.May   DOI
2 YunJeong Choi, SeungSoo Park. "Interplay of Text Mining and Data Mining for Classifying Web Contents." KOREAN JOURNAL OF COGNITIVE SCIENCE, 13.3 pp.33-46.(14) 2002.9
3 Sunghae Jun "A Big Data Preprocessing using Statistical Text Mining" Journal of Korean Institute of Intelligent Systems Vol. 25, No. 5, pp. 470-476(7) 2015 October   DOI
4 Eun-Soon You, Gun-Hee, Choi, Seung-Hoon Kim "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels" Korean Society of Computer Information Volume 20, Issue 2, pp.121-129(9) 2015 February
5 J. Ramos, "Using tf-idf to determine word relevance in document queries", In Proceedings of the First Instructional Conference on Machine Learning, 2003
6 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean "Distributed Representations of Words and Phrases and their Compositionality" NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 pp.3111-3119(9) Lake Tahoe, Nevada December 2013
7 Garam Choi, Sung-Pil Choi "A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables" Journal of the Korean Society for Information Management, 35(1) pp.231-250 (20) 2018.3   DOI
8 Jung-Mi Kim, Ju-Hong Lee. "Text Document Classification Based on Recurrent Neural Network Using Word2vec." Journal of Korean Institute of Intelligent Systems, 27.6 pp. 560-565 (6) 2017.12   DOI
9 Quoc Le ,Tomas Mikolov "Distributed Representations of Sentences and Documents" ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning Volume 32 pp.1188-1196(9) Beijing, China June 2014
10 Lucy Park, Sungzoon Cho, "KoNLPy : Korean natural language processing in Python" Proceeding soft he 26th Annual Conferenceon Human & Cognitive Language Technology, 2014 10
11 Seong-Ho Choi, Eun-Sol Kim, Byoung-Tak Zhang "An Intention Prediction Method for Dialogue using Paragraph Vector" Korea Computer Congress 2016 pp.977-979(3) 2016.6
12 KyuWan Kim, HyunJu Shin, SunJin Kim, KyoungDuek Moon, HyunAh Lee. "Detecting Improper Paragraphs in a News Article Using Logistic Regression Classification and Inter-class Similarity." Journal of Computing Science and Engineering pp.1873-1875.(3) 2017.12
13 Dan-Ho Park, Won-Sik Choi, Hong-Jo Kim, Seok-Lyong Lee. "Web Document Classification System Using the Text Analysis and Decision Tree Model." Journal of Computing Science and Engineering, 38.2A 248-251.(4) 2011.11
14 Do-Sik Min, Mu-Hee Song, Ki-Jun Son, Sang-Jo Lee. "Spam - mail Filtering Using SVM Classifier." Journal of Computing Science and Engineering 30.1B pp.552-554.(3) 2003.4
15 Song-yi Han, Yong-Gyu Jung. "Spam Filtering Using A Complement Naive Bayesian Classifier." Journal of Computing Science and Engineering, 36.2C 325-328.(4) 2009.11
16 scikit-learn, https://scikit-learn.org/stable/