• Title/Summary/Keyword: Document-Classification

Search Result 451, Processing Time 0.024 seconds

A Study on Automatic Classification of Newspaper Articles Based on Unsupervised Learning by Departments (비지도학습 기반의 행정부서별 신문기사 자동분류 연구)

  • Kim, Hyun-Jong;Ryu, Seung-Eui;Lee, Chul-Ho;Nam, Kwang Woo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.21 no.9
    • /
    • pp.345-351
    • /
    • 2020
  • Administrative agencies today are paying keen attention to big data analysis to improve their policy responsiveness. Of all the big data, news articles can be used to understand public opinion regarding policy and policy issues. The amount of news output has increased rapidly because of the emergence of new online media outlets, which calls for the use of automated bots or automatic document classification tools. There are, however, limits to the automatic collection of news articles related to specific agencies or departments based on the existing news article categories and keyword search queries. Thus, this paper proposes a method to process articles using classification glossaries that take into account each agency's different work features. To this end, classification glossaries were developed by extracting the work features of different departments using Word2Vec and topic modeling techniques from news articles related to different agencies. As a result, the automatic classification of newspaper articles for each department yielded approximately 71% accuracy. This study is meaningful in making academic and practical contributions because it presents a method of extracting the work features for each department, and it is an unsupervised learning-based automatic classification method for automatically classifying news articles relevant to each agency.

Study on SCS CN Estimation and Flood Flow Characteristics According to the Classification Criteria of Hydrologic Soil Groups (수문학적 토양군의 분류기준에 따른 SCS CN 및 유출변화특성에 관한 연구)

  • Ahn, Seung-Seop;Park, Ro-Sam;Ko, Soo-Hyun;Song, In-Ryeol
    • Journal of Environmental Science International
    • /
    • v.15 no.8
    • /
    • pp.775-784
    • /
    • 2006
  • In this study, CN value was estimated by using detailed soil map and land cover characteristic against upper basin of Kumho watermark located on the upper basin of Kumho river and the hydrologic morphological characteristic factors were extracted from the basin by using the DEM document. Also the runoff analysis was conducted by the WMS model in order to study how the assumed CN value affects the runoff characteristic. First of all, as a result of studying the soil type in this study area, mostly D type soil was Identified by the application of the 1987 classification criteria. However, by that in 1995, B type soil and C type soil were distributed more widely in that area. When CN value was classified by the 1995 classification criteria, it was estimated lower than in 1987, as a result of comparing the estimated CNs by those standars. Also it was assumed that CN value was underestimated when the plan for Geum-ho river maintenance was drawn up. As a result of the analysis of runoff characteristic, the pattern of generation of the classification criteria of soil groups appeared to be similar, but in the case of the application of the classification criteria in 1995, the peak rate of runoff was found to be smaller on the whole than in the case of the application of the classification criteria in 1987. Also when the statistical data such as the prediction errors, the mean squared errors, the coefficient of determination and other data emerging from the analysis, was looked over in total, it seemed appropriate to apply the 1995 classification criteria when hydrological soil classification group was applied. As the result of this study, however, the difference of the result of the statistical dat was somewhat small. In future study, it is necessary to follow up evidence about soil application On many more watersheds and in heavy rain.

Research on Text Classification of Research Reports using Korea National Science and Technology Standards Classification Codes (국가 과학기술 표준분류 체계 기반 연구보고서 문서의 자동 분류 연구)

  • Choi, Jong-Yun;Hahn, Hyuk;Jung, Yuchul
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.21 no.1
    • /
    • pp.169-177
    • /
    • 2020
  • In South Korea, the results of R&D in science and technology are submitted to the National Science and Technology Information Service (NTIS) in reports that have Korea national science and technology standard classification codes (K-NSCC). However, considering there are more than 2000 sub-categories, it is non-trivial to choose correct classification codes without a clear understanding of the K-NSCC. In addition, there are few cases of automatic document classification research based on the K-NSCC, and there are no training data in the public domain. To the best of our knowledge, this study is the first attempt to build a highly performing K-NSCC classification system based on NTIS report meta-information from the last five years (2013-2017). To this end, about 210 mid-level categories were selected, and we conducted preprocessing considering the characteristics of research report metadata. More specifically, we propose a convolutional neural network (CNN) technique using only task names and keywords, which are the most influential fields. The proposed model is compared with several machine learning methods (e.g., the linear support vector classifier, CNN, gated recurrent unit, etc.) that show good performance in text classification, and that have a performance advantage of 1% to 7% based on a top-three F1 score.

The Analysis of MOUs and their Activities Related to Port State Control

  • Min, Byung-Sun;Kim, Soon-Kap;Kong, Gil-Young;Kim, Chol-Seong;Lee, Yoon-Sok;Kim, Jung-Man;Lee, Chung-Ro
    • Journal of Navigation and Port Research
    • /
    • v.27 no.3
    • /
    • pp.321-327
    • /
    • 2003
  • The Memorandum of Understanding (MOU) is the document of intent signed between the Port States Control(PSC) to undertake a uniform as agreed. Though the MOU is not a legally binding, in case where the agreed items are violated without a just cause, the denunciation will follow. International Maritime Organization (IMO) and regional MOUs have been making amendments and reinforcing the relevant requirements, so that port State Authorities can effectively eradicate the substandard vessels. However, the various problems have arisen due to the existence of different requirements of each MOU, the lack of information exchange between each MOU, the lack of uniform PSC implementation within the same MOU and the lack of adequate system due to the short history of MOUs. In this paper, the MOU records for three years (1999∼2001) were analyzed according to each MOU, type of ship, deficiency code, classification society, the number of inspected ships and the number of detained ships to assess the problems (Statistics during 2002 will be published after August 2003). The purpose of this study is to help better understand the PSC activities within each MOU and to establish effective countermeasures by grasping the problems that exist in the PSC at present.

Using Ontologies for Semantic Text Mining (시맨틱 텍스트 마이닝을 위한 온톨로지 활용 방안)

  • Yu, Eun-Ji;Kim, Jung-Chul;Lee, Choon-Youl;Kim, Nam-Gyu
    • The Journal of Information Systems
    • /
    • v.21 no.3
    • /
    • pp.137-161
    • /
    • 2012
  • The increasing interest in big data analysis using various data mining techniques indicates that many commercial data mining tools now need to be equipped with fundamental text analysis modules. The most essential prerequisite for accurate analysis of text documents is an understanding of the exact semantics of each term in a document. The main difficulties in understanding the exact semantics of terms are mainly attributable to homonym and synonym problems, which is a traditional problem in the natural language processing field. Some major text mining tools provide a thesaurus to solve these problems, but a thesaurus cannot be used to resolve complex synonym problems. Furthermore, the use of a thesaurus is irrelevant to the issue of homonym problems and hence cannot solve them. In this paper, we propose a semantic text mining methodology that uses ontologies to improve the quality of text mining results by resolving the semantic ambiguity caused by homonym and synonym problems. We evaluate the practical applicability of the proposed methodology by performing a classification analysis to predict customer churn using real transactional data and Q&A articles from the "S" online shopping mall in Korea. The experiments revealed that the prediction model produced by our proposed semantic text mining method outperformed the model produced by traditional text mining in terms of prediction accuracy such as the response, captured response, and lift.

A Study of Designing the Intelligent Information Retrieval System by Automatic Classification Algorithm (자동분류 알고리즘을 이용한 지능형 정보검색시스템 구축에 관한 연구)

  • Seo, Whee
    • Journal of Korean Library and Information Science Society
    • /
    • v.39 no.4
    • /
    • pp.283-304
    • /
    • 2008
  • This is to develop Intelligent Retrieval System which can automatically present early query's category terms(association terms connected with knowledge structure of relevant terminology) through learning function and it changes searching form automatically and runs it with association terms. For the reason, this theoretical study of Intelligent Automatic Indexing System abstracts expert's index term through learning and clustering algorism about automatic classification, text mining(categorization), and document category representation. It also demonstrates a good capacity in the aspects of expense, time, recall ratio, and precision ratio.

  • PDF

Analysis of Term Ambiguity based on Genetic Algorithm (유전자 알고리즘 기반 용어 중의성 분석)

  • Kim, Jeong-Joon;Chung, Sung-Taek;Park, Jeong-Min
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.17 no.5
    • /
    • pp.131-136
    • /
    • 2017
  • Recently, with the development of Internet media, many document materials have become exponentially increasing on the web. These materials are described, and the information on what is the most by this text are classified according. However, the text has meant that many have room for ambiguous interpretation must look at it from various angles in order to interpret them correctly. In conventional classification methods it was simply a classification only have the appearance of the text. In this paper, we analyze it in terms genetic algorithm and local preserving based techniques and implemented a clustering system fragmentation them. Finally, the performance of this paper was evaluated based on the implementation results compared to traditional methods.

(The Classification Method of the Document Plagiarism Similarity based on Similar Syntagma Tree and Non-Index Term) (유사 어절 트리와 비 색인어 기반의 문서 표절 유사도 분류 방법)

  • 천승환;김미영;이귀상
    • Journal of the Korea Computer Industry Society
    • /
    • v.3 no.8
    • /
    • pp.1039-1048
    • /
    • 2002
  • It is difficult and laborious to distinguish between the original and the plagiarism about the electrical documents or on-line received documents, specially student homeworks because in many case, the homeworks are written on the same subject. Existing methods are not appropriate to solve this problem, which find the most appropriate category using the expression frequency of index term in documents to be classified. In this paper, a new classification method was proposed to distinguish between the original and the plagiarism about documents which were written similarly which is based on the syntagma vector - except the similar syntagma tree structure and non-index term.

  • PDF

Automatic Classification of Web documents According to their Styles (스타일에 따른 웹 문서의 자동 분류)

  • Lee, Kong-Joo;Lim, Chul-Su;Kim, Jae-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.5
    • /
    • pp.555-562
    • /
    • 2004
  • A genre or a style is another view of documents different from a subject or a topic. The style is also a criterion to classify the documents. There have been several studies on detecting a style of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect styles of web documents. Web documents are different from textual documents in that Dey contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.

Financial Instruments Recommendation based on Classification Financial Consumer by Text Mining Techniques (비정형 데이터 분석을 통한 금융소비자 유형화 및 그에 따른 금융상품 추천 방법)

  • Lee, Jaewoong;Kim, Young-Sik;Kwon, Ohbyung
    • Journal of Information Technology Services
    • /
    • v.15 no.4
    • /
    • pp.1-24
    • /
    • 2016
  • With the innovation of information technology, non-face-to-face robo advisor with high accessibility and convenience is spreading. The current robot advisor recommends appropriate investment products after understanding the investment propensity based on the structured data entered directly or indirectly by individuals. However, it is an inconvenient and obtrusive way for financial consumers to inquire or input their own subjective propensity to invest. Hence, this study proposes a way to deduce the propensity to invest in unstructured data that customers voluntarily exposed during consultation or online. Since prediction performance based on unstructured document differs according to the characteristics of text, in this study, classification algorithm optimized for the characteristic of text left by financial consumers is selected by performing prediction performance evaluation of various learning discrimination algorithms and proposed an intelligent method that automatically recommends investment products. User tests were given to MBA students. After showing the recommended investment and list of investment products, satisfaction was asked. Financial consumers' satisfaction was measured by dividing them into investment propensity and recommendation goods. The results suggest that the users high satisfaction with investment products recommended by the method proposed in this paper. The results showed that it can be applies to non-face-to-face robo advisor.