• Title/Summary/Keyword: Document Evaluation

Search Result 359, Processing Time 0.025 seconds

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents (단일 카테고리 문서의 다중 카테고리 자동확장 방법론)

  • Hong, Jin-Sung;Kim, Namgyu;Lee, Sangwon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.77-92
    • /
    • 2014
  • Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.

Impact of Diverse Document-evaluation Measure-based Searching Methods in Big Data Search Accuracy (빅데이터 검색 정확도에 미치는 다양한 측정 방법 기반 검색 기법의 효과)

  • Kim, Ji young;Han, DaHyeon;Kim, Jongkwon
    • Journal of KIISE
    • /
    • v.44 no.5
    • /
    • pp.553-558
    • /
    • 2017
  • With the rapid growth of Big Data, research on extracting meaningful information is being pursued by both academia and industry. Especially, data characteristics derived from analysis, and researcher intention are key factors for search algorithms to obtain accurate output. Therefore, reflecting both data characteristics and researcher intention properly is the final goal of data analysis research. The data analyzed properly can help users to increase loyalty to the service provided by company, and to utilize information more effectively and efficiently. In this paper, we explore various methods of document-evaluation, so that we can improve the accuracy of searching article one of the most frequently searches used in real life. We also analyze the experiment result, and suggest the proper manners to use various methods.

A Study on the Case Study and Evaluation Methodology of Operational Availability for a Naval Ship using OT&E Data (운용시험평가 데이터를 활용한 함정 운용가용도 평가 방안 및 사례 연구)

  • Paik, Soonhuem
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.17 no.4
    • /
    • pp.471-478
    • /
    • 2014
  • Navy forces of ROK asked for more than 90% operational availability in the requirement document of combat ship. This study proposes the evaluation methodology of operational availability with the evaluation process, calculation formula, analysis of operational test data. As the case study, the developed methodology is proved to apply for 00 batch-I naval ship using the data to be acquired during the operational test period. The operational availability by test data was 90.03%, and it was satisfied with objective value 90%. The paper will contribute not only to establish the evaluation methodology of operational availability for combat ship but also other general weapon system.

Relevance Feedback based on Medicine Ontology for Retrieval Performance Improvement (검색 성능 향상을 위한 약품 온톨로지 기반 연관 피드백)

  • Lim, Soo-Yeon
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.2 s.56
    • /
    • pp.41-56
    • /
    • 2005
  • For the purpose of extending the Web that is able to understand and process information by machine, Semantic Web shared knowledge in the ontology form. For exquisite query processing, this paper proposes a method to use semantic relations in the ontology as relevance feedback information to query expansion. We made experiment on pharmacy domain. And in order to verify the effectiveness of the semantic relation in the ontology, we compared a keyword based document retrieval system that gives weights by using the frequency information compared with an ontology based document retrieval system that uses relevant information existed in the ontology to a relevant feedback. From the evaluation of the retrieval performance. we knew that search engine used the concepts and relations in ontology for improving precision effectively. Also it used them for the basis of the inference for improvement the retrieval performance.

Design and implementation of an XML Repository System supporting Document Version (버전을 지원하는 XML 저장관리 시스템 설계 및 구현)

  • Son, Chung-Beom;Oh, Kyoung-Keun;Yoo, Jae-Soo
    • The KIPS Transactions:PartD
    • /
    • v.10D no.1
    • /
    • pp.13-22
    • /
    • 2003
  • Recently, as the Importance of the management on internet documents has highly increased, the research of an XML repository system has been actively made to store, retrieve and manage large XML documents. The version management for XML documents is required in the XML applications such as patent documents, software design and system manual that the modified documents have to be managed. In this paper, we propose a data model based on a fragmentation model that supports document versioning. We also design and implement an XML repository system supporting document versioning. It is shown through Performance evaluation that our system outperforms the existing repository system.

RAM(Rolling Airframe missile) Canister Localization on development of technical process and Quality Assurance procedure establishment (RAM(Rolling Airframe Missile) 발사관 국산화에 대한 공정 기술 개발 및 품질 인증 절차 확립)

  • Lee, Sang-Woo;Jo, Jung-Pyo;Lee, Sang-Jae
    • Proceedings of the Korean Society of Propulsion Engineers Conference
    • /
    • 2010.11a
    • /
    • pp.541-547
    • /
    • 2010
  • This document presents the research and result about localization of the fiber-based launching canister of rolling missile which has helical rails. This document is about the technical process development which is included with the manufacturing of helical rails and application of the flame spaying which is different with other the fiber-based launching canister in korea. and And this document is about quality Assurance procedure through the qualification test and structural evaluation in conditions that the canister has to have in real shipboard environment.

  • PDF

An Innovative Approach of Bangla Text Summarization by Introducing Pronoun Replacement and Improved Sentence Ranking

  • Haque, Md. Majharul;Pervin, Suraiya;Begum, Zerina
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.752-777
    • /
    • 2017
  • This paper proposes an automatic method to summarize Bangla news document. In the proposed approach, pronoun replacement is accomplished for the first time to minimize the dangling pronoun from summary. After replacing pronoun, sentences are ranked using term frequency, sentence frequency, numerical figures and title words. If two sentences have at least 60% cosine similarity, the frequency of the larger sentence is increased, and the smaller sentence is removed to eliminate redundancy. Moreover, the first sentence is included in summary always if it contains any title word. In Bangla text, numerical figures can be presented both in words and digits with a variety of forms. All these forms are identified to assess the importance of sentences. We have used the rule-based system in this approach with hidden Markov model and Markov chain model. To explore the rules, we have analyzed 3,000 Bangla news documents and studied some Bangla grammar books. A series of experiments are performed on 200 Bangla news documents and 600 summaries (3 summaries are for each document). The evaluation results demonstrate the effectiveness of the proposed technique over the four latest methods.

The Project and Prospects of Old Documents Information Systems in Korea (한국 고문헌 정보시스템의 구축 및 전망)

  • Kang Soon-Ae
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.31 no.4
    • /
    • pp.83-112
    • /
    • 1997
  • The purpose of this paper Is to describe the matters to plan the best information systems in Korean old books. It analyzes: i) a range of definition of old books, ii) its characteristics and current state of processing the old documents, iii) the scope of automation and building up the library institution, iv) the construction of Korean old books Information systems, v) its case study, and vi) the evaluation and vision of system. The old document information system have been organized on the basis of library networks systems with the National Central Library as leader, its implemented system has the subsystem such as cataloging system, annotation system, full-text or image-based system, and retrieval system. In case study, it is suggested two examples which has been built in the National Central Library and Sung Kyun Kwan university. finally, it provides the evaluation criteria and vision for the library which designs the old document information systems.

  • PDF

Keyword Reorganization Techniques for Improving the Identifiability of Topics (토픽 식별성 향상을 위한 키워드 재구성 기법)

  • Yun, Yeoil;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.18 no.4
    • /
    • pp.135-149
    • /
    • 2019
  • Recently, there are many researches for extracting meaningful information from large amount of text data. Among various applications to extract information from text, topic modeling which express latent topics as a group of keywords is mainly used. Topic modeling presents several topic keywords by term/topic weight and the quality of those keywords are usually evaluated through coherence which implies the similarity of those keywords. However, the topic quality evaluation method based only on the similarity of keywords has its limitations because it is difficult to describe the content of a topic accurately enough with just a set of similar words. In this research, therefore, we propose topic keywords reorganizing method to improve the identifiability of topics. To reorganize topic keywords, each document first needs to be labeled with one representative topic which can be extracted from traditional topic modeling. After that, classification rules for classifying each document into a corresponding label are generated, and new topic keywords are extracted based on the classification rules. To evaluated the performance our method, we performed an experiment on 1,000 news articles. From the experiment, we confirmed that the keywords extracted from our proposed method have better identifiability than traditional topic keywords.

Review of Wind Energy Publications in Korea Citation Index using Latent Dirichlet Allocation (잠재디리클레할당을 이용한 한국학술지인용색인의 풍력에너지 문헌검토)

  • Kim, Hyun-Goo;Lee, Jehyun;Oh, Myeongchan
    • New & Renewable Energy
    • /
    • v.16 no.4
    • /
    • pp.33-40
    • /
    • 2020
  • The research topics of more than 1,900 wind energy papers registered in the Korean Journal Citation Index (KCI) were modeled into 25 topics using latent directory allocation (LDA), and their consistency was cross-validated through principal component analysis (PCA) of the document word matrix. Key research topics in the wind energy field were identified as "offshore, wind farm," "blade, design," "generator, voltage, control," 'dynamic, load, noise," and "performance test." As a new method to determine the similarity between research topics in journals, a systematic evaluation method was proposed to analyze the correlation between topics by constructing a journal-topic matrix (JTM) and clustering them based on topic similarity between journals. By evaluating 24 journals that published more than 20 wind energy papers, it was confirmed that they were classified into meaningful clusters of mechanical engineering, electrical engineering, marine engineering, and renewable energy. It is expected that the proposed systematic method can be applied to the evaluation of the specificity of subsequent journals.