• Title/Summary/Keyword: Document similarity

Search Result 246, Processing Time 0.025 seconds

Export Control System based on Case Based Reasoning: Design and Evaluation (사례 기반 지능형 수출통제 시스템 : 설계와 평가)

  • Hong, Woneui;Kim, Uihyun;Cho, Sinhee;Kim, Sansung;Yi, Mun Yong;Shin, Donghoon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.109-131
    • /
    • 2014
  • As the demand of nuclear power plant equipment is continuously growing worldwide, the importance of handling nuclear strategic materials is also increasing. While the number of cases submitted for the exports of nuclear-power commodity and technology is dramatically increasing, preadjudication (or prescreening to be simple) of strategic materials has been done so far by experts of a long-time experience and extensive field knowledge. However, there is severe shortage of experts in this domain, not to mention that it takes a long time to develop an expert. Because human experts must manually evaluate all the documents submitted for export permission, the current practice of nuclear material export is neither time-efficient nor cost-effective. Toward alleviating the problem of relying on costly human experts only, our research proposes a new system designed to help field experts make their decisions more effectively and efficiently. The proposed system is built upon case-based reasoning, which in essence extracts key features from the existing cases, compares the features with the features of a new case, and derives a solution for the new case by referencing similar cases and their solutions. Our research proposes a framework of case-based reasoning system, designs a case-based reasoning system for the control of nuclear material exports, and evaluates the performance of alternative keyword extraction methods (full automatic, full manual, and semi-automatic). A keyword extraction method is an essential component of the case-based reasoning system as it is used to extract key features of the cases. The full automatic method was conducted using TF-IDF, which is a widely used de facto standard method for representative keyword extraction in text mining. TF (Term Frequency) is based on the frequency count of the term within a document, showing how important the term is within a document while IDF (Inverted Document Frequency) is based on the infrequency of the term within a document set, showing how uniquely the term represents the document. The results show that the semi-automatic approach, which is based on the collaboration of machine and human, is the most effective solution regardless of whether the human is a field expert or a student who majors in nuclear engineering. Moreover, we propose a new approach of computing nuclear document similarity along with a new framework of document analysis. The proposed algorithm of nuclear document similarity considers both document-to-document similarity (${\alpha}$) and document-to-nuclear system similarity (${\beta}$), in order to derive the final score (${\gamma}$) for the decision of whether the presented case is of strategic material or not. The final score (${\gamma}$) represents a document similarity between the past cases and the new case. The score is induced by not only exploiting conventional TF-IDF, but utilizing a nuclear system similarity score, which takes the context of nuclear system domain into account. Finally, the system retrieves top-3 documents stored in the case base that are considered as the most similar cases with regard to the new case, and provides them with the degree of credibility. With this final score and the credibility score, it becomes easier for a user to see which documents in the case base are more worthy of looking up so that the user can make a proper decision with relatively lower cost. The evaluation of the system has been conducted by developing a prototype and testing with field data. The system workflows and outcomes have been verified by the field experts. This research is expected to contribute the growth of knowledge service industry by proposing a new system that can effectively reduce the burden of relying on costly human experts for the export control of nuclear materials and that can be considered as a meaningful example of knowledge service application.

A Text Similarity Measurement Method Based on Singular Value Decomposition and Semantic Relevance

  • Li, Xu;Yao, Chunlong;Fan, Fenglong;Yu, Xiaoqiang
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.863-875
    • /
    • 2017
  • The traditional text similarity measurement methods based on word frequency vector ignore the semantic relationships between words, which has become the obstacle to text similarity calculation, together with the high-dimensionality and sparsity of document vector. To address the problems, the improved singular value decomposition is used to reduce dimensionality and remove noises of the text representation model. The optimal number of singular values is analyzed and the semantic relevance between words can be calculated in constructed semantic space. An inverted index construction algorithm and the similarity definitions between vectors are proposed to calculate the similarity between two documents on the semantic level. The experimental results on benchmark corpus demonstrate that the proposed method promotes the evaluation metrics of F-measure.

An Efficient kNN Algorithm (효율적인 kNN 알고리즘)

  • Lee Jae Moon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.7 s.96
    • /
    • pp.849-854
    • /
    • 2004
  • This paper proposes an algorithm to enhance the execution time of kNN in the document classification. The proposed algorithm is to enhance the execution time by minimizing the computing cost of the similarity between two documents by using the list of pairs, while the conventional kNN uses the iist of pairs. The 1ist of pairs can be obtained by applying the matrix transposition to the list of pairs at the training phase of the document classification. This paper analyzed the proposed algorithm in the time complexity and compared it with the conventional kNN. And it compared the proposed algorithm with the conventional kNN by using routers-21578 data experimentally. The experimental results show that the proposed algorithm outperforms kNN about $90{\%}$ in terms of the ex-ecution time.

An Improvement Of Efficiency For kNN By Using A Heuristic (휴리스틱을 이용한 kNN의 효율성 개선)

  • Lee, Jae-Moon
    • The KIPS Transactions:PartB
    • /
    • v.10B no.6
    • /
    • pp.719-724
    • /
    • 2003
  • This paper proposed a heuristic to enhance the speed of kNN without loss of its accuracy. The proposed heuristic minimizes the computation of the similarity between two documents which is the dominant factor in kNN. To do this, the paper proposes a method to calculate the upper limit of the similarity and to sort the training documents. The proposed heuristic was implemented on the existing framework of the text categorization, so called, AI :: Categorizer and it was compared with the conventional kNN with the well-known data, Router-21578. The comparisons show that the proposed heuristic outperforms kNN about 30∼40% with respect to the execution time.

An Innovative Approach of Bangla Text Summarization by Introducing Pronoun Replacement and Improved Sentence Ranking

  • Haque, Md. Majharul;Pervin, Suraiya;Begum, Zerina
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.752-777
    • /
    • 2017
  • This paper proposes an automatic method to summarize Bangla news document. In the proposed approach, pronoun replacement is accomplished for the first time to minimize the dangling pronoun from summary. After replacing pronoun, sentences are ranked using term frequency, sentence frequency, numerical figures and title words. If two sentences have at least 60% cosine similarity, the frequency of the larger sentence is increased, and the smaller sentence is removed to eliminate redundancy. Moreover, the first sentence is included in summary always if it contains any title word. In Bangla text, numerical figures can be presented both in words and digits with a variety of forms. All these forms are identified to assess the importance of sentences. We have used the rule-based system in this approach with hidden Markov model and Markov chain model. To explore the rules, we have analyzed 3,000 Bangla news documents and studied some Bangla grammar books. A series of experiments are performed on 200 Bangla news documents and 600 summaries (3 summaries are for each document). The evaluation results demonstrate the effectiveness of the proposed technique over the four latest methods.

Multi-class Support Vector Machines Model Based Clustering for Hierarchical Document Categorization in Big Data Environment (빅 데이터 환경에서 계층적 문서 유형 분류를 위한 클러스터링 기반 다중 SVM 모델)

  • Kim, Young Soo;Lee, Byoung Yup
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.11
    • /
    • pp.600-608
    • /
    • 2017
  • Recently data growth rates are growing exponentially according to the rapid expansion of internet. Since users need some of all the information, they carry a heavy workload for examination and discovery of the necessary contents. Therefore information retrieval must provide hierarchical class information and the priority of examination through the evaluation of similarity on query and documents. In this paper we propose an Multi-class support vector machines model based clustering for hierarchical document categorization that make semantic search possible considering the word co-occurrence measures. A combination of hierarchical document categorization and SVM classifier gives high performance for analytical classification of web documents that increase exponentially according to extension of document hierarchy. More information retrieval systems are expected to use our proposed model in their developments and can perform a accurate and rapid information retrieval service.

Related Documents Classification System by Similarity between Documents (문서 유사도를 통한 관련 문서 분류 시스템 연구)

  • Jeong, Jisoo;Jee, Minkyu;Go, Myunghyun;Kim, Hakdong;Lim, Heonyeong;Lee, Yurim;Kim, Wonil
    • Journal of Broadcast Engineering
    • /
    • v.24 no.1
    • /
    • pp.77-86
    • /
    • 2019
  • This paper proposes using machine-learning technology to analyze and classify historical collected documents based on them. Data is collected based on keywords associated with a specific domain and the non-conceptuals such as special characters are removed. Then, tag each word of the document collected using a Korean-language morpheme analyzer with its nouns, verbs, and sentences. Embedded documents using Doc2Vec model that converts documents into vectors. Measure the similarity between documents through the embedded model and learn the document classifier using the machine running algorithm. The highest performance support vector machine measured 0.83 of F1-score as a result of comparing the classification model learned.

Generative probabilistic model with Dirichlet prior distribution for similarity analysis of research topic

  • Milyahilu, John;Kim, Jong Nam
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.4
    • /
    • pp.595-602
    • /
    • 2020
  • We propose a generative probabilistic model with Dirichlet prior distribution for topic modeling and text similarity analysis. It assigns a topic and calculates text correlation between documents within a corpus. It also provides posterior probabilities that are assigned to each topic of a document based on the prior distribution in the corpus. We then present a Gibbs sampling algorithm for inference about the posterior distribution and compute text correlation among 50 abstracts from the papers published by IEEE. We also conduct a supervised learning to set a benchmark that justifies the performance of the LDA (Latent Dirichlet Allocation). The experiments show that the accuracy for topic assignment to a certain document is 76% for LDA. The results for supervised learning show the accuracy of 61%, the precision of 93% and the f1-score of 96%. A discussion for experimental results indicates a thorough justification based on probabilities, distributions, evaluation metrics and correlation coefficients with respect to topic assignment.

A Two Phases Plagiarism Detection System for the Newspaper Articles by using a Web Search and a Document Similarity Estimation (웹 검색과 문서 유사도를 활용한 2 단계 신문 기사 표절 탐지 시스템)

  • Cho, Jung-Hyun;Jung, Hyun-Ki;Kim, Yu-Seop
    • The KIPS Transactions:PartB
    • /
    • v.16B no.2
    • /
    • pp.181-194
    • /
    • 2009
  • With the increased interest on the document copyright, many of researches related to the document plagiarism have been done up to now. The plagiarism problem of newspaper articles has attracted much interest because the plagiarism cases of the articles having much commercial values in market are currently happened very often. Many researches related to the document plagiarism have been so hard to be applied to the newspaper articles because they have strong real-time characteristics. So to detect the plagiarism of the articles, many human detectors have to read every single thousands of articles published by hundreds of newspaper companies manually. In this paper, we firstly sorted out the articles with high possibility of being copied by utilizing OpenAPI modules supported by web search companies such as Naver and Daum. Then, we measured the document similarity between selected articles and the original article and made the system decide whether the article was plagiarized or not. In experiment, we used YonHap News articles as the original articles and we also made the system select the suspicious articles from all searched articles by Naver and Daum news search services.

A Dynamic Recommendation System Using User Log Analysis and Document Similarity in Clusters (사용자 로그 분석과 클러스터 내의 문서 유사도를 이용한 동적 추천 시스템)

  • 김진수;김태용;최준혁;임기욱;이정현
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.5
    • /
    • pp.586-594
    • /
    • 2004
  • Because web documents become creation and disappearance rapidly, users require the recommend system that offers users to browse the web document conveniently and correctly. One largely untapped source of knowledge about large data collections is contained in the cumulative experiences of individuals finding useful information in the collection. Recommendation systems attempt to extract such useful information by capturing and mining one or more measures of the usefulness of the data. The existing Information Filtering system has the shortcoming that it must have user's profile. And Collaborative Filtering system has the shortcoming that users have to rate each web document first and in high-quantity, low-quality environments, users may cover only a tiny percentage of documents available. And dynamic recommendation system using the user browsing pattern also provides users with unrelated web documents. This paper classifies these web documents using the similarity between the web documents under the web document type and extracts the user browsing sequential pattern DB using the users' session information based on the web server log file. When user approaches the web document, the proposed Dynamic recommendation system recommends Top N-associated web documents set that has high similarity between current web document and other web documents and recommends set that has sequential specificity using the extracted informations and users' session information.