• Title/Summary/Keyword: Similar Documents

Search Result 283, Processing Time 0.022 seconds

A Study on the Classification of Yi Dynasty Documents and Records (고문서(古文書)의 유형별(類型別) 분류(分類)에 관한 연구(硏究))

  • Lee, Choon-Hee
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.6 no.1
    • /
    • pp.81-109
    • /
    • 1984
  • The purpose of this research is (i) to establish the principles particularly appropriate for the arrangement of archival collections is korea, and (ii) to produce a workable model of classification scheme in conformity with the established principles. The archival collections in korea are roughly devided into two groups as follows. (1) The collections of professional institutions of archives such as Korean National Archives. (2) The collections preserved by libraries, museums, and other similar institutions as a secondary collection, and these groups of collections are generally non-systematic collecting. For the arrangement of the former collections, the concept of "respect des fonds" which is universally accepted principies in archives are also applicable. But in case of the arrangement of the latter collections, the above mentioned principles are inappropriate because its collections a re built in separate pieces of documents and records without any relevance to the original function or structure of the corporation. Consequantly it is badly needed to make some devices for the arrangement of these archival collections since the archival collections of korea, in the majority of cases, belong to the latter. The author produced a tentative classification scheme, and adapted the korean traditional form (or type) of documents and records as a cardinal principle of the classification. The scheme is presented at the end of this paper.

  • PDF

The Development of the Drawing Information Management System Based on Group Technology (Group Technology를 이용한 설계정보관리 시스템의 개발)

  • H.S. Moon;Kim, S.H.
    • Journal of the Korean Society for Precision Engineering
    • /
    • v.14 no.1
    • /
    • pp.58-68
    • /
    • 1997
  • In order to provide economic high-quality products to customers in a timely manner, companies have tried much effort to decrease the time period of engineering design and information management. As a part of this effort, we have developed the Drawing Information Management System(DIMS) based ofn GT(Group Technology) that could decrease design processing time by speedy and rational management of design processes. The characteristics of DIMS are as follows: First, the concept of Concurrent Engineering was applied to DIMS. Through LAN, reviewers are able to attach comments to dlectronic documents by anno- tation functions called Mark-up. The reviewer annotations are collected and combind with the original document to revise the documents. Second, we have developed a Classification and Coding(C&C) system suitable for electronic component parts bassed on GT(Group Technology). The C&C system makes both parts and drawing with similar characteriscs into families and helps users search existing documents or create new drawings promptly. Finally, DIMS provides the Engineering BOM(Bill of Material) using the concept of Family BOM based on model options.

  • PDF

Hierarchical Overlapping Clustering to Detect Complex Concepts (중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법)

  • Hong, Su-Jeong;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.111-125
    • /
    • 2011
  • Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.

Statistical Techniques for Automatic Indexing and Some Experiments with Korean Documents (자동색인의 통계적기법과 한국어 문헌의 실험)

  • Chung Young Mee;Lee Tae Young
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.9
    • /
    • pp.99-118
    • /
    • 1982
  • This paper first reviews various techniques proposed for automatic indexing with special emphasis placed on statistical techniques. Frequency-based statistical techniques are categorized into the following three approaches for further investigation on the basis of index term selection criteria: term frequency approach, document frequency approach, and probabilistic approach. In the experimental part of this study, Pao's technique based on the Goffman's transition region formula and Harter's 2-Poisson distribution model with a measure of the potential effectiveness of index term were tested. Experimental document collection consists of 30 agriculture-related documents written in Korean. Pao's technique did not yield good result presumably due to the difference in word usage between Korean and English. However, Harter's model holds some promise for Korean document indexing because the evaluation result from this experiment was similar to that of the Harter's.

  • PDF

A Hangul Document Classification System using Case-based Reasoning (사례기반 추론을 이용한 한글 문서분류 시스템)

  • Lee, Jae-Sik;Lee, Jong-Woon
    • Asia pacific journal of information systems
    • /
    • v.12 no.2
    • /
    • pp.179-195
    • /
    • 2002
  • In this research, we developed an efficient Hangul document classification system for text mining. We mean 'efficient' by maintaining an acceptable classification performance while taking shorter computing time. In our system, given a query document, k documents are first retrieved from the document case base using the k-nearest neighbor technique, which is the main algorithm of case-based reasoning. Then, TFIDF method, which is the traditional vector model in information retrieval technique, is applied to the query document and the k retrieved documents to classify the query document. We call this procedure 'CB_TFIDF' method. The result of our research showed that the classification accuracy of CB_TFIDF was similar to that of traditional TFIDF method. However, the average time for classifying one document decreased remarkably.

A Method for Measuring Similarity Measure of Thesaurus Transformation Documents using DBSCAN (DBSCAN을 활용한 유의어 변환 문서 유사도 측정 방법)

  • Kim, Byeongsik;Shin, Juhyun
    • Journal of Korea Multimedia Society
    • /
    • v.21 no.9
    • /
    • pp.1035-1043
    • /
    • 2018
  • There is a case where the core content of another person's work is decorated as though it is his own thoughts by changing own thoughts without showing the source. Plagiarism test of copykiller free service used in plagiarism check is performed by comparing plagiarism more than 6th word. However, it is not enough to judge it as a plagiarism with a six - word match if it is replaced with a similar word. Therefore, in this paper, we construct word clusters by using DBSCAN algorithm, find synonyms, convert the words in the clusters into representative synonyms, and construct L-R tables through L-R parsing. We then propose a method for determining the similarity of documents by applying weights to the thesaurus and weights for each paragraph of the thesis.

A Study on Development of Patent Information Retrieval Using Textmining (텍스트 마이닝을 이용한 특허정보검색 개발에 관한 연구)

  • Go, Gwang-Su;Jung, Won-Kyo;Shin, Young-Geun;Park, Sang-Sung;Jang, Dong-Sik
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.12 no.8
    • /
    • pp.3677-3688
    • /
    • 2011
  • The patent information retrieval system can serve a variety of purposes. In general, the patent information is retrieved using limited key words. To identify earlier technology and priority rights repeated effort is needed. This study proposes a method of content-based retrieval using text mining. Using the proposed algorithm, each of the documents is invested with characteristic value. The characteristic values are used to compare similarities between query documents and database documents. Text analysis is composed of 3 steps: stop-word, keyword analysis and weighted value calculation. In the test results, the general retrieval and the proposed algorithm were compared by using accuracy measurements. As the study arranges the result documents as similarities of the query documents, the surfer can improve the efficiency by reviewing the similar documents first. Also because of being able to input the full-text of patent documents, the users unacquainted with surfing can use it easily and quickly. It can reduce the amount of displayed missing data through the use of content based retrieval instead of keyword based retrieval for extending the scope of the search.

A Study on Automatic Discovery and Summarization Method of Battlefield Situation Related Documents using Natural Language Processing and Collaborative Filtering (자연어 처리 및 협업 필터링 기반의 전장상황 관련 문서 자동탐색 및 요약 기법연구)

  • Kunyoung Kim;Jeongbin Lee;Mye Sohn
    • Journal of Internet Computing and Services
    • /
    • v.24 no.6
    • /
    • pp.127-135
    • /
    • 2023
  • With the development of information and communication technology, the amount of information produced and shared in the battlefield and stored and managed in the system dramatically increased. This means that the amount of information which cansupport situational awareness and decision making of the commanders has increased, but on the other hand, it is also a factor that hinders rapid decision making by increasing the information overload on the commanders. To overcome this limitation, this study proposes a method to automatically search, select, and summarize documents that can help the commanders to understand the battlefield situation reports that he or she received. First, named entities are discovered from the battlefield situation report using a named entity recognition method. Second, the documents related to each named entity are discovered. Third, a language model and collaborative filtering are used to select the documents. At this time, the language model is used to calculate the similarity between the received report and the discovered documents, and collaborative filtering is used to reflect the commander's document reading history. Finally, sentences containing each named entity are selected from the documents and sorted. The experiment was carried out using academic papers since their characteristics are similar to military documents, and the validity of the proposed method was verified.

A Method for Extracting Equipment Specifications from Plant Documents and Cross-Validation Approach with Similar Equipment Specifications (플랜트 설비 문서로부터 설비사양 추출 및 유사설비 사양 교차 검증 접근법)

  • Jae Hyun Lee;Seungeon Choi;Hyo Won Suh
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.29 no.2
    • /
    • pp.55-68
    • /
    • 2024
  • Plant engineering companies create or refer to requirements documents for each related field, such as plant process/equipment/piping/instrumentation, in different engineering departments. The process-related requirements document includes not only a description of the process but also the requirements of the equipment or related facilities that will operate it. Since the authors and reviewers of the requirements documents are different, there is a possibility that inconsistencies may occur between equipment or parts design specifications described in different requirement documents. Ensuring consistency in these matters can increase the reliability of the overall plant design information. However, the amount of documents and the scattered nature of requirements for a same equipment and parts across different documents make it challenging for engineers to trace and manage requirements. This paper proposes a method to analyze requirement sentences and calculate the similarity of requirement sentences in order to identify semantically identical sentences. To calculate the similarity of requirement sentences, we propose a named entity recognition method to identify compound words for the parts and properties that are semantically central to the requirements. A method to calculate the similarity of the identified compound words for parts and properties is also proposed. The proposed method is explained using sentences in practical documents, and experimental results are described.

An Efficient Preprocessing System for Searching Similar Texts among Massive Document Repository (대용량 문서 집합에서 유사 문서 탐색을 위한 효과적인 전처리 시스템의 설계)

  • Park, Sun-Young;Kim, Ji-Hun;Kim, Seon-Yeong;Kim, Hyung-Joon;Cho, Hwan-Gue
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.5
    • /
    • pp.626-630
    • /
    • 2010
  • Since the paper plagiarism has become one of important social issues, it is necessary to develop system for measuring the similarity between papers. The speed and accuracy of the system are very important features. So many researchers are studying the features. In this paper, we propose a preprocessing method using 'Global Dictionary' model to enhance performance of the system. The global dictionary includes information of all words in the document repository. The system uses the model to find similar papers with low computing time. Finally our experiment showed that a set of more than 20,000 documents could be reduced to about 50 documents drastically by our filtering techniques, which proves the excellence of our system.