• Title/Summary/Keyword: Document similarity

Search Result 249, Processing Time 0.023 seconds

XML-based Modeling for Semantic Retrieval of Syslog Data (Syslog 데이터의 의미론적 검색을 위한 XML 기반의 모델링)

  • Lee Seok-Joon;Shin Dong-Cheon;Park Sei-Kwon
    • The KIPS Transactions:PartD
    • /
    • v.13D no.2 s.105
    • /
    • pp.147-156
    • /
    • 2006
  • Event logging plays increasingly an important role in system and network management, and syslog is a de-facto standard for logging system events. However, due to the semi-structured features of Common Log Format data most studies on log analysis focus on the frequent patterns. The extensible Markup Language can provide a nice representation scheme for structure and search of formatted data found in syslog messages. However, previous XML-formatted schemes and applications for system logging are not suitable for semantic approach such as ranking based search or similarity measurement for log data. In this paper, based on ranked keyword search techniques over XML document, we propose an XML tree structure through a new data modeling approach for syslog data. Finally, we show suitability of proposed structure for semantic retrieval.

A Study on Information Retrieval Using Query Splitting Relevance Feedback (질의분해 적합성 피드백을 이용한 정보검색에 관한 연구)

  • 김영천;박병권;이성주
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.11 no.3
    • /
    • pp.252-257
    • /
    • 2001
  • In conventional boolean retrieval systems, document ranking is not supported and similarity coefficients cannot be computed between queries and documents. The MMM, Paice and P-norm models have been proposed in the past to support the ranking facility for boolean retrieval systems. They have common properties of interpreting boolean operators softly. In this paper we propose a new soft evaluation method for Information retrieval using query splitting relevance feedback model. We also show through performance comparison that query splitting relevance feedback(QSRF) is more efficient and effective than MMM, Paice and P-norm.

  • PDF

Measurement of WSD based Document Similarity using U-WIN (U-WIN을 이용한 WSD 기반의 문서 유사도 측정)

  • Shim, Kang-Seop;Bae, Young-Jun;Ock, Cheol-Young;Choe, Ho-Seop
    • Annual Conference on Human and Language Technology
    • /
    • 2008.10a
    • /
    • pp.90-95
    • /
    • 2008
  • 이미 국외에서는 WordNet과 같은 의미적 언어자원을 활용한 문서 유사도 측정에 관한 많은 연구가 진행되고 있다. 그러나 국내에서는 아직 WordNet과 같은 언어자원이 부족하여, 이를 바탕으로 한 문서 유사도 측정 방법이나 그 결과를 활용하는 방법에 관한 연구가 미흡하다. 기존에 국내에서 사용된 문서 유사도 측정법들은 대부분 문서 내에 출현하는 어휘들의 의미에 기반하기 보다는, 그 어휘들의 단순 매칭이나 빈도수를 이용한 가중치 측정법, 또는 가중치를 이용한 중요 어휘 추출방법들 이었다. 이 때문에, 기존의 유사도 측정법들은 문서의 문맥정보를 포함하지 못하고, 어휘의 빈도를 구하기 위하여 대용량의 문서집합에 의존적이며, 또한 특정 개념(의미)을 다른 어휘로 표현하거나, 유사/관련 어휘가 사용된 유사 문서에 대한 처리가 미흡하였다. 본 논문에서는 이에 착안하여 한국어 어휘 의미망인 U-WIN과 문맥에 사용된 어휘들의 overlap 정보를 사용하여, 단순히 어휘에 기반하지 않고, 기본적인 문맥정보를 활용하며, 어휘의 의미에 기반을 둔 문서유사도 측정법을 제안한다.

  • PDF

Printed Hangul Recognition with Adaptive Hierarchical Structures Depending on 6-Types (6-유형 별로 적응적 계층 구조를 갖는 인쇄 한글 인식)

  • Ham, Dae-Sung;Lee, Duk-Ryong;Choi, Kyung-Ung;Oh, Il-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.1
    • /
    • pp.10-18
    • /
    • 2010
  • Due to a large number of classes in Hangul character recognition, it is usual to use the six-type preclassification stage. After the preclassification, the first consonent, vowel, and last consonent can be classified separately. Though each of three components has a few of classes, classification errors occurs often due to shape similarity such as 'ㅔ' and 'ㅖ'. So this paper proposes a hierarchical recognition method which adopts multi-stage tree structures for each of 6-types. In addition, to reduce the interference among three components, the method uses the recognition results of first consonents and vowel as features of vowel classifier. The recognition accuracy for the test set of PHD08 database was 98.96%.

The Historical Study of SDI System (SDI System의 사적연구(史的硏究)(1))

  • Kim, Chong Hwoe
    • Journal of the Korean Society for information Management
    • /
    • v.1 no.1
    • /
    • pp.146-161
    • /
    • 1984
  • This study is to introduce the SDI(Selective Dissemination of Information) system, a typical aspect of information retrieval systems nowadays quite popular. The term "SDI" is most often used to describe systems of using electronic data processing equipment as a means of matching the terms of user-interest profile against document descriptors and selecting those documents with a specified degree of similarity to the terms of the user-interest profile. Various up-to-date informations on SDI systems developed after the first introduction of the original idea by "Luhn" are reviewed and compared. The stage of development, structure, characteristics, and various other matters concerning the SDI systems are analyzed and discussed.

  • PDF

A Method to Measure the Self-Supplied News Volumes of Internet Newspaper Company

  • Kim, Dong-Joo;Lee, Won Joo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.10
    • /
    • pp.99-105
    • /
    • 2015
  • The growth of internet infrastructure and a tremendous increment of internet users lead actively to found internet newspaper publishing companies, which are able to dig up and publish own news articles. In disregard of these quantitative growth of internet newspaper companies, the qualitative growth of them doesn't coincide with the quantitative growth. Therefore, to require social responsibility and to build healthy media environment, Korean government has put in force registration system of internet newspaper company. According to this system, internet newspaper companies have to produce at the inside over 30 percent of weekly publications, and this requisite increases the needs of its verification. This paper investigates technologies to measure the self-supplied news volumes of internet newspaper company, examines validity of them, and presents appropriate method to measure. To compare huge amount of news articles rapidly, the presented method is based on the modified edit-distance, which reflects human cognition of word and empirical information related with it. To prove correctness of our presented method, we show experimental results for some real internet news articles.

Comparative Analysis of Index Terms and Social Tags: Medical Subject Headings vs. BibSonomy and Delicious

  • Lee, Danielle H.
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.49 no.2
    • /
    • pp.291-311
    • /
    • 2015
  • This paper demonstrates the comparative analysis of the similarity and difference between Medical Subject Headings (MeSH) and social tags. Both types of metadata have the same purpose - that is, succinctly abstracting content of a given document - but are created from heterogeneous viewpoints. The former MeSH terms show the aspects of publication related professionals, whereas the latter social tags are from the perspectives of general readers. When both types of metadata are assigned to the same publications, do they consist of different nomenclatures reflecting the heterogeneous viewpoints or are they similar, since both metadata types describe the same publications? Social tags are also compared with family terms of MeSH terms in the given MeSH hierarchy, so as to understand the specificity of social tags, related to MeSH terms. Lastly, given the fact that readers assign social tags in casual ways without any restricted vocabulary, we tested how many social tags contain consumer health terms, which are familiar to laypeople. Through these comparisons, we ultimately aim to examine how much the highly controlled publication index reflects general readers' cognitive understandings and stress the necessity of general readers' involvement in the publication indexing process.

An Incremental Clustering Technique of XML Documents using Cluster Histograms (클러스터의 히스토그램을 이용한 XML 문서의 점진적 클러스터링 기법)

  • Hwang, Jeong-Hee
    • Journal of KIISE:Databases
    • /
    • v.34 no.3
    • /
    • pp.261-269
    • /
    • 2007
  • As a basic research to integrate and to retrieve XML documents efficiently, this paper proposes a clustering method by structures of XML documents. We apply an algorithm processing the many transaction data to the clustering of XML documents, which is a quite different method from the previous algorithms measuring structure similarity. Our method performs the clustering of XML documents not only using the cluster histograms that represent the distribution of items in clusters but also considering the global cluster cohesion. We compare the proposed method with the existing techniques by performing experiments. Experiments show that our method not only creates good quality clusters but also improves the processing time.

Research on Subjective-type Grading System Using Syntactic-Semantic Tree Comparator (구문의미트리 비교기를 이용한 주관식 문항 채점 시스템에 대한 연구)

  • Kang, WonSeog
    • The Journal of Korean Association of Computer Education
    • /
    • v.21 no.6
    • /
    • pp.83-92
    • /
    • 2018
  • The subjective question is appropriate for evaluation of deep thinking, but it is not easy to score. Since, regardless of same scoring criterion, the graders are able to produce different scores, we need the objective automatic evaluation system. However, the system has the problem of Korean analysis and comparison. This paper suggests the Korean syntactic analysis and subjective grading system using the syntactic-semantic tree comparator. This system is the hybrid grading system of word based and syntactic-semantic tree based grading. This system grades the answers on the subjective question using the syntactic-semantic comparator. This proposed system has the good result. This system will be utilized in Korean syntactic-semantic analysis, subjective question grading, and document classification.

A Study on Analysis of a Process Similarity for the Service Reuse (서비스 재사용을 위한 프로세스 유사도 분석에 관한 연구)

  • Hwang, Chi-Gon;Yun, Chang-Pyo;Jung, Kye-Dong
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.05a
    • /
    • pp.238-240
    • /
    • 2014
  • A cloud computing include a SaaS frameworks be able to use a software as a service. Despite the existing service depending on the difference of the tenant and the use, if the service provider re-establish a service, they are required resources In terms of costs and managerial. So we propose a technique for analysis software structure using the process algebra to reuse existing software. A process algebra analyze the structure of the software, express in business process or different languages and verify that it can be reused. As CCS in a process algebra is useful to convert the business process or XML, by using this, we structure a process and propose meta storage for comparison and management a structured document.

  • PDF