• Title/Summary/Keyword: 중복 인덱싱

Search Result 9, Processing Time 0.029 seconds

A Method of Summary based Indexing in De-duplication File System (중복제거 파일시스템에서 서머리 기반 인덱싱 기법)

  • Lee, Joongsoo;Ahn, Chang-Won
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.312-313
    • /
    • 2012
  • 중복제거 파일 시스템은 가상머신 이미지와 같이 서로 중복되는 데이터가 많은 파일에서 용량을 줄이기 위하여 많이 사용된다. 중복제거를 위하여 많은 경우 서머리 벡터와 인덱스를 함께 사용하고 있는데, 이는 메모리를 많이 소모하고 인덱스 구조에 따라 여러 번의 하드 디스크 접근을 해야 하는 한계가 있었다. 본 논문에서는 서머리 벡터를 인덱스 내에서 활용하고 하드디스크를 접근하는 횟수를 감소할 수 있는 인덱싱 기법을 제안한다.

An Efficient Method for Detecting Duplicated Documents in a Blog Service System (블로그 서비스 시스템을 위한 효과적인 중복문서의 검출 기법)

  • Lee, Sang-Chul;Lee, Soon-Haeng;Kim, Sang-Wook
    • Journal of KIISE:Databases
    • /
    • v.37 no.1
    • /
    • pp.50-55
    • /
    • 2010
  • Duplicate documents in blog service system are one of causes that deteriorate both of the quality and the performance of blog searches. Unlike the WWW environment, the creation of documents is reported every time in blog service system, which makes it possible to identify the original document from its duplicate documents. Based on this observation, this paper proposes a novel method for detecting duplication documents in blog service system. This method determines whether a document is original or not at the time it is stored in the blog service system. As a result, it solves the problem of duplicate documents retrieved in the search result by keeping those documents from being stored in the index for the blog search engine. This paper also proposes three indexing methods that preserve an accuracy of previous work, Min-hashing. We show most effective indexing method via extensive experiments using real-life blog data.

MLR-tree : Spatial Indexing Method for Window Query of Multi-Level Geographic Data (MLR 트리 : 다중 레벨 지리정보 데이터의 윈도우 질의를 위한 공간 인덱싱 기법)

  • 권준희;윤용익
    • Journal of KIISE:Databases
    • /
    • v.30 no.5
    • /
    • pp.521-531
    • /
    • 2003
  • Multi-level geographic data can be mainpulated by a window query such as a zoom operation. In order to handle multi-level geographic data efficiently, a spatial indexing method supporting a window query is needed. However, the conventional spatial indexing methods are not efficient to access multi-level geographic data quickly. To solve it, other a few spatial indexing methods for multi-level geographic data are known. However these methods do not support all types of multi-level geographic data. This paper presents a new efficient spatial indexing method, the MLR-tree for window query of multi-level geographic data. The MLR-tree offers both high search performance and no data redundancy. Experiments show them. Moreover, the MLR-tree supports all types of multi-level geographic data.

A New Spatial Indexing Method for Level-Of-Detailed Data (레벨별로 상세화된 공간 데이터를 위한 새로운 공간 인덱싱 기법)

  • 권준희;윤용익
    • Journal of Korea Multimedia Society
    • /
    • v.5 no.4
    • /
    • pp.361-371
    • /
    • 2002
  • An efficient access technique is one of the most Important requirements in GIS. Using level -of-detailed data, we can access spatial data efficiently, because of no access to the fully detailed spatial data. Previous spatial access methods do not access data with level of detail efficiently. To solve it, a few spatial access methods for spatial data with level of detail, are known. However these methods support only a few kinds of data with level of detail, i.e, data through selection and simplification operations. For the effects, we propose a new spatial indexing method supporting fast searching in all kinds of data with level of detail. In the proposed method, the collection of indexes in its own level are integrated into a single index structure. Experimental results show that our method offers both no data redundancy and high search performance.

  • PDF

CopyCheck: Korean Plagiarism Detection System (CopyCheck: 한국어 표절 검사 시스템)

  • Jang, Eun-Seo;Kwon, Do-Hyoung;Kim, Nak-Won;Park, So-Yeong;Kang, Seung-Shik
    • Annual Conference on Human and Language Technology
    • /
    • 2012.10a
    • /
    • pp.117-118
    • /
    • 2012
  • 기존의 표절 검사 소프트웨어의 경우에는 수행 시간이 지나치게 오래 걸리거나 표절의 의미가 희박한 구간들을 찾는 등의 문제가 있었다. 본 논문은 대학에서 과제물 표절 검사에 활용할 수 있는 소프트웨어인 CopyCheck을 설계 및 개발하였다. CopyCheck은 각각의 대상 문서로부터 문서 고유의 시그니처 세트를 추출 비교하여 표절이 의심되는 문서들 간의 중복 인텍스 세트를 만들어 의심 구간들을 추려낸 다음 지역 정렬 방법을 이용하여 일치 구간을 찾아내는 방법으로 많은 문서들을 대상으로도 표절 구간들을 빠르게 찾아낸다.

  • PDF

A Method for Non-redundant Keyword Search over Graph Data (그래프 데이터에 대한 비-중복적 키워드 검색 방법)

  • Park, Chang-Sup
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.6
    • /
    • pp.205-214
    • /
    • 2016
  • As a large amount of graph-structured data is widely used in various applications such as social networks, semantic web, and bio-informatics, keyword-based search over graph data has been getting a lot of attention. In this paper, we propose an efficient method for keyword search over graph data to find a set of top-k answers that are relevant as well as non-redundant in structure. We define a non-redundant answer structure for a keyword query and a relevance measure for the answer. We suggest a new indexing scheme on the relevant paths between nodes and keyword terms in the graph, and also propose a query processing algorithm to find top-k non-redundant answers efficiently by exploiting the pre-calculated indexes. We present effectiveness and efficiency of the proposed approach compared to the previous method by conducting an experiment using a real dataset.

Content based Image Retrieval using RGB Maximum Frequency Indexing and BW Clustering (RGB 최대 주파수 인덱싱과 BW 클러스터링을 이용한 콘텐츠 기반 영상 검색)

  • Kang, Ji-Young;Beak, Jung-Uk;Kang, Gwang-Won;An, Young-Eun;Park, Jong-An
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.1 no.2
    • /
    • pp.71-79
    • /
    • 2008
  • This study proposed a content-based image retrieval system that uses RGB maximum frequency indexing and BW clustering in order to deal with existing retrieval errors using histogram. We split RGB from RGB color images, obtained histogram which was evenly split into 32 bins, calculated and analysed pixels of each area at histogram of R, G, B and obtained the maximum value. We indexed the color information obtained, obtained 100 similar images using the values, operated the final image retrieval system using the total number and distribution rate of clusters. The algorithm proposed in this study used space information using the features obtained from R, G, and B and clusters to obtain effective features, which overcame the disadvantage of existing gray-scale algorithm that perceived different images as same if they have the same frequencies of shade. As a result of measuring the performances using Recall and Precision, this study found that the retrieval rate and priority of the proposed algorithm are more outstanding than those of existing algorithm.

  • PDF

Techniques of XML Fragment Stream Organization for Efficient XML Query Processing in Mobile Clients (이동 클라이언트에서 효율적인 XML 질의 처리를 위한 XML 조각 스트림 구성 기법)

  • Ryu, Jeong-Hoon;Kang, Hyun-Chul
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.75-94
    • /
    • 2009
  • Since XML emerged as a standard for data exchange on the web, it has been established as a core component in e-Commerce and efficient query processing over XML data in ubiquitous computing environment has been also receiving much attention. Recently, the techniques were proposed whereby an XML document is fragmented into XML fragments to be streamed and the mobile clients receive the stream while processing queries over it. In processing queries over an XML fragment stream, the average access time significantly depends on the order of fragments in the stream. As such, for query performance, an efficient organization of XML fragment stream is required as well as the indexing for energy-efficient query processing due to the reduction of tuning time. In this paper, a technique of XML fragment stream organization based on query frequencies, fragment size, fragment access frequencies, and an active XML-based indexing scheme are proposed. Through implementation and performance experiments, our techniques were shown to be efficient compared with the conventional XML fragment stream organizations.

  • PDF

An Enhancing Technique for Scan Performance of a Skip List with MVCC (MVCC 지원 스킵 리스트의 범위 탐색 향상 기법)

  • Kim, Leeju;Lee, Eunji
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.5
    • /
    • pp.107-112
    • /
    • 2020
  • Recently, unstructured data is rapidly being produced based on web-based services. NoSQL systems and key value stores that process unstructured data as key and value pairs are widely used in various applications. In this paper, a study was conducted on a skip list used for in-memory data management in an LSM-tree based key value store. The skip list used in the key value store is an insertion-based skip list that does not allow overwriting and processes all changes only by inserting. This behavior can support Multi-Version Concurrency Control (MVCC), which can simultaneously process multiple read/write requests through snapshot isolation. However, since duplicate keys exist in the skip list, the performance significantly degrades due to unnecessary node visits during a list traverse. In particular, serious overhead occurs when a range query or scan operation that collectively searches a specific range of data occurs. This paper proposes a newly designed Stride SkipList to reduce this overhead. The stride skip list additionally maintains an indexing pointer for the last node of the same key to avoid unnecessary node visits. The proposed scheme is implemented using RocksDB's in-memory component, and the performance evaluation shows that the performance of SCAN operation improves by up to 350 times compared to the existing skip list for various workloads.