• 제목/요약/키워드: Locality sensitive hashing

검색결과 12건 처리시간 0.257초

Locality-Sensitive Hashing Techniques for Nearest Neighbor Search

  • Lee, Keon Myung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제12권4호
    • /
    • pp.300-307
    • /
    • 2012
  • When the volume of data grows big, some simple tasks could become a significant concern. Nearest neighbor search is such a task which finds from a data set the k nearest data points to queries. Locality-sensitive hashing techniques have been developed for approximate but fast nearest neighbor search. This paper introduces the notion of locality-sensitive hashing and surveys the locality-sensitive hashing techniques. It categories them based on several criteria, presents their characteristics, and compares their performance.

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

  • Lee, Keon Myung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제14권2호
    • /
    • pp.98-104
    • /
    • 2014
  • Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.

Enhanced Locality Sensitive Clustering in High Dimensional Space

  • Chen, Gang;Gao, Hao-Lin;Li, Bi-Cheng;Hu, Guo-En
    • Transactions on Electrical and Electronic Materials
    • /
    • 제15권3호
    • /
    • pp.125-129
    • /
    • 2014
  • A dataset can be clustered by merging the bucket indices that come from the random projection of locality sensitive hashing functions. It should be noted that for this to work the merging interval must be calculated first. To improve the feasibility of large scale data clustering in high dimensional space we propose an enhanced Locality Sensitive Hashing Clustering Method. Firstly, multiple hashing functions are generated. Secondly, data points are projected to bucket indices. Thirdly, bucket indices are clustered to get class labels. Experimental results showed that on synthetic datasets this method achieves high accuracy at much improved cluster speeds. These attributes make it well suited to clustering data in high dimensional space.

MapReduce 환경에서 재그룹핑을 이용한 Locality Sensitive Hashing 기반의 K-Nearest Neighbor 그래프 생성 알고리즘의 개선 (An Improvement in K-NN Graph Construction using re-grouping with Locality Sensitive Hashing on MapReduce)

  • 이인희;오혜성;김형주
    • 정보과학회 컴퓨팅의 실제 논문지
    • /
    • 제21권11호
    • /
    • pp.681-688
    • /
    • 2015
  • k-Nearest Neighbor(k-NN)그래프는 모든 노드에 대한 k-NN 정보를 나타내는 데이터 구조로써, 협업 필터링, 유사도 탐색과 여러 정보검색 및 추천 시스템에서 k-NN그래프를 활용하고 있다. 이러한 장점에도 불구하고 brute-force방법의 k-NN그래프 생성 방법은 $O(n^2)$의 시간복잡도를 갖기 때문에 빅데이터 셋에 대해서는 처리가 곤란하다. 따라서, 고차원, 희소 데이터에 효율적인 Locality Sensitive Hashing 기법을 (key, value)기반의 분산환경인 MapReduce환경에서 사용하여 k-NN그래프를 생성하는 알고리즘이 연구되고 있다. Locality Sensitive Hashing 기법을 사용하여 사용자를 이웃후보 그룹으로 만들고 후보내의 쌍에 대해서만 brute-force하게 유사도를 계산하는 two-stage 방법을 MapReduce환경에서 사용하였다. 특히, 그래프 생성과정 중 유사도 계산하는 부분이 가장 많은 시간이 소요되므로 후보 그룹을 어떻게 만드는 것인지가 중요하다. 기존의 방법은 사이즈가 큰 후보그룹을 방지하는데 한계점이 있다. 본 논문에서는 효율적인 k-NN 그래프 생성을 위하여 사이즈가 큰 후보그룹을 재구성하는 알고리즘을 제시하였다. 실험을 통해 본 논문에서 제안한 알고리즘이 그래프의 정확성, Scan Rate측면에서 좋은 성능을 보임을 확인하였다.

A Dynamic Locality Sensitive Hashing Algorithm for Efficient Security Applications

  • Mohammad Y. Khanafseh;Ola M. Surakhi
    • International Journal of Computer Science & Network Security
    • /
    • 제24권5호
    • /
    • pp.79-88
    • /
    • 2024
  • The information retrieval domain deals with the retrieval of unstructured data such as text documents. Searching documents is a main component of the modern information retrieval system. Locality Sensitive Hashing (LSH) is one of the most popular methods used in searching for documents in a high-dimensional space. The main benefit of LSH is its theoretical guarantee of query accuracy in a multi-dimensional space. More enhancement can be achieved to LSH by adding a bit to its steps. In this paper, a new Dynamic Locality Sensitive Hashing (DLSH) algorithm is proposed as an improved version of the LSH algorithm, which relies on employing the hierarchal selection of LSH parameters (number of bands, number of shingles, and number of permutation lists) based on the similarity achieved by the algorithm to optimize searching accuracy and increasing its score. Using several tampered file structures, the technique was applied, and the performance is evaluated. In some circumstances, the accuracy of matching with DLSH exceeds 95% with the optimal parameter value selected for the number of bands, the number of shingles, and the number of permutations lists of the DLSH algorithm. The result makes DLSH algorithm suitable to be applied in many critical applications that depend on accurate searching such as forensics technology.

효율적인 트랜스포머를 이용한 팩트체크 자동화 모델 (Automated Fact Checking Model Using Efficient Transfomer)

  • Yun, Hee Seung;Jung, Jason J.
    • 한국정보통신학회논문지
    • /
    • 제25권9호
    • /
    • pp.1275-1278
    • /
    • 2021
  • Nowadays, fake news from newspapers and social media is a serious issue in news credibility. Some of machine learning methods (such as LSTM, logistic regression, and Transformer) has been applied for fact checking. In this paper, we present Transformer-based fact checking model which improves computational efficiency. Locality Sensitive Hashing (LSH) is employed to efficiently compute attention value so that it can reduce the computation time. With LSH, model can group semantically similar words, and compute attention value within the group. The performance of proposed model is 75% for accuracy, 42.9% and 75% for Fl micro score and F1 macro score, respectively.

API 콜 시퀀스와 Locality Sensitive Hashing을 이용한 악성코드 클러스터링 기법에 관한 연구 (A Study on Malware Clustering Technique Using API Call Sequence and Locality Sensitive Hashing)

  • 고동우;김휘강
    • 정보보호학회논문지
    • /
    • 제27권1호
    • /
    • pp.91-101
    • /
    • 2017
  • API(Application Program Interface) 콜 시퀀스 분석은 분석 대상 프로그램에서 API 콜 정보를 추출한 후 분석하는 기법으로 다른 기법들에 비해 대상의 행위를 특징할 수 있는 장점이 있다. 하지만 기존의 API 콜 시퀀스 분석기법은 동일한 기능을 수행하는 함수를 상이한 함수로 잘못 식별하여 분석을 수행하는 문제점이 존재한다. 본 연구에서는 API 각각을 추상화시키는 방식을 추가하여 기존의 식별 문제를 해결하고 분석 성능을 향상시키고자 한다. 그 후 분석 대상들에서 획득한 추상화된 API 콜 시퀀스에 LSH(Locality Sensitive Hashing) 기법을 적용하여 각 분석 대상들 간의 유사도를 계산하고 유사한 유형끼리 클러스터를 형성하는 과정을 수행하였다. 본 연구는 악성코드 분석 시 악성코드의 유형을 파악하는 데 요긴하게 사용할 수 있으며, 최종적으로는 해당 유형 정보를 기반으로 악성코드 분석의 정확도를 향상시키는 데 기여할 수 있다.

Copy-move Forgery Detection Robust to Various Transformation and Degradation Attacks

  • Deng, Jiehang;Yang, Jixiang;Weng, Shaowei;Gu, Guosheng;Li, Zheng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제12권9호
    • /
    • pp.4467-4486
    • /
    • 2018
  • Trying to deal with the problem of low robustness of Copy-Move Forgery Detection (CMFD) under various transformation and degradation attacks, a novel CMFD method is proposed in this paper. The main advantages of proposed work include: (1) Discrete Analytical Fourier-Mellin Transform (DAFMT) and Locality Sensitive Hashing (LSH) are combined to extract the block features and detect the potential copy-move pairs; (2) The Euclidian distance is incorporated in the pixel variance to filter out the false potential copy-move pairs in the post-verification step. In addition to extracting the effective features of an image block, the DAMFT has the properties of rotation and scale invariance. Unlike the traditional lexicographic sorting method, LSH is robust to the degradations of Gaussian noise and JEPG compression. Because most of the false copy-move pairs locate closely to each other in the spatial domain or are in the homogeneous regions, the Euclidian distance and pixel variance are employed in the post-verification step. After evaluating the proposed method by the precision-recall-$F_1$ model quantitatively based on the Image Manipulation Dataset (IMD) and Copy-Move Hard Dataset (CMHD), our method outperforms Emam et al.'s and Li et al.'s works in the recall and $F_1$ aspects.

k-NN Join Based on LSH in Big Data Environment

  • Ji, Jiaqi;Chung, Yeongjee
    • Journal of information and communication convergence engineering
    • /
    • 제16권2호
    • /
    • pp.99-105
    • /
    • 2018
  • k-Nearest neighbor join (k-NN Join) is a computationally intensive algorithm that is designed to find k-nearest neighbors from a dataset S for every object in another dataset R. Most related studies on k-NN Join are based on single-computer operations. As the data dimensions and data volume increase, running the k-NN Join algorithm on a single computer cannot generate results quickly. To solve this scalability problem, we introduce the locality-sensitive hashing (LSH) k-NN Join algorithm implemented in Spark, an approach for high-dimensional big data. LSH is used to map similar data onto the same bucket, which can reduce the data search scope. In order to achieve parallel implementation of the algorithm on multiple computers, the Spark framework is used to accelerate the computation of distances between objects in a cluster. Results show that our proposed approach is fast and accurate for high-dimensional and big data.

Object Classification based on Weakly Supervised E2LSH and Saliency map Weighting

  • Zhao, Yongwei;Li, Bicheng;Liu, Xin;Ke, Shengcai
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제10권1호
    • /
    • pp.364-380
    • /
    • 2016
  • The most popular approach in object classification is based on the bag of visual-words model, which has several fundamental problems that restricting the performance of this method, such as low time efficiency, the synonym and polysemy of visual words, and the lack of spatial information between visual words. In view of this, an object classification based on weakly supervised E2LSH and saliency map weighting is proposed. Firstly, E2LSH (Exact Euclidean Locality Sensitive Hashing) is employed to generate a group of weakly randomized visual dictionary by clustering SIFT features of the training dataset, and the selecting process of hash functions is effectively supervised inspired by the random forest ideas to reduce the randomcity of E2LSH. Secondly, graph-based visual saliency (GBVS) algorithm is applied to detect the saliency map of different images and weight the visual words according to the saliency prior. Finally, saliency map weighted visual language model is carried out to accomplish object classification. Experimental results datasets of Pascal 2007 and Caltech-256 indicate that the distinguishability of objects is effectively improved and our method is superior to the state-of-the-art object classification methods.