• Title/Summary/Keyword: Similarity hash

Search Result 25, Processing Time 0.025 seconds

Similarity measurement based on Min-Hash for Preserving Privacy

  • Cha, Hyun-Jong;Yang, Ho-Kyung;Song, You-Jin
    • International Journal of Advanced Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.240-245
    • /
    • 2022
  • Because of the importance of the information, encryption algorithms are heavily used. Raw data is encrypted and secure, but problems arise when the key for decryption is exposed. In particular, large-scale Internet sites such as Facebook and Amazon suffer serious damage when user data is exposed. Recently, research into a new fourth-generation encryption technology that can protect user-related data without the use of a key required for encryption is attracting attention. Also, data clustering technology using encryption is attracting attention. In this paper, we try to reduce key exposure by using homomorphic encryption. In addition, we want to maintain privacy through similarity measurement. Additionally, holistic similarity measurements are time-consuming and expensive as the data size and scope increases. Therefore, Min-Hash has been studied to efficiently estimate the similarity between two signatures Methods of measuring similarity that have been studied in the past are time-consuming and expensive as the size and area of data increases. However, Min-Hash allowed us to efficiently infer the similarity between the two sets. Min-Hash is widely used for anti-plagiarism, graph and image analysis, and genetic analysis. Therefore, this paper reports privacy using homomorphic encryption and presents a model for efficient similarity measurement using Min-Hash.

An Improved Histogram-Based Image Hash (Histogram에 기반한 Image Hash 개선)

  • Kim, So-Young;Kim, Hyoung-Joong
    • 한국정보통신설비학회:학술대회논문집
    • /
    • 2008.08a
    • /
    • pp.531-534
    • /
    • 2008
  • Image Hash specifies as a descriptor that can be used to measure similarity in images. Among all image Hash methods, histogram based image Hash has robustness to common noise-like operation and various geometric except histogram _equalization. In this_paper an improved histogram based Image Hash that is using "Imadjust" filter I together is proposed. This paper has achieved a satisfactory performance level on histogram equalization as well as geometric deformation.

  • PDF

Min-Max Hash for Similarity Measurement based on Multiset (Min-Max Hash를 활용한 다중 집합 기반의 유사도 측정)

  • Yoon, Jin-Uk;Kim, Byoungwook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.05a
    • /
    • pp.36-39
    • /
    • 2019
  • 데이터 마이닝에서 클러스터링은 서로 유사한 특징을 갖는 데이터들을 동일한 클래스로 분류하는 방법이다. 클러스터링에는 다양한 방법이 존재하지만 대표적으로 집합으로 표현된 데이터들의 유사도를 측정하기 위해서는 자카드 유사도(Jaccard Similarity)를 이용한다. 자카드 유사도는 서로 다른 집합 간의 공통된 부분을 상대적으로 평가하여 유사도를 측정하는 방법이다. 그러나 최근에는 데이터를 저장할 수 있는 기술과 매체의 발전으로 표현할 수 있는 데이터의 영역과 범위는 발전되고 있기 때문에 많은 연산과 시간의 비용이 발생하게 된다. 이를 해결하기 위해서 두 데이터의 표본의 유사도를 통해 실제 데이터들의 유사도를 추정할 수 있는 Min-Hash 가 제안되었다. 본 논문에서는 이를 활용하여 집합의 영역을 다중 집합(Multiset)으로 확장하여 중복되는 값을 가질 수 있는 두 데이터 간의 유사도를 효율적으로 추정할 수 있는 Min-Max Hash 를 제안한다.

Method of Similarity Hash-Based Malware Family Classification (유사성 해시 기반 악성코드 유형 분류 기법)

  • Kim, Yun-jeong;Kim, Moon-sun;Lee, Man-hee
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.5
    • /
    • pp.945-954
    • /
    • 2022
  • Billions of malicious codes are detected every year, of which only 0.01% are new types of malware. In this situation, an effective malware type classification tool is needed, but previous studies have limitations in quickly analyzing a large amount of malicious code because it requires a complex and massive amount of data pre-processing. To solve this problem, this paper proposes a method to classify the types of malicious code based on the similarity hash without complex data preprocessing. This approach trains the XGBoost model based on the similarity hash information of the malware. To evaluate this approach, we used the BIG-15 dataset, which is widely used in the field of malware classification. As a result, the malicious code was classified with an accuracy of 98.9% also, identified 3,432 benign files with 100% accuracy. This result is superior to most recent studies using complex preprocessing and deep learning models. Therefore, it is expected that more efficient malware classification is possible using the proposed approach.

Function partitioning methods for malware variant similarity comparison (변종 악성코드 유사도 비교를 위한 코드영역의 함수 분할 방법)

  • Park, Chan-Kyu;Kim, Hyong-Shik;Lee, Tae Jin;Ryou, Jae-Cheol
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.25 no.2
    • /
    • pp.321-330
    • /
    • 2015
  • There have been found many modified malwares which could avoid detection simply by replacing a sequence of characters or a part of code. Since the existing anti-virus program performs signature-based analysis, it is difficult to detect a malware which is slightly different from the well-known malware. This paper suggests a method of detecting modified malwares by extending a hash-value based code comparison. We generated hash values for individual functions and individual code blocks as well as the whole code, and thus use those values to find whether a pair of codes are similar in a certain degree. We also eliminated some numeric data such as constant and address before generating hash values to avoid incorrectness incurred from them. We found that the suggested method could effectively find inherent similarity between original malware and its derived ones.

Research on the Classification Model of Similarity Malware using Fuzzy Hash (퍼지해시를 이용한 유사 악성코드 분류모델에 관한 연구)

  • Park, Changwook;Chung, Hyunji;Seo, Kwangseok;Lee, Sangjin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.22 no.6
    • /
    • pp.1325-1336
    • /
    • 2012
  • In the past about 10 different kinds of malicious code were found in one day on the average. However, the number of malicious codes that are found has rapidly increased reachingover 55,000 during the last 10 year. A large number of malicious codes, however, are not new kinds of malicious codes but most of them are new variants of the existing malicious codes as same functions are newly added into the existing malicious codes, or the existing malicious codes are modified to evade anti-virus detection. To deal with a lot of malicious codes including new malicious codes and variants of the existing malicious codes, we need to compare the malicious codes in the past and the similarity and classify the new malicious codes and the variants of the existing malicious codes. A former calculation method of the similarity on the existing malicious codes compare external factors of IPs, URLs, API, Strings, etc or source code levels. The former calculation method of the similarity takes time due to the number of malicious codes and comparable factors on the increase, and it leads to employing fuzzy hashing to reduce the amount of calculation. The existing fuzzy hashing, however, has some limitations, and it causes come problems to the former calculation of the similarity. Therefore, this research paper has suggested a new comparison method for malicious codes to improve performance of the calculation of the similarity using fuzzy hashing and also a classification method employing the new comparison method.

Distributed data deduplication technique using similarity based clustering and multi-layer bloom filter (SDS 환경의 유사도 기반 클러스터링 및 다중 계층 블룸필터를 활용한 분산 중복제거 기법)

  • Yoon, Dabin;Kim, Deok-Hwan
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.14 no.5
    • /
    • pp.60-70
    • /
    • 2018
  • A software defined storage (SDS) is being deployed in cloud environment to allow multiple users to virtualize physical servers, but a solution for optimizing space efficiency with limited physical resources is needed. In the conventional data deduplication system, it is difficult to deduplicate redundant data uploaded to distributed storages. In this paper, we propose a distributed deduplication method using similarity-based clustering and multi-layer bloom filter. Rabin hash is applied to determine the degree of similarity between virtual machine servers and cluster similar virtual machines. Therefore, it improves the performance compared to deduplication efficiency for individual storage nodes. In addition, a multi-layer bloom filter incorporated into the deduplication process to shorten processing time by reducing the number of the false positives. Experimental results show that the proposed method improves the deduplication ratio by 9% compared to deduplication method using IP address based clusters without any difference in processing time.

Twitter HashTag Recommendation Scheme based on Similar Tweet Analysis (유사 트윗 분석에 기반한 트위터 해시태그 추천기법)

  • Jeon, Mina;Jun, Sanghoon;Hwang, Eenjun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.962-963
    • /
    • 2013
  • 트위터 해시태그(#, HashTag)는 트윗(Tweets)에서 특정 키워드나 내용을 주제별로 분류하고 검색을 보다 효율적으로 사용하기 위한 사용자 정의 태그이다. 사용자가 정의하기에 따라 다양한 형태로 작성되기 때문에 오히려 검색의 효율성이 떨어질 수 있으며, 사용자는 자신이 작성한 트윗에 어떤 해시태그를 추가해야 하는지에 대한 궁금증이 생기는 경우가 발생한다. 본 논문에서는 이러한 문제를 해결하기 위해 사용자가 작성한 트윗에 적합한 해시태그를 추천하는 기법을 제안한다. 수집한 트윗과 해시태그의 키워드를 추출하고 트윗의 유사도를 계산하기 위해 TF-IDF와 Cosine Similarity를 적용하여 유사한 트윗을 갖는 해시태그를 추천한다. 본 논문에서 제안된 기법을 검증하기 위한 실험으로 추천의 정확성을 평가했다.

Similar Contents Recommendation Model Based On Contents Meta Data Using Language Model (언어모델을 활용한 콘텐츠 메타 데이터 기반 유사 콘텐츠 추천 모델)

  • Donghwan Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.1
    • /
    • pp.27-40
    • /
    • 2023
  • With the increase in the spread of smart devices and the impact of COVID-19, the consumption of media contents through smart devices has significantly increased. Along with this trend, the amount of media contents viewed through OTT platforms is increasing, that makes contents recommendations on these platforms more important. Previous contents-based recommendation researches have mostly utilized metadata that describes the characteristics of the contents, with a shortage of researches that utilize the contents' own descriptive metadata. In this paper, various text data including titles and synopses that describe the contents were used to recommend similar contents. KLUE-RoBERTa-large, a Korean language model with excellent performance, was used to train the model on the text data. A dataset of over 20,000 contents metadata including titles, synopses, composite genres, directors, actors, and hash tags information was used as training data. To enter the various text features into the language model, the features were concatenated using special tokens that indicate each feature. The test set was designed to promote the relative and objective nature of the model's similarity classification ability by using the three contents comparison method and applying multiple inspections to label the test set. Genres classification and hash tag classification prediction tasks were used to fine-tune the embeddings for the contents meta text data. As a result, the hash tag classification model showed an accuracy of over 90% based on the similarity test set, which was more than 9% better than the baseline language model. Through hash tag classification training, it was found that the language model's ability to classify similar contents was improved, which demonstrated the value of using a language model for the contents-based filtering.

The Design of Rescreening System for Social Network (소셜 네트워크 재검색 시스템의 설계)

  • Sim, Gyu Ri;Kim, Dong Hyun
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2022.07a
    • /
    • pp.139-140
    • /
    • 2022
  • 최근 소셜 네트워크 서비스 시장이 급속히 성장함에 따라 SNS 사용자 또한 지속적으로 증가하고 있다. 그러나, 광고성 게시물도 함께 증가함에 따라 해시태그 기반 검색의 정확도가 감소하는 문제점을 가지고 있다. 본 연구에서는 SNS 검색 활동의 정확도와 효율성을 개선하기 위하여 SNS 해시태그 기반 재검색 시스템을 제안한다. 제안 시스템을 적용하면 SNS 사용자의 검색 활동의 정확도와 효율성이 증가할 것으로 기대된다.

  • PDF