• Title/Summary/Keyword: Jaccard Similarity

Search Result 50, Processing Time 0.027 seconds

Improving Performance of Jaccard Coefficient for Collaborative Filtering

  • Lee, Soojung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.11
    • /
    • pp.121-126
    • /
    • 2016
  • In recommender systems based on collaborative filtering, measuring similarity is very critical for determining the range of recommenders. Data sparsity problem is fundamental in collaborative filtering systems, which is partly solved by Jaccard coefficient combined with traditional similarity measures. This study proposes a new coefficient for improving performance of Jaccard coefficient by compensating for its drawbacks. We conducted experiments using datasets of various characteristics for performance analysis. As a result of comparison between the proposed and the similarity metric of Pearson correlation widely used up to date, it is found that the two metrics yielded competitive performance on a dense dataset while the proposed showed much better performance on a sparser dataset. Also, the result of comparing the proposed with Jaccard coefficient showed that the proposed yielded far better performance as the dataset is denser. Overall, the proposed coefficient demonstrated the best prediction and recommendation performance among the experimented metrics.

Performance Analysis of Similarity Reflecting Jaccard Index for Solving Data Sparsity in Collaborative Filtering (협력필터링의 데이터 희소성 해결을 위한 자카드 지수 반영의 유사도 성능 분석)

  • Lee, Soojung
    • The Journal of Korean Association of Computer Education
    • /
    • v.19 no.4
    • /
    • pp.59-66
    • /
    • 2016
  • It has been studied to reflect the number of co-rated items for solving data sparsity problem in collaborative filtering systems. A well-known method of Jaccard index allowed performance improvement, when combined with previous similarity measures. However, the degree of performance improvement when combined with existing similarity measures in various data environments are seldom analyzed, which is the objective of this study. Jaccard index as a sole similarity measure yielded much higher prediction quality than traditional measures and very high recommendation quality in a sparse dataset. In general, previous similarity measures combined with Jaccard index improved performance regardless of dataset characteristics. Especially, cosine similarity achieved the highest improvement in sparse datasets, while similarity of Mean Squared Difference degraded prediction quality in denser sets. Therefore, one needs to consider characteristics of data environment and similarity measures before combining Jaccard index for similarity use.

Applying Different Similarity Measures based on Jaccard Index in Collaborative Filtering

  • Lee, Soojung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.5
    • /
    • pp.47-53
    • /
    • 2021
  • Sparse ratings data hinder reliable similarity computation between users, which degrades the performance of memory-based collaborative filtering techniques for recommender systems. Many works in the literature have been developed for solving this data sparsity problem, where the most simple and representative ones are the methods of utilizing Jaccard index. This index reflects the number of commonly rated items between two users and is mostly integrated into traditional similarity measures to compute similarity more accurately between the users. However, such integration is very straightforward with no consideration of the degree of data sparsity. This study suggests a novel idea of applying different similarity measures depending on the numeric value of Jaccard index between two users. Performance experiments are conducted to obtain optimal values of the parameters used by the proposed method and evaluate it in comparison with other relevant methods. As a result, the proposed demonstrates the best and comparable performance in prediction and recommendation accuracies.

Multi-Modal Based Malware Similarity Estimation Method (멀티모달 기반 악성코드 유사도 계산 기법)

  • Yoo, Jeong Do;Kim, Taekyu;Kim, In-sung;Kim, Huy Kang
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.29 no.2
    • /
    • pp.347-363
    • /
    • 2019
  • Malware has its own unique behavior characteristics, like DNA for living things. To respond APT (Advanced Persistent Threat) attacks in advance, it needs to extract behavioral characteristics from malware. To this end, it needs to do classification for each malware based on its behavioral similarity. In this paper, various similarity of Windows malware is estimated; and based on these similarity values, malware's family is predicted. The similarity measures used in this paper are as follows: 'TF-IDF cosine similarity', 'Nilsimsa similarity', 'malware function cosine similarity' and 'Jaccard similarity'. As a result, we find the prediction rate for each similarity measure is widely different. Although, there is no similarity measure which can be applied to malware classification with high accuracy, this result can be helpful to select a similarity measure to classify specific malware family.

Jaccard Index Reflecting Time-Context for User-based Collaborative Filtering

  • Soojung Lee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.10
    • /
    • pp.163-170
    • /
    • 2023
  • The user-based collaborative filtering technique, one of the implementation methods of the recommendation system, recommends the preferred items of neighboring users based on the calculations of neighboring users with similar rating histories. However, it fundamentally has a data scarcity problem in which the quality of recommendations is significantly reduced when there is little common rating history. To solve this problem, many existing studies have proposed various methods of combining Jaccard index with a similarity measure. In this study, we introduce a time-aware concept to Jaccard index and propose a method of weighting common items with different weights depending on the rating time. As a result of conducting experiments using various performance metrics and time intervals, it is confirmed that the proposed method showed the best performance compared to the original Jaccard index at most metrics, and that the optimal time interval differs depending on the type of performance metric.

Hierarchic Document Clustering in OPAC (OPAC에서 자동분류 열람을 위한 계층 클러스터링 연구)

  • 노정순
    • Journal of the Korean Society for information Management
    • /
    • v.21 no.1
    • /
    • pp.93-117
    • /
    • 2004
  • This study is to develop a hierarchic clustering model fur document classification and browsing in OPAC systems. Two automatic indexing techniques (with and without controlled terms), two term weighting methods (based on term frequency and binary weight), five similarity coefficients (Dice, Jaccard, Pearson, Cosine, and Squared Euclidean). and three hierarchic clustering algorithms (Between Average Linkage, Within Average Linkage, and Complete Linkage method) were tested on the document collection of 175 books and theses on library and information science. The best document clusters resulted from the Between Average Linkage or Complete Linkage method with Jaccard or Dice coefficient on the automatic indexing with controlled terms in binary vector. The clusters from Between Average Linkage with Jaccard has more likely decimal classification structure.

Comparison of Plant Diversity of Natural Forest and Plantations of Rema-Kalenga Wildlife Sanctuary of Bangladesh

  • Sobuj, Norul-Alam;Rahman, Mizanur
    • Journal of Forest and Environmental Science
    • /
    • v.27 no.3
    • /
    • pp.127-134
    • /
    • 2011
  • The purpose of the study was to assess and compare the diversity of plant species (trees, shrubs, herbs) of natural forest and plantations. A total of 52 plant species were recorded in the natural forest, of which 16 were trees, 15 were shrubs and 21 were herbs. On the contrary, 31 species of plants including 11 trees, 8 shrubs and 12 herbs were identified in plantation forest. Shannon-Wiener diversity index were 2.70, 2.72 and 3.12 for trees, shrubs and herbs respectively in the natural forest. However, it was 2.35 for tree species, 2.31 for shrub species and 2.81 for herb species in the plantation forest. Jaccard's similarity index showed that 71% species of trees, 44% species of shrubs and 43% species of herbs were same in plantations and natural forest.

Recruitment matching mentoring system using Jaccard Similarity (자카드 유사도 기법을 이용한 채용 매칭 멘토링 시스템)

  • Seunghun Jang;Bong-Jun Choi
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.07a
    • /
    • pp.699-700
    • /
    • 2023
  • 최근 국내 기업에서는 블라인트 테스트나 포트폴리오와 같은 자료를 활용하여 채용하는 추세이다. 지원자마다 개인의 역량이 다를 뿐만 아니라 기업에서 요구하는 기술/경험, 지원 자격, 특정 기술에 대한 경험을 요구한다. 따라서 본 논문에서는 국내 기업의 채용 공고에 기재된 지원 자격, 우대 기술, 우대 사항 등의 데이터와 지원자의 개인 역량(기술 스택, 전공 역량, 진행 프로젝트 등) 데이터를 활용하여 키워드를 추출한다. 지원자와 기업이 입력한 데이터를 통해 추출한 키워드들을 두 개의 집합으로 나눈 뒤 각각의 키워드를 할당한다. 할당받은 집합들을 비교하여 지원자의 정보가 기업의 채용 조건에 얼마나 부합하는지 계산한 후, 해당확률을 지원자에게 제공하는 방식의 시스템이다.

  • PDF

Road Tracking based on Prior Information in Video Sequences (비디오 영상에서 사전정보 기반의 도로 추적)

  • Lee, Chang Woo
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.18 no.2
    • /
    • pp.19-25
    • /
    • 2013
  • In this paper, we propose an approach to tracking road regions from video sequences. The proposed method segments and tracks road regions by utilizing the prior information from the result of the previous frame. For the efficiency of the system, we have a simple assumption that the road region is usually shown in the lower part of input images so that lower 60% of input images is set to the region of interest(ROI). After initial segmentation using flood-fill algorithm, we merge neighboring regions based on color similarity measure. The previous segmentation result, in which seed points for the successive frame are extracted, is used as prior information to segment the current frame. The similarity between the road region of the previous frame and that of the current frame is measured by the modified Jaccard coefficient. According to the similarity we refine and track the detected road regions. The experimental results reveal that the proposed method is effective to segment and track road regions in noisy and non-noisy environments.

Development of a Personalized Similarity Measure using Genetic Algorithms for Collaborative Filtering

  • Lee, Soojung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.12
    • /
    • pp.219-226
    • /
    • 2018
  • Collaborative filtering has been most popular approach to recommend items in online recommender systems. However, collaborative filtering is known to suffer from data sparsity problem. As a simple way to overcome this problem in literature, Jaccard index has been adopted to combine with the existing similarity measures. We analyze performance of such combination in various data environments. We also find optimal weights of factors in the combination using a genetic algorithm to formulate a similarity measure. Furthermore, optimal weights are searched for each user independently, in order to reflect each user's different rating behavior. Performance of the resulting personalized similarity measure is examined using two datasets with different data characteristics. It presents overall superiority to previous measures in terms of recommendation and prediction qualities regardless of the characteristics of the data environment.