• Title/Summary/Keyword: Pearson similarity

Search Result 80, Processing Time 0.022 seconds

On the Study of Perfect Coverage for Recommender System

  • Lee, Hee-Choon;Lee, Seok-Jun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.4
    • /
    • pp.1151-1160
    • /
    • 2006
  • The similarity weight, the pearson's correlation coefficient, which is used in the recommender system has a weak point that it cannot predict all of the prediction value. The similarity weight, the vector similarity, has a weak point of the high MAE although the prediction coverage using the vector similarity is higher than that using the pearson's correlation coefficient. The purpose of this study is to suggest how to raise the prediction coverage. Also, the MAE using the suggested method in this study was compared both with the MAE using the pearson's correlation coefficient and with the MAE using the vector similarity, so was the prediction coverage. As a result, it was found that the low of the MAE in the case of using the suggested method was higher than that using the pearson's correlation coefficient. However, it was also shown that it was lower than that using the vector similarity. In terms of the prediction coverage, when the suggested method was compared with two similarity weights as I mentioned above, it was found that its prediction coverage was higher than that pearson's correlation coefficient as well as vector similarity.

  • PDF

A Study on the Maximizing Coverage for Recommender System

  • Lee, Hee-Choon;Lee, Seok-Jun;Park, Ji-Won;Kim, Chul-Seoung
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2006.11a
    • /
    • pp.119-128
    • /
    • 2006
  • The similarity weight, the pearson's correlation coefficient, which is used in the recommender system has a weak point that it cannot predict all of the prediction value. The similarity weight, the vector similarity, has a weak point of the high MAE although the prediction coverage using the vector similarity is higher than that using the pearson's correlation coefficient. The purpose of this study is to suggest how to raise the prediction coverage. Also, the MAE using the suggested method in this study was compared both with the MAE using the pearson's correlation coefficient and with the MAE using the vector similarity, so was the prediction coverage. As a result, it was found that the low of the MAE in the case of using the suggested method was higher than that using the pearson's correlation coefficient. However, it was also shown that it was lower than that using the vector similarity In terms of the prediction coverage, when the suggested method was compared with two similarity weights as I mentioned above, it was found that its prediction coverage was higher than that pearson's correlation coefficient as well as vector similarity.

  • PDF

Parametric and Non Parametric Measures for Text Similarity (텍스트 유사성을 위한 파라미터 및 비 파라미터 측정)

  • Mlyahilu, John;Kim, Jong-Nam
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.20 no.4
    • /
    • pp.193-198
    • /
    • 2019
  • The wide spread of genuine and fake information on internet has lead to various studies on text analysis. Copying and pasting others' work without acknowledgement, research results manipulation without proof has been trending for a while in the era of data science. Various tools have been developed to reduce, combat and possibly eradicate plagiarism in various research fields. Text similarity measurements can be manually done by using both parametric and non parametric methods of which this study implements cosine similarity and Pearson correlation as parametric while Spearman correlation as non parametric. Cosine similarity and Pearson correlation metrics have achieved highest coefficients of similarity while Spearman shown low similarity coefficients. We recommend the use of non parametric methods in measuring text similarity due to their non normality assumption as opposed to the parametric methods which relies on normality assumptions and biasness.

On the Effect of Significance of Correlation Coefficient for Recommender System

  • Lee, Hee-Choon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.4
    • /
    • pp.1129-1139
    • /
    • 2006
  • Pearson's correlation coefficient and vector similarity are generally applied to The users' similarity weight of user based recommender system. This study is needed to find that the correlation coefficient of similarity weight is effected by the number of pair response and significance probability. From the classified correlation coefficient by the significance probability test on the correlation coefficient and pair of response, the change of MAE is studied by comparing the predicted precision of the two. The results are experimentally related with the change of MAE from the significant correlation coefficient and the number of pair response.

  • PDF

Identifying Spatial Distribution Pattern of Water Quality in Masan Bay Using Spatial Autocorrelation Index and Pearson's r (공간자기상관 지수와 Pearson 상관계수를 이용한 마산만 수질의 공간분포 패턴 규명)

  • Choi, Hyun-Woo;Park, Jae-Moon;Kim, Hyun-Wook;Kim, Young-Ok
    • Ocean and Polar Research
    • /
    • v.29 no.4
    • /
    • pp.391-400
    • /
    • 2007
  • To identify the spatial distribution pattern of water quality in Masan Bay, Pearson's correlation as a common statistic method and Moran's I as a spatial autocorrelation statistics were applied to the hydrological data seasonally collected from Masan Bay for two years ($2004{\sim}2005$). Spatial distribution of salinity, DO and silicate among the hydrological parameters clustered strongly while chlorophyll a distribution displayed a weak clustering. When the similarity matrix of Moran's I was compared with correlation matrix of Pearson's r, only the relationships of temperature vs. salinity, temperature vs. silicate and silicate vs. total inorganic nitrogen showed significant correlation and similarity of spatial clustered pattern. Considering Pearson's correlation and the spatial autocorrelation results, water quality distribution patterns of Masan Bay were conceptually simplified into four types. Based on the simplified types, Moran's I and Pearson's r were compared respectively with spatial distribution maps on salinity and silicate with a strong clustered pattern, and with chlorophyll a having no clustered pattern. According to these test results, spatial distribution of the water quality in Masan Bay could be summed up in four patterns. This summation should be developed as spatial index to be linked with pollutant and ecological indicators for coastal health assessment.

A Study on the Effect of Co-Ratings and Correlation Coefficient for Recommender System

  • Lee, Hee-Choon;Lee, Seok-Jun;Park, Ji-Won;Kim, Chul-Seung
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2006.11a
    • /
    • pp.59-69
    • /
    • 2006
  • Pearson's correlation coefficient and Vector similarity are generally applied to The users' similarity weight of user based recommender system. This study is needed to find that the correlation coefficient of similarity weight is effected by the number of pair response and significance probability. From the classified correlation coefficient by the significance probability test on the correlation coefficient and pair of response, the change of MAE is studied by comparing the predicted precision of the two. The results are experimentally related with the change of MAE from the significant correlation coefficient and the number of pair response.

  • PDF

Empirical Comparison of Word Similarity Measures Based on Co-Occurrence, Context, and a Vector Space Model

  • Kadowaki, Natsuki;Kishida, Kazuaki
    • Journal of Information Science Theory and Practice
    • /
    • v.8 no.2
    • /
    • pp.6-17
    • /
    • 2020
  • Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSM- and LDA-based similarity measures performed best.

An Analysis Scheme Design of Customer Spending Pattern using Text Mining (텍스트 마이닝을 이용한 소비자 소비패턴 분석 기법 설계)

  • Jeong, Eun-Hee;Lee, Byung-Kwan
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.2
    • /
    • pp.181-188
    • /
    • 2018
  • In this paper, we propose an analysis scheme of customer spending pattern using text mining. In proposed consumption pattern analysis scheme, first we analyze user's rating similarity using Pearson correlation, second we analyze user's review similarity using TF-IDF cosine similarity, third we analyze the consistency of the rating and review using Sendiwordnet. And we select the nearest neighbors using rating similarity and review similarity, and provide the recommended list that is proper with consumption pattern. The precision of recommended list are 0.79 for the Pearson correlation, 0.73 for the TF-IDF, and 0.82 for the proposed consumption pattern. That is, the proposed consumption pattern analysis scheme can more accurately analyze consumption pattern because it uses both quantitative rating and qualitative reviews of consumers.

The Effect of Co-rating on the Recommender System of User Base

  • Lee, Hee-Choon;Lee, Seok-Jun;Chung, Young-Jun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.3
    • /
    • pp.775-784
    • /
    • 2006
  • This study is to investigate the effect of the number of co-rated users to the MAE. User based collaborative algorithm generally uses similarity weight to compute the relation of active user and other users. The original estimation algorithm of the GroupLens used the Pearson's correlation coefficient, soon after other researchers used various weighting. The Pearson’s correlation coefficient and Vector similarity, which is used in the field of information retrieval, are commonly used to the estimation algorithm. In prediction, we analyze the effect of the number of co-rated users on the user based recommender system.

  • PDF

Similarity Measurement Between Titles and Abstracts Using Bijection Mapping and Phi-Correlation Coefficient

  • John N. Mlyahilu;Jong-Nam Kim
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.23 no.3
    • /
    • pp.143-149
    • /
    • 2022
  • This excerpt delineates a quantitative measure of relationship between a research title and its respective abstract extracted from different journal articles documented through a Korean Citation Index (KCI) database published through various journals. In this paper, we propose a machine learning-based similarity metric that does not assume normality on dataset, realizes the imbalanced dataset problem, and zero-variance problem that affects most of the rule-based algorithms. The advantage of using this algorithm is that, it eliminates the limitations experienced by Pearson correlation coefficient (r) and additionally, it solves imbalanced dataset problem. A total of 107 journal articles collected from the database were used to develop a corpus with authors, year of publication, title, and an abstract per each. Based on the experimental results, the proposed algorithm achieved high correlation coefficient values compared to others which are cosine similarity, euclidean, and pearson correlation coefficients by scoring a maximum correlation of 1, whereas others had obtained non-a-number value to some experiments. With these results, we found that an effective title must have high correlation coefficient with the respective abstract.