• Title/Summary/Keyword: Interestingness measure

Search Result 24, Processing Time 0.022 seconds

Utilization of similarity measures by PIM with AMP as association rule thresholds (모든 주변 비율을 고려한 확률적 흥미도 측도 기반 유사성 측도의 연관성 평가 기준 활용 방안)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.1
    • /
    • pp.117-124
    • /
    • 2013
  • Association rule of data mining techniques is the method to quantify the relationship between a set of items in a huge database, andhas been applied in various fields like internet shopping mall, healthcare, insurance, and education. There are three primary interestingness measures for association rule, support and confidence and lift. Confidence is the most important measure of these measures, and we generate some association rules using confidence. But it is an asymmetric measure and has only positive value. So we can face with difficult problems in generation of association rules. In this paper we apply the similarity measures by probabilistic interestingness measure (PIM) with all marginal proportions (AMP) to solve this problem. The comparative studies with support, confidences, lift, chi-square statistics, and some similarity measures by PIM with AMPare shown by numerical example. As the result, we knew that the similarity measures by PIM with AMP could be seen the degree of association same as confidence. And we could confirm the direction of association because they had the sign of their values, and select the best similarity measure by PIM with AMP.

Bounds of PIM-based similarity measures with partially marginal proportion (부분적 주변 비율에 의한 확률적 흥미도 측도 기반 유사성 측도의 상한 및 하한의 설정)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.4
    • /
    • pp.857-864
    • /
    • 2015
  • By Wikipedia, data mining is the computational process of discovering patterns in huge data sets involving methods at the intersection of association rule, decision tree, clustering, artificial intelligence, machine learning. Clustering or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. The similarity measures being used in the clustering may be classified into various types depending on the characteristics of data. In this paper, we computed bounds for similarity measures based on the probabilistic interestingness measure with partially marginal probability such as Peirce I, Peirce II, Cole I, Cole II, Loevinger, Park I, and Park II measure. We confirmed the absolute value of Loevinger measure wasthe upper limit of the absolute value of any other existing measures. Ordering of other measures is determined by the size of concurrence proportion, non-simultaneous occurrence proportion, and mismatch proportion.

Effect of Market Basket Size on the Accuracy of Association Rule Measures (장바구니 크기가 연관규칙 척도의 정확성에 미치는 영향)

  • Kim, Nam-Gyu
    • Asia pacific journal of information systems
    • /
    • v.18 no.2
    • /
    • pp.95-114
    • /
    • 2008
  • Recent interests in data mining result from the expansion of the amount of business data and the growing business needs for extracting valuable knowledge from the data and then utilizing it for decision making process. In particular, recent advances in association rule mining techniques enable us to acquire knowledge concerning sales patterns among individual items from the voluminous transactional data. Certainly, one of the major purposes of association rule mining is to utilize acquired knowledge in providing marketing strategies such as cross-selling, sales promotion, and shelf-space allocation. In spite of the potential applicability of association rule mining, unfortunately, it is not often the case that the marketing mix acquired from data mining leads to the realized profit. The main difficulty of mining-based profit realization can be found in the fact that tremendous numbers of patterns are discovered by the association rule mining. Due to the many patterns, data mining experts should perform additional mining of the results of initial mining in order to extract only actionable and profitable knowledge, which exhausts much time and costs. In the literature, a number of interestingness measures have been devised for estimating discovered patterns. Most of the measures can be directly calculated from what is known as a contingency table, which summarizes the sales frequencies of exclusive items or itemsets. A contingency table can provide brief insights into the relationship between two or more itemsets of concern. However, it is important to note that some useful information concerning sales transactions may be lost when a contingency table is constructed. For instance, information regarding the size of each market basket(i.e., the number of items in each transaction) cannot be described in a contingency table. It is natural that a larger basket has a tendency to consist of more sales patterns. Therefore, if two itemsets are sold together in a very large basket, it can be expected that the basket contains two or more patterns and that the two itemsets belong to mutually different patterns. Therefore, we should classify frequent itemset into two categories, inter-pattern co-occurrence and intra-pattern co-occurrence, and investigate the effect of the market basket size on the two categories. This notion implies that any interestingness measures for association rules should consider not only the total frequency of target itemsets but also the size of each basket. There have been many attempts on analyzing various interestingness measures in the literature. Most of them have conducted qualitative comparison among various measures. The studies proposed desirable properties of interestingness measures and then surveyed how many properties are obeyed by each measure. However, relatively few attentions have been made on evaluating how well the patterns discovered by each measure are regarded to be valuable in the real world. In this paper, attempts are made to propose two notions regarding association rule measures. First, a quantitative criterion for estimating accuracy of association rule measures is presented. According to this criterion, a measure can be considered to be accurate if it assigns high scores to meaningful patterns that actually exist and low scores to arbitrary patterns that co-occur by coincidence. Next, complementary measures are presented to improve the accuracy of traditional association rule measures. By adopting the factor of market basket size, the devised measures attempt to discriminate the co-occurrence of itemsets in a small basket from another co-occurrence in a large basket. Intensive computer simulations under various workloads were performed in order to analyze the accuracy of various interestingness measures including traditional measures and the proposed measures.

A study on the ordering of PIM family similarity measures without marginal probability (주변 확률을 고려하지 않는 확률적 흥미도 측도 계열 유사성 측도의 서열화)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.2
    • /
    • pp.367-376
    • /
    • 2015
  • Today, big data has become a hot keyword in that big data may be defined as collection of data sets so huge and complex that it becomes difficult to process by traditional methods. Clustering method is to identify the information in a big database by assigning a set of objects into the clusters so that the objects in the same cluster are more similar to each other clusters. The similarity measures being used in the cluster analysis may be classified into various types depending on the nature of the data. In this paper, we computed upper and lower limits for probability interestingness measure based similarity measures without marginal probability such as Yule I and II, Michael, Digby, Baulieu, and Dispersion measure. And we compared these measures by real data and simulated experiment. By Warrens (2008), Coefficients with the same quantities in the numerator and denominator, that are bounded, and are close to each other in the ordering, are likely to be more similar. Thus, results on bounds provide means of classifying various measures. Also, knowing which coefficients are similar provides insight into the stability of a given algorithm.

Exploration of relationship between confirmation measures and association thresholds (기준 확인 측도와 연관성 평가기준과의 관계 탐색)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.4
    • /
    • pp.835-845
    • /
    • 2013
  • Association rule of data mining techniques is the method to quantify the relevance between a set of items in a big database, andhas been applied in various fields like manufacturing industry, shopping mall, healthcare, insurance, and education. Philosophers of science have proposed interestingness measures for various kinds of patterns, analyzed their theoretical properties, evaluated them empirically, and suggested strategies to select appropriate measures for particular domains and requirements. Such interestingness measures are divided into objective, subjective, and semantic measures. Objective measures are based on data used in the discovery process and are typically motivated by statistical considerations. Subjective measures take into account not only the data but also the knowledge and interests of users who examine the pattern, while semantic measures additionally take into account utility and actionability. In a very different context, researchers have devoted a lot of attention to measures of confirmation or evidential support. The focus in this paper was on asymmetric confirmation measures, and we compared confirmation measures with basic association thresholds using some simulation data. As the result, we could distinguish the direction of association rule by confirmation measures, and interpret degree of association operationally by them. Futhermore, the result showed that the measure by Rips and that by Kemeny and Oppenheim were better than other confirmation measures.

A Measure for Improvement in Quality of Association Rules in the Item Response Dataset (문항 응답 데이터에서 문항간 연관규칙의 질적 향상을 위한 도구 개발)

  • Kwak, Eun-Young;Kim, Hyeoncheol
    • The Journal of Korean Association of Computer Education
    • /
    • v.10 no.3
    • /
    • pp.1-8
    • /
    • 2007
  • In this paper, we introduce a new measure called surprisal that estimates the informativeness of transactional instances and attributes in the item response dataset and improve the quality of association rules. In order to this, we set artificial dataset and eliminate noisy and uninformative data using the surprisal first, and then generate association rules between items. And we compare the association rules from the dataset after surprisal-based pruning with support-based pruning and original dataset unpruned. Experimental result that the surprisal-based pruning improves quality of association rules in question item response datasets significantly.

  • PDF

The proposition of compared and attributably pure confidence in association rule mining (연관 규칙 마이닝에서 비교 기여 순수 신뢰도의 제안)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.3
    • /
    • pp.523-532
    • /
    • 2013
  • Generally, data mining is the process of analyzing big data from different perspectives and summarizing it into useful information. The most widely used data mining technique is to generate association rules, and it finds the relevance between two items in a huge database. This technique has been used to find the relationship between each set of items based on the interestingness measures such as support, confidence, lift, etc. Among many interestingness measures, confidence is the most frequently used, but it has the drawback that it can not determine the direction of the association. The attributably pure confidence and compared confidence are able to determine the direction of the association, but their ranges are not [-1, +1]. So we can not interpret the degree of association operationally by their values. This paper propose a compared and attributably pure confidence to compensate for this drawback, and then describe some properties for a proposed measure. The comparative studies with confidence, compared confidence, attributably pure confidence, and a proposed measure are shown by numerical example. The results show that the a compared and attributably pure confidence is better than any other confidences.

Frequent Pattern Mining By using a Completeness for BigData (빅데이터에 대한 Completeness를 이용한 빈발 패턴 마이닝)

  • Park, In-Kyu
    • Journal of Korea Game Society
    • /
    • v.18 no.2
    • /
    • pp.121-130
    • /
    • 2018
  • Most of those studies use frequency, the number of times a pattern appears in a transaction database, as the key measure for pattern interestingness. It prerequisites that any interesting pattern should occupy a maximum portion of the transactions it appears. But in our real world scenarios the completeness of any pattern is more likely to become various in transactions. Hence, we should also consider the problem of finding the qualified patterns with the significant values of the weighted support by completeness in order to reduce the loss of information within any pattern in transaction. In these pattern recommendation applications, patterns with higher completeness may lead to higher recall while patterns with higher completeness may lead to higher recall while patterns with higher frequency lead to higher precision. In this paper, we propose a measure of weighted support and completeness and an algorithm WSCFPM(weigted support and completeness frequent pattern mining). Our algorithm handles the invalidation of the monotone or anti-monotone property which does not hold on completeness. Extensive performance analysis show that our algorithm is very efficient and scalable for word pattern mining.

The proposition of attributably pure confidence in association rule mining (연관 규칙 마이닝에서 기여 순수 신뢰도의 제안)

  • Park, Hee-Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.2
    • /
    • pp.235-243
    • /
    • 2011
  • The most widely used data mining technique is to explore association rules. This technique has been used to find the relationship between each set of items based on the association thresholds such as support, confidence, lift, etc. There are many interestingness measures as the criteria for evaluating association rules. Among them, confidence is the most frequently used, but it has the drawback that it can not determine the direction of the association. The net confidence measure was developed to compensate for this drawback, but it is useless in the case that the value of positive confidence is the same as that of negative confidence. This paper propose a attributably pure confidence to evaluate association rules and then describe some properties for a proposed measure. The comparative studies with confidence, net confidence, and attributably pure confidence are shown by numerical example. The results show that the attributably pure confidence is better than confidence or net confidence.

Proposition of negatively pure association rule threshold (음의 순수 연관성 규칙 평가 기준의 제안)

  • Park, Hee-Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.2
    • /
    • pp.179-188
    • /
    • 2011
  • Association rule represents the relationship between items in a massive database by quantifying their relationship, and is used most frequently in data mining techniques. In general, association rule technique generates the rule, 'If A, then B.', whereas negative association rule technique generates the rule, 'If A, then not B.', or 'If not A, then B.'. We can determine whether we promote other products in addition to promote its products only if we add negative association rules to existing association rules. In this paper, we proposed the negatively pure association rules by negatively pure support, negatively pure confidence, and negatively pure lift to overcome the problems faced by negative association rule technique. In checking the usefulness of this technique through numerical examples, we could find the direction of association by the sign of the negatively pure association rule measure.