• Title/Summary/Keyword: Jaccard similarity

Search Result 50, Processing Time 0.026 seconds

Semantic Process Retrieval with Similarity Algorithms (유사도 알고리즘을 활용한 시맨틱 프로세스 검색방안)

  • Lee, Hong-Joo;Klein, Mark
    • Asia pacific journal of information systems
    • /
    • v.18 no.1
    • /
    • pp.79-96
    • /
    • 2008
  • One of the roles of the Semantic Web services is to execute dynamic intra-organizational services including the integration and interoperation of business processes. Since different organizations design their processes differently, the retrieval of similar semantic business processes is necessary in order to support inter-organizational collaborations. Most approaches for finding services that have certain features and support certain business processes have relied on some type of logical reasoning and exact matching. This paper presents our approach of using imprecise matching for expanding results from an exact matching engine to query the OWL(Web Ontology Language) MIT Process Handbook. MIT Process Handbook is an electronic repository of best-practice business processes. The Handbook is intended to help people: (1) redesigning organizational processes, (2) inventing new processes, and (3) sharing ideas about organizational practices. In order to use the MIT Process Handbook for process retrieval experiments, we had to export it into an OWL-based format. We model the Process Handbook meta-model in OWL and export the processes in the Handbook as instances of the meta-model. Next, we need to find a sizable number of queries and their corresponding correct answers in the Process Handbook. Many previous studies devised artificial dataset composed of randomly generated numbers without real meaning and used subjective ratings for correct answers and similarity values between processes. To generate a semantic-preserving test data set, we create 20 variants for each target process that are syntactically different but semantically equivalent using mutation operators. These variants represent the correct answers of the target process. We devise diverse similarity algorithms based on values of process attributes and structures of business processes. We use simple similarity algorithms for text retrieval such as TF-IDF and Levenshtein edit distance to devise our approaches, and utilize tree edit distance measure because semantic processes are appeared to have a graph structure. Also, we design similarity algorithms considering similarity of process structure such as part process, goal, and exception. Since we can identify relationships between semantic process and its subcomponents, this information can be utilized for calculating similarities between processes. Dice's coefficient and Jaccard similarity measures are utilized to calculate portion of overlaps between processes in diverse ways. We perform retrieval experiments to compare the performance of the devised similarity algorithms. We measure the retrieval performance in terms of precision, recall and F measure? the harmonic mean of precision and recall. The tree edit distance shows the poorest performance in terms of all measures. TF-IDF and the method incorporating TF-IDF measure and Levenshtein edit distance show better performances than other devised methods. These two measures are focused on similarity between name and descriptions of process. In addition, we calculate rank correlation coefficient, Kendall's tau b, between the number of process mutations and ranking of similarity values among the mutation sets. In this experiment, similarity measures based on process structure, such as Dice's, Jaccard, and derivatives of these measures, show greater coefficient than measures based on values of process attributes. However, the Lev-TFIDF-JaccardAll measure considering process structure and attributes' values together shows reasonably better performances in these two experiments. For retrieving semantic process, we can think that it's better to consider diverse aspects of process similarity such as process structure and values of process attributes. We generate semantic process data and its dataset for retrieval experiment from MIT Process Handbook repository. We suggest imprecise query algorithms that expand retrieval results from exact matching engine such as SPARQL, and compare the retrieval performances of the similarity algorithms. For the limitations and future work, we need to perform experiments with other dataset from other domain. And, since there are many similarity values from diverse measures, we may find better ways to identify relevant processes by applying these values simultaneously.

Malicious Trojan Horse Application Discrimination Mechanism using Realtime Event Similarity on Android Mobile Devices (안드로이드 모바일 단말에서의 실시간 이벤트 유사도 기반 트로이 목마 형태의 악성 앱 판별 메커니즘)

  • Ham, You Joung;Lee, Hyung-Woo
    • Journal of Internet Computing and Services
    • /
    • v.15 no.3
    • /
    • pp.31-43
    • /
    • 2014
  • Large number of Android mobile application has been developed and deployed through the Android open market by increasing android-based smart work device users recently. But, it has been discovered security vulnerabilities on malicious applications that are developed and deployed through the open market or 3rd party market. There are issues to leak user's personal and financial information in mobile devices to external server without the user's knowledge in most of malicious application inserted Trojan Horse forms of malicious code. Therefore, in order to minimize the damage caused by malignant constantly increasing malicious application, it is required a proactive detection mechanism development. In this paper, we analyzed the existing techniques' Pros and Cons to detect a malicious application and proposed discrimination and detection result using malicious application discrimination mechanism based on Jaccard similarity after collecting events occur in real-time execution on android-mobile devices.

Comparison of User-generated Tags with Subject Descriptors, Author Keywords, and Title Terms of Scholarly Journal Articles: A Case Study of Marine Science

  • Vaidya, Praveenkumar;Harinarayana, N.S.
    • Journal of Information Science Theory and Practice
    • /
    • v.7 no.1
    • /
    • pp.29-38
    • /
    • 2019
  • Information retrieval is the challenge of the Web 2.0 world. The experiment of knowledge organisation in the context of abundant information available from various sources proves a major hurdle in obtaining information retrieval with greater precision and recall. The fast-changing landscape of information organisation through social networking sites at a personal level creates a world of opportunities for data scientists and also library professionals to assimilate the social data with expert created data. Thus, folksonomies or social tags play a vital role in information organisation and retrieval. The comparison of these user-created tags with expert-created index terms, author keywords and title words, will throw light on the differentiation between these sets of data. Such comparative studies show revelation of a new set of terms to enhance subject access and reflect the extent of similarity between user-generated tags and other set of terms. The CiteULike tags extracted from 5,150 scholarly journal articles in marine science were compared with corresponding Aquatic Science and Fisheries Abstracts descriptors, author keywords, and title terms. The Jaccard similarity coefficient method was employed to compare the social tags with the above mentioned wordsets, and results proved the presence of user-generated keywords in Aquatic Science and Fisheries Abstracts descriptors, author keywords, and title words. While using information retrieval techniques like stemmer and lemmatization, the results were found to enhance keywords to subject access.

Performance Analysis of Forwarding Schemes Based on Similarities for Opportunistic Networks (기회적 네트워크에서의 유사도 기반의 포워딩 기법의 성능 분석)

  • Kim, Sun-Kyum;Lee, Tae-Seok;Kim, Wan-Jong
    • KIISE Transactions on Computing Practices
    • /
    • v.24 no.3
    • /
    • pp.145-150
    • /
    • 2018
  • Forwarding in opportunistic networks shows low performance because there may be no connecting paths between the source and the destination nodes due to the intermittent connectivity. Currently, social network analysis has been researched. Specifically, similarity is one of methods of social networks analysis. In this paper, we propose forwarding schemes based on representative similarities, and evaluate how much the forwarding performance increases. As a result, since the forwarding schemes are based on similarities, these schemes only forward messages to nodes with higher similarity as relay nodes, toward the destination node. These schemes have low network traffic and hop count while having stable transmission delay.

RAPD Analysis of Three Deer Species in Malaysia

  • El-Jaafari, Habiba A.A.;Panandam, Jothi M.;Idris, Ismail;Siraj, Siti Shapor
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.21 no.9
    • /
    • pp.1233-1237
    • /
    • 2008
  • The genetic variability within and among three deer species in Malaysia, namely Cervus nippon (sika), Cervus timorensis (rusa) and Cervus unicolor (sambar), were evaluated using the RAPD technique. The DNA extracted from the buffy coat of 34 sika, 38 rusa and 9 sambar were analysed using ten primers that gave bands which showed good resolution. The primers generated 164 RAPD markers in total, and these ranged in size from 150 to 900 bp. The percent of polymorphism of the bands generated per primer ranged from 66.66-93.33% for rusa, 36.84-61.14% for sambar and 52.38-100% for sika. The overall percent polymorphism observed for the 164 RAPD markers was 99.39%. The results revealed five exclusive, monomorphic markers for sambar and one exclusive, monomorphic marker for sika; none was observed for rusa. However, these cannot be declared as markers for the identification of the species without analysis of more samples, populations and species. The means of within population genetic distances, based on Dice's and Jaccard's similarity indices, were similar for the rusa (0.383 and 0.542, respectively) and sika (0.397 and 0.558, respectively) populations with the sambar population being the least variable (0.194 and 0.323, respectively). The Dice based genetic distances within the species ranged from 0.194 to 0.397 and the genetic distances among the species were 0.791-0.911. The genetic distances based on Dice's and Jaccard's similarity indices between the rusa and sambar were 0.556 and 0.713, between the rusa and sika populations were 0.552 and 0.710, and between sambar and sika were 0.622 and 0.766, respectively.

Vegetation Types and Ecological Characteristics of Larix kaempferi Plantations in Baekdudaegan Protected Area, South Korea (백두대간 보호지역 일본잎갈나무림의 현존식생 유형과 생태적 특성)

  • Oh, Seung-Hwan;Kim, Jun-Soo;Cho, Joon-Hee;Cho, Hyun-Je
    • Journal of Korean Society of Forest Science
    • /
    • v.110 no.4
    • /
    • pp.530-542
    • /
    • 2021
  • To establish the basic unit for the ecological management of the Larix kaempferiplantations in the Baekdudaegan protected area, we classified the vegetation types using TWINSPAN and DCA ordination analysis based on the vegetation information collected from 119 plots and analyzed their spatial arrangement status. Vegetation types were classified into seven types, including Quercus mongolica-Rhododendron schlippenbachii type, Q. mongolica-Lespedeza maximowiczii type, Cornus controversa-Morus australis type, Q. mongolica-Carpinus cordata type, Lindera erythrocarpa-Rosa multiflora type, Q. serrata-Zanthoxylum schinifolium type, and Q. serrata-Sasa borealis type and they have usually reflected differences in the floristic composition according to latitude, elevation, establishment period, operation history, characteristics of the surrounding stands, and degree of disturbance. Furthermore, using the Jaccard coefficient to comparethe floristic composition similarity between Larix kaempferiplantations and surrounding potential natural vegetation (Q. mongolica and Q. serrata forests), although some differences depended on vegetation types, it was 0.21 on average with Q. mongolica forest and 0.32 with Q. serrata forest, indicating that the floristic composition was still heterogeneous.

3-Step Security Vulnerability Risk Scoring considering CVE Trends (CVE 동향을 반영한 3-Step 보안 취약점 위험도 스코어링)

  • Jihye, Lim;Jaewoo, Lee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.27 no.1
    • /
    • pp.87-96
    • /
    • 2023
  • As the number of security vulnerabilities increases yearly, security threats continue to occur, and the vulnerability risk is also important. We devise a security threat score calculation reflecting trends to determine the risk of security vulnerabilities. The three stages considered key elements such as attack type, supplier, vulnerability trend, and current attack methods and techniques. First, it reflects the results of checking the relevance of the attack type, supplier, and CVE. Secondly, it considers the characteristics of the topic group and CVE identified through the LDA algorithm by the Jaccard similarity technique. Third, the latest version of the MITER ATT&CK framework attack method, technology trend, and relevance between CVE are considered. We used the data within overseas sites provide reliable security information to review the usability of the proposed final formula CTRS. The scoring formula makes it possible to fast patch and respond to related information by identifying vulnerabilities with high relevance and risk only with some particular phrase.

Cluster Analysis with Balancing Weight on Mixed-type Data

  • Chae, Seong-San;Kim, Jong-Min;Yang, Wan-Youn
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.3
    • /
    • pp.719-732
    • /
    • 2006
  • A set of clustering algorithms with proper weight on the formulation of distance which extend to mixed numeric and multiple binary values is presented. A simple matching and Jaccard coefficients are used to measure similarity between objects for multiple binary attributes. Similarities are converted to dissimilarities between i th and j th objects. The performance of clustering algorithms with balancing weight on different similarity measures is demonstrated. Our experiments show that clustering algorithms with application of proper weight give competitive recovery level when a set of data with mixed numeric and multiple binary attributes is clustered.

Min-Max Hash for Similarity Measurement based on Multiset (Min-Max Hash를 활용한 다중 집합 기반의 유사도 측정)

  • Yoon, Jin-Uk;Kim, Byoungwook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.05a
    • /
    • pp.36-39
    • /
    • 2019
  • 데이터 마이닝에서 클러스터링은 서로 유사한 특징을 갖는 데이터들을 동일한 클래스로 분류하는 방법이다. 클러스터링에는 다양한 방법이 존재하지만 대표적으로 집합으로 표현된 데이터들의 유사도를 측정하기 위해서는 자카드 유사도(Jaccard Similarity)를 이용한다. 자카드 유사도는 서로 다른 집합 간의 공통된 부분을 상대적으로 평가하여 유사도를 측정하는 방법이다. 그러나 최근에는 데이터를 저장할 수 있는 기술과 매체의 발전으로 표현할 수 있는 데이터의 영역과 범위는 발전되고 있기 때문에 많은 연산과 시간의 비용이 발생하게 된다. 이를 해결하기 위해서 두 데이터의 표본의 유사도를 통해 실제 데이터들의 유사도를 추정할 수 있는 Min-Hash 가 제안되었다. 본 논문에서는 이를 활용하여 집합의 영역을 다중 집합(Multiset)으로 확장하여 중복되는 값을 가질 수 있는 두 데이터 간의 유사도를 효율적으로 추정할 수 있는 Min-Max Hash 를 제안한다.

Analysis of Genetic Variability Using RAPD Markers in Paeonia spp. Grown in Korea

  • Lim, Mi Young;Jana, Sonali;Sivanesan, Iyyakkannu;Park, Hyun Rho;Hwang, Ji Hyun;Park, Young Hoon;Jeong, Byoung Ryong
    • Horticultural Science & Technology
    • /
    • v.31 no.3
    • /
    • pp.322-327
    • /
    • 2013
  • The genetic diversity and phylogenetic relationships of eleven herbaceous peonies grown in Korea were analyzed by random amplified polymorphic DNA (RAPD). Twenty-four decamer RAPD primers were used in a comparative analysis of these Korean peony species. Of the 142 total RAPD fragments amplified, 124 (87.3%) were found to be polymorphic. The remaining 18 fragments were found to be monomorphic (12.7%) shared by individuals of all 11 peony species. Cluster analysis based on the presence or absence of bands was performed by Jaccard's similarity coefficient, based on Unweighted Pair Group Method with Arithmetic Averages. Genetic similarity range was 0.39 to 0.90 with a mean of 0.64. This study offered a rapid and reliable method for the estimation of variability among different peony species which could be utilized by the breeders for further improvement of the local peony species. Also, the results propose that the RAPD marker technique is a useful tool for evaluation of genetic diversity and relationship amongst different peony species.