• Title/Summary/Keyword: Jaccard Coefficient

Search Result 37, Processing Time 0.026 seconds

Exploration of Hierarchical Techniques for Clustering Korean Author Names (한글 저자명 군집화를 위한 계층적 기법 비교)

  • Kang, In-Su
    • Journal of Information Management
    • /
    • v.40 no.2
    • /
    • pp.95-115
    • /
    • 2009
  • Author resolution is to disambiguate same-name author occurrences into real individuals. For this, pair-wise author similarities are computed for author name entities, and then clustering is performed. So far, many studies have employed hierarchical clustering techniques for author disambiguation. However, various hierarchical clustering methods have not been sufficiently investigated. This study covers an empirical evaluation and analysis of hierarchical clustering applied to Korean author resolution, using multiple distance functions such as Dice coefficient, Cosine similarity, Euclidean distance, Jaccard coefficient, Pearson correlation coefficient.

Vegetation Types and Ecological Characteristics of Larix kaempferi Plantations in Baekdudaegan Protected Area, South Korea (백두대간 보호지역 일본잎갈나무림의 현존식생 유형과 생태적 특성)

  • Oh, Seung-Hwan;Kim, Jun-Soo;Cho, Joon-Hee;Cho, Hyun-Je
    • Journal of Korean Society of Forest Science
    • /
    • v.110 no.4
    • /
    • pp.530-542
    • /
    • 2021
  • To establish the basic unit for the ecological management of the Larix kaempferiplantations in the Baekdudaegan protected area, we classified the vegetation types using TWINSPAN and DCA ordination analysis based on the vegetation information collected from 119 plots and analyzed their spatial arrangement status. Vegetation types were classified into seven types, including Quercus mongolica-Rhododendron schlippenbachii type, Q. mongolica-Lespedeza maximowiczii type, Cornus controversa-Morus australis type, Q. mongolica-Carpinus cordata type, Lindera erythrocarpa-Rosa multiflora type, Q. serrata-Zanthoxylum schinifolium type, and Q. serrata-Sasa borealis type and they have usually reflected differences in the floristic composition according to latitude, elevation, establishment period, operation history, characteristics of the surrounding stands, and degree of disturbance. Furthermore, using the Jaccard coefficient to comparethe floristic composition similarity between Larix kaempferiplantations and surrounding potential natural vegetation (Q. mongolica and Q. serrata forests), although some differences depended on vegetation types, it was 0.21 on average with Q. mongolica forest and 0.32 with Q. serrata forest, indicating that the floristic composition was still heterogeneous.

Comparison of User-generated Tags with Subject Descriptors, Author Keywords, and Title Terms of Scholarly Journal Articles: A Case Study of Marine Science

  • Vaidya, Praveenkumar;Harinarayana, N.S.
    • Journal of Information Science Theory and Practice
    • /
    • v.7 no.1
    • /
    • pp.29-38
    • /
    • 2019
  • Information retrieval is the challenge of the Web 2.0 world. The experiment of knowledge organisation in the context of abundant information available from various sources proves a major hurdle in obtaining information retrieval with greater precision and recall. The fast-changing landscape of information organisation through social networking sites at a personal level creates a world of opportunities for data scientists and also library professionals to assimilate the social data with expert created data. Thus, folksonomies or social tags play a vital role in information organisation and retrieval. The comparison of these user-created tags with expert-created index terms, author keywords and title words, will throw light on the differentiation between these sets of data. Such comparative studies show revelation of a new set of terms to enhance subject access and reflect the extent of similarity between user-generated tags and other set of terms. The CiteULike tags extracted from 5,150 scholarly journal articles in marine science were compared with corresponding Aquatic Science and Fisheries Abstracts descriptors, author keywords, and title terms. The Jaccard similarity coefficient method was employed to compare the social tags with the above mentioned wordsets, and results proved the presence of user-generated keywords in Aquatic Science and Fisheries Abstracts descriptors, author keywords, and title words. While using information retrieval techniques like stemmer and lemmatization, the results were found to enhance keywords to subject access.

Genetic Stability of the Plant-materials Induced in the Process of in vitro Organogenesis of Japanese Blood Grass (화본과 식물의 기내 기관분화 단계별 기관분화체의 유전적 안전성)

  • Ye-Jin Lee;In-Jin Kang;Chang-Hyu Bae
    • Proceedings of the Plant Resources Society of Korea Conference
    • /
    • 2023.04a
    • /
    • pp.35-35
    • /
    • 2023
  • 안정적인 유묘의 확보는 스마트작물생산을 위한 공정육묘 생산에서도 중요하며, 기내배양시 유전적 안정성이 높은 유묘의 대량증식은 유묘생산과 공정육묘생산에서 중요한 과정이다. 기내배양시 배양과 정에서 존재하는 체세포영양계변이(somaclonal variation)라는 장벽을 제거하는 것이 중요하다. 본 연구에서는 화본과 식물인 홍띠(Imperata cylindrica ‘Rubra’)로부터 기관분화 단계별 재분화체를 작성하여 기관분화 시 기내재생체의 유전적 안정성을 조사하였다. ISSR 마커에 기반하여 유전적 변이성을 조사하고자 7종류 총 21개체의 기관분화 단계별 재분화체 및 재분화식물체에 대하여 분석한 결과, 유전적 다형성은 기관분화 단계별 재분화체 및 순화 재분화체에서 대조구인 모식물체(1.4%) 대비 같거나 높게 나타나서 재분화체에서 유전적 안정성이 다소 낮은 것으로 나타났다. 또한, Jaccard 계수(Jaccard coefficient)로 총 21개체들 간의 유전적 유사도 지수를 평가한 결과, 유전적 유사도 지수는 0.747~1.0 사이에 분포하며, 평균 0.868로 나타났다. ISSR 마커 밴드에 기반하여 평균연결법(Average linkage method)으로 군집 분석한 결과, 모든 개체는 유사도 지수 0.809 ~ 1.000 내에 분포하였다. 유전적 유사도 지수 0.809에서 2개 그룹으로 유집되었으며, 모식물체와 실내재배, 노지재배 재분화 녹색 식물체가 같은 그룹으로 분류되었다. 이상의 결과는 화본과 식물의 기내배양에서 기관분화 시 존재하는 체세포영양계변이에 대한 기초 정보를 제공해 준다. 이들 기관분화에 따른 기내재생체의 안정성에 대한 연구자료는 향후 기내식물의 안정적인 대량번식에 있어 유익한 배경을 제공해 줄 것이다.

  • PDF

An Experimental Study on Selecting Association Terms Using Text Mining Techniques (텍스트 마이닝 기법을 이용한 연관용어 선정에 관한 실험적 연구)

  • Kim, Su-Yeon;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.3 s.61
    • /
    • pp.147-165
    • /
    • 2006
  • In this study, experiments for selection of association terms were conducted in order to discover the optimum method in selecting additional terms that are related to an initial query term. Association term sets were generated by using support, confidence, and lift measures of the Apriori algorithm, and also by using the similarity measures such as GSS, Jaccard coefficient, cosine coefficient, and Sokal & Sneath 5, and mutual information. In performance evaluation of term selection methods, precision of association terms as well as the overlap ratio of association terms and relevant documents' indexing terms were used. It was found that Apriori algorithm and GSS achieved the highest level of performances.

RAPD Polymorphism and Genetic Distance among Phenotypic Variants of Tamarindus indica

  • Mayavel, A;Vikashini, B;Bhuvanam, S;Shanthi, A;Kamalakannan, R;Kim, Ki-Won;Kang, Kyu-Suk
    • Journal of Korean Society of Forest Science
    • /
    • v.109 no.4
    • /
    • pp.421-428
    • /
    • 2020
  • Tamarind (Tamarindus indica L.) is one of the multipurpose tree species distributed in the tropical and sub-tropical climates. It is an important fruit yielding tree that supports the livelihood and has high social and cultural values for rural communities. The vegetative, reproductive, qualitative, and quantitative traits of tamarind vary widely. Characterization of phenotypic and genetic structure is essential for the selection of suitable accessions for sustainable cultivation and conservation. This study aimedto examine the genetic relationship among the collected accessions of sweet, red, and sour tamarind by using Random Amplified Polymorphic DNA (RAPD) primers. Nine accessions were collected from germplasm gene banks and subjected to marker analysis. Fifteen highly polymorphic primers generated a total of 169 fragments, out of which 138 bands were polymorphic. The polymorphic information content of RAPD markers varied from 0.10 to 0.44, and the Jaccard's similarity coefficient values ranged from 0.37 to 0.70. The genetic clustering showed a sizable genetic variation in the tamarind accessions at the molecular level. The molecular and biochemical variations in the selected accessions are very important for developing varieties with high sugar, anthocyanin, and acidity traits in the ongoing tamarind improvement program.

Cluster Analysis with Balancing Weight on Mixed-type Data

  • Chae, Seong-San;Kim, Jong-Min;Yang, Wan-Youn
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.3
    • /
    • pp.719-732
    • /
    • 2006
  • A set of clustering algorithms with proper weight on the formulation of distance which extend to mixed numeric and multiple binary values is presented. A simple matching and Jaccard coefficients are used to measure similarity between objects for multiple binary attributes. Similarities are converted to dissimilarities between i th and j th objects. The performance of clustering algorithms with balancing weight on different similarity measures is demonstrated. Our experiments show that clustering algorithms with application of proper weight give competitive recovery level when a set of data with mixed numeric and multiple binary attributes is clustered.

서낙동강의 식물플랑크톤상과 군집동태

  • Choe, Cheol-Man;Mun, Seong-Gi
    • Proceedings of the Korean Environmental Sciences Society Conference
    • /
    • 2005.05a
    • /
    • pp.257-259
    • /
    • 2005
  • 서낙동강에서 조사된 식물플랑크톤은 6강 31과 128종류로 녹조류(Chlorophyceae)가 49종류(38.3%), 규조류(Bacillariophyceae)가 44종류(34.4%)였다. 계절별로는 여름에 최고 80종, 겨울에 최소 47종으로 출현하였으나 일반적인 경우와는 상이한 결과였다. 정점별로는 여름에 정점 1에서 63종류로 가장 많은 종수를, 가을과 겨울의 정점 4에서 18종류로 가장 적은 종수로 조사되어 계절별, 정점별 출현종수의 차이는 크게 나타났다. 생태적 주요종은 모두 63종류였고 Actinastrum hantzschii var. fluviatile을 비롯한 33종류가 광분포종, 오수지표종은 Ankistrodesmus falcatus를 비롯하여 28종류, 적조원인종은 Aulacoseira garanulata var. angustissima for. spiralis를 비롯하여 23종류, 우점종으로는 Aphanizomenon flos-aquae를 비롯하여 8종류, 출현빈번종은 Asterionella formosa를 비롯하여 7종이었다. Jaccard's coefficient에 의한 집괴분석을 실시한 결과, 거의 모든 계절에서 서낙동교를 중심으로 서낙동교 상부지역(st. 1 ${\sim}$ st. 3)과 하부지역(st. 4 ${\sim}$ st. 6)으로 구분되거나 또는 담수지역(st. 1 ${\sim}$ st. 4)과 해수의 영향이 미칠 것으로 예상되는 지역(st. 5 ${\sim}$ st. 6)의 두그룹으로 그룹지어졌다.

  • PDF

Comparative Analysis of Segmentation Methods in Psoriasis Area (건선 영역 분할기법 비교분석)

  • Yoo, Hyun-Jong;Lee, Ji-Won;Moon, Cho-I;Kim, Eun-Bin;Baek, Yoo-Sang;Jang, Sang-Hoon;Lee, OnSeok
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.10a
    • /
    • pp.657-659
    • /
    • 2019
  • 본 논문에서는 피부 이미지에서 건선 병변만을 가장 효과적으로 분할 할 수 있는 분할기법 선별을 목표로 한다. Interactive graph cuts (IGC)와 Level set method (LSM)를 사용하여 건선 영역을 분할한 후 Jaccard Index (JI)와 Dice Similarity Coefficient (DSC)을 사용하여 건선 영역에 효과적인 분할 방법을 제안한다.

Semantic Process Retrieval with Similarity Algorithms (유사도 알고리즘을 활용한 시맨틱 프로세스 검색방안)

  • Lee, Hong-Joo;Klein, Mark
    • Asia pacific journal of information systems
    • /
    • v.18 no.1
    • /
    • pp.79-96
    • /
    • 2008
  • One of the roles of the Semantic Web services is to execute dynamic intra-organizational services including the integration and interoperation of business processes. Since different organizations design their processes differently, the retrieval of similar semantic business processes is necessary in order to support inter-organizational collaborations. Most approaches for finding services that have certain features and support certain business processes have relied on some type of logical reasoning and exact matching. This paper presents our approach of using imprecise matching for expanding results from an exact matching engine to query the OWL(Web Ontology Language) MIT Process Handbook. MIT Process Handbook is an electronic repository of best-practice business processes. The Handbook is intended to help people: (1) redesigning organizational processes, (2) inventing new processes, and (3) sharing ideas about organizational practices. In order to use the MIT Process Handbook for process retrieval experiments, we had to export it into an OWL-based format. We model the Process Handbook meta-model in OWL and export the processes in the Handbook as instances of the meta-model. Next, we need to find a sizable number of queries and their corresponding correct answers in the Process Handbook. Many previous studies devised artificial dataset composed of randomly generated numbers without real meaning and used subjective ratings for correct answers and similarity values between processes. To generate a semantic-preserving test data set, we create 20 variants for each target process that are syntactically different but semantically equivalent using mutation operators. These variants represent the correct answers of the target process. We devise diverse similarity algorithms based on values of process attributes and structures of business processes. We use simple similarity algorithms for text retrieval such as TF-IDF and Levenshtein edit distance to devise our approaches, and utilize tree edit distance measure because semantic processes are appeared to have a graph structure. Also, we design similarity algorithms considering similarity of process structure such as part process, goal, and exception. Since we can identify relationships between semantic process and its subcomponents, this information can be utilized for calculating similarities between processes. Dice's coefficient and Jaccard similarity measures are utilized to calculate portion of overlaps between processes in diverse ways. We perform retrieval experiments to compare the performance of the devised similarity algorithms. We measure the retrieval performance in terms of precision, recall and F measure? the harmonic mean of precision and recall. The tree edit distance shows the poorest performance in terms of all measures. TF-IDF and the method incorporating TF-IDF measure and Levenshtein edit distance show better performances than other devised methods. These two measures are focused on similarity between name and descriptions of process. In addition, we calculate rank correlation coefficient, Kendall's tau b, between the number of process mutations and ranking of similarity values among the mutation sets. In this experiment, similarity measures based on process structure, such as Dice's, Jaccard, and derivatives of these measures, show greater coefficient than measures based on values of process attributes. However, the Lev-TFIDF-JaccardAll measure considering process structure and attributes' values together shows reasonably better performances in these two experiments. For retrieving semantic process, we can think that it's better to consider diverse aspects of process similarity such as process structure and values of process attributes. We generate semantic process data and its dataset for retrieval experiment from MIT Process Handbook repository. We suggest imprecise query algorithms that expand retrieval results from exact matching engine such as SPARQL, and compare the retrieval performances of the similarity algorithms. For the limitations and future work, we need to perform experiments with other dataset from other domain. And, since there are many similarity values from diverse measures, we may find better ways to identify relevant processes by applying these values simultaneously.