• Title/Summary/Keyword: Bayesian clustering analysis

Search Result 48, Processing Time 0.025 seconds

Genetic Diversity and Genetic Structure of Phellodendron amurense Populations in South Korea (황벽나무 자연집단의 유전다양성 및 유전구조 분석)

  • Lee, Jei-Wan;Hong, Kyung-Nak;Kang, Jin-Taek
    • Journal of Korean Society of Forest Science
    • /
    • v.103 no.1
    • /
    • pp.51-58
    • /
    • 2014
  • Genetic diversity and genetic structures were estimated in seven natural populations of Phellodendron amurense Rupr in South Korea using ISSR markers. The average of polymorphic loci per primer and the proportion of polymorphic loci per population were 4.5 and 78.8% respectively with total 27 polymorphic loci from 6 ISSR primers. The Shannon's diversity index(I) was 0.421 and the expected heterozygosity($H_e$) was 0.285, which was similar to the heterozygosity (hs =0.287) inferred by Bayesian method. In AMOVA, 7.6% of total genetic variation in the populations was resulted from the genetic difference among populations and the other 92.4% was resulted from the difference among individuals within populations. Genetic differentiation(${\theta}^{II}$) and inbreeding coefficient(f) for total population were estimated to be 0.066 and 0.479 by Bayesian method respectively. In Bayesian clustering analysis, seven populations were assigned into three groups. This result was similar to the results of genetic relationships by UPGMA and PCA. The first group included Hwachoen, Gapyeong, Bongpyeong and Yongpyeong population, and the second included two populations in Sancheong region. Muju population was discretely assigned into the third group in spite of the geographically short distance from the Sancheong region. There was no significant correlation between genetic relationship and geographic distribution among populations in Mantel's test. For conservation of the phellodendron trees, it would be effective to consider the findings resulted from this study with ecological traits and life histories of this species.

Comparison of genome-wide association and genomic prediction methods for milk production traits in Korean Holstein cattle

  • Lee, SeokHyun;Dang, ChangGwon;Choy, YunHo;Do, ChangHee;Cho, Kwanghyun;Kim, Jongjoo;Kim, Yousam;Lee, Jungjae
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.32 no.7
    • /
    • pp.913-921
    • /
    • 2019
  • Objective: The objectives of this study were to compare identified informative regions through two genome-wide association study (GWAS) approaches and determine the accuracy and bias of the direct genomic value (DGV) for milk production traits in Korean Holstein cattle, using two genomic prediction approaches: single-step genomic best linear unbiased prediction (ss-GBLUP) and Bayesian Bayes-B. Methods: Records on production traits such as adjusted 305-day milk (MY305), fat (FY305), and protein (PY305) yields were collected from 265,271 first parity cows. After quality control, 50,765 single-nucleotide polymorphic genotypes were available for analysis. In GWAS for ss-GBLUP (ssGWAS) and Bayes-B (BayesGWAS), the proportion of genetic variance for each 1-Mb genomic window was calculated and used to identify informative genomic regions. Accuracy of the DGV was estimated by a five-fold cross-validation with random clustering. As a measure of accuracy for DGV, we also assessed the correlation between DGV and deregressed-estimated breeding value (DEBV). The bias of DGV for each method was obtained by determining regression coefficients. Results: A total of nine and five significant windows (1 Mb) were identified for MY305 using ssGWAS and BayesGWAS, respectively. Using ssGWAS and BayesGWAS, we also detected multiple significant regions for FY305 (12 and 7) and PY305 (14 and 2), respectively. Both single-step DGV and Bayes DGV also showed somewhat moderate accuracy ranges for MY305 (0.32 to 0.34), FY305 (0.37 to 0.39), and PY305 (0.35 to 0.36) traits, respectively. The mean biases of DGVs determined using the single-step and Bayesian methods were $1.50{\pm}0.21$ and $1.18{\pm}0.26$ for MY305, $1.75{\pm}0.33$ and $1.14{\pm}0.20$ for FY305, and $1.59{\pm}0.20$ and $1.14{\pm}0.15$ for PY305, respectively. Conclusion: From the bias perspective, we believe that genomic selection based on the application of Bayesian approaches would be more suitable than application of ss-GBLUP in Korean Holstein populations.

Development of Medical Cost Prediction Model Based on the Machine Learning Algorithm (머신러닝 알고리즘 기반의 의료비 예측 모델 개발)

  • Han Bi KIM;Dong Hoon HAN
    • Journal of Korea Artificial Intelligence Association
    • /
    • v.1 no.1
    • /
    • pp.11-16
    • /
    • 2023
  • Accurate hospital case modeling and prediction are crucial for efficient healthcare. In this study, we demonstrate the implementation of regression analysis methods in machine learning systems utilizing mathematical statics and machine learning techniques. The developed machine learning model includes Bayesian linear, artificial neural network, decision tree, decision forest, and linear regression analysis models. Through the application of these algorithms, corresponding regression models were constructed and analyzed. The results suggest the potential of leveraging machine learning systems for medical research. The experiment aimed to create an Azure Machine Learning Studio tool for the speedy evaluation of multiple regression models. The tool faciliates the comparision of 5 types of regression models in a unified experiment and presents assessment results with performance metrics. Evaluation of regression machine learning models highlighted the advantages of boosted decision tree regression, and decision forest regression in hospital case prediction. These findings could lay the groundwork for the deliberate development of new directions in medical data processing and decision making. Furthermore, potential avenues for future research may include exploring methods such as clustering, classification, and anomaly detection in healthcare systems.

Genetic Variation of Abies holophylla Populations in South Korea Based on ISSR Markers (ISSR 분석에 의한 전나무 집단의 유전변이)

  • Kim, Young-Mi;Hong, Kyung Nak;Lee, Jei Wan;Yang, Byeong-Hoon
    • Journal of Korean Society of Forest Science
    • /
    • v.103 no.2
    • /
    • pp.182-188
    • /
    • 2014
  • Genetic diversity and genetic differentiation in six natural populations of Abies holophylla Max were investigated using ISSR marker system. From 6 ISSR primers, the average percentage of polymorphic loci was 85.6%, and the average expected heterozygosity ($H_e$) was 0.288. From the result of AMOVA, 94.4% of total genetic variation came from the differences among individuals within populations, and 5.6% was caused by those of among-populations. On the basis of Bayesian inference, genetic differentiation (${\theta}^{II}$ and $G_{ST}$) and inbreeding coefficient for all populations were 0.045, 0.038, and 0.509, respectively. The correlation between genetic distance and geographical distance was highly significant at the Mental's test (r = 0.74, P < 0.05). Six populations divided into two groups according to the results of UPGMA and PCA. One group included Namwon, Cheongdo and Mungyeong population. The other was Inje, Hongcheon and Pyeongchang population. Also, in Bayesian clustering analysis, 6 populations were divided into two clusters. But Cheongdo population was assigned into the other cluster unlike those of UPGMA or PCA. Taking the regions based on the results of the cluster analysis into consideration of AMOVA, 3.9% of genetic variation came from the regional difference. The dendrogram from UPGMA could provide the most genetically reasonable explanation for the distribution of Abies holophylla populations in South Korea.

Classification and Analysis of Data Mining Algorithms (데이터마이닝 알고리즘의 분류 및 분석)

  • Lee, Jung-Won;Kim, Ho-Sook;Choi, Ji-Young;Kim, Hyon-Hee;Yong, Hwan-Seung;Lee, Sang-Ho;Park, Seung-Soo
    • Journal of KIISE:Databases
    • /
    • v.28 no.3
    • /
    • pp.279-300
    • /
    • 2001
  • Data mining plays an important role in knowledge discovery process and usually various existing algorithms are selected for the specific purpose of the mining. Currently, data mining techniques are actively to the statistics, business, electronic commerce, biology, and medical area and currently numerous algorithms are being researched and developed for these applications. However, in a long run, only a few algorithms, which are well-suited to specific applications with excellent performance in large database, will survive. So it is reasonable to focus our effort on those selected algorithms in the future. This paper classifies about 30 existing algorithms into 7 categories - association rule, clustering, neural network, decision tree, genetic algorithm, memory-based reasoning, and bayesian network. First of all, this work analyzes systematic hierarchy and characteristics of algorithms and we present 14 criteria for classifying the algorithms and the results based on this criteria. Finally, we propose the best algorithms among some comparable algorithms with different features and performances. The result of this paper can be used as a guideline for data mining researches as well as field applications of data mining.

  • PDF

Evaluation of the taxonomic rank of the terrestrial orchid Cephalanthera subaphylla based on allozymes

  • CHUNG, Mi Yoon;SON, Sungwon;CHUNG, Jae Min;LOPEZ-PUJOL, Jordi;YUKAWA, Tomohisa;CHUNG, Myong Gi
    • Korean Journal of Plant Taxonomy
    • /
    • v.49 no.2
    • /
    • pp.118-126
    • /
    • 2019
  • The taxonomic rank of the tiny-leaved terrestrial orchid Cephalanthera subaphylla Miyabe & $Kud{\hat{o}}$ has been somewhat controversial, as it has been treated as a species or as an infraspecific taxon, under C. erecta (Thunb.) Blume [C. erecta var. subaphylla (Miyabe & $Kud{\hat{o}}$) Ohwi and C. erecta f. subaphylla (Miyabe & $Kud{\hat{o}}$) M. Hiro]. Allozyme markers, traditionally employed for delimiting species boundaries, are used here to gain information for determining the taxonomic status of C. subaphylla. To do this, we sampled three populations of five taxa (a total of 15 populations) of Cephalanthera native to the Korean Peninsula [C. erecta, C. falcata (Thunb.) Blume, C. longibracteata Blume, C. longifolia (L.) Fritsch, and C. subaphylla]. Among 20 putative loci resolved, three were monomorphic (Dia-2, Pgi-1, and Tpi-1) across the five species. Apart from C. longibracteata, there was no allozyme variation within the remaining four species. Of the 51 alleles harbored by these 17 polymorphic loci, each of the 27 alleles at 14 loci was unique to a single species. Accordingly, we found low average values of Nei's genetic identities (I) between ten species pairs (from I = 0.250 for C. erecta versus C. longifolia to I = 0.603 for C. falcata vs. C. longibracteata), with C. subaphylla being genetically clearly differentiated from the other species (from I = 0.349 for C. subaphylla vs. C. longifolia to 0.400 for C. subaphylla vs. C. falcata). These results clearly indicate that C. subaphylla is not genetically related to any of the other taxa of Cephalanthera that are native to the Korean Peninsula, including C. erecta. In a principal coordinate analysis (PCoA), C. subaphylla was positioned distant not only from C. falcata, C. longibracteata, and C. longifolia, but also from C. erecta. Finally, K = 5 was the best clustering scheme using a Bayesian approach, with five clusters precisely corresponding to the five taxa. Thus, our allozyme results strongly suggest that C. subaphylla merits the rank of species.

Predictive Clustering-based Collaborative Filtering Technique for Performance-Stability of Recommendation System (추천 시스템의 성능 안정성을 위한 예측적 군집화 기반 협업 필터링 기법)

  • Lee, O-Joun;You, Eun-Soon
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.119-142
    • /
    • 2015
  • With the explosive growth in the volume of information, Internet users are experiencing considerable difficulties in obtaining necessary information online. Against this backdrop, ever-greater importance is being placed on a recommender system that provides information catered to user preferences and tastes in an attempt to address issues associated with information overload. To this end, a number of techniques have been proposed, including content-based filtering (CBF), demographic filtering (DF) and collaborative filtering (CF). Among them, CBF and DF require external information and thus cannot be applied to a variety of domains. CF, on the other hand, is widely used since it is relatively free from the domain constraint. The CF technique is broadly classified into memory-based CF, model-based CF and hybrid CF. Model-based CF addresses the drawbacks of CF by considering the Bayesian model, clustering model or dependency network model. This filtering technique not only improves the sparsity and scalability issues but also boosts predictive performance. However, it involves expensive model-building and results in a tradeoff between performance and scalability. Such tradeoff is attributed to reduced coverage, which is a type of sparsity issues. In addition, expensive model-building may lead to performance instability since changes in the domain environment cannot be immediately incorporated into the model due to high costs involved. Cumulative changes in the domain environment that have failed to be reflected eventually undermine system performance. This study incorporates the Markov model of transition probabilities and the concept of fuzzy clustering with CBCF to propose predictive clustering-based CF (PCCF) that solves the issues of reduced coverage and of unstable performance. The method improves performance instability by tracking the changes in user preferences and bridging the gap between the static model and dynamic users. Furthermore, the issue of reduced coverage also improves by expanding the coverage based on transition probabilities and clustering probabilities. The proposed method consists of four processes. First, user preferences are normalized in preference clustering. Second, changes in user preferences are detected from review score entries during preference transition detection. Third, user propensities are normalized using patterns of changes (propensities) in user preferences in propensity clustering. Lastly, the preference prediction model is developed to predict user preferences for items during preference prediction. The proposed method has been validated by testing the robustness of performance instability and scalability-performance tradeoff. The initial test compared and analyzed the performance of individual recommender systems each enabled by IBCF, CBCF, ICFEC and PCCF under an environment where data sparsity had been minimized. The following test adjusted the optimal number of clusters in CBCF, ICFEC and PCCF for a comparative analysis of subsequent changes in the system performance. The test results revealed that the suggested method produced insignificant improvement in performance in comparison with the existing techniques. In addition, it failed to achieve significant improvement in the standard deviation that indicates the degree of data fluctuation. Notwithstanding, it resulted in marked improvement over the existing techniques in terms of range that indicates the level of performance fluctuation. The level of performance fluctuation before and after the model generation improved by 51.31% in the initial test. Then in the following test, there has been 36.05% improvement in the level of performance fluctuation driven by the changes in the number of clusters. This signifies that the proposed method, despite the slight performance improvement, clearly offers better performance stability compared to the existing techniques. Further research on this study will be directed toward enhancing the recommendation performance that failed to demonstrate significant improvement over the existing techniques. The future research will consider the introduction of a high-dimensional parameter-free clustering algorithm or deep learning-based model in order to improve performance in recommendations.

Performance Improvement of Collaborative Filtering System Using Associative User′s Clustering Analysis for the Recalculation of Preference and Representative Attribute-Neighborhood (선호도 재계산을 위한 연관 사용자 군집 분석과 Representative Attribute -Neighborhood를 이용한 협력적 필터링 시스템의 성능향상)

  • Jung, Kyung-Yong;Kim, Jin-Su;Kim, Tae-Yong;Lee, Jung-Hyun
    • The KIPS Transactions:PartB
    • /
    • v.10B no.3
    • /
    • pp.287-296
    • /
    • 2003
  • There has been much research focused on collaborative filtering technique in Recommender System. However, these studies have shown the First-Rater Problem and the Sparsity Problem. The main purpose of this Paper is to solve these Problems. In this Paper, we suggest the user's predicting preference method using Bayesian estimated value and the associative user clustering for the recalculation of preference. In addition to this method, to complement a shortcoming, which doesn't regard the attribution of item, we use Representative Attribute-Neighborhood method that is used for the prediction when we find the similar neighborhood through extracting the representative attribution, which most affect the preference. We improved the efficiency by using the associative user's clustering analysis in order to calculate the preference of specific item within the cluster item vector to the collaborative filtering algorithm. Besides, for the problem of the Sparsity and First-Rater, through using Association Rule Hypergraph Partitioning algorithm associative users are clustered according to the genre. New users are classified into one of these genres by Naive Bayes classifier. In addition, in order to get the similarity value between users belonged to the classified genre and new users, and this paper allows the different estimated value to item which user evaluated through Naive Bayes learning. As applying the preference granted the estimated value to Pearson correlation coefficient, it can make the higher accuracy because the errors that cause the missing value come less. We evaluate our method on a large collaborative filtering database of user rating and it significantly outperforms previous proposed method.

Reliability of microarray analysis for studying periodontitis: low consistency in 2 periodontitis cohort data sets from different platforms and an integrative meta-analysis

  • Jeon, Yoon-Seon;Shivakumar, Manu;Kim, Dokyoon;Kim, Chang-Sung;Lee, Jung-Seok
    • Journal of Periodontal and Implant Science
    • /
    • v.51 no.1
    • /
    • pp.18-29
    • /
    • 2021
  • Purpose: The aim of this study was to compare the characteristic expression patterns of advanced periodontitis in 2 cohort data sets analyzed using different microarray platforms, and to identify differentially expressed genes (DEGs) through a meta-analysis of both data sets. Methods: Twenty-two patients for cohort 1 and 40 patients for cohort 2 were recruited with the same inclusion criteria. The 2 cohort groups were analyzed using different platforms: Illumina and Agilent. A meta-analysis was performed to increase reliability by removing statistical differences between platforms. An integrative meta-analysis based on an empirical Bayesian methodology (ComBat) was conducted. DEGs for the integrated data sets were identified using the limma package to adjust for age, sex, and platform and compared with the results for cohorts 1 and 2. Clustering and pathway analyses were also performed. Results: This study detected 557 and 246 DEGs in cohorts 1 and 2, respectively, with 146 and 42 significantly enriched gene ontology (GO) terms. Overlapping between cohorts 1 and 2 was present in 59 DEGs and 18 GO terms. However, only 6 genes from the top 30 enriched DEGs overlapped, and there were no overlapping GO terms in the top 30 enriched pathways. The integrative meta-analysis detected 34 DEGs, of which 10 overlapped in all the integrated data sets of cohorts 1 and 2. Conclusions: The characteristic expression pattern differed between periodontitis and the healthy periodontium, but the consistency between the data sets from different cohorts and metadata was too low to suggest specific biomarkers for identifying periodontitis.

Genetic diversity of Indonesian cattle breeds based on microsatellite markers

  • Agung, Paskah Partogi;Saputra, Ferdy;Zein, Moch Syamsul Arifin;Wulandari, Ari Sulistyo;Putra, Widya Pintaka Bayu;Said, Syahruddin;Jakaria, Jakaria
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.32 no.4
    • /
    • pp.467-476
    • /
    • 2019
  • Objective: This research was conducted to study the genetic diversity in several Indonesian cattle breeds using microsatellite markers to classify the Indonesian cattle breeds. Methods: A total of 229 DNA samples from of 10 cattle breeds were used in this study. The polymerase chain reaction process was conducted using 12 labeled primers. The size of allele was generated using the multiplex DNA fragment analysis. The POPGEN and CERVUS programs were used to obtain the observed number of alleles, effective number of alleles, observed heterozygosity value, expected heterozygosity value, allele frequency, genetic differentiation, the global heterozygote deficit among breeds, and the heterozygote deficit within the breed, gene flow, Hardy-Weinberg equilibrium, and polymorphism information content values. The MEGA program was used to generate a dendrogram that illustrates the relationship among cattle population. Bayesian clustering assignments were analyzed using STRUCTURE program. The GENETIX program was used to perform the correspondence factorial analysis (CFA). The GENALEX program was used to perform the principal coordinates analysis (PCoA) and analysis of molecular variance. The principal component analysis (PCA) was performed using adegenet package of R program. Results: A total of 862 alleles were detected in this study. The INRA23 allele 205 is a specific allele candidate for the Sumba Ongole cattle, while the allele 219 is a specific allele candidate for Ongole Grade. This study revealed a very close genetic relationship between the Ongole Grade and Sumba Ongole cattle and between the Madura and Pasundan cattle. The results from the CFA, PCoA, and PCA analysis in this study provide scientific evidence regarding the genetic relationship between Banteng and Bali cattle. According to the genetic relationship, the Pesisir cattle were classified as Bos indicus cattle. Conclusion: All identified alleles in this study were able to classify the cattle population into three clusters i.e. Bos taurus cluster (Simmental Purebred, Simmental Crossbred, and Holstein Friesian cattle); Bos indicus cluster (Sumba Ongole, Ongole Grade, Madura, Pasundan, and Pesisir cattle); and Bos javanicus cluster (Banteng and Bali cattle).