• 제목/요약/키워드: hierarchical cluster analysis

Search Result 302, Processing Time 0.023 seconds

The effect of missing levels of nesting in multilevel analysis

  • Park, Seho;Chung, Yujin
    • Genomics & Informatics
    • /
    • v.20 no.3
    • /
    • pp.34.1-34.11
    • /
    • 2022
  • Multilevel analysis is an appropriate and powerful tool for analyzing hierarchical structure data widely applied from public health to genomic data. In practice, however, we may lose the information on multiple nesting levels in the multilevel analysis since data may fail to capture all levels of hierarchy, or the top or intermediate levels of hierarchy are ignored in the analysis. In this study, we consider a multilevel linear mixed effect model (LMM) with single imputation that can involve all data hierarchy levels in the presence of missing top or intermediate-level clusters. We evaluate and compare the performance of a multilevel LMM with single imputation with other models ignoring the data hierarchy or missing intermediate-level clusters. To this end, we applied a multilevel LMM with single imputation and other models to hierarchically structured cohort data with some intermediate levels missing and to simulated data with various cluster sizes and missing rates of intermediate-level clusters. A thorough simulation study demonstrated that an LMM with single imputation estimates fixed coefficients and variance components of a multilevel model more accurately than other models ignoring data hierarchy or missing clusters in terms of mean squared error and coverage probability. In particular, when models ignoring data hierarchy or missing clusters were applied, the variance components of random effects were overestimated. We observed similar results from the analysis of hierarchically structured cohort data.

Hierarchical and Incremental Clustering for Semi Real-time Issue Analysis on News Articles (준 실시간 뉴스 이슈 분석을 위한 계층적·점증적 군집화)

  • Kim, Hoyong;Lee, SeungWoo;Jang, Hong-Jun;Seo, DongMin
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.6
    • /
    • pp.556-578
    • /
    • 2020
  • There are many different researches about how to analyze issues based on real-time news streams. But, there are few researches which analyze issues hierarchically from news articles and even a previous research of hierarchical issue analysis make clustering speed slower as the increment of news articles. In this paper, we propose a hierarchical and incremental clustering for semi real-time issue analysis on news articles. We trained siamese neural network based weighted cosine similarity model, applied this model to k-means algorithm which is used to make word clusters and converted news articles to document vectors by using these word clusters. Finally, we initialized an issue cluster tree from document vectors, updated this tree whenever news articles happen, and analyzed issues in semi real-time. Through the experiment and evaluation, we showed that up to about 0.26 performance has been improved in terms of NMI. Also, in terms of speed of incremental clustering, we also showed about 10 times faster than before.

A Method for Comparing Multiple Bacterial Community Structures from 16S rDNA Clone Library Sequences

  • Hur, Inae;Chun, Jongsik
    • Journal of Microbiology
    • /
    • v.42 no.1
    • /
    • pp.9-13
    • /
    • 2004
  • Culture-independent approaches, based on 16S rDNA sequences, are extensively used in modern microbial ecology. Sequencing of the clone library generated from environmental DNA has advantages over fingerprint-based methods, such as denaturing gradient gel electrophoresis, as it provides precise identification and quantification of the phylotypes present in samples. However, to date, no method exists for comparing multiple bacterial community structures using clone library sequences. In this study, an automated method to achieve this has been developed, by applying pair wise alignment, hierarchical clustering and principle component analysis. The method has been demonstrated to be successful in comparing samples from various environments. The program, named CommCluster, was written in JAVA, and is now freely available, at http://chunlab.snu.ac.kr/commcluster/.

Classification of Ambient Particulate Samples Using Cluster Analysis and Disjoint Principal Component Analysis (군집분석법과 분산주성분분석법을 이용한 대기분진시료의 분류)

  • 유상준;김동술
    • Journal of Korean Society for Atmospheric Environment
    • /
    • v.13 no.1
    • /
    • pp.51-63
    • /
    • 1997
  • Total suspended particulate matters in the ambient air were analyzed for eight chemical elements (Ca, Co, Cu, Fe, Mn, Pb, Si, and Zn) using an x-ray fluorescence spectrometry (XRF) at the Kyung Hee University - Suwon Campus during 1989 to 1994. To use these data as basis for source identification study, membership of each sample was selected to represent one of the well defined sample groups. The data sets consisting of 83 objects and 8 variables were initially separated into two groups, fine (d$_{p}$<3.3 ${\mu}{\textrm}{m}$) and coarse particle groups (d$_{p}$>3.3 ${\mu}{\textrm}{m}$). A hierarchical clustering method was examined to obtain possible member of homogeneous sample classes for each of the two groups by transforming raw data and by applying various distances. A disjoint principal component analysis was then used to define homogeneous sample classes after deleting outliers. Each of five homogeneous sample classes was determined for the fine and the coarse particle group, respectively. The data were properly classified via an application of logarithmic transformation and Euclidean distance concept. After determining homogeneous classes, correlation coefficients among eight chemical variables within all the homogeneous classes for calculated and meteorological variables (temperature. relative humidity, wind speed, wind direction, and precipitation) were examined as well to intensively interpret environmental factors influencing the characteristics of each class for each group. According to our analysis, we found that each class had its own distinct seasonal pattern that was affected most sensitively by wind direction.ion.

  • PDF

Classifying and Characterizing the Types of Gentrified Commercial Districts Based on Sense of Place Using Big Data: Focusing on 14 Districts in Seoul (빅데이터를 활용한 젠트리피케이션 상권의 장소성 분류와 특성 분석 -서울시 14개 주요상권을 중심으로-)

  • Young-Jae Kim;In Kwon Park
    • Journal of the Korean Regional Science Association
    • /
    • v.39 no.1
    • /
    • pp.3-20
    • /
    • 2023
  • This study aims to categorize the 14 major gentrified commercial areas of Seoul and analyze their characteristics based on their sense of place. To achieve this, we conducted hierarchical cluster analysis using text data collected from Naver Blog. We divided the districts into two dimensions: "experience" and "feature" and analyzed their characteristics using LDA (Latent Dirichlet Allocation) of the text data and statistical data collected from Seoul Open Data Square. As a result, we classified the commercial districts of Seoul into 5 categories: 'theater district,' 'traditional cultural district,' 'female-beauty district,' 'exclusive restaurant and medical district,' and 'trend-leading district.' The findings of this study are expected to provide valuable insights for policy-makers to develop more efficient and suitable commercial policies.

Country Clustering Based on Environmental Factors Influencing on Software Piracy (소프트웨어 불법복제에 영향을 미치는 환경 요인에 기반한 국가 분류)

  • Suh, Bomil;Shim, Junho
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.227-246
    • /
    • 2017
  • Purpose: As the importance of software has been emphasized recently, the size of the software market is continuously expanding. The development of the software market is being adversely affected by software piracy. In this study, we try to classify countries around the world based on the macro environmental factors, which influence software piracy. We also try to identify the differences in software piracy for each classified type. Design/methodology/approach: The data-driven approach is used in this study. From the BSA, the World Bank, and the OECD, we collect data from 1990 to 2015 for 127 environmental variables of 225 countries. Cronbach's ${\alpha}$ analysis, item-to-total correlation analysis, and exploratory factor analysis derive 15 constructs from the data. We apply two-step approach to cluster analysis. The number of clusters is determined to be 5 by hierarchical cluster analysis at the first step, and the countries are classified by the K-means clustering at the second step. We conduct ANOVA and MANOVA in order to verify the differences of the environmental factors and software piracy among derived clusters. Findings: The five clusters are identified as underdeveloped countries, developing countries, developed countries, world powers, and developing country with large market. There are statistically significant differences in the environmental factors among the clusters. In addition, there are statistically significant differences in software piracy rate, pirated value, and legal software sales among the clusters.

A Comparative Study on Statistical Clustering Methods and Kohonen Self-Organizing Maps for Highway Characteristic Classification of National Highway (일반국도 도로특성분류를 위한 통계적 군집분석과 Kohonen Self-Organizing Maps의 비교연구)

  • Cho, Jun Han;Kim, Seong Ho
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.29 no.3D
    • /
    • pp.347-356
    • /
    • 2009
  • This paper is described clustering analysis of traffic characteristics-based highway classification in order to deviate from methodologies of existing highway functional classification. This research focuses on comparing the clustering techniques performance based on the total within-group errors and deriving the optimal number of cluster. This research analyzed statistical clustering method (Hierarchical Ward's minimum-variance method, Nonhierarchical K-means method) and Kohonen self-organizing maps clustering method for highway characteristic classification. The outcomes of cluster techniques compared for the number of samples and traffic characteristics from subsets derived by the optimal number of cluster. As a comprehensive result, the k-means method is superior result to other methods less than 12. For a cluster of more than 20, Kohonen self-organizing maps is the best result in the cluster method. The main contribution of this research is expected to use important the basic road attribution information that produced the highway characteristic classification.

Differences of Narrative Representations by Foster Care, Adopted and Biological Family Children (가정위탁유아, 연장입양유아와 일반유아의 내적표상에서의 차이)

  • Shin, Hye Won;Min, Sung Hye
    • Korean Journal of Child Studies
    • /
    • v.29 no.3
    • /
    • pp.157-174
    • /
    • 2008
  • This study used the person-oriented approach to explore differences in narrative representations of 97 4-, 5- and 6-year old children (30 foster care, 40 biological family, 17 adopted). Using the MacArthur Story Stem Battery (Bretherton et al., 1990), observations were made to obtain children's narrative representations of content themes and performances. Descriptive statistics, ANOV A and hierarchical cluster analyses were performed. The results of this study were that : (1) Biological family children showed more empathy/helping representations. Foster care children and adopted children showed more anxious representations, and foster care children showed more dysregulated aggression. (2) Four clusters of foster care and adopted children and five clusters of family biological children were found.

  • PDF

Comprehensive review on Clustering Techniques and its application on High Dimensional Data

  • Alam, Afroj;Muqeem, Mohd;Ahmad, Sultan
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.6
    • /
    • pp.237-244
    • /
    • 2021
  • Clustering is a most powerful un-supervised machine learning techniques for division of instances into homogenous group, which is called cluster. This Clustering is mainly used for generating a good quality of cluster through which we can discover hidden patterns and knowledge from the large datasets. It has huge application in different field like in medicine field, healthcare, gene-expression, image processing, agriculture, fraud detection, profitability analysis etc. The goal of this paper is to explore both hierarchical as well as partitioning clustering and understanding their problem with various approaches for their solution. Among different clustering K-means is better than other clustering due to its linear time complexity. Further this paper also focused on data mining that dealing with high-dimensional datasets with their problems and their existing approaches for their relevancy

Visualized Determination for Installation Location of Monitoring Devices using CPTED (CPTED기법을 통한 모니터링 시스템 설치위치 시각화 결정법)

  • Kim, Joohwan;Nam, Doohee
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.15 no.2
    • /
    • pp.145-150
    • /
    • 2015
  • Needs about safety of residents are important in urbanized society, elderly and small-size family. People are looking for safety information system and device of CPTED. That is, Needs and Installations of CCTV increased steadily. But, scientific analysis about validity, systematic plan and location of security CCTV is nonexistent. It is simply put these devised in more demanded areas. It has limits to look for safety of residents by increasing density of CCTVs. One of the characteristics of crime is clustering and stong interconnectivity. So, exploratory spatial data of crime is geo-coded using 2 years data and carried out cluster analysis and space statistical analysis through GIS space analysis by dividing 18 variables into social economy, urban space, crime prevention facility and crime occurrence index. The result of analysis shows cluster of 5 major crimes, theft, violence and sexual violence by Nearest Neighbor distance analysis and Ripley's K function. It also shows strong crime interconnectivity through criminal correlation analysis. In case of finding criminal cluster, you can find criminal hotspot. So, in this study I found concept of hotspot and considered technique about selection of hotspot. And then, selected hotspot about 5 major crimes, theft, violence and sexual violence through Nearest Neighbor Hierarchical Spatial Clustering.