• Title/Summary/Keyword: over-clustering

Search Result 386, Processing Time 0.027 seconds

Deconstructing Opinion Survey: A Case Study

  • Alanazi, Entesar
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.4
    • /
    • pp.52-58
    • /
    • 2021
  • Questionnaires and surveys are increasingly being used to collect information from participants of empirical software engineering studies. Usually, such data is analyzed using statistical methods to show an overall picture of participants' agreement or disagreement. In general, the whole survey population is considered as one group with some methods to extract varieties. Sometimes, there are different opinions in the same group, but they are not well discovered. In some cases of the analysis, the population may be divided into subgroups according to some data. The opinions of different segments of the population may be the same. Even though the existing approach can capture the general trends, there is a risk that the opinions of different sub-groups are lost. The problem becomes more complex in longitudinal studies where minority opinions might fade over time. Longitudinal survey data may include several interesting patterns that can be extracted using a clustering process. It can discover new information and give attention to different opinions. We suggest using a data mining approach to finding the diversity among the different groups in longitudinal studies. Our study shows that diversity can be revealed and tracked over time using the clustering approach, and the minorities have an opportunity to be heard.

Extended Kepler Grid-based System for Diabetes Study Workspace

  • Hazemi, Fawaz Al;Youn, Chan-Hyun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.04a
    • /
    • pp.230-233
    • /
    • 2011
  • Chronic disease is linked to patient's' lifestyle. Therefore, doctor has to monitor his/her patient over time. This may involve reviewing many reports, finding any changes, and modifying several treatments. One solution to optimize the burden is using a visualizing tool over time such as a timeline-based visualization tool where all reports and medicine are integrated in a problem centric and time-based style to enable the doctor to predict and adjust the treatment plan. This solution was proposed by Bui et. al. [2] to observe the medical history of a patient. However, there was limitation of studying the diabetes patient's history to find out what was the cause of the current development in patient's condition; moreover what would be the prediction of current implication in one of the diabetes' related factors (such as fat, cholesterol, or potassium). In this paper, we propose a Grid-based Interactive Diabetes System (GIDS) to support bioinformatics analysis application for diabetes diseases. GIDS used an agglomerative clustering algorithm as clustering correlation algorithm as primary algorithm to focus medical researcher in the findings to predict the implication of the undertaken diabetes patient. The algorithm was Chronological Clustering proposed by P. Legendre [11] [12].

Deconstructing Agile Survey to Identify Agile Skeptics

  • Entesar Alanazi;Mohammad Mahdi Hassan
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.3
    • /
    • pp.201-210
    • /
    • 2024
  • In empirical software engineering research, there is an increased use of questionnaires and surveys to collect information from practitioners. Typically, such data is then analyzed based on overall, descriptive statistics. Overall, they consider the whole survey population as a single group with some sampling techniques to extract varieties. In some cases, the population is also partitioned into sub-groups based on some background information. However, this does not reveal opinion diversity properly as similar opinions can exist in different segments of the population, whereas people within the same group might have different opinions. Even though existing approach can capture the general trends there is a risk that the opinions of different sub-groups are lost. The problem becomes more complex in case of longitudinal studies where minority opinions might fade or resolute over time. Survey based longitudinal data may have some potential patterns which can be extracted through a clustering process. It may reveal new information and attract attention to alternative perspectives. We suggest using a data mining approach to finding the diversity among the different groups in longitudinal studies (agile skeptics). In our study, we show that diversity can be revealed and tracked over time with the use of clustering approach, and the minorities have an opportunity to be heard.

Detection of Differentially Expressed Genes by Clustering Genes Using Class-Wise Averaged Data in Microarray Data

  • Kim, Seung-Gu
    • Communications for Statistical Applications and Methods
    • /
    • v.14 no.3
    • /
    • pp.687-698
    • /
    • 2007
  • A normal mixture model with which dependence between classes is incorporated is proposed in order to detect differentially expressed genes. Gene clustering approaches suffer from the high dimensional column of microarray expression data matrix which leads to the over-fit problem. Various methods are proposed to solve the problem. In this paper, use of simple averaging data within each class is proposed to overcome the various problems due to high dimensionality when the normal mixture model is fitted. Some experiments through simulated data set and real data set show its availability in actuality.

Top-down Hierarchical Clustering using Multidimensional Indexes (다차원 색인을 이용한 하향식 계층 클러스터링)

  • Hwang, Jae-Jun;Mun, Yang-Se;Hwang, Gyu-Yeong
    • Journal of KIISE:Databases
    • /
    • v.29 no.5
    • /
    • pp.367-380
    • /
    • 2002
  • Due to recent increase in applications requiring huge amount of data such as spatial data analysis and image analysis, clustering on large databases has been actively studied. In a hierarchical clustering method, a tree representing hierarchical decomposition of the database is first created, and then, used for efficient clustering. Existing hierarchical clustering methods mainly adopted the bottom-up approach, which creates a tree from the bottom to the topmost level of the hierarchy. These bottom-up methods require at least one scan over the entire database in order to build the tree and need to search most nodes of the tree since the clustering algorithm starts from the leaf level. In this paper, we propose a novel top-down hierarchical clustering method that uses multidimensional indexes that are already maintained in most database applications. Generally, multidimensional indexes have the clustering property storing similar objects in the same (or adjacent) data pares. Using this property we can find adjacent objects without calculating distances among them. We first formally define the cluster based on the density of objects. For the definition, we propose the concept of the region contrast partition based on the density of the region. To speed up the clustering algorithm, we use the branch-and-bound algorithm. We propose the bounds and formally prove their correctness. Experimental results show that the proposed method is at least as effective in quality of clustering as BIRCH, a bottom-up hierarchical clustering method, while reducing the number of page accesses by up to 26~187 times depending on the size of the database. As a result, we believe that the proposed method significantly improves the clustering performance in large databases and is practically usable in various database applications.

Determining on Model-based Clusters of Time Series Data (시계열데이터의 모델기반 클러스터 결정)

  • Jeon, Jin-Ho;Lee, Gye-Sung
    • The Journal of the Korea Contents Association
    • /
    • v.7 no.6
    • /
    • pp.22-30
    • /
    • 2007
  • Most real word systems such as world economy, stock market, and medical applications, contain a series of dynamic and complex phenomena. One of common methods to understand these systems is to build a model and analyze the behavior of the system. In this paper, we investigated methods for best clustering over time series data. As a first step for clustering, BIC (Bayesian Information Criterion) approximation is used to determine the number of clusters. A search technique to improve clustering efficiency is also suggested by analyzing the relationship between data size and BIC values. For clustering, two methods, model-based and similarity based methods, are analyzed and compared. A number of experiments have been performed to check its validity using real data(stock price). BIC approximation measure has been confirmed that it suggests best number of clusters through experiments provided that the number of data is relatively large. It is also confirmed that the model-based clustering produces more reliable clustering than similarity based ones.

A Dual-layer Energy Efficient Distributed Clustering Algorithm for Wireless Sensor Networks (무선 센서 네트워크를 위한 에너지 효율적인 이중 레이어 분산 클러스터링 기법)

  • Yeo, Myung-Ho;Kim, Yu-Mi;Yoo, Jae-Soo
    • Journal of KIISE:Databases
    • /
    • v.35 no.1
    • /
    • pp.84-95
    • /
    • 2008
  • Wireless sensor networks have recently emerged as a platform for several applications. By deploying wireless sensor nodes and constructing a sensor network, we can remotely obtain information about the behavior, conditions, and positions of objects in a region. Since sensor nodes operate on batteries, energy-efficient mechanisms for gathering sensor data are indispensable to prolong the lifetime of a sensor network as long as possible. In this paper, we propose a novel clustering algorithm that distributes the energy consumption of a cluster head. First, we analyze the energy consumption if cluster heads and divide each cluster into a collection layer and a transmission layer according to their roles. Then, we elect a cluster head for each layer to distribute the energy consumption of single cluster head. In order to show the superiority of our clustering algorithm, we compare it with the existing clustering algorithm in terms of the lifetime of the sensor network. As a result, our experimental results show that the proposed clustering algorithm achieves about $10%{\sim}40%$ performance improvements over the existing clustering algorithms.

A Data Mining Procedure for Unbalanced Binary Classification (불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차)

  • Jung, Han-Na;Lee, Jeong-Hwa;Jun, Chi-Hyuck
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.36 no.1
    • /
    • pp.13-21
    • /
    • 2010
  • The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

An Energy-Efficient Clustering Scheme in Underwater Acoustic Sensor Networks (수중음향 센서 네트워크에서 효율적인 저전력 군집화 기법)

  • Lee, Jae-Hun;Seo, Bo-Min;Cho, Ho-Shin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.33 no.5
    • /
    • pp.341-350
    • /
    • 2014
  • In this paper, an energy efficient clustering scheme using self organization method is proposed. The proposed scheme selects a cluster head considering not only the number of neighbor nodes but also the residual battery amount. In addition, the network life time is extended by re-selecting the cluster heads only in case the current cluster head's residual energy falls down below a certain threshold level. Accordingly, the energy consumption is evenly distributed over the entire network nodes. The cluster head delivers the collected data from member nodes to a Sink node in a way of multi-hop relaying. In order to evaluate the proposed scheme, we run computer simulation in terms of the total residual amount of battery, the number of alive nodes after a certain amount of time, the accumulated energy cost for network configuration, and the deviation of energy consumption of all nodes, comparing with LEACH which is one of the most popular network clustering schemes. Numerical results show that the proposed scheme has twice network life-time of LEACH scheme and has much more evenly distributed energy consumption over the entire network.

Keyphrase Extraction Using Active Learning and Clustering (Active Learning과 군집화를 이용한 고정키어구 추출)

  • Lee, Hyun-Woo;Cha, Jeong-Won
    • MALSORI
    • /
    • no.66
    • /
    • pp.87-103
    • /
    • 2008
  • We describe a new active learning method in conditional random fields (CRFs) framework for keyphrase extraction. To save elaboration in annotation, we use diversity and representative measure. We select high diversity training candidates by sentence confidence value. We also select high representative candidates by clustering the part-of-speech patterns of contexts. In the experiments using dialog corpus, our method achieves 86.80% and saves 88% training corpus compared with those of supervised method. From the results of experiment, we can see that the proposed method shows improved performance over the previous methods. Additionally, the proposed method can be applied to other applications easily since its implementation is independent on applications.

  • PDF