• Title/Summary/Keyword: Clustering for High Dimensional Data

Search Result 64, Processing Time 0.024 seconds

Hybrid Learning-Based Cell Morphology Profiling Framework for Classifying Cancer Heterogeneity (암의 이질성 분류를 위한 하이브리드 학습 기반 세포 형태 프로파일링 기법)

  • Min, Chanhong;Jeong, Hyuntae;Yang, Sejung;Shin, Jennifer Hyunjong
    • Journal of Biomedical Engineering Research
    • /
    • v.42 no.5
    • /
    • pp.232-240
    • /
    • 2021
  • Heterogeneity in cancer is the major obstacle for precision medicine and has become a critical issue in the field of a cancer diagnosis. Many attempts were made to disentangle the complexity by molecular classification. However, multi-dimensional information from dynamic responses of cancer poses fundamental limitations on biomolecular marker-based conventional approaches. Cell morphology, which reflects the physiological state of the cell, can be used to track the temporal behavior of cancer cells conveniently. Here, we first present a hybrid learning-based platform that extracts cell morphology in a time-dependent manner using a deep convolutional neural network to incorporate multivariate data. Feature selection from more than 200 morphological features is conducted, which filters out less significant variables to enhance interpretation. Our platform then performs unsupervised clustering to unveil dynamic behavior patterns hidden from a high-dimensional dataset. As a result, we visualize morphology state-space by two-dimensional embedding as well as representative morphology clusters and trajectories. This cell morphology profiling strategy by hybrid learning enables simplification of the heterogeneous population of cancer.

Feature-Based Image Retrieval using SOM-Based R*-Tree

  • Shin, Min-Hwa;Kwon, Chang-Hee;Bae, Sang-Hyun
    • Proceedings of the KAIS Fall Conference
    • /
    • 2003.11a
    • /
    • pp.223-230
    • /
    • 2003
  • Feature-based similarity retrieval has become an important research issue in multimedia database systems. The features of multimedia data are useful for discriminating between multimedia objects (e 'g', documents, images, video, music score, etc.). For example, images are represented by their color histograms, texture vectors, and shape descriptors, and are usually high-dimensional data. The performance of conventional multidimensional data structures(e'g', R- Tree family, K-D-B tree, grid file, TV-tree) tends to deteriorate as the number of dimensions of feature vectors increases. The R*-tree is the most successful variant of the R-tree. In this paper, we propose a SOM-based R*-tree as a new indexing method for high-dimensional feature vectors.The SOM-based R*-tree combines SOM and R*-tree to achieve search performance more scalable to high dimensionalities. Self-Organizing Maps (SOMs) provide mapping from high-dimensional feature vectors onto a two dimensional space. The mapping preserves the topology of the feature vectors. The map is called a topological of the feature map, and preserves the mutual relationship (similarity) in the feature spaces of input data, clustering mutually similar feature vectors in neighboring nodes. Each node of the topological feature map holds a codebook vector. A best-matching-image-list. (BMIL) holds similar images that are closest to each codebook vector. In a topological feature map, there are empty nodes in which no image is classified. When we build an R*-tree, we use codebook vectors of topological feature map which eliminates the empty nodes that cause unnecessary disk access and degrade retrieval performance. We experimentally compare the retrieval time cost of a SOM-based R*-tree with that of an SOM and an R*-tree using color feature vectors extracted from 40, 000 images. The result show that the SOM-based R*-tree outperforms both the SOM and R*-tree due to the reduction of the number of nodes required to build R*-tree and retrieval time cost.

  • PDF

A Big Data Analysis by Between-Cluster Information using k-Modes Clustering Algorithm (k-Modes 분할 알고리즘에 의한 군집의 상관정보 기반 빅데이터 분석)

  • Park, In-Kyoo
    • Journal of Digital Convergence
    • /
    • v.13 no.11
    • /
    • pp.157-164
    • /
    • 2015
  • This paper describes subspace clustering of categorical data for convergence and integration. Because categorical data are not designed for dealing only with numerical data, The conventional evaluation measures are more likely to have the limitations due to the absence of ordering and high dimensional data and scarcity of frequency. Hence, conditional entropy measure is proposed to evaluate close approximation of cohesion among attributes within each cluster. We propose a new objective function that is used to reflect the optimistic clustering so that the within-cluster dispersion is minimized and the between-cluster separation is enhanced. We performed experiments on five real-world datasets, comparing the performance of our algorithms with four algorithms, using three evaluation metrics: accuracy, f-measure and adjusted Rand index. According to the experiments, the proposed algorithm outperforms the algorithms that were considered int the evaluation, regarding the considered metrics.

Design of Black Plastics Classifier Using Data Information (데이터 정보를 이용한 흑색 플라스틱 분류기 설계)

  • Park, Sang-Beom;Oh, Sung-Kwun
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.67 no.4
    • /
    • pp.569-577
    • /
    • 2018
  • In this paper, with the aid of information which is included within data, preprocessing algorithm-based black plastic classifier is designed. The slope and area of spectrum obtained by using laser induced breakdown spectroscopy(LIBS) are analyzed for each material and its ensuing information is applied as the input data of the proposed classifier. The slope is represented by the rate of change of wavelength and intensity. Also, the area is calculated by the wavelength of the spectrum peak where the material property of chemical elements such as carbon and hydrogen appears. Using informations such as slope and area, input data of the proposed classifier is constructed. In the preprocessing part of the classifier, Principal Component Analysis(PCA) and fuzzy transform are used for dimensional reduction from high dimensional input variables to low dimensional input variables. Characteristic analysis of the materials as well as the processing speed of the classifier is improved. In the condition part, FCM clustering is applied and linear function is used as connection weight in the conclusion part. By means of Particle Swarm Optimization(PSO), parameters such as the number of clusters, fuzzification coefficient and the number of input variables are optimized. To demonstrate the superiority of classification performance, classification rate is compared by using WEKA 3.8 data mining software which contains various classifiers such as Naivebayes, SVM and Multilayer perceptron.

Dual graph-regularized Constrained Nonnegative Matrix Factorization for Image Clustering

  • Sun, Jing;Cai, Xibiao;Sun, Fuming;Hong, Richang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.5
    • /
    • pp.2607-2627
    • /
    • 2017
  • Nonnegative matrix factorization (NMF) has received considerable attention due to its effectiveness of reducing high dimensional data and importance of producing a parts-based image representation. Most of existing NMF variants attempt to address the assertion that the observed data distribute on a nonlinear low-dimensional manifold. However, recent research results showed that not only the observed data but also the features lie on the low-dimensional manifolds. In addition, a few hard priori label information is available and thus helps to uncover the intrinsic geometrical and discriminative structures of the data space. Motivated by the two aspects above mentioned, we propose a novel algorithm to enhance the effectiveness of image representation, called Dual graph-regularized Constrained Nonnegative Matrix Factorization (DCNMF). The underlying philosophy of the proposed method is that it not only considers the geometric structures of the data manifold and the feature manifold simultaneously, but also mines valuable information from a few known labeled examples. These schemes will improve the performance of image representation and thus enhance the effectiveness of image classification. Extensive experiments on common benchmarks demonstrated that DCNMF has its superiority in image classification compared with state-of-the-art methods.

Subspace Projection-Based Clustering and Temporal ACRs Mining on MapReduce for Direct Marketing Service

  • Lee, Heon Gyu;Choi, Yong Hoon;Jung, Hoon;Shin, Yong Ho
    • ETRI Journal
    • /
    • v.37 no.2
    • /
    • pp.317-327
    • /
    • 2015
  • A reliable analysis of consumer preference from a large amount of purchase data acquired in real time and an accurate customer characterization technique are essential for successful direct marketing campaigns. In this study, an optimal segmentation of post office customers in Korea is performed using a subspace projection-based clustering method to generate an accurate customer characterization from a high-dimensional census dataset. Moreover, a traditional temporal mining method is extended to an algorithm using the MapReduce framework for a consumer preference analysis. The experimental results show that it is possible to use parallel mining through a MapReduce-based algorithm and that the execution time of the algorithm is faster than that of a traditional method.

A Clustering using Incremental Projection for High Dimensional Data (고차원 데이터에서 점진적 프로젝션을 이용한 클러스터링)

  • 이혜명;박영배
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2000.10a
    • /
    • pp.189-191
    • /
    • 2000
  • 데이터 마이닝의 방법론 중 클러스터링은 데이터베이스 객체들의 에트리뷰트 값에 근거하여 유사한 그룹으로 식별하는 기술적인 작업이다. 그러나 대부분 알고리즘들은 데이터의 차원이 증가할수록 형성된 전체 데이터 공간은 매우 방대하므로 의미있는 클러스터의 탐색이 더욱 어렵다. 따라서 효과적인 클러스터링을 위해서는 클러스터가 포함될 데이터 공간의 예측이 필요하다. 본 논문에서는 고차원 데이터에서 각 차원에 대한 점진적 프로젝션을 이용한 클러스터링 방법을 제안한다. 제안한 방법에서는 클러스터가 포함될 가능성이 있는 데이터공간의 후보영역을 결정하여, 이 영역에서 점들의 평균값을 중심으로 클러스터를 탐색한다.

  • PDF

DATA MINING-BASED MULTIDIMENSIONAL EXTRACTION METHOD FOR INDICATORS OF SOCIAL SECURITY SYSTEM FOR PEOPLE WITH DISABILITIES

  • BATYHA, RADWAN M.
    • Journal of applied mathematics & informatics
    • /
    • v.40 no.1_2
    • /
    • pp.289-303
    • /
    • 2022
  • This article examines the multidimensional index extraction method of the disability social security system based on data mining. While creating the data warehouse of the social security system for the disabled, we need to know the elements of the social security indicators for the disabled. In this context, a clustering algorithm was used to extract the indicators of the social security system for the disabled by investigating the historical dimension of social security for the disabled. The simulation results show that the index extraction method has high coverage, sensitivity and reliability. In this paper, a multidimensional extraction method is introduced to extract the indicators of the social security system for the disabled based on data mining. The simulation experiments show that the method presented in this paper is more reliable, and the indicators of social security system for the disabled extracted are more effective in practical application.

Music Composition Using Markov Chain and Hierarchical Clustering (마르코프 체인과 계층적 클러스터링 기법을 이용한 작곡 기법)

  • Kwon, Ji-Yong;Lee, In-Kwon
    • 한국HCI학회:학술대회논문집
    • /
    • 2008.02a
    • /
    • pp.744-748
    • /
    • 2008
  • In this paper, we propose a novel technique that generate a new song with given example songs. Our system use k-th order Markov chain of which each state represents notes in a measure. Because we have to consider very high-dimensional space if we use notes in a measure as a state of Markov chain directly, we exploit a hierarchical clustering technique for given example songs to use each cluster as a state. Each given examples can be represented as sequences of cluster ID, and we use them for training data of the Markov chain. The resulting Markov chain effectively gives new song similar to given examples.

  • PDF

A Method of Extracting Features of Sensor-only Facilities for Autonomous Cooperative Driving

  • Hyung Lee;Chulwoo Park;Handong Lee;Sanyeon Won
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.12
    • /
    • pp.191-199
    • /
    • 2023
  • In this paper, we propose a method to extract the features of five sensor-only facilities built as infrastructure for autonomous cooperative driving, which are from point cloud data acquired by LiDAR. In the case of image acquisition sensors installed in autonomous vehicles, the acquisition data is inconsistent due to the climatic environment and camera characteristics, so LiDAR sensor was applied to replace them. In addition, high-intensity reflectors were designed and attached to each facility to make it easier to distinguish it from other existing facilities with LiDAR. From the five sensor-only facilities developed and the point cloud data acquired by the data acquisition system, feature points were extracted based on the average reflective intensity of the high-intensity reflective paper attached to the facility, clustered by the DBSCAN method, and changed to two-dimensional coordinates by a projection method. The features of the facility at each distance consist of three-dimensional point coordinates, two-dimensional projected coordinates, and reflection intensity, and will be used as training data for a model for facility recognition to be developed in the future.