• Title/Summary/Keyword: 데이타 분류

Search Result 305, Processing Time 0.03 seconds

An Improved Co-training Method without Feature Split (속성분할이 없는 향상된 협력학습 방법)

  • 이창환;이소민
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.10
    • /
    • pp.1259-1265
    • /
    • 2004
  • In many applications, producing labeled data is costly and time consuming while an enormous amount of unlabeled data is available with little cost. Therefore, it is natural to ask whether we can take advantage of these unlabeled data in classification teaming. In machine learning literature, the co-training method has been widely used for this purpose. However, the current co-training method requires the entire features to be split into two independent sets. Therefore, in this paper, we improved the current co-training method in a number of ways, and proposed a new co-training method which do not need the feature split. Experimental results show that our proposed method can significantly improve the performance of the current co-training algorithm.

A Sliding Window-based Multivariate Stream Data Classification (슬라이딩 윈도우 기반 다변량 스트림 데이타 분류 기법)

  • Seo, Sung-Bo;Kang, Jae-Woo;Nam, Kwang-Woo;Ryu, Keun-Ho
    • Journal of KIISE:Databases
    • /
    • v.33 no.2
    • /
    • pp.163-174
    • /
    • 2006
  • In distributed wireless sensor network, it is difficult to transmit and analyze the entire stream data depending on limited networks, power and processor. Therefore it is suitable to use alternative stream data processing after classifying the continuous stream data. We propose a classification framework for continuous multivariate stream data. The proposed approach works in two steps. In the preprocessing step, it takes input as a sliding window of multivariate stream data and discretizes the data in the window into a string of symbols that characterize the signal changes. In the classification step, it uses a standard text classification algorithm to classify the discretized data in the window. We evaluated both supervised and unsupervised classification algorithms. For supervised, we tested Bayesian classifier and SVM, and for unsupervised, we tested Jaccard, TFIDF Jaro and Jaro Winkler. In our experiments, SVM and TFIDF outperformed other classification methods. In particular, we observed that classification accuracy is improved when the correlation of attributes is also considered along with the n-gram tokens of symbols.

분산 환경하에서의 데이타관리 분류체계에 대한 연구

  • 박주석;편흥렬
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 1994.04a
    • /
    • pp.49-57
    • /
    • 1994
  • 진정한 의미의 다운사이징을 구현하기 위해서는 필요한 분산데이타베이스의 구현은 현재 여러가지 기술적인 문제점들을 안고 있다. 따라서 동시성 제어(concurrency control)와 갱신 (update propagation), 복구(recovery), 질의어 처리(query processing), 카달로그 관리(catalog management)등과 같은 분산 환경에서의 데이타베이스에 관련된 기술적인 문제점들을 해결하기 위해서는 어떠한 최적의 방법들을 개발해야만 한다. 이러한 방법들의 개발은 관계형 데이타베이스의 데이타관리 분류체계를 통한 대안들의 선택과 운용에 의해 가능할 것이다. 분산 환경하에서 사용되어질 수 있는 관계형 데이타베이스의 데이타관리 분류체계를 availability, expression, currency의 관점에서 Basic table과 view로 구분하여 정립해 보았다. Basic table은 current update가 필수적이므로 availability와 expression의 관점에서 분류하였고, view는 physical file의 존재성 유무와 시간적 실행의 차이를 기준으로 분류하였다. 그리고 이러한 분류기준에 따른 특성들을 분산데이타베이스 구축에 이용하는 방법들에 대해 이야기 하였다. 다시 말해 non-current materialized view 뿐만 아니라 current materialized view를 동시에 지원하는 하나의 distributed view update architecture를 개발하는 것에 대해 하나의 방법으로 제시하였다. 즉 immediate update와 deferred update는 current view를 이용하고 periodical update는 non-current view를 이용하여 100%의 distributed data resources를 관리 할 수 있는 효율적인 distributed system를 개발하는 것을 제시하였다. 본 논문은 데이타베이스론의 입장에서 아직 정립되어 있지 않은 분산 환경하에서의 관계형 데이타베이스의 데이타관리의 분류체계를 나름대로 정립하였다는데 그 의의가 있다. 또한 이것의 응용은 현재 분산데이타베이스 구축에 있어 나타나는 기술적인 문제점들을 어느정도 보완할 수 있다는 점에서 그 중요성이 있다.

Parallel Sorting Algorithm by Median-Median (중위수의 중위수에 의한 병렬 분류 알고리즘)

  • Min, Yong-Sik
    • The Journal of the Acoustical Society of Korea
    • /
    • v.14 no.1E
    • /
    • pp.14-21
    • /
    • 1995
  • This paper presents a parallel sorting algorithm suitable for the SIMD multiprocessor. The algorithm finds pivots for partitioning the data into ordered subsets. The data can be evenly distributed to be sorted since it uses the probability theory. For n data elements to be sorted on p processors, when $n{\geq}p^2$, the algorithm is shown to be asymptotically optimal. In practice, sorting 8 million data items on 64 processors achieved a 48.43-fold speedup, while the PSRS required a 44.4-fold speedup. On a variety of shared and distributed memory machines, the algorithm achieved better than half-linear speedups.

  • PDF

A Co-training Method based on Classification Using Unlabeled Data (비분류표시 데이타를 이용하는 분류 기반 Co-training 방법)

  • 윤혜성;이상호;박승수;용환승;김주한
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.8
    • /
    • pp.991-998
    • /
    • 2004
  • In many practical teaming problems including bioinformatics area, there is a small amount of labeled data along with a large pool of unlabeled data. Labeled examples are fairly expensive to obtain because they require human efforts. In contrast, unlabeled examples can be inexpensively gathered without an expert. A common method with unlabeled data for data classification and analysis is co-training. This method uses a small set of labeled examples to learn a classifier in two views. Then each classifier is applied to all unlabeled examples, and co-training detects the examples on which each classifier makes the most confident predictions. After some iterations, new classifiers are learned in training data and the number of labeled examples is increased. In this paper, we propose a new co-training strategy using unlabeled data. And we evaluate our method with two classifiers and two experimental data: WebKB and BIND XML data. Our experimentation shows that the proposed co-training technique effectively improves the classification accuracy when the number of labeled examples are very small.

Temporal Associative Classification based on Calendar Patterns (캘린더 패턴 기반의 시간 연관적 분류 기법)

  • Lee Heon Gyu;Noh Gi Young;Seo Sungbo;Ryu Keun Ho
    • Journal of KIISE:Databases
    • /
    • v.32 no.6
    • /
    • pp.567-584
    • /
    • 2005
  • Temporal data mining, the incorporation of temporal semantics to existing data mining techniques, refers to a set of techniques for discovering implicit and useful temporal knowledge from temporal data. Association rules and classification are applied to various applications which are the typical data mining problems. However, these approaches do not consider temporal attribute and have been pursued for discovering knowledge from static data although a large proportion of data contains temporal dimension. Also, data mining researches from temporal data treat problems for discovering knowledge from data stamped with time point and adding time constraint. Therefore, these do not consider temporal semantics and temporal relationships containing data. This paper suggests that temporal associative classification technique based on temporal class association rules. This temporal classification applies rules discovered by temporal class association rules which extends existing associative classification by containing temporal dimension for generating temporal classification rules. Therefore, this technique can discover more useful knowledge in compared with typical classification techniques.

Bio-data Classification using Modified Additive Factor Model (변형된 팩터 분석 모델을 이용한 생체데이타 분류 시스템)

  • Cho, Min-Kook;Park, Hye-Young
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.7
    • /
    • pp.667-680
    • /
    • 2007
  • The bio-data processing is used for a suitable purpose with bio-signals, which are obtained from human individuals. Recently, there is increasing demand that the bio-data has been widely applied to various applications. However, it is often that the number of data within each class is limited and the number of classes is large due to the property of problem domain. Therefore, the conventional pattern recognition systems and classification methods are suffering form low generalization performance because the system using the lack of data is influenced by noises of that. To solve this problem, we propose a modified additive factor model for bio-data generation, with two factors; the class factor which affects properties of each individuals and the environment factor such as noises which affects all classes. We then develop a classification system through defining a new similarity function using the proposed model. The proposed method maximizes to use an information of the class classification. So, we can expect to obtain good generalization performances with robust noises from small number of datas for bio-data. Experimental results show that proposed method outperforms significantly conventional method with real bio-data.

The database construction of a classification system using an optimal cluster analysis model (최적 클러스터 분석 모델을 이용한 분류시스템의 데이터베이스 구축)

  • 이현숙
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.23 no.4
    • /
    • pp.1045-1050
    • /
    • 1998
  • Classification techniques are often an importand component of intelligent systems and are use for both deta preprocessing and decision making. In the design of a classification system, the labled samples must be given to provide a priori information for the classification. Moreover, the number of classes to be categorized must be known a priori information, called OFCAM. In OFCAM, an unsupervised by OFCAM, the database of a classification system, called PCSDB, is constructed. Then, PCSDB can be effectively used in the decision process of the system.

  • PDF

Building a Classifier for Integrated Microarray Datasets through Two-Stage Approach (2 단계 접근법을 통한 통합 마이크로어레이 데이타의 분류기 생성)

  • Yoon, Young-Mi;Lee, Jong-Chan;Park, Sang-Hyun
    • Journal of KIISE:Databases
    • /
    • v.34 no.1
    • /
    • pp.46-58
    • /
    • 2007
  • Since microarray data acquire tens of thousands of gene expression values simultaneously, they could be very useful in identifying the phenotypes of diseases. However, the results of analyzing several microarray datasets which were independently carried out with the same biological objectives, could turn out to be different. One of the main reasons is attributable to the limited number of samples involved in one microarry experiment. In order to increase the classification accuracy, it is desirable to augment the sample size by integrating and maximizing the use of independently-conducted microarray datasets. In this paper, we propose a novel two-stage approach which firstly integrates individual microarray datasets to overcome the problem caused by limited number of samples, and identifies informative genes, secondly builds a classifier using only the informative genes. The classifier from large samples by integrating independent microarray datasets achieves high accuracy up to 24.19% increase as against other comparison methods, sensitivity, and specificity on independent test sample dataset.

The Use of Linearly Transformed LANDSAT Data in Landuse Classification (선형 변환된 LANDSAT 데이타를 이용한 토지이용분류(낙동강 하구역을 중심으로))

  • 안철호;박병욱;김종인
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.7 no.2
    • /
    • pp.85-92
    • /
    • 1989
  • The aim of this study is to find out the combination of effective transformed data, applying Remote Sensing techniques, as to the classification and particular objects by transforming the MSS data and TM data of the satellite LANDSAT into several linearly transformed data. Since one of the problems in the processing of the LANDSAT data is the vastness of the data, the Linear Transformation could be a method to perform analysis of those vast data, more efficiently and economically. This method is carried out as follows : (1) offering the simplicity over complex data, (2) selectional processing over redundant data and removing unnecessary data, (3) emphasizing on the object of the study ; by transforming multispectral data through linear calculation and statistical transformation. In this study, the analysis and transformation of the data have been performed by means of Band Ratioing and Principal Component Analysis. As the classificatory consequence, Infrared/RED Ratioing which expands the characterization of green vegetation, has been useful for a distinctive classification among other classes. For the Principal Component Analysis, band 1,2,7 are efficient in the classification of the green vegetation.

  • PDF