• 제목/요약/키워드: Data Sets

검색결과 3,740건 처리시간 0.035초

Cross platform classification of microarrays by rank comparison

  • Lee, Sunho
    • Journal of the Korean Data and Information Science Society
    • /
    • 제26권2호
    • /
    • pp.475-486
    • /
    • 2015
  • Mining the microarray data accumulated in the public data repositories can save experimental cost and time and provide valuable biomedical information. Big data analysis pooling multiple data sets increases statistical power, improves the reliability of the results, and reduces the specific bias of the individual study. However, integrating several data sets from different studies is needed to deal with many problems. In this study, I limited the focus to the cross platform classification that the platform of a testing sample is different from the platform of a training set, and suggested a simple classification method based on rank. This method is compared with the diagonal linear discriminant analysis, k nearest neighbor method and support vector machine using the cross platform real example data sets of two cancers.

웨이블릿에 기반한 시그널 형태를 지닌 대형 자료의 feature 추출 방법 (A Wavelet based Feature Selection Method to Improve Classification of Large Signal-type Data)

  • 장우성;장우진
    • 대한산업공학회지
    • /
    • 제32권2호
    • /
    • pp.133-140
    • /
    • 2006
  • Large signal type data sets are difficult to classify, especially if the data sets are non-stationary. In this paper, large signal type and non-stationary data sets are wavelet transformed so that distinct features of the data are extracted in wavelet domain rather than time domain. For the classification of the data, a few wavelet coefficients representing class properties are employed for statistical classification methods : Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Network etc. The application of our wavelet-based feature selection method to a mass spectrometry data set for ovarian cancer diagnosis resulted in 100% classification accuracy.

A Novel Reversible Data Hiding Scheme for VQ-Compressed Images Using Index Set Construction Strategy

  • Qin, Chuan;Chang, Chin-Chen;Chen, Yen-Chang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제7권8호
    • /
    • pp.2027-2041
    • /
    • 2013
  • In this paper, we propose a novel reversible data hiding scheme in the index tables of the vector quantization (VQ) compressed images based on index set construction strategy. On the sender side, three index sets are constructed, in which the first set and the second set include the indices with greater and less occurrence numbers in the given VQ index table, respectively. The index values in the index table belonging to the second set are added with prefixes from the third set to eliminate the collision with the two derived mapping sets of the first set, and this operation of adding prefixes has data hiding capability additionally. The main data embedding procedure can be achieved easily by mapping the index values in the first set to the corresponding values in the two derived mapping sets. The same three index sets reconstructed on the receiver side ensure the correctness of secret data extraction and the lossless recovery of index table. Experimental results demonstrate the effectiveness of the proposed scheme.

유사도 측정 데이터 셋과 쓰레숄드 (Practical Datasets for Similarity Measures and Their Threshold Values)

  • 양병주;심준호
    • 한국전자거래학회지
    • /
    • 제18권1호
    • /
    • pp.97-105
    • /
    • 2013
  • 방대한 량의 전자상거래 데이터 객체를 다루는데 같거나 유사한 객체들을 찾는 유사도 측정은 중요하다. 객체간 유사도 측정은 객체 쌍의 유사도 측정값을 비교하므로 객체 량이 많아질수록 오랜 시간이 걸린다. 최근의 여러 유사도 측정 연구에선 이를 더 효율적으로 수행하는 기법을 제시하고 실제 데이터 셋에서 그 성능을 평가해왔다. 본 논문에서는 이들 연구에서 사용하는 데이터 셋의 특성과 실험에서 사용되는 쓰레숄드 값이 가지는 의미에 대해 분석해본다. 이러한 분석은 새로운 유사도 측정 기법의 성능 평가 실험의 참조 기준을 제시하는 역할을 한다.

퍼지-Rough 집합에 관한 연구 (A Study on Fuzzy-Rough sets)

  • 정구범;김명순
    • 한국컴퓨터정보학회논문지
    • /
    • 제1권1호
    • /
    • pp.183-188
    • /
    • 1996
  • Zadeh에 의하여 소개된 퍼지 집합은 소속 함수를 이용하여 애매한 정보처리 및 추론을 가능토록 한 개념이다 Rough 집합의 개념은 Pawlak에 의하여 소개 되었으며.식별 곤란한 데이터의 분류, 축소 및 근사추론을 가능토록 한다. Pawlakl은 퍼지 집합과 Hough 집합을 서로 다른 개념으로 비교하여 서로 결합할 수 없는 것으로 정의하였다. 본 논문의 목적은 Pawlak의 정의와는 달리 퍼지 집합의 소속 함수를 Rough 집합에 적용함으로써 퍼지 집합과 Rough집합을 결합한 퍼지-rough집합의 개념을 정립하기 위한 것이다.

  • PDF

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

  • Lee, Keon Myung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제14권2호
    • /
    • pp.98-104
    • /
    • 2014
  • Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.

LS-SVM for large data sets

  • Park, Hongrak;Hwang, Hyungtae;Kim, Byungju
    • Journal of the Korean Data and Information Science Society
    • /
    • 제27권2호
    • /
    • pp.549-557
    • /
    • 2016
  • In this paper we propose multiclassification method for large data sets by ensembling least squares support vector machines (LS-SVM) with principal components instead of raw input vector. We use the revised one-vs-all method for multiclassification, which is one of voting scheme based on combining several binary classifications. The revised one-vs-all method is performed by using the hat matrix of LS-SVM ensemble, which is obtained by ensembling LS-SVMs trained using each random sample from the whole large training data. The leave-one-out cross validation (CV) function is used for the optimal values of hyper-parameters which affect the performance of multiclass LS-SVM ensemble. We present the generalized cross validation function to reduce computational burden of leave-one-out CV functions. Experimental results from real data sets are then obtained to illustrate the performance of the proposed multiclass LS-SVM ensemble.

Frequency Matrix 기법을 이용한 결측치 자료로부터의 개인신용예측 (Predicting Personal Credit Rating with Incomplete Data Sets Using Frequency Matrix technique)

  • 배재권;김진화;황국재
    • Journal of Information Technology Applications and Management
    • /
    • 제13권4호
    • /
    • pp.273-290
    • /
    • 2006
  • This study suggests a frequency matrix technique to predict personal credit rate more efficiently using incomplete data sets. At first this study test on multiple discriminant analysis and logistic regression analysis for predicting personal credit rate with incomplete data sets. Missing values are predicted with mean imputation method and regression imputation method here. An artificial neural network and frequency matrix technique are also tested on their performance in predicting personal credit rating. A data set of 8,234 customers in 2004 on personal credit information of Bank A are collected for the test. The performance of frequency matrix technique is compared with that of other methods. The results from the experiments show that the performance of frequency matrix technique is superior to that of all other models such as MDA-mean, Logit-mean, MDA-regression, Logit-regression, and artificial neural networks.

  • PDF

반복적 2차원 프로젝션 필터링을 이용한 확장 고차원 클러스터링 (Extended High Dimensional Clustering using Iterative Two Dimensional Projection Filtering)

  • 이혜명;박영배
    • 정보처리학회논문지D
    • /
    • 제8D권5호
    • /
    • pp.573-580
    • /
    • 2001
  • 대용량의 고차원 데이터 집합은 고차원 데이터 고유 희소성에 의하여 상당한 양의 잡음을 포함하므로 효과적인 고차원 클러스터링에 어려움을 더한다. CLIP은 이와 같은 고차원 데이터의 특성을 지원하는 클러스터링 알고리즘으로 개발되었다. CLIP은 1차원 성형변환 프로젝션을 점진적으로 적용하여, 각 프로젝션 공간에서 얻어진 1차원 클러스터들의 곱집합을 찾는다. 이 집합은 클러스터를 포함할 뿐 아니라 잡음도 포함할 수 있다. 본 논문에서는 클러스터를 포함하는 곱집합을 정제하는 확장된 CLIP 알고리즘을 제안한다. 이미 CLIP에서 찾은 곱집합에 반복적인 2차원 프로젝션을 적용하여 클러스터의 고차원적 잡음을 제거한다. 확장된 알고리즘의 성능을 평가하기 위해 합성 데이터를 이용한 일련의 실험을 통하여 효과성을 증명한다.

  • PDF

Demonstration of the Effectiveness of Monte Carlo-Based Data Sets with the Simplified Approach for Shielding Design of a Laboratory with the Therapeutic Level Proton Beam

  • Lai, Bo-Lun;Chang, Szu-Li;Sheu, Rong-Jiun
    • Journal of Radiation Protection and Research
    • /
    • 제47권1호
    • /
    • pp.50-57
    • /
    • 2022
  • Background: There are several proton therapy facilities in operation or planned in Taiwan, and these facilities are anticipated to not only treat cancer but also provide beam services to the industry or academia. The simplified approach based on the Monte Carlo-based data sets (source terms and attenuation lengths) with the point-source line-of-sight approximation is friendly in the design stage of the proton therapy facilities because it is intuitive and easy to use. The purpose of this study is to expand the Monte Carlo-based data sets to allow the simplified approach to cover the application of proton beams more widely. Materials and Methods: In this work, the MCNP6 Monte Carlo code was used in three simulations to achieve the purpose, including the neutron yield calculation, Monte Carlo-based data sets generation, and dose assessment in simple cases to demonstrate the effectiveness of the generated data sets. Results and Discussion: The consistent comparison of the simplified approach and Monte Carlo simulation results show the effectiveness and advantage of applying the data set to a quick shielding design and conservative dose assessment for proton therapy facilities. Conclusion: This study has expanded the existing Monte Carlo-based data set to allow the simplified approach method to be used for dose assessment or shielding design for beam services in proton therapy facilities. It should be noted that the default model of the MCNP6 is no longer the Bertini model but the CEM (cascade-exciton model), therefore, the results of the simplified approach will be more conservative when it was used to do the double confirmation of the final shielding design.