• Title/Summary/Keyword: data-sets

Search Result 3,782, Processing Time 0.025 seconds

Cross platform classification of microarrays by rank comparison

  • Lee, Sunho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.2
    • /
    • pp.475-486
    • /
    • 2015
  • Mining the microarray data accumulated in the public data repositories can save experimental cost and time and provide valuable biomedical information. Big data analysis pooling multiple data sets increases statistical power, improves the reliability of the results, and reduces the specific bias of the individual study. However, integrating several data sets from different studies is needed to deal with many problems. In this study, I limited the focus to the cross platform classification that the platform of a testing sample is different from the platform of a training set, and suggested a simple classification method based on rank. This method is compared with the diagonal linear discriminant analysis, k nearest neighbor method and support vector machine using the cross platform real example data sets of two cancers.

A Wavelet based Feature Selection Method to Improve Classification of Large Signal-type Data (웨이블릿에 기반한 시그널 형태를 지닌 대형 자료의 feature 추출 방법)

  • Jang, Woosung;Chang, Woojin
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.32 no.2
    • /
    • pp.133-140
    • /
    • 2006
  • Large signal type data sets are difficult to classify, especially if the data sets are non-stationary. In this paper, large signal type and non-stationary data sets are wavelet transformed so that distinct features of the data are extracted in wavelet domain rather than time domain. For the classification of the data, a few wavelet coefficients representing class properties are employed for statistical classification methods : Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Network etc. The application of our wavelet-based feature selection method to a mass spectrometry data set for ovarian cancer diagnosis resulted in 100% classification accuracy.

A Novel Reversible Data Hiding Scheme for VQ-Compressed Images Using Index Set Construction Strategy

  • Qin, Chuan;Chang, Chin-Chen;Chen, Yen-Chang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.7 no.8
    • /
    • pp.2027-2041
    • /
    • 2013
  • In this paper, we propose a novel reversible data hiding scheme in the index tables of the vector quantization (VQ) compressed images based on index set construction strategy. On the sender side, three index sets are constructed, in which the first set and the second set include the indices with greater and less occurrence numbers in the given VQ index table, respectively. The index values in the index table belonging to the second set are added with prefixes from the third set to eliminate the collision with the two derived mapping sets of the first set, and this operation of adding prefixes has data hiding capability additionally. The main data embedding procedure can be achieved easily by mapping the index values in the first set to the corresponding values in the two derived mapping sets. The same three index sets reconstructed on the receiver side ensure the correctness of secret data extraction and the lossless recovery of index table. Experimental results demonstrate the effectiveness of the proposed scheme.

Practical Datasets for Similarity Measures and Their Threshold Values (유사도 측정 데이터 셋과 쓰레숄드)

  • Yang, Byoungju;Shim, Junho
    • The Journal of Society for e-Business Studies
    • /
    • v.18 no.1
    • /
    • pp.97-105
    • /
    • 2013
  • In the e-business domain where data objects are quantitatively large, measuring similarity to find the same or similar objects is important. It basically requires comparing and computing the features of objects in pairs, and therefore takes longer time as the amount of data becomes bigger. Recent studies have shown various algorithms to efficiently perform it. Most of them show their performance superiority by empirical tests over some sets of data. In this paper, we introduce those data sets, present their characteristics and the meaningful threshold values that each of data sets contain in nature. The analysis on practical data sets with respect to their threshold values may serve as a referential baseline to the future experiments of newly developed algorithms.

A Study on Fuzzy-Rough sets (퍼지-Rough 집합에 관한 연구)

  • 정구범;김명순
    • Journal of the Korea Society of Computer and Information
    • /
    • v.1 no.1
    • /
    • pp.183-188
    • /
    • 1996
  • Fuzzy sets Introduced by Zadeh is a concept which can process, and reson a vague Information using membership functions. The notion of rough sets introduced by Pawlak is based on the ability to classify. reduce. and perform approximation reasoning for the Indiscernible data.A comparison between fuzzy sets and rough sets has been given In Pawlak where it is shown that these concepts are different and can't combine each other. The purpose of this paper Is to Introduce and define the notion of fuzzy-rough sets which joins the membership function of fuzzy sets to the rough sets.

  • PDF

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

  • Lee, Keon Myung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.98-104
    • /
    • 2014
  • Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.

LS-SVM for large data sets

  • Park, Hongrak;Hwang, Hyungtae;Kim, Byungju
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.2
    • /
    • pp.549-557
    • /
    • 2016
  • In this paper we propose multiclassification method for large data sets by ensembling least squares support vector machines (LS-SVM) with principal components instead of raw input vector. We use the revised one-vs-all method for multiclassification, which is one of voting scheme based on combining several binary classifications. The revised one-vs-all method is performed by using the hat matrix of LS-SVM ensemble, which is obtained by ensembling LS-SVMs trained using each random sample from the whole large training data. The leave-one-out cross validation (CV) function is used for the optimal values of hyper-parameters which affect the performance of multiclass LS-SVM ensemble. We present the generalized cross validation function to reduce computational burden of leave-one-out CV functions. Experimental results from real data sets are then obtained to illustrate the performance of the proposed multiclass LS-SVM ensemble.

Predicting Personal Credit Rating with Incomplete Data Sets Using Frequency Matrix technique (Frequency Matrix 기법을 이용한 결측치 자료로부터의 개인신용예측)

  • Bae, Jae-Kwon;Kim, Jin-Hwa;Hwang, Kook-Jae
    • Journal of Information Technology Applications and Management
    • /
    • v.13 no.4
    • /
    • pp.273-290
    • /
    • 2006
  • This study suggests a frequency matrix technique to predict personal credit rate more efficiently using incomplete data sets. At first this study test on multiple discriminant analysis and logistic regression analysis for predicting personal credit rate with incomplete data sets. Missing values are predicted with mean imputation method and regression imputation method here. An artificial neural network and frequency matrix technique are also tested on their performance in predicting personal credit rating. A data set of 8,234 customers in 2004 on personal credit information of Bank A are collected for the test. The performance of frequency matrix technique is compared with that of other methods. The results from the experiments show that the performance of frequency matrix technique is superior to that of all other models such as MDA-mean, Logit-mean, MDA-regression, Logit-regression, and artificial neural networks.

  • PDF

Extended High Dimensional Clustering using Iterative Two Dimensional Projection Filtering (반복적 2차원 프로젝션 필터링을 이용한 확장 고차원 클러스터링)

  • Lee, Hye-Myeong;Park, Yeong-Bae
    • The KIPS Transactions:PartD
    • /
    • v.8D no.5
    • /
    • pp.573-580
    • /
    • 2001
  • The large amounts of high dimensional data contains a significant amount of noises by it own sparsity, which adds difficulties in high dimensional clustering. The CLIP is developed as a clustering algorithm to support characteristics of the high dimensional data. The CLIP is based on the incremental one dimensional projection on each axis and find product sets of the dimensional clusters. These product sets contain not only all high dimensional clusters but also they may contain noises. In this paper, we propose extended CLIP algorithm which refines the product sets that contain cluster. We remove high dimensional noises by applying two dimensional projections iteratively on the already found product sets by CLIP. To evaluate the performance of extended algorithm, we demonstrate its effectiveness through a series of experiments on synthetic data sets.

  • PDF

Demonstration of the Effectiveness of Monte Carlo-Based Data Sets with the Simplified Approach for Shielding Design of a Laboratory with the Therapeutic Level Proton Beam

  • Lai, Bo-Lun;Chang, Szu-Li;Sheu, Rong-Jiun
    • Journal of Radiation Protection and Research
    • /
    • v.47 no.1
    • /
    • pp.50-57
    • /
    • 2022
  • Background: There are several proton therapy facilities in operation or planned in Taiwan, and these facilities are anticipated to not only treat cancer but also provide beam services to the industry or academia. The simplified approach based on the Monte Carlo-based data sets (source terms and attenuation lengths) with the point-source line-of-sight approximation is friendly in the design stage of the proton therapy facilities because it is intuitive and easy to use. The purpose of this study is to expand the Monte Carlo-based data sets to allow the simplified approach to cover the application of proton beams more widely. Materials and Methods: In this work, the MCNP6 Monte Carlo code was used in three simulations to achieve the purpose, including the neutron yield calculation, Monte Carlo-based data sets generation, and dose assessment in simple cases to demonstrate the effectiveness of the generated data sets. Results and Discussion: The consistent comparison of the simplified approach and Monte Carlo simulation results show the effectiveness and advantage of applying the data set to a quick shielding design and conservative dose assessment for proton therapy facilities. Conclusion: This study has expanded the existing Monte Carlo-based data set to allow the simplified approach method to be used for dose assessment or shielding design for beam services in proton therapy facilities. It should be noted that the default model of the MCNP6 is no longer the Bertini model but the CEM (cascade-exciton model), therefore, the results of the simplified approach will be more conservative when it was used to do the double confirmation of the final shielding design.