DOI QR코드

DOI QR Code

Missing values imputation for time course gene expression data using the pattern consistency index adaptive nearest neighbors

시간경로 유전자 발현자료에서 패턴일치지수와 적응 최근접 이웃을 활용한 결측값 대치법

  • Shin, Heyseo (Department of Biomedicine.Health Science, The Catholic University of Korea) ;
  • Kim, Dongjae (Department of Biomedicine.Health Science, The Catholic University of Korea)
  • 신혜서 (가톨릭대학교 의생명.건강과학과) ;
  • 김동재 (가톨릭대학교 의생명.건강과학과)
  • Received : 2020.02.26
  • Accepted : 2020.05.06
  • Published : 2020.06.30

Abstract

Time course gene expression data is a large amount of data observed over time in microarray experiments. This data can also simultaneously identify the level of gene expression. However, the experiment process is complex, resulting in frequent missing values due to various causes. In this paper, we propose a pattern consistency index adaptive nearest neighbors as a method of missing value imputation. This method combines the adaptive nearest neighbors (ANN) method that reflects local characteristics and the pattern consistency index that considers consistent degree for gene expression between observations over time points. We conducted a Monte Carlo simulation study to evaluate the usefulness of proposed the pattern consistency index adaptive nearest neighbors (PANN) method for two yeast time course data.

시간경로 유전자 발현 자료는 마이크로어레이 실험을 시간에 따라 관측한 대용량의 자료로 유전자 발현 수준을 동시에 파악할 수 있다. 하지만 실험 과정이 복잡하여 다양한 원인들에 의해 결측값이 자주 발생한다. 본 논문에서는 시간경로 유전자 발현 자료에 대한 결측값을 추정하는 방법으로 패턴 적응 최근접 이웃(pattern consistency index adaptive nearest neighbors; PANN) 방법을 제안하였다. 이 방법은 국소적 특징을 반영하는 적응 최근접 이웃(adaptive nearest neighbors; ANN) 방법과 관측 시점간 유전자 발현의 일치 정도를 고려하는 패턴일치지수를 결합시킨 것이다. 제안한 PANN 방법의 효능을 평가하기 위하여 두 가지의 실제 시간경로 자료들을 사용하여 몬테카를로 모의실험(Monte Carlo simulation study)을 시행하였다.

Keywords

References

  1. DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 680-686. https://doi.org/10.1126/science.278.5338.680
  2. Jhun, M., Jeong, H., and Koo, J. (2007). On the use of adaptive nearest neighbors for missing value imputation, Communications in Statistics: Simulation and Computation, 36, 1275-1286. https://doi.org/10.1080/03610910701569069
  3. Kim, K., Oh, M., and Son, Y. (2008). Missing values estimation for tine course gene expression data using the sequential partial least squares regression fitting, The Korean Journal of Applied Statistics, 21, 275-290. https://doi.org/10.5351/KJAS.2008.21.2.275
  4. Kim, S. and Kim, D. (2018). Imputation method for missing data based on clustering and measure of property, The Korean Journal of Applied Statistics, 31, 29-40. https://doi.org/10.5351/KJAS.2018.31.1.029
  5. Park, J. and Lee, I. (2002). Utilization of BioInforMetics with high efficiency array biotech, News & Information for Chemical Engineers, 20, 431-440.
  6. Son, Y. and Baek, J. (2005). A pattern consistency index for detecting heterogeneous time series in clustering time course gene expression data, The Korean Journal of Applied Statistics, 18, 371-379. https://doi.org/10.5351/KJAS.2005.18.2.371
  7. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle- regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Molecular Biology of the Cell, 9, 3273-3297. https://doi.org/10.1091/mbc.9.12.3273
  8. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Bostein, D., and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525. https://doi.org/10.1093/bioinformatics/17.6.520