DOI QR코드

DOI QR Code

Analysis of massive data in astronomy

천문학에서의 대용량 자료 분석

  • Shin, Min-Su (Korea Astronomy and Space Science Institute)
  • Received : 2016.09.19
  • Accepted : 2016.10.06
  • Published : 2016.10.31

Abstract

Recent astronomical survey observations have produced substantial amounts of data as well as completely changed conventional methods of analyzing astronomical data. Both classical statistical inference and modern machine learning methods have been used in every step of data analysis that range from data calibration to inferences of physical models. We are seeing the growing popularity of using machine learning methods in classical problems of astronomical data analysis due to low-cost data acquisition using cheap large-scale detectors and fast computer networks that enable us to share large volumes of data. It is common to consider the effects of inhomogeneous spatial and temporal coverage in the analysis of big astronomical data. The growing size of the data requires us to use parallel distributed computing environments as well as machine learning algorithms. Distributed data analysis systems have not been adopted widely for the general analysis of massive astronomical data. Gathering adequate training data is expensive in observation and learning data are generally collected from multiple data sources in astronomy; therefore, semi-supervised and ensemble machine learning methods will become important for the analysis of big astronomical data.

최근의 탐사 천문학 관측으로부터 대용량 관측 자료가 획득되면서, 기존의 일상적인 자료 분석 방법에 큰 변화가 있었다. 고전적인 통계적인 추론과 더불어 기계학습 방법들이, 자료의 표준화로부터 물리적인 모델을 추론하는 단계까지 자료 분석의 전 과정에서 활용되어 왔다. 적은 비용으로 대형 검출 기기들을 이용할 수 있게 되고, 더불어서 고속의 컴퓨터 네트워크를 통해서 대용량의 자료들을 쉽게 공유할 수 있게 되면서, 기존의 다양한 천문학 자료 분석의 문제들에 대해서 기계학습을 활용하는 것이 보편화되고 있다. 일반적으로 대용량 천문학 자료의 분석은, 자료의 시간과 공간 분포가 가지는 비 균질성 때문에 야기되는 효과를 고려해야 하는 문제를 가진다. 오늘날 증가하는 자료의 규모는 자연스럽게 기계학습의 활용과 더불어 병렬 분산 컴퓨팅을 필요로 하고 있다. 그러나 이러한 병렬 분산 분석 환경의 일반적인 자료 분석에서의 활용은 아직 활발하지 않은 상황이다. 천문학에서 기계학습을 사용하는데 있어서, 충분한 학습 자료를 관측을 통해 획득하는 것이 어렵고, 그래서 다양한 출처의 자료를 모아서 학습 자료를 수집해야 는 것이 일반적이다. 따라서 앞으로 준 지도학습이나 앙상블 학습과 같은 방법의 역할이 중요해 질 것으로 예상된다.

Keywords

References

  1. Abazajian, K. N. Abazajian, K. N., Adelman-McCarthy, J. K., Agueros, M. A., Allam, S. S., Prieto, C. A., An, D., et al. (2009). The seventh data release of the Sloan Digital Sky survey. The Astrophysical Journal Supplement, 182, 543-558. https://doi.org/10.1088/0067-0049/182/2/543
  2. Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K., and Taha, K. (2015). Efficient machine learning for big data: a review. Big Data Research, 2, 87-93. https://doi.org/10.1016/j.bdr.2015.04.001
  3. Allison, R. and Dunkley, J. (2014). Comparison of sampling techniques for Bayesian parameter estimation. Monthly Notices of the Royal Astronomical Society, 437, 3918-3928. https://doi.org/10.1093/mnras/stt2190
  4. Alonso, D. (2012). CUTE solutions for two-point correlation functions from large cosmological datasets, ArXiv e-prints, 1210.1833. Available from: https://arxiv.org/abs/1210.1833
  5. Ball, N. M. and Brunner, R. J. (2010). Data mining and machine learning in astronomy. International Journal of Modern Physics D, 19, 1049-1106. https://doi.org/10.1142/S0218271810017160
  6. Bhat, P. C. (2011). Multivariate analysis methods in particle physics. Annual Review of Nuclear and Particle Science, 61, 281-309. https://doi.org/10.1146/annurev.nucl.012809.104427
  7. Borne, K. (2013). Virtual observatories, data mining, and astroinformatics. In Planets, Stars and Stellar Systems (pp. 403-443), Springer Netherlands
  8. Borra, S. and Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis, 54, 2976-2989. https://doi.org/10.1016/j.csda.2010.03.004
  9. Cavuoti, S., Brescia, M., De Stefano, V., and Longo, G. (2015). Photometric redshift estimation based on data mining with PhotoRApToR. Experimental Astronomy, 39, 45-71. https://doi.org/10.1007/s10686-015-9443-4
  10. Chapelle, O., Schlkopf, B., and Zien, A. (2010). Semi-Supervised Learning, The MIT Press.
  11. Feigelson, E. D. and Babu, J. (2012). Statistical Challenges in Modern Astronomy V, (Volume 902 of Lecture Notes in Statistics), Springer, New York.
  12. Feroz, F., Hobson, M. P., and Bridges, M. (2009). MULTINEST: an efficient and robust Bayesian inference tool for cosmology and particle physics. Monthly Notices of the Royal Astronomical Society, 398, 1601-1614. https://doi.org/10.1111/j.1365-2966.2009.14548.x
  13. Foreman-Mackey, D., Hogg, D. W., Lang, D., and Goodman, J. (2013). emcee: The MCMC Hammer. Publications of the Astronomical Society of Pacific, 125, 306-312. https://doi.org/10.1086/670067
  14. Gebru, I. D., Alameda-Pineda, X., Forbes, F., and Horaud, R. (2015). EM algorithms for weighted-data clustering with application to audio-visual scene analysis, CoRR, Available from: https://arxiv.org/abs/1509.01509
  15. Golombek, D. (2004). Archives, databases and the emerging virtual observatories. Astrophysics and Space Science, 290, 449-456. https://doi.org/10.1023/B:ASTR.0000032543.18493.d6
  16. Gunn, J. E., Siegmund, W. A., Mannery, E. J., Owen, R. E., Hull, C. L., Leger, R. F., et al. (2006). The 2.5 m telescope of the sloan digital sky survey. The Astronomical Journal, 131, 2332-2359. https://doi.org/10.1086/500975
  17. Hahm, J., Kwon, O.-K., Kim, S., Jung, Y.-H., Yoon, J.-W., Kim, J., Kim, M.-K., Byun, Y.-I., Shin, M.-S., and Park, C. (2012). Astronomical time series data analysis leveraging science cloud, In Lecture Notes in Electrical Engineering, 181, 493-500.
  18. Hira, Z. M. and Gillies, D. F. (2015). A review of feature selection and feature extraction methods applied on microarray data, Advances in Bioinformatics, 2015, Article ID 198363.
  19. Ihaka, R. and Gentleman, R. (1996). R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5, 299-314.
  20. Ivezic, Z., Tyson, J. A., Abel, B., Acosta, E., Allsman, R., AlSayyad, Y., et al. (2008). LSST: from science drivers to reference design and anticipated data products, ArXiv e-prints, 0805.2366, Available from: https://arxiv.org/abs/0805.2366
  21. Ivezic, Z., Connolly, A. J., VanderPlas, J. T., and Gray, A. (2014). Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data, Princeton University Press.
  22. Liao, K., Treu, T., Marshall, P., Fassnacht, C. D., Rumbaugh, N., Dobler, G., et al. (2015). Strong lens time delay challenge. II. Results of TDC1. The Astrophysical Journal, 800, 11. https://doi.org/10.1088/0004-637X/800/1/11
  23. Patil, A., Huard, D., and Fonnesbeck, C. (2010). PyMC: Bayesian stochastic modelling in python. Journal of Statistical Software, 35, 4.
  24. Pier, J. R., Munn, J. A., Hindsley, R. B., Hennessy, G. S., Kent, S. M., Lupton, R. H., et al. (2003). Astrometric calibration of the sloan digital sky survey. The Astronomical Journal, 125, 1559-1579. https://doi.org/10.1086/346138
  25. Saeys, Y., Inza, I., and Larra-naga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507-2517. https://doi.org/10.1093/bioinformatics/btm344
  26. Shin, M.-S. and Byun, Y.-I. (2004). Efficient period search for time series photometry. Journal of Korean Astronomical Society, 37, 79-85. https://doi.org/10.5303/JKAS.2004.37.2.079
  27. Singh, N., Browne, L.-M,. and Butler, R. (2013). Parallel astronomical data processing with Python: Recipes for multicore machines. Astronomy and Computing, 2, 1-10. https://doi.org/10.1016/j.ascom.2013.04.002
  28. Stetson, P. B. (1996). On the automatic determination of light-curve parameters for Cepheid variables. Publications of the Astronomical Society of the Pacific, 108, 851-876. https://doi.org/10.1086/133808
  29. Szalay, A. S., Kunszt, P. Z., Thakar, A. R., Gray, J., and Slutz, D. (2000). The sloan digital sky survey and its archive, Astronomical Data Analysis Software and Systems IX. ASP Conference Proceedings, 216, 405-414.
  30. Szapudi, I., Pan, J., Prunet, S., and Budavari, T. (2005). Fast edge-corrected measurement of the two-point correlation function and the power spectrum. The Astrophysical Journal, 631, L1-L4. https://doi.org/10.1086/496971
  31. Townsend, R. H. D. (2010). Fast calculation of the Lomb-Scargle periodogram using graphics processing units. The Astrophysical Journal Supplement, 191, 247-253. https://doi.org/10.1088/0067-0049/191/2/247
  32. Vio, R., Diaz-Trigo, M., and Andreani, P. (2013). Irregular time series in astronomy and the use of the Lomb-Scargle periodogram. Astronomy and Computing, 1, 5-16. https://doi.org/10.1016/j.ascom.2012.12.001
  33. Way, M. J., Scargle, J. D., Ali, K. M., and Srivastava, A. N. (2012). Advances in Machine Learning and Data Mining for Astronomy (1st ed.), Chapman & Hall/CRC.
  34. Zhang, Y. and Zhao, Y. (2015). Astronomy in the big data era. Data Science Journal, 14, 1-9.
  35. Zheng, H. and Zhang, Y. (2008). Feature selection for high-dimensional data in astronomy. Advances in Space Research, 41, 1960-1964. https://doi.org/10.1016/j.asr.2007.08.033
  36. Zhou, Z.-H. (2015). Ensemble learning, Encyclopedia of Biometrics, Springer US, Boston.
  37. Zuntz, J., Paterno, M., Jennings, E., Rudd, D., Manzotti, A., Dodelson, S., Bridle, S., Sehrish, S., and Kowalkowski, J. (2015). CosmoSIS: Modular cosmological parameter estimation. Astronomy and Computing, 12, 45-59. https://doi.org/10.1016/j.ascom.2015.05.005
  38. Von Neumann, J. (1941). Distribution of the ratio of mean square successive difference to the variance. The Annals of Mathematical Statistics, 12, 367-395. https://doi.org/10.1214/aoms/1177731677