• Title/Summary/Keyword: sub-data re-sampling

Search Result 2, Processing Time 0.021 seconds

How Many SNPs Should Be Used for the Human Phylogeny of Highly Related Ethnicities? A Case of Pan Asian 63 Ethnicities

  • Ghang, Ho-Young;Han, Young-Joo;Jeong, Sang-Jin;Bhak, Jong;Lee, Sung-Hoon;Kim, Tae-Hyung;Kim, Chul-Hong;Kim, Sang-Soo;Al-Mulla, Fahd;Youn, Chan-Hyun;Yoo, Hyang-Sook;The HUGO Pan-Asian SNP Consortium, The HUGO Pan-Asian SNP Consortium
    • Genomics & Informatics
    • /
    • v.9 no.4
    • /
    • pp.181-188
    • /
    • 2011
  • In planning a model-based phylogenic study for highly related ethnic data, the SNP marker number is an important factor to determine for relationship inferences. Genotype frequency data, utilizing a sub sampling method, from 63 Pan Asian ethnic groups was used for determining the minimum SNP number required to establish such relationships. Bootstrap random sub-samplings were done from 5.6K PASNPi SNP data. DA distance was calculated and neighbour-joining trees were drawn with every re-sampling data set. Consensus trees were made with the same 100 sub-samples and bootstrap proportions were calculated. The tree consistency to the one obtained from the whole marker set, improved with increasing marker numbers. The bootstrap proportions became reliable when more than 7,000 SNPs were used at a time. Within highly related ethnic groups, the minimum SNPs number for a robust neighbor-joining tree inference was about 7,000 for a 95% bootstrap support.

Efficient Outlier Detection of the Water Temperature Monitoring Data (수온 관측 자료의 효율적인 이상 자료 탐지)

  • Cho, Hongyeon;Jeong, Shin Taek;Ko, Dong Hui;Son, Kyeong-Pyo
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.26 no.5
    • /
    • pp.285-291
    • /
    • 2014
  • The statistical information of the coastal water temperature monitoring data can be biased because of outliers and missing intervals. Though a number of outlier detection methods have been developed, their applications are very limited to the in-situ monitoring data because of the assumptions of the a prior information of the outliers and no-missing condition, and the excessive computational time for some methods. In this study, the practical robust method is developed that can be efficiently and effectively detect the outliers in case of the big-data. This model is composed of these two parts, one part is the construction part of the approximate components of the monitoring data using the robust smoothing and data re-sampling method, and the other part is the main iterative outlier detection part using the detailed components of the data estimated by the approximate components. This model is tested using the two-years 5-minute interval water temperature data in Lake Saemangeum. It can be estimated that the outlier proportion of the data is about 1.6-3.7%. It shows that most of the outliers in the data are detected and removed with satisfaction by the model. In order to effectively detect and remove the outliers, the outlier detection using the long-span smoothing should be applied earlier than that using the short-span smoothing.