An assessment of the taxonomic reliability of DNA barcode sequences in publicly available databases

  • Jin, Soyeong (School of Biological Sciences and Technology, Chonnam National University) ;
  • Kim, Kwang Young (Department of Oceanography, Chonnam National University) ;
  • Kim, Min-Seok (Dental Science Research Institute, School of Dentistry, Chonnam National University) ;
  • Park, Chungoo (School of Biological Sciences and Technology, Chonnam National University)
  • Received : 2020.07.29
  • Accepted : 2020.09.04
  • Published : 2020.09.21


The applications of DNA barcoding have a wide range of uses, such as in taxonomic studies to help elucidate cryptic species and phylogenetic relationships and analyzing environmental samples for biodiversity monitoring and conservation assessments of species. After obtaining the DNA barcode sequences, sequence similarity-based homology analysis is commonly used. This means that the obtained barcode sequences are compared to the DNA barcode reference databases. This bioinformatic analysis necessarily implies that the overall quantity and quality of the reference databases must be stringently monitored to not have an adverse impact on the accuracy of species identification. With the development of next-generation sequencing techniques, a noticeably large number of DNA barcode sequences have been produced and are stored in online databases, but their degree of validity, accuracy, and reliability have not been extensively investigated. In this study, we investigated the extent to which the amount and types of erroneous barcode sequences were deposited in publicly accessible databases. Over 4.1 million sequences were investigated in three largescale DNA barcode databases (NCBI GenBank, Barcode of Life Data System [BOLD], and Protist Ribosomal Reference database [PR2]) for four major DNA barcodes (cytochrome c oxidase subunit 1 [COI], internal transcribed spacer [ITS], ribulose bisphosphate carboxylase large chain [rbcL], and 18S ribosomal RNA [18S rRNA]); approximately 2% of erroneous barcode sequences were found and their taxonomic distributions were uneven. Consequently, our present findings provide compelling evidence of data quality problems along with insufficient and unreliable annotation of taxonomic data in DNA barcode databases. Therefore, we suggest that if ambiguous taxa are presented during barcoding analysis, further validation with other DNA barcode loci or morphological characters should be mandated.



We thank the members of the CSB lab and the anonymous reviewers for their valuable comments. This research was supported by the "Research center for fishery resource management based on the information and communication technology" (ICT to C.P.) of the Korea Institute of Marine Science and Technology Promotion (KIMST) funded by the Ministry of Oceans and Fisheries, Korea, and the National Research Foundation (NRF) of Korea grant funded by the Korea government (MSIT) (NRF-2020R1A2C3005053 to K.Y.K and NRF-2017R1A2B1007928 to M.S.K).


  1. Ashelford, K. E., Chuzhanova, N. A., Fry, J. C., Jones, A. J. & Weightman, A. J. 2005. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl. Environ. Microbiol. 71:7724-7736.
  2. Barrett, R. D. H. & Hebert, P. D. N. 2005. Identifying spiders through DNA barcodes. Can. J. Zool. 83:481-491.
  3. Bridge, P. D., Roberts, P. J., Spooner, B. M. & Panchal, G. 2003. On the unreliability of published DNA sequences. New Phytol. 160:43-48.
  4. Burns, J. M., Janzen, D. H., Hajibabaei, M., Hallwachs, W. & Hebert, P. D. 2008. DNA barcodes and cryptic species of skipper butterflies in the genus Perichares in Area de Conservacion Guanacaste, Costa Rica. Proc. Natl. Acad. Sci. U. S. A. 105:6350-6355.
  5. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. & Madden, T. L. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421.
  6. Guillou, L., Bachar, D., Audic, S., Bass, D., Berney, C., Bittner, L., Boutte, C., Burgaud, G., de Vargas, C., Decelle, J., Del Campo, J., Dolan, J. R., Dunthorn, M., Edvardsen, B., Holzmann, M., Kooistra, W. H. C. F., Lara, E., Le Bescot, N., Logares, R., Mahe, F., Massana, R., Montresor, M., Morard, R., Not, F., Pawlowski, J., Probert, I., Sauvadet, A. -L., Siano, R., Stoeck, T., Vaulot, D., Zimmermann, P. & Christen, R. 2013. The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote small sub-unit rRNA sequences with curated taxonomy. Nucleic Acids Res. 41(Database issue):D597-D604.
  7. Hebert, P. D. N., Cywinska, A., Ball, S. L. & deWaard, J. R. 2003. Biological identifications through DNA barcodes. Proc. Biol. Sci. 270:313-321.
  8. Jo, J., Lee, H. -G., Kim, K. Y. & Park, C. 2019. SoEM: a novel PCR-free biodiversity assessment method based on small-organelles enriched metagenomics. Algae 34:57-70.
  9. Kerr, K. C. R., Stoeckle, M. Y., Dove, C. J., Weigt, L. A., Francis, C. M. & Hebert, P. D. N. 2007. Comprehensive DNA barcode coverage of North American birds. Mol. Ecol. Notes 7:535-543.
  10. Kim, H. M., Jo, J., Park, C., Choi, B. -J., Lee, H. -G. & Kim, K. Y. 2019. Epibionts associated with floating Sargassum horneri in the Korea Strait. Algae 34:303-313.
  11. Koljalg, U., Larsson, K. -H., Abarenkov, K., Nilsson, R. H., Alexander, I. J., Eberhardt, U., Erland, S., Hoiland, K., Kjoller, R., Larsson, E., Pennanen, T., Sen, R., Taylor, A. F. S., Tedersoo, L., Vralstad, T. & Ursing, B. M. 2005. UNITE: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi. New Phytol. 166:1063-1068.
  12. Kress, W. J., Garcia-Robledo, C., Uriarte, M. & Erickson, D. L. 2015. DNA barcodes for ecology, evolution, and conservation. Trends Ecol. Evol. 30:25-35.
  13. Nilsson, R. H., Ryberg, M., Kristiansson, E., Abarenkov, K., Larsson, K. -H. & Koljalg, U. 2006. Taxonomic reliability of DNA sequences in public sequence databases: a fungal perspective. PLoS ONE 1:e59.
  14. Ratnasingham, S. & Hebert, P. D. N. 2007. Bold: The Barcode of Life Data System ( Mol. Ecol. Notes 7:355-364.
  15. Sayers, E. W., Cavanaugh, M., Clark, K., Ostell, J., Pruitt, K. D. & Karsch-Mizrachi, I. 2019. GenBank. Nucleic Acids Res. 47:D94-D99.
  16. Seah, Y. G., Ariffin, A. F. & Jaafar, T. N. A. M. 2017. Levels of COI divergence in Family Leiognathidae using sequences available in GenBank and BOLD systems: a review on the accuracy of public databases. AACL Bioflux 10:391-401.
  17. Smith, M. A., Poyarkov, N. A. Jr. & Hebert, P. D. N. 2008. DNA BARCODING: CO1 DNA barcoding amphibians: take the chance, meet the challenge. Mol. Ecol. Resour. 8:235-246.
  18. Sonet, G., Jordaens, K., Braet, Y., Bourguignon, L., Dupont, E., Backeljau, T., De Meyer, M. & Desmyter, S. 2013. Utility of GenBank and the Barcode of Life Data Systems (BOLD) for the identification of forensically important Diptera from Belgium and France. Zookeys 365:307-328.