DOI QR코드

DOI QR Code

Classification in Different Genera by Cytochrome Oxidase Subunit I Gene Using CNN-LSTM Hybrid Model

  • Meijing Li (Department of Artificial Intelligence and Data Engineering, Sangmyung University) ;
  • Dongkeun Kim (Department of Intelligent Engineering Informatics for Human, College of Convergence Engineering, Sangmyung University)
  • Received : 2023.02.23
  • Accepted : 2023.05.08
  • Published : 2023.06.30

Abstract

The COI gene is a sequence of approximately 650 bp at the 5' terminal of the mitochondrial Cytochrome c Oxidase subunit I (COI) gene. As an effective DeoxyriboNucleic Acid (DNA) barcode, it is widely used for the taxonomic identification and evolutionary analysis of species. We created a CNN-LSTM hybrid model by combining the gene features partially extracted by the Long Short-Term Memory ( LSTM ) network with the feature maps obtained by the CNN. Compared to K-Means Clustering, Support Vector Machines (SVM), and a single CNN classification model, after training 278 samples in a training set that included 15 genera from two orders, the CNN-LSTM hybrid model achieved 94% accuracy in the test set, which contained 118 samples. We augmented the training set samples and four genera into four orders, and the classification accuracy of the test set reached 100%. This study also proposes calculating the cosine similarity between the training and test sets to initially assess the reliability of the predicted results and discover new species.

Keywords

References

  1. P. D. Hebert, A. Cywinska, S. L. Ball, and J. R. DeWaard, "Biological identifications through DNA barcodes," in Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 270, no. 1512, pp. 313-321, Feb. 2003. DOI: 10.1098/rspb.2002.2218.
  2. P. D. Hebert, S. Ratnasingham, and J. R. De Waard, "Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species," in Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 270, no. suppl_1, pp. S96-S99, Aug. 2003. DOI: 10.1098/rsbl.2003.0025.
  3. P. D. N. Hebert, M. Y. Stoeckle, T. S. Zemlak, and C. M. Francis, "Identification of birds through DNA barcodes," PLoS Biology, vol. 2, no. 10, p. e312, Sep. 2004. DOI: 10.1371/journal.pbio.0020312.
  4. S. M. Guan, and B. Q. Gao, "COI sequence, the DNA barcode affecting animal taxonomy and ecology [J]," Chinese Journal of Ecology, vol. 27, no. 8, pp. 1406-1412, 2008. [Online] Available: http://www.cje.net.cn/CN/abstract/abstract15065.shtml.
  5. Y. F. Tan, and R. C. Jin, "The efficient algorithm for reconstructing phylogenetic tree based on neibor-joining method," Computer Engineering and Applications, vol. 40, no. 21, pp. 84-85, 2004. [Online] Available: http://caod.oriprobe.com/articles/8475946/The_Efficient_Algorithm_for_Reconstructing_Phylogenetic_Tree_Based_on_Neibor_joining_Method.htm.
  6. A. Tampuu, Z. Bzhalava, J. Dillner, and R. Vicente, "ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples," PloS One, vol. 14, no. 9, pp. e0222271, Sep. 2019. DOI: 10.1371/journal.pone.0222271.
  7. U. Singh, S. Chauhan, A. Krishnamachari, and L. Vig, "Ensemble of deep long short term memory networks for labelling origin of replication sequences," IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 1-7, Oct. 2015. DOI: 10.1109/DSAA.2015.7344871.
  8. H. Gunasekaran, K. Ramalakshmi, A. Rex Macedo Arokiaraj, S. Deepa Kanmani, C. Venkatesan, and C. Suresh Gnana Dhas, "Analysis of DNA sequence classification using CNN and hybrid models," Computational and Mathematical Methods in Medicine, vol. 2021, pp. 1-12, Jul. 2021. DOI: 10.1155/2021/1835056.
  9. GenBank nucleic acid sequence database in NCBI. [Online] Available: https://www.ncbi.nlm.nih.gov/nuccore/?term=.
  10. B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, "Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning," Nature Biotechnology, vol. 33, no. 8, pp. 831-838, Aug. 2015. DOI: 10.1038/nbt.3300.
  11. D. R. Kelley, J. Snoek, and J. L. Rinn, "Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks," Genome Research, vol. 26, no. 7, pp. 990-999, 2016. DOI: 10.1101/gr.200535.115.
  12. A. Tharwat, "Parameter investigation of support vector machine classifier with kernel functions," Knowledge and Information Systems, vol. 61, no. 3, pp. 1269-1302, Dec. 2019. DOI: 10.1007/s10115-019-01335-4.
  13. A. Ghosh, Anirudha, A. Sufian, F. Sultana, A. Chakrabarti, and D. De, "Fundamental concepts of convolutional neural network," Recent trends and advances in artificial intelligence and Internet of Things, pp. 519-567, 2020. DOI: 10.1007/978-3-030-32644-9_36.
  14. Y. Yu, X. Si, C. Hu, and J. Zhang, "A review of recurrent neural networks: LSTM cells and network architectures," Neural Computation, vol. 31, no. 7, pp. 1235-1270, Jul. 2019. DOI: 10.1162/neco_a_01199.
  15. J. Zhou, and O. G. Troyanskaya, "Predicting effects of noncoding variants with deep learning-based sequence model," Nature Methods, vol. 12, no. 10, pp. 931-934, Oct. 2015. DOI: 10.1038/nmeth.3547.
  16. J. D. Watson, T. A. Baker, A. Gann, S. P. Bell, M. Levine, and R. M. Losick, Molecular Biology of the Gene, 7th ed. San Francisco: Pearson, 2013.
  17. L. Deng, H. Wu, X. Liu, and H. Liu, "DeepD2V: a novel deep learning-based framework for predicting transcription factor binding sites from combined DNA sequence," International Journal of Molecular Sciences, vol. 22, no. 11, p. 5521, May 2021. DOI:10.3390/ijms22115521.
  18. Y. Zhang, S. Qiao, S. Ji, and Y. Li, "DeepSite: bidirectional LSTM and CNN models for predicting DNA-protein binding," International Journal of Machine Learning and Cybernetics, vol. 11, no. 4, pp. 841-851, Apr. 2020. DOI: 10.1007/s13042-019-00990-x.
  19. L. B. Alexandrov, J. Kim, N. J. Haradhvala, M. N. Huang, A. W. Tian Ng, Y. Wu, A. Boot, K. R. Covington,, D. A. Gordenin, E. N. Bergstrom, and S. A. Islam, "The repertoire of mutational signatures in human cancer," Nature, vol. 578, no. 7793, pp. 94-101, Feb. 2020. DOI: 10.1038/s41586-020-1943-3.
  20. H. Vinje, K. H. Liland, T. Almoy, and L. Snipen, "Comparing K-mer based methods for improved classification of 16S sequences," BMC Bioinformatics, vol. 16, no. 1, pp. 1-13, Dec. 2015. DOI: 10.1186/s12859-015-0647-4.