An Effective Data Analysis System for Improving Throughput of Shotgun Proteomic Data based on Machine Learning

대량의 프로테옴 데이타를 효과적으로 해석하기 위한 기계학습 기반 시스템

  • 나승진 (서울시립대학교 기계정보공학과) ;
  • 백은옥 (서울시립대학교 기계정보공학과)
  • Published : 2007.10.15

Abstract

In proteomics, recent advancements In mass spectrometry technology and in protein extraction and separation technology made high-throughput analysis possible. This leads to thousands to hundreds of thousands of MS/MS spectra per single LC-MS/MS experiment. Such a large amount of data creates significant computational challenges and therefore effective data analysis methods that make efficient use of computational resources and, at the same time, provide more peptide identifications are in great need. Here, SIFTER system is designed to avoid inefficient processing of shotgun proteomic data. SIFTER provides software tools that can improve throughput of mass spectrometry-based peptide identification by filtering out poor-quality tandem mass spectra and estimating a Peptide charge state prior to applying analysis algorithms. SIFTER tools characterize and assess spectral features and thus significantly reduce the computation time and false positive rates by localizing spectra that lead to wrong identification prior to full-blown analysis. SIFTER enables fast and in-depth interpretation of tandem mass spectra.

최근 프로테오믹스 분야에서 단백질의 추출, 분리기술의 발전과 고성능 질량분석 장비로 인하여 대량으로, 또 빠르게 샘플을 분석하는 것이 가능해짐에 따라서, 한번의 실험으로부터 얻어지는 실험데이타의 양이 대폭 늘어나게 되었다. 따라서 대량의 데이타를 어떻게 처리하여 필요한 정보만을 얻어내는가가 큰 이슈가 되고 있다. 하지만 기존의 데이타 해석과정은 불필요하게 계산자원을 낭비하는 요소를 상당 부분을 포함하고 있고, 이로 인해 데이타 해석 시간이 증가함은 물론, 종종 옳지 않은 해석 결과를 생성함으로써 결과에 대한 신뢰도의 저하를 초래했다. 본 논문에서는 기존의 데이타 해석 과정에서의 문제점을 지적하고, 데이타 처리의 효율을 높임과 동시에 해석 결과의 신뢰도를 제고하기 위한 SIFTER 시스템을 제안한다. SIFTER 시스템은 본격적인 데이타 해석에 앞서, 질량 스펙트럼의 질을 평가하고 하전량을 결정하는 소프트웨어를 제공한다. 탠덤 질량 스펙트럼에 나타나는 단편 이온의 특성을 고려하여 스펙트럼의 질과 하전량을 정확하게 결정하는 방법을 제공함으로써, 데이타 해석에 앞서 스펙트럼의 질이 낮아 해석이 불가능할 것이 분명한 경우 이들을 미리 제거하고 스펙트럼 해석과정에 잘못된 정보가 사용되지 않도록 한다. 결과적으로 데이타 해석과정에서의 효율과 해석결과의 정확성에 있어 대폭적인 개선을 기대할 수 있다.

Keywords

References

  1. Aebersold, R. and Mann, M., 'Mass spectrometrybased proteomics,' Nature, 422, 198-207, 2003 https://doi.org/10.1038/nature01511
  2. Steen, H. and Mann, M., 'THE ABC'S (AND XYZ'S) OF PEPTIDE SEQUENCING,' Nat. Rev. Mol. Cell Biol., 5, 699-711, 2004 https://doi.org/10.1038/nrm1468
  3. Eng, J. K., McCormack, A. L., and Yates, J. R. 'An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database,' J. Am. Soc. Mass Spectrom., 5, 976-989, 1994 https://doi.org/10.1016/1044-0305(94)80016-2
  4. Perkins, D. N., Pappin, D. J. C., Creasy, D. M., and Cottrell, J. S. 'Probability-based protein identification by searching sequence databases using mass spectrometry data,' Electrophoresis, 20, 3551-3567, 1999 https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
  5. Taylor, J. A. and Johnson, R. S., 'Implementation and Uses of Automated de Novo Peptide Sequencing by Tandem Mass Spectrometry,' Anal. Chem., 74, 2594-2604, 2001
  6. Ma, B., Zhang, K., Hendrie, C., Liang, C., Li, M., Doherty-Kirby, A. and Lajoie, G., 'PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry,' Rapid Commun. Mass Spectrom., 17, 2337-2342, 2003 https://doi.org/10.1002/rcm.1196
  7. Mann, M. and Wilm, M. Error-Tolerant Identification of Peptides in Sequence Databases by Peptide Sequence Tags,' Anal. Chem., 66, 4390-4399, 1994 https://doi.org/10.1021/ac00096a002
  8. Tabb, D. L., Saraf, A., and Yates, J. R. 'Guten Tag: high-throughput sequence tagging via an empirically derived fragmentation model,' Anal. Chem., 75, 6415-6421, 2003 https://doi.org/10.1021/ac0347462
  9. Kim, S., Na, S., Sim, J. W., Park, H., Jeong, J., Kim, H., Seo, Y., Seo, J., Lee, K. J., Paek, E. 'Modi : a powerful and convenient web server for identifying multiple post-translational peptide modifications from tandem mass spectra,' Nuc. Acids Res., 34, W258-W263, 2006 https://doi.org/10.1093/nar/gkl245
  10. Moore, R. E., Young, M. K. and Lee, T. D. 'Method for Screening Peptide Fragment Ion Mass Spectra Prior to Database Searching,' J. Am. Soc. Mass Spectrom., 11, 422-426, 2000 https://doi.org/10.1016/S1044-0305(00)00097-0
  11. Bern, M., Goldberg, D., McDonald, W. H. and Yates, J. R., III., 'Automatic Quality Assessment of Peptide Tandem Mass Spectra,' Bioinformatics, 20, i49-i54, 2004 https://doi.org/10.1093/bioinformatics/bth947
  12. Purvine, S., Kolker, N. and Kolker, E., 'Spectral Quality Assessment for High-Throughput Tandem Mass Spectrometry Proteomics,' OMICS, 8, 255-265, 2004 https://doi.org/10.1089/omi.2004.8.255
  13. Flikka, K., Martens, L., Vandekerckhove, J., Gevaert, K. and Eidhammer, I., 'Improving the reliability and throughput of mass spectrometrybased proteomics by spectrum quality filtering,' Proteomics, 6, 2086-2094, 2006 https://doi.org/10.1002/pmic.200500309
  14. Xu, M., Geer, L. Y., Bryant, S. H., Roth, J. S., Kowalak, J. A., Maynard, D. M. and Markey, S. P., 'Assessing Data Quality of Peptide Mass Spectra Obtained by Quadrupole Ion Trap Mass Spectrometry,' J. Proteome Res., 4, 300-305, 2005 https://doi.org/10.1021/pr049844y
  15. Savitski, M. M., Nielsen, M. L. and Zubarev, R. A., 'New Data Base-independent, Sequence Tagbased Scoring of Peptide MS/MS Data Validates Mowse Scores, Recovers Below Threshold Data, Singles Out Modified Peptides, and Assesses the Quality of MS/MS Techniques,' Mol. Cell. Proteomics, 4, 1180-1188, 2005 https://doi.org/10.1074/mcp.T500009-MCP200
  16. Nesvizhskii, A. I., Roos, F. F., Grossmann, J., Vogelzang, M., Eddes, J. S., Gruissem, W., Baginsky, S. and Aebersold, R., 'Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data,' Mol. Cell. Proteomics, 5, 652-670, 2006 https://doi.org/10.1074/mcp.M500319-MCP200
  17. Na, S. and Paek, E., 'Quality Assessment of Tandem Mass Spectra Based on Cumulative Intensity Normalization,' J. Proteome Res., 5, 3241-3248, 2006 https://doi.org/10.1021/pr0603248
  18. Sadygov, R. G., Eng, J., Durr, E., Saraf, A., McDonald, H., MacCoss, M. J. and Yates, J. R., III., 'Code Developments to Improve the Efficiency of Automated MS/MS Spectra Interpretation,' J. Proteome Res., 1, 211-215, 2002 https://doi.org/10.1021/pr015514r
  19. Hogan, J. M., Higdon, R., Kolker, N. and Kolker, E., 'Charge State Estimation for Tandem Mass Spectrometry Proteomics,' OMICS, 9, 233-250, 2005 https://doi.org/10.1089/omi.2005.9.233
  20. Colinge, J., Magnin, J., Dessingy, T., Giron, M. and Masselot, A., 'Improved peptide charge state assignment,' Proteomics, 3, 1434-1440, 2003 https://doi.org/10.1002/pmic.200300489
  21. Klammer, A. A., Wu, C. C., MacCoss, M. J. and Noble, W. S., 'Peptide charge state determination for low-resolution tandem mass spectra,' Proceedings of the Computational Systems Bioinformatics Conference, Stanford, CA., August 8-11, pp 175-185, 2005
  22. Keller, A., Purvine, S., Nesvizhskii, A. I., Stolyar, S., Goodlett, D. R. and Kolker, E., 'Experimental Protein Mixture for Validating Tandem Mass Spectral Analysis,' OMICS, 6, 207-212, 2002 https://doi.org/10.1089/153623102760092805
  23. Huang, Y., Triscari, J. M., Tseng, G. C., Pasa-Tolic, L., Lipton, M. S., Smith, R. D. and Wysocki, V. H., 'Statistical Characterization of the Charge State and Residue Dependence of Low-Energy CID Peptide Dissociation Patterns,' Anal. Chem., 77, 5800-5813, 2005 https://doi.org/10.1021/ac0480949
  24. Resing, K. A., Meyer-Arendt, K., Mendoza, A. M., Aveline-Wolf, L. D., Jonscher, K. R., Pierce, K. G., Old, W. M., Cheung, H. T., Russell, S., Wattawa, J. L., Goehle, G. R., Knight, R. D. and Ahn, N. G., 'Improving Reproducibility and Sensitivity in Identifying Human Proteins by Shotgun Proteomics,' Anal. Chem., 76, 3556-3568, 2004 https://doi.org/10.1021/ac035229m
  25. Schnapp, L. M., Donohoe, S., Chen, J., Sunde, D. A., Kelly, P. M., Ruzinski, J., Martin, T. and Goodlett, D. R., 'Mining the Acute Respiratory Distress Syndrome Proteome: Identification of the Insulin-Like Growth Factor (IGF)/IGF-Binding Protein-3 Pathway in Acute Lung Injury,' Am. J. Pathol., 169, 86-95, 2006 https://doi.org/10.2353/ajpath.2006.050612
  26. Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R., 'Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search,' Anal. Chem.., 74, 5383-5392, 2002 https://doi.org/10.1021/ac025747h