«By Ze-Qiang Ma Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for ...»
ALGORITHMS FOR SHOTGUN PROTEOMICS SPECTRAL IDENTIFICATION
AND QUALITY ASSESSMENT
Submitted to the Faculty of the
Graduate School of Vanderbilt University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Professor David L. Tabb Professor Daniel C. Liebler Professor Bing Zhang Professor Kathleen L. Gould Professor Zhongming Zhao
ACKNOWLEDGMENTSI would like to express profound gratitude to my advisor, Dr. David L. Tabb, for his invaluable support, supervision and helpful suggestions throughout all my graduate school research work. I am also grateful to my other dissertation committee members, Dr.
Daniel C. Liebler, Dr. Bing Zhang, Dr. Kathleen L. Gould and Dr. Zhongming Zhao, who were very supportive of my research and provided valuable advice on my dissertation work.
I would like to thank other members in Tabb group, particularly Dr. Surendra Dasari and our star programmer Matt Chambers for their tremendous help in my research.
I found it always fun to work with them and I learn something new every day from them.
I am also grateful to Dr. Amy-Joan L. Ham, Dr. Stacy D. Sherrod and Dr. Robbert Slebos at the Jim Ayers Institute for Precancer Detection and Diagnosis at Vanderbilt University for providing testing data sets and helpful discussions for my dissertation work.
Finally, I would like to express my gratitude to my wife Yang Wang and our lovely daughter Olivia Ma for all unconditional supports and patience. I want to thank my parents for being ever so understanding and supportive.
Thanks to NIH grants R01 CA126218 and U24 CA126479 for supporting my research work.
ABBREVIATIONS1D, 2D One-Dimensional, Two-Dimensional BSA Bovine Serum Albumin CID Collision Induced Dissociation CPTAC Clinical Proteomic Tumor Analysis Consortium Da Dalton DNA DeoxyriboNucleic Acid DTT DiThioThreitol ESI ElectroSpray Ionization ETD Electron Transfer Dissociation FDR False Discovery Rate FPR False Positive Rate FTICR Fourier Transform Ion Cyclotron Resonance GUI Graphical User Interface HCD Higher-energy Collision Dissociation HPLC High Pressure Liquid Chromatography
IMAC Immobilized Metal Ion Affinity Chromatography MALDI Matrix Assisted Laser Desorption and Ionization MRM Multiple Reaction Monitoring
NCI National Cancer Institute NGS Next Generation Sequencing NIST National Institute of Standards and Technology OMSSA Open Mass Spectrometry Search Algorithm PEP Posterior Error Probability
SDS-PAGE Sodium Dodecyl Sulfate PolyAcrylamide Gel Electrophoresis S/N Signal-to-Noise ratio TCGA The Cancer Genome Altas XIC Extracted Ion Chromatograms
LIST OF TABLES
LIST OF FIGURES
Chapter I. INTRODUCTION
I.1 Mass Spectrometry-Based Proteomics
I.1.2 Sample Preparation and Separation
I.1.3 Protein Digestion
I.1.4 Mass Spectrometry Instruments
I.1.5 Peptide Fragmentation
I.2 Proteomics Data Analysis
I.2.2 Peptide Identification
I.2.3 Peptide Validation
I.2.4 Protein Inference
I.3 Instrumentation Quality Control
I.4 Dissertation Outline
IDENTIFICATIONS VIA SPECTRAL CLUSTERING
II.2.2 Spectral Clustering
II.2.3 Rescue of Spectral Identifications
II.2.4 Bayesian Average Score
II.3 Data Sources
II.4 Results and Discussion
II.4.2 Rescue of Spectra in Comparative Analysis
II.4.3 Rescue of Spectra in a Variety of Datasets
III. SCANRANKER: QUALITY ASSESSMENT OF TANDEM MASS SPECTRAVIA SEQUENCE TAGGING
III.2.2 BestTagScore Subscore
III.2.3 BestTagTIC Subscore
III.2.4 TagMzRange Subscore
III.3 Data Sources
III.4 Results and Discussion
III.4.1 Subscore Evaluation
III.4.2 Removal of Low Quality Spectra
III.4.3 Recovery of Unidentified High Quality Spectra
III.4.4 Comparison of ScanRanker to QualScore
III.4.5 Prediction of Richness of Identifiable Spectra
III.4.6 Use of Quality Score in Peptide Validation
III.4.7 Selection of Spectra for De Novo Sequencing
III.4.8 Use of ScanRanker in Cross-linking Analysis
IV. QUAMETER: MULTI-VENDOR PERFORMANCE METRICS FOR LCMS/MS PROTEOMICS INSTRUMENTATION
IV.3 Data Sources
IV.4 Results and Discussion
IV.4.1 Differences between QuaMeter and MSQC
IV.4.2 Multi-vendor Performance
IV.4.3 Impact of identification tools
V.1 Summary of Results
V.2 Future Direction
V.2.1 Peptide Identification
V.2.2 PTM Identification and Validation
V.2.3 Next Generation Sequencing and Proteomics
V.2.4 Integration of Omics Data
V.2.5 Targeted Proteomics
Appendix A. SOFTWARE CONFIGURATIONS
Table 1. Bioinformatics tools for MS-based proteomics data analysis.
Table 2. Experimental datasets for the evaluation of IDBoost.
Table 3. Experimental datasets for the evaluation of ScanRanker.
Table 4. Experimental datasets for the evaluation of QuaMeter.
Figure 1. The typical MS-based proteomics workflow.
Figure 2. Theoretical fragmentation of a peptide
Figure 3. Mobile proton model for peptide fragmentation.
Figure 4. The typical MS-based proteomics data analysis workflow.
Figure 5. Four peptide identification strategies.
Figure 6. Peptide identification by the database search strategy.
Figure 7. Score distribution for correct and incorrect PSMs.
Figure 8. A simplified example of protein inference.
Figure 9. A diagram of rescuing unidentified spectra in a cluster.
Figure 10. Analysis of rescued PSMs in phosphorylation studies.
Figure 11. Impact of IDBoost on recognition of differentially expressed proteins in comparative analysis.
Figure 12. IDBoost performance in a variety of datasets.
Figure 13. A screenshot of ScanRanker GUI
Figure 14. A screenshot of IonMatcher GUI.
Figure 15. Combining three subscores improves the discriminating power of ScanRanker.
Figure 16. Removing poor MS/MS scans in ScanRanker does not significantly reduce identifications.
Figure 18. Evaluation of ScanRanker to recover unidentified high quality spectra.
........ 79 Figure 19. Comparison of ScanRanker to QualScore.
Figure 20. ScanRanker scores predict the richness of identifiable spectra.
Figure 21. Adding ScanRanker scores in peptide validation increases the number of confident spectrum identifications.
Figure 22. ScanRanker scores can be used to predict de novo sequencing success.
........ 86 Figure 23. ScanRanker helps to prioritize spectra for manual inspection in cross-linking analysis.
Figure 24. Workflow diagram for QuaMeter operation
Figure 25. QuaMeter generates similar metrics as MSQC except several chromatographic metrics due to the use of distinct chromatogram extraction tools.
Figure 26. QuaMeter generates reliable chromatographic data in instruments from multiple vendors via the Crawdad function in ProteoWizard.
Figure 27. QuaMeter computes QC metrics for multiple instrument platforms.
............ 104 Figure 28. QuaMeter metrics help to spot abnormal instrument performance............... 106 Figure 29. Distinct identification tools produce different QC metrics with similar variation.
Figure 30. A summary of three bioinformatics tools in proteomics data analysis workflow.
The topic of this dissertation is the development of novel algorithms and bioinformatics tools for proteomics data analysis. This chapter provides a general introduction to the field of proteomics and the data analysis process. The following is not intended to be a complete coverage of all areas of proteomics, but rather to serve as an overview in order to provide an understanding of the work detailed in the following chapters.
Proteomics as a discipline can be defined as the identification and quantification of the complete set of proteins in a cell or tissue at a particular state. Although a number of alternative proteomics strategies such as protein array based methods have been developed, mass spectrometry (MS)-based proteomics has become the method of choice for large-scale studies. The applications of MS-based proteomics approaches have proved to be successful in molecular and cellular biology research including post-translational modification (PTM) identification and protein-protein interactions (Aebersold & Mann 2003). With recent improvements in instrumentation and methodology, proteomics has undergone tremendous advances over the past few years, enabling many powerful applications such as functional analysis of complex organisms (Schrimpf et al. 2009), global analysis of PTM (Witze et al. 2007), large-scale reconstruction of protein interaction networks (Gstaiger & Aebersold 2009) and introduction of proteomics in clinical and translational research (Bousquet-Dubouch et al. 2011).
Tandem Mass Spectra Peptide Identifications Confident Peptide List Assembled Protein List Figure 1. The typical MS-based proteomics workflow.
The typical workflow for a bottom-up MS-based proteomics experiment is illustrated in Figure 1. The first step is to reduce the complexity of a biological sample by one or several separation techniques such as SDS-PAGE and two-dimensional (2D) gel electrophoresis. Large proteins are then digested to peptides using site-specific proteases.
Next, peptide mixtures are separated by liquid chromatography and ionized in a mass spectrometer. Precursor ions with particular mass-to-charge (m/z) values are selected and collided with nonreactive gas to generate fragment ions. The corresponding m/z values and peak intensities of fragment ions are recorded in tandem mass spectra, which are interpreted to peptides by computational tools. Finally, the identified peptides are assembled into a list of proteins that are most likely present in the sample.
In proteomics studies, complex biological samples that contain a large number of proteins are often separated to simple mixtures prior to MS analysis. Various separation techniques can be used for this purpose. A widely used approach is to separate protein mixtures by SDS-PAGE, and then cut the gel to fractions for MS analysis. Samples of high complexity are now often fractionated by 2D-gel electrophoresis (Kenrick & Margolis 1970), which separates proteins based on their isoelectric points and molecular weights. Each spot in the gel may represent one or several purified proteins that can be further analyzed by MS. Recently a gel-based peptide-level isoelectric focusing approach (Hörth et al. 2006) has been shown to provide complementary coverage to the conventional gel-based fractionation method and yield higher identification rates (Hubner et al. 2008).
A gel-free approach known as shotgun proteomics directly analyzes large mixtures of peptides by coupling the electrospray ionization (ESI) of mass spectrometer in-line with a liquid chromatography (LC) system. Peptides are separated in the chromatography system to reduce the complexity. Two major types of LC systems are reverse phase high pressure liquid chromatography (RP-HPLC) that separates molecules by hydrophobicity and ion exchange chromatography that separates molecules by their charges. High complexity samples can be separated using the multidimensional protein identification technology (MudPIT) (Washburn et al. 2001), which consists of a two dimensional chromatography. The first dimension is usually a strong cation exchange (SCX) column with high loading capacity. Eluted samples are subsequently separated by a reverse phase chromatography.