«Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften (Dr. rer. nat.) eingereicht im Fachbereich Biologie, Chemie, ...»
Assignment of Local Protein Structure
with Different Strategies
Dissertation zur Erlangung des akademischen Grades des
Doktors der Naturwissenschaften (Dr. rer. nat.)
eingereicht im Fachbereich Biologie, Chemie, Pharmazie
der Freien Universität Berlin
vorgelegt in englischer Sprache von
Berlin, Oktober 2014
Die vorliegende Arbeit wurde unter Anleitung von Prof. Dr. E. W. Knapp im Zeitraum
von 05.2010 - 08.2014 am Institut für Chemie / Physikalische und Theoretische Chemie der Freien Universität Berlin im Fachbereich Biologie, Chemie und Pharmazie durchgeführt.
1. Gutachter: Prof. Dr. Ernst-Walter Knapp
2. Gutachter: Prof. Dr. Markus Wahl Disputation am 16.12.2014 Preamble This thesis summarizes my doctoral research work. It is mainly based on the following two
peer-reviewed journal publications:
J. Zacharias and E. W. Knapp, “Geometry motivated alternative view on local protein back- bone structures.,” Protein Sci., vol. 22, no. 11, pp. 1669–74, Nov. 2013.
http://dx.doi.org/10.1002/pro.2364 J. Zacharias and E.-W. Knapp, “Protein Secondary Structure Classification Revisited: Pro- cessing DSSP Information with PSSC.,” J. Chem. Inf. Model., Jun. 2014.
http://dx.doi.org/10.1021/ci5000856 Acknowledgements This work was carried out at the Freie Universität Berlin in the group of Prof. Ernst-Walter Knapp. I would like to thank him for fruitful discussions and his valuable support.
Arturo Robertazzi for proofreading this manuscript.
Nadia Elghobashi-Meinhardt for proofreading both papers.
All members of the Knapp Group that created a cooperative and friendly working environ- ment.
Meiner Familie für beständige moralische und gelegentliche finanzielle Unterstützung.
Statutory Declaration I hereby testify that this thesis is the result of my own work and research, except for any ex- plicitly referenced material, whose source can be found in the bibliography. This work con- tains material that is the copyright property of others which cannot be reproduced without the permission of the copyright owner.
Jan Zacharias Table of contents Introduction
“Geometry motivated alternative view on local protein backbone structures”
“Protein Secondary Structure Classification Revisited:
Processing DSSP Information with PSSC”
Protein Backbone Geometry
Backbone dihedral angles
Hydrogen bonds in proteins
Definition in DSSP and PSSC
Secondary Structure Types
310, α-, and π-helix
Amino Acid Preferences
Secondary Structure Assignment
Development of PSSC
Differences between DSSP and PSSC
Hydrogens and Hydrogen Bonds
Efficient evaluation of Hydrogen-bonded residue pairs
Solvent Accessible Area Calculation
Secondary Structure Assignment with PSSC
Turns and Helices
Bridges and Strands
Seven Building Blocks of Hydrogen-Bonded Secondary Structure
Coils and Bents
Assessment of Dihedral Angles
Isolated Strands and Polyproline Helix
Discriminating between Strands, Isolated Strands, and PII Helices
Development of a Web Frontend for PSSC
Modeling of Hydrogen Positions
Preparation of Structural Data
Adding Hydrogen Atoms
Zusammenfassung auf Deutsch
Introduction Proteins are polymers of amino acids and are essential for all living organisms. They play an important role in virtually all biological reactions—a fact reflected by the vast abundance of proteins in eukaryotic cells, which consist of 70% water and 15% proteins. The broad range of protein functions covers active roles such as immune response, cell signaling, cell reproduction, and catalysis of biochemical reactions as well as passive tasks, like structural functions in the viral envelope or in collagen, and keratin.
The function of a protein is determined by its structure, which is usually described on four different levels of organization: the primary, secondary, tertiary, and quaternary structure.
Proteins consist of polypeptide chains of highly variable size of the twenty different proteinogenic amino acids. The sequence of these amino acids represents the primary structure of the protein. The size of proteins spans the whole range from 20 amino acid residues, as in the case of the synthetic Trp-Cage miniprotein, up to 33.000 of Titin, which provides the passive elasticity of muscles.
Figure 1: The four levels of biomolecular structure. Minor modifications to original artwork by Mariana Ruiz Villarreal.
spatial arrangement of shorter protein segments. The most prominent examples of regularly repeating secondary structure motifs are the α-helix and the β-strand. Following the preliminary work by William Astbury in the early 1930s and the prediction by Linus Pauling in 1951,  the first X-ray structures of myoglobin and hemoglobin were solved three years later, confirming the existence of these structures. In fact, roughly one half1 of all residues in a protein are either helical or part of a β-strand. The definitions “α-helix” and “β-strand” were derived from the fibrous structural proteins α- and β-keratin, which are both rich in the respective motifs. Neighboring β-strands form the so-called β-pleated-sheets (also called βsheet). Helices and β-sheets are stabilized by a repeating hydrogen-bond pattern between the protein backbone’s C=O- and N-H-groups of different amino acid residues.
Figure 2: Two visualizations of pepsin inhibitor-3 protein (1F34–) from Ascaris suum (large round worms of pigs). Left: All-atom representation with backbone atoms in solid and side-chain atoms in transparent mode. Right: Cartoon representation of the same protein. Alpha-helices are in purple, 310 helices in dark blue, strands in yellow, and hydrogen-bonded turns in turquois. Both images have been created with VMD, which uses stride for secondary structure assignment.
1 54% percent for the Astral40 dataset of version 1.75 according to PSSC and DSSP 2 At the next higher organization level, the tertiary structure describes the complete three-dimensional folding of the protein’s peptide chain. Besides covalent disulfide bonds between cysteine side chains, a combination of different non-covalent interactions stabilize the structure of a protein, i.e., the hydrophobic effect of polypeptide-water interactions, salt bridges, and hydrogen bonds, including backbone and side-chain groups.
Several protein chains can form a protein complex. A specific protein may only be functional as such a multimer. As an example, antibodies consist of four chains, i.e., two copies of the immunoglobulin heavy chain and two of the immunoglobulin light chain. The arrangement of protein subunits in space is described by the quaternary structure.
Knowledge about a protein’s fold, the arrangement of major secondary structural elements, is a key step towards the understanding of the protein’s function. Even though a strong correlation between structure and function exists, structural conservation between functionally similar proteins is more pronounced than conservation of amino acid sequences.
The protein folding problem is the task to predict the three-dimensional structure of a protein from its sequence, i.e., secondary and tertiary structures from primary structure. Experimental determination of a protein’s structure is mostly carried out with either X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy. NMR is usually restricted to smaller water-soluble proteins, while X-ray crystallography requires the protein to be prepared in the crystalline state first—a non-trivial process, which may be challenging or even impossible for some proteins. In contrast to that, full genome sequencing has become a largely automatized process with high throughput at low costs. Hence, the number of known protein sequences raises on a much higher rate than that of known structures. To measure the quality of a conducted secondary structure prediction as well as to train the algorithms employed for this task, a reliable method for secondary structure assignment is crucial.
A versatile tool to describe a protein’s structure in a two-dimensional graph is the Ramachandran or (φ, ψ) plot, introduced in 1963. In this representation, the backbone torsion angle ψ of a residue is plotted against the torsion angle φ, leading to a distinctive scatter plot, where residues of similar secondary structures are found in close proximity, independent from their spatial and sequential distances.
picted in a three-dimensional manner on modern computer hardware. The usage of a “cartoon” or ribbon representation has become what can be safely called the most common way of representing protein structures in publications (usually created with tools such as molscript, VMD, and PyMOL) and online tools for interactive protein visualizing such as Jmol, JSmol, and GLmol. For such tools to work properly, a solid assignment of helix and β-strand residues is critical to avoid visually unattractive and, most importantly, misleading results.
Interestingly, despite the indisputable importance of secondary structure prediction and hence structure assignment, a widely accepted canonical definition of protein secondary structure has not yet been proposed. Textbooks as well as publications dealing with protein structure almost exclusively focus on idealized motifs that are of infinite length without disruptions and ambiguities. In the interim regions of two secondary, structural motifs residues exist that may be assigned to any of the two interconnected motifs. Especially helices tend to possess contractions and bulges that add 310 helical or π-helical character to residues of an α-helix.
While some publications deal with the problem of helix capping and kinks in longer helices –, the majority of available secondary structure assignment software does not take into account the information illustrated in these studies.
The de facto standard for assigning protein secondary structure remains the software DSSP, which was developed in 1983 by Kabsch and Sander. During my work, I developed a fork of this software, named PSSC (Protein Secondary Structure Characterization) that fixes many of the problems of the original software and adds new features such as an identification of mixed secondary classes, left-handed hydrogen-bonded helices, and the polyproline II helix.
4 Publications “Geometry motivated alternative view on local protein backbone structures” Authors Zacharias, J., Knapp, E.W.
• Development of the research question
• Development of the webpage and necessary software tools
• Generation and analysis of the results
• Manuscript preparation Summary In this publication, the (d, ϑ)-plot is introduced as an alternative to the well-known Ramachandran plot. Instead of the (φ, ψ)-backbone angles, the helix rotation angle ϑ and the helical rise parameter d are displayed in a polar diagram. Both parameters are derived from a description of the local protein backbone structure in terms of a helix that would occur if the (φ, ψ) angles were repeated indefinitely. As repeated values of φ and ψ always result in a helical symmetry of the backbone structure, this transformation is possible for the whole (φ, ψ) space. A helix can be described by the angular rotation step ϑ and the rise d per residue, both with respect to the helical axis.
Assuming standard backbone geometry, the formulas for d and ϑ are then given by:
The sign of ϑ corresponds to the handedness of the helix (positive for right-handed, negative for left-handed), and the number of residues per full turn is given by = 360°/. Hence, a clear discrimination of the handedness of a local structural motif is gained: residues on the left side of the (d, ϑ) correspond to (φ, ψ) values that would generate left-handed helices if repeated. For this publication, all parameter pairs (n, r, d, D, ϑ, φ, ψ; where D = d n) were examined, and the combination of d and ϑ was found to be most insightful.
ϑ are almost parallel to the lines for φ + ψ = const. Firstly, it should be noted that helical residues possess dihedral angles in a way that the sum φ + ψ is approximately constant2; secondly, the boundary between the PII basin and the beta strand region is also diagonally shaped. Both features make the (d, ϑ) plot very appealing for secondary structure assignment.
Figure 3: Comparison of protein visualizations. A: Comic representation of the crystal structure of chain A of human superoxide dismutase (PDB id: 1KKC ). B: Ramachandran plot with sterically allowed regions shaded in gray. C: (d, ϑ)-plot of the same data.
To substantiate this claim, the mean and the standard deviation of the dihedral angles of all residues 2 in the Astral40 protein dataset were evaluated that belong to α-helices according to PSSC. The results are φ = -64° ± 12°, ψ = -40° ± 12°, and φ + ψ = -105° ± 13°. If the values φ and ψ were uncorrelated, the standard deviation of their sum would be. A clear diagonal trend can also be observed for the allowed regions in the Ramachandran plot of Figure 3.
6 “Protein Secondary Structure Classification Revisited:
Processing DSSP Information with PSSC” Authors Zacharias, J., Knapp, E.W.
• Software development
• Generation and analysis of the results