«INTRODUCTION Protein disorder is a topic worth attention from the structural bioinformatics community largely for the technical challenges it ...»
THE SIGNIFICANCE AND IMPACTS
OF PROTEIN DISORDER AND
Jenny Gu and Vincent Hilser
Protein disorder is a topic worth attention from the structural bioinformatics community
largely for the technical challenges it presents to the ﬁeld, but also for its biological and functional implications. The success of structural genomic efforts using X-ray crystallog- raphy depends on overcoming several potential bottlenecks (Chapter 40), one of which is the formation of protein crystals that can be obstructed by the presence of highly ﬂexible and disordered regions. Despite precluding the number of structures that can be obtained thus impacting the coverage of protein space, our current generalized understanding of disor- dered regions is a result of structural bioinformatics efforts that were able to extract and analyze patterns associated with these regions. These disorder predictors have been proven to be useful in advancing our understanding of disordered regions with potential impact to improve the success rate of structural genomics efforts, particularly those focused on eukaryotic proteins (Oldﬁeld et al., 2005b).
The importance of resolving differences observed in conformational variants within protein families and understanding their impacts is also a rising issue. Most structural genomics efforts aim to solve a representative structure for each protein family to maximize the coverage of protein space with particular focus on identifying new protein folds.
However, it is equally important to understand structural changes that result from sequence differences introduced by a few single point mutations, insertions, and/or deletions since it can have a large functional impact. Furthermore, the structural information recorded in the Protein Data Bank (PDB) is often overlooked as a macroscopic view of a collection of microscopic ensembles that give rise to the observed protein structure. In other words, the Structural Bioinformatics, Second Edition Edited by Jenny Gu and Philip E. Bourne Copyright Ó 2009 John Wiley & Sons, Inc.
940 T H E SI G N I F IC A N C E A N D I M P A C T S OF P R O T E I N D I S O R D E Robserved protein structure is not the only conformation adopted by the protein. In fact, most observed biological phenomena are a macroscopic consequence of the collective micro- scopic states. Understanding the differences in the microscopic states and how the changes impact the macroscopic event is currently addressed in several ways that will be discussed.
By exploiting the technical weakness in structural data, researchers have been able to gain insight into the potential biological signiﬁcance of these otherwise poorly characterized disordered regions (Ringe and Petsko, 1986). Recognition for the importance of protein disorder in biological function came around as early as the late 1970s when disordered regions seem to reoccur within particular features of enzymes such as the zymogens of pancreatic serine proteases and tyrosyl-tRNA synthetases (Blow, 1977). In light of these investigations, the hypothesis presented at the time was that the reactivity and speciﬁcity are associated with more rigid structures while disordered regions may be involved with control of the function. Since then, many functional roles of disordered regions including regulatory control have been implicated through experimental investigation of these regions, statistical mechanics, and structural bioinformatics approaches.
While the topics of protein disorder and conformational variations are intrinsically
related to protein ﬂexibility, these topics warranted a separate chapter from ‘‘Protein Motion:
Simulation’’ (Chapter 37) largely because it deals with a time frame and complexity beyond what is captured by protein dynamic modeling approaches (Figure 38.1). Molecular dynamics simulations have been used to study conformational disorder and variants of proteins with limitations (Torda and Scheek, 1990; Kuriyan et al., 1991; Fuentes et al., 2005).
Longer molecular dynamic simulations are reserved for smaller proteins or are otherwise restricted to a small time frame within limits of nanoseconds for larger proteins. As such, the observed conformational changes with these simulations will also be limited. The topics of disorder and conformational variations discussed here extend beyond what can be offered by molecular dynamic simulations, although various strategies such as the use of Monte Carlo sampling (Lindorff-Larsen et al., 2004) and averaging over a few samples of generated conformers while using experimental constraints (Kemmink et al., 1993; Bonvin and Brunger, 1995) have been used to address this issue. Coarse-grained dynamic modeling addresses molecular motion beyond the time frame limitations of classical molecular dynamics. However, a systematic analysis between disordered regions and the modeled large-amplitude ﬂuctuating regions using these rigid-body based approaches needs to be conducted.
Range of protein dynamics and structural observation. Protein ﬂexibility lies on a spectrum where the ﬂuctuations occur at a range of different time scales. Ordered structures can be visualized with simulated motion limited to the nanosecond range. Beyond these limits, protein dynamics is perceived as protein disorder and lacking stable structures.
P R O T E I N D I S O R D E R : U N D E R S T A N D I N G T H E RE A L M O F ‘ ‘ I N V I S I B L E ’ ’ 941 In this chapter, we discuss brieﬂy the experimental methods used to study disordered regions and highlight the computational resources that have largely fueled the advancement of this ﬁeld, by providing many of the current generalized observations. The biological importance of protein disorder and conformational variations as they exist in microscopic ensembles will also be examined in more detail. We attempt to create an introductory chapter to the subject and apologize if not all research efforts are represented in this otherwise rapidly growing ﬁeld.
PROTEIN DISORDER: UNDERSTANDING THE REALM OF ‘‘INVISIBLE’’
Deﬁning Protein Disorder Before proceeding, we must ﬁrst make clear that the ﬁeld currently lacks a unifying deﬁnition when discussing protein ﬂexibility, disorder, and intrinsically unstructured proteins. These terms are often used interchangeably largely due to the qualitative nature of the deﬁnition and can leave readers with some confusion if the slight distinctions are not clariﬁed. Other disorder-related terms that have been coined in the ﬁeld are intrinsic coils, random coils, unfolded proteins, molten globules, and premolten globules as examples to deﬁne protein states that are not natively folded. These terms are often referred to the global state of the protein rather than speciﬁc regions within the protein structures that are disordered. Without setting the standard nomenclature for the ﬁeld, we will clarify by deﬁning the usage of ‘‘disorder’’ in this chapter as regions in the protein structure where the equilibrium position of the backbone, along with the dihedral angles, has no speciﬁc values and vary signiﬁcantly over time.
When evaluating and using disorder predictors, it is also important to have a clariﬁed view of how these regions were deﬁned in the training of disorder predictors and other efforts to understand these regions. Some sequence-based disorder predictors, such as PONDR (Romero et al., 1997) and DISOPRED (Jones and Ward, 2003), were trained on disorder deﬁned as missing regions in the X-ray crystallographic structures. This deﬁnition is also used to benchmark the performance of disorder predictors by evaluators in CASP experiments (Chapter 28). However, other predictors such as GlobPlot (Linding et al., 2003b) and DisEMBL (Linding et al., 2003a) are trained on deﬁnition based on a temperature factor (B-value) threshold to deﬁne disorder in X-ray crystal structures. Finally, other subtle differences in disorder predictors should be considered such as RONN (Yang et al., 2005) and Wiggle (Gu, Gribskov, and Bourne, 2006). RONN incorporates additional use of curated information from homologous proteins to make predictions regarding disordered regions, and Wiggle was trained on a data set where ﬂexible regions are deﬁned using dynamic modeling techniques. These subtle distinctions should be noted when considering which predictor would best serve the scientiﬁc question at hand.
Prevalence of Disordered Protein Regions Flexible and disordered regions present two challenges to our understanding of protein structures. Aside from being unable to resolve atomic coordinates for these regions to understand the structure, the regions also interfere with the formation of protein crystals needed to collect X-ray diffraction data. Disordered regions are often addressed by removing them from proteins targeted for structure determination. These disordered regions can also
942 T H E SI G N I F IC A N C E A N D I M P A C T S OF P R O T E I N D I S O R D E Rbe detected using nuclear magnetic resonance (NMR—Chapter 5), but the structure of these regions cannot be easily determined due to the increased conformational space sampled by the disordered regions. An analysis of a nonredundant subset of the PDB shows that $7% of the complete sequences, as deposited in the Swiss-Prot Database, contained no disordered regions (Le Gall et al., 2007). A number of sequences where 95% of the protein is resolved structurally comprise about $25% of the data set, a surprisingly small count that illustrates the prevalence of disordered regions within protein structures.
The presence of disordered regions is not a technical artifact and several different techniques have been employed to study this phenomenon. Early studies used spectroscopic techniques such as infrared circular dichroism (CD), Fourier transform infrared (FTIR), electron paramagnetic resonance (EPR), and optical rotary dispersion (ORD) to detect native and nonnative structures that may form within the disordered regions. More recently, NMR and small-angle X-ray scattering (SAXS) have been used to provide quantitative data about disordered and denatured proteins (Kern, Eisenmesser, and Wolf-Watz, 2005; Mittag and Forman-Kay, 2007; Sasakawa et al., 2007; Tsutakawa et al., 2007). These experimental approaches can provide quantitative data that can be incorporated into the calculation of the observed conformational ensembles in solution to determine the structural information about denatured, unfolded, and intrinsically disordered proteins. Hydrogen–deuterium (H/
D) exchange mass spectrometry (Chapter 7) has also been used to study dynamic processes such as the role of transient structural disorder as a facilitator of protein–ligand binding (Xiao and Kaltashov, 2005). These experiments have detected structural formations within these disordered regions, and these structures have been associated with functional implications.
With the development of sequence-based predictors, the prevalence of disordered regions in organisms has been investigated across the three kingdoms of life (Oldﬁeld et al., 2005a; Ward et al., 2004). The frequency of native disorder was calculated for several representative genomes and found to have increased content in eukaryotic proteins (33.0%) compared to 2.0% and 4.2% of archaean and eubacterial proteins, respectively (Ward et al., 2004). The analysis showed that proteins containing disorder are often located in the cell nucleus with functional association to regulations of transcription and cell signaling. In a separate study, an increase in intrinsic disorder content has been observed in regulatory cell signaling, cytoskeletal, and human cancer-associated proteins (Iakoucheva et al., 2002).
Disordered regions are currently being curated into a database, DisProt (Sickmeier et al., 2007), which contains 472 proteins and 1121 disordered regions as reported for release 3.6 (June 29, 2007).
Computational Approaches to Understanding Protein Disorder The computational tools that have been developed to predict regions of protein ﬂexibility and disorder range from the use of simple sequence complexity proﬁles to complex machine learning infrastructure schemes such as the neural network and support vector machines (SVMs) (Figure 38.2). The successful development of these tools is attributed to the fact that sequence signatures of protein disorder are present. The popular choice of training set to construct these predictors often use reported missing residues in X-ray crystallographic structures, but reported temperature factors (B-factors) and NMR characterized disordered regions have also been used. First we will discuss algorithms that do not use structural information to identify and understand disordered regions. This is achieved by either examination of the sequence space only or focusing on residues in which the structure cannot be resolved. Then we will follow with alternative strategies that use temperature P R O T E I N D I S O R D E R : U N D E R S T A N D I N G T H E RE A L M O F ‘ ‘ I N V I S I B L E ’ ’ 943 Figure 38.2. General strategies to predict the disorder sequence space. Schema of various Q1 strategies used to identify and understand the sequence space of disordered regions. The differences stem largely from how the disordered regions were deﬁned and the underlying infrastructure for analysis and prediction tool development. Within all of the sequence space, a subset of sequence space will be associated with regions with low complexity, detected disordered, or those transitioning between an ordered and a disordered state. Overlaps can occur between the subsets.
approach hinges on the assumption that disordered regions have low sequence complexities.
However, many disordered regions are not detected by SEG and therefore suggest that features other than sequence complexity are involved.