«Improving de novo model quality and its application in ab initio phasing Rojan Shrestha A Dissertation Presented By Rojan Shrestha Submitted to The ...»
Improving de novo model quality and its application in ab initio
A Dissertation Presented
The Graduate School of Frontier Sciences of the
University of Tokyo in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Department of Computational Biology
De novo models are computationally predicted three-dimensional models of the given proteins using only amino acids sequence information. The key components of de novo modeling are the methods responsible for conformational space searching and the evaluation of each conformation accurately using energy function. The conformational space is astronomically large due to the degrees of freedom associated with each residue, which creates the challenge to develop the efficient method for searching the conformational space. Another challenge in de novo modeling is to devise an accurate energy function to evaluate the conformers.
Despite these challenges, the de novo modeling has succeeded to generate accurate models for small and single domain proteins. Fragment assembly is an effective and efficient method for de novo modeling. This method assembles the fragments from known structures under the guidance of energy function. This concept was practically implemented in Rosetta, which achieved a number of break-through successes.
Rosetta has two major stages, which are termed as coarse-grained sampling and all- atom refinement, to generate the final model from the input sequence. At the initial stage, three-residue and nine-residue fragments obtained from known structures are assembled to generate full-length coarse-grained models. These models contain only backbone atoms and the centroid of side-chain atoms. Subsequently, side-chain atoms were packed to construct all-atom models followed by energy minimization in all- atom refinement. However, there exist many challenges in the prediction of accurate models needed for practical use such as solving the crystallographic phase problem.
To address these issues, I have focused on method development – biased conformation sampling and fragment quality improvement to enhance the quality of predicted models. Furthermore, I have developed the method to use de novo fragments for phasing and to assemble these fragments after phasing when full-length model is difficult to predict accurately for phasing.
First, I have developed a method to improve the conformational space search for accuracy improvement. This method first generated coarse-grained models using Rosetta. Second, an ensemble of lowest energy coarse-grained models was selected and deviation for each model from other models of the ensemble was calculated. The deviation for each residue was also computed and this score was called as average pair-wise residue distant score. The score correlated with the accuracy of predicted I residues in the model. When the predicted residues had larger scores, the residues were considered as less accurate and vice versa. Lastly, conformational search was biased using the score as residues with larger scores were given higher frequency for sampling. This procedure rebuilt selected coarse-grained models and then packed the side-chain atoms followed by energy minimization. Molecular replacement was run on these all-atom models and the entire simulation was terminated after a few correct solutions were obtained. This method was tested on 10 difficult targets, which were failed to achieve the success in previous studies using other methods - Rosetta and RosettaX. The rebuilding procedure improved the accuracy of coarse-grained models from 4.93 Å to 4.06 Å on average. Seven out of ten protein targets showed successful molecular replacement solution using rebuilt models.
The second method focused on improving the fragment quality to generate the better quality model. In this study, the method was developed to generate new fragment libraries using a resampling process. Therefore, the lowest energy all-atom models were selected after generation of models using Rosetta. These models were broken into overlapping fragments of three-residue and nine-residue. Average pairwise residue deviation score was computed for three-residue and nine-residue fragments to remove distant fragments. The resultant fragments were clustered and then twenty-five fragments were randomly selected from the top five clusters. These new fragments were used for the second round of prediction. The performance of the method was tested on a benchmark set of 30 different proteins. The accuracy of new fragments and predicted models was evaluated. The result showed that the new fragment library contained better fragments and enriched with many high-quality fragments. In order to evaluate the performance, the lowest energy models and one of best from top five models were taken as the best prediction and computed their root mean square deviation of C-alpha atom (CA-RMSD), template modeling score (TMscore), and global distance test total score (GDT-TS) to the native structures. In all these assessment criteria, this method performed significantly better than Rosetta for lowest energy models and best in top five models. On average, this method improved CA-RMSD from 5.99 Å to 5.03 Å when lowest energy models were selected as the best predicted models. Similarly, it improved both the TM-score and GDT-TS by 7%.
Lastly, a new method was developed to tackle the phase problem using fragmentation and fragment reassembly approach when the full-length model was II inaccurate to use as the template model in molecular replacement. In this method, de novo model were fragmented, independently phased, and reassembled. A lowest energy all-atom models produced using Rosetta were chosen for fragmentation. For each residue position, constant-length overlapping fragments were constructed. These fragments were clustered and two hundred candidate fragments were randomly selected for each residue position. The selected fragments were independently used as search model in molecular replacement. The fragments were assembled together after molecular replacement. To reassemble, one fragment was selected as a seed fragment and one low-energy de novo model was taken as a reference model. The reference model was superposed to the seed fragment. Using the seed fragment and the reference model, position and orientation of other fragments were determined in the crystallographic unit cell and partial model was obtained. The combinations of permissible origins and symmetry operators of space group with unit cell translation were computed to identify the location of other fragments. The combination that gave the smallest distant between the reference model and the candidate fragment was taken as the correct location. In this way, all the fragments were reassembled in the asymmetric unit. This method was tested in ten difficult proteins with three different fragments – thirteen-residue, seventeen-residue and twenty-one-residue. Ten targets were considered as difficult because the best predicted full-length models of these targets, which showed average CA-RMSD 3.97 Å, were unable to provide the phase angles after molecular replacement experiment. The crystal structures of eight protein targets were solved from a total of ten using seventeen-residue fragment and their average CA-RMSD is 1.25 Å.
III Acknowledgements First and foremost, I would like to thank my supervisor Professor Kam Y. J.
Zhang. It has been an honor to be his first Ph.D. student at The University of Tokyo and RIKEN. He has provided me a fantastic academic and research environments where I have had the chance to develop logical thinking, creativity, research skills, and to become an independent and collaborative research professional. I appreciate all his efforts to make my PhD study productive and enjoyable.
I would also like to thank Professor Masahiro Kasahara for stimulating discussions about programming. The discussion with him about programming was very fruitful for the study. I am also grateful to Professor Kasahara for being the jury member of thesis evaluation committee. I would also like to thank Professor Yutaka Suzuki and Professor Koji Tsuda from The University of Tokyo for being the judge for thesis evaluation committee. Similarly, I would like to thank Professor Min Yao from Hokkaido University for being the judge as external referee of my PhD thesis committee.
I like to thank all of my co-workers, Dr. Asuhtosh Kumar, Dr. Arnout Voet, Dr. David Simoncini, Dr. Muhammad Muddassar, Dr Kamlesh Sahu, Dr. Taeho Jo, Dr. Yong Zhou, and Dr. Ryo Takahashi, Mr. Francois Berenger and Ms. Xiao Yin Lee, for professional and personal supports. Their supports have made the PhD study enjoyable and interesting. I would also like to thank the secretary Ms. Hiroko Kani for her support in many aspects of life in Japan.
I gratefully acknowledge the RIKEN, Japan for many things. First, RIKEN funded me for three years to study PhD. The financial support, International Program Associate (IPA), provided from RIKEN was tremendous to spend good life in Japan during the PhD study. Without support from RIKEN, I would not have reached to write this PhD dissertation. I would also thank Graduate School of Frontier Sciences, The University of Tokyo for different research grants. RIKEN has also provided highly sophisticated facilities required for the research from workstation to supercomputer. I appreciate the supercomputing power provided by RIKEN Integrated Cluster of Clusters and would acknowledge Advance Center of Computing and Communication, RIKEN. All experiments I have presented in this thesis were carried out at RIKEN Integrated Cluster of Clusters.
IV I appreciate the open source community that has freely provided source code written in different programming languages that saved my time and effort tremendously. Especially, I would like to thank the researchers and developers of Rosetta software team from University of Washington, Phaser program group from University of Cambridge, and Kevin Cowtan developer of clipper from University of York.
Lastly, I sincerely thank my family for their all time supports, love, and encouragement. I am grateful to my parents (Min Bahadur Shrestha and Dropati Shrestha) who raised me with a love of science and supported me in all my pursuits. I thank my sister, Roj, and brother, Ujjan, for all their supports. Finally, I appreciate my wife, Shalu, for her love, supports, and encouragements during the period of this PhD.
V Table of Contents Abstract
List of Figures
List of Tables
Chapter 1. Introduction
1.1. Protein and its structure
1.2. Computational methods for protein structure prediction
1.3. X-ray crystallography for protein structure determination
1.4. Phase problem
1.5. Ab initio phasing with de novo models
Chapter 2. Objective of the study
Chapter 3. MORPHEUS – error-estimation-guided rebuilding of de novo models increases the success rate of ab initio phasing
3.2.1. Benchmark dataset and initial model generation
3.2.2. Determine incorrectly predicted residues or regions
3.2.3. Rebuilt inaccurately predicted residues
3.2.4. Molecular replacement with rebuilt models
3.3.1. Model accuracy correlated with their divergence
3.3.2. Accuracy improvement after rebuilding
3.3.3. Ab initio phasing with rebuilt de novo models
3.3.4. Performance measurement
3.4.2. Biased conformational space searching
3.4.3. Molecular replacement with rebuilt models
Chapter 4. NEFILIM – improving fragment quality for de novo structure prediction 41
4.2.1. Benchmark data set and initial model generation
4.2.2. Improved fragment library generation
4.2.3. Resampling with new fragments
4.3.1. New fragments from the de novo models
4.3.2. Model accuracy improvement
4.3.3. Improved performance in resampling
Chapter 5. FRAP – ab initio phasing with de novo fragments for difficult targets.
5.2.1. Benchmark data selection
5.2.2. De novo fragments generation for molecular replacement
5.2.3. Fragment assembly after molecular replacement
5.3. Result and Discussion
5.3.1. Seed fragment and reference model
5.3.2. De novo fragments and molecular replacement
5.3.3. Fragment assembly
Chapter 6. Summary
Chapter 7. Reference
VIII List of Figures Figure 1.1 Different level of protein structure
Figure 1.2 Bond length, bond angle, and dihedral angle
Figure 3.1 Schematic diagram of MORPHEUS program
Figure 3.2 Scatter plot between coarse-grained energy and accuracy of the models
Figure 3.3 Correlation between APMDS and model accuracy
Figure 3.4 Correlation between APRDS and CA-RMSD of the residue in the sequence
Figure 3.5 Comparison of accuracy of models before and after rebuilding
Figure 3.6 Comparison of accuracy of residues before and after rebuilding
Figure 3.7 Comparsion of average improvement in models before and after rebuilding
Figure 3.8 Distribution of APRDS of model before and after rebuilding with their accuracy.
.............. 29 Figure 3.9 Superposition of models after rebuilding to the native structures
Figure 3.10 Total elapsed time spent by Rosetta3.
2 and MORPHEUS
Figure 4.1 An overview of NEFILIM
Figure 4.2 Quality of best fragment in structure-derived and sequence-derived fragment library.
........ 46 Figure 4.3 Enrichment of good quality in sequence-derived and structure-derived fragments.............. 47 Figure 4.4 Best fragment for each residue position (nine-residue)
Figure 4.5 Average accuracy of fragments at each residue position (nine-residue)