«PRIIT ADLER Analysis and visualisation of large scale microarray data DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS 277 DISSERTATIONES ...»
Analysis and visualisation
of large scale microarray data
DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS277
DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS277
PRIIT ADLERAnalysis and visualisation of large scale microarray data Institute of Molecular and Cellular Biology, University of Tartu, Estonia Dissertation is accepted for the commencement of the degree of Doctor of Philosophy in bioinformatics at University of Tartu on 19th of June 2015 by the Council of the Institute of Molecular and Cellular Biology, University of Tartu.
Prof. Jaak Vilo, PhD Institute of Computer Science University of Tartu Tartu, Estonia Prof. Juhan Sedman, PhD Institute of Molecular and Cell Biology University of Tartu Tartu, Estonia
Gabriella Rustici, PhD School of the Biological Sciences University of Cambridge Cambridge, United Kingdom
Room No 105, 23B Riia St, Tartu, on August 26th, 2015, at 10:15 The publication of this dissertation was ﬁnanced by the Institute of Computer Science, University of Tartu.
ISSN 1024-6479 ISBN 978-9949-32-873-4 (print) ISBN 978-9949-32-874-1 (pdf) Copyright: Priit Adler, 2015 University of Tartu Press www.tyk.ee “An education was a bit like a communicable sexual disease. It made you unsuitable for a lot of jobs and then you had the urge to pass it on” Terry Pratchett
TABLE OF CONTENTS
LIST OF ORIGINAL PUBLICATIONS 9
LIST OF ABBREVIATIONS 10INTRODUCTION 11 I. REVIEW OF LITERATURE 13
1.1. High-throughput expression data.................. 14 1.1.1. Normalisation........................ 14 1.1.2. Quality assessment...................... 15 1.1.3. Visualisation......................
I Adler, P.*, Reimand, J.*, Jänes, J., Kolde, R., Peterson, H., and Vilo, J.
(2008). KEGGanim: pathway animations for high-throughput data. Bioinformatics, 24(4):588–90.
II Adler, P.*, Peterson, H.*, Agius, P., Reimand, J., and Vilo, J. (2009). Ranking genes by their co-expression to subsets of pathway members. Annals of the New York Academy of Sciences, 1158:1–13.
III Adler, P.*, Kolde, R.*, Kull, M., Tkachenko, A., Peterson, H., Reimand, J., and Vilo, J. (2009). Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biology, 10(12):R139.
IV Kolde, R., Laur, S., Adler, P., and Vilo, J. (2012). Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics, 28(4):573–580.
The articles listed above have been printed with the permission of the copyright owners.
My contribution to these articles:
Ref. I – Designed and implemented the visualisation framework for KEGG pathways and implemented web application. Prepared expression data used as examples. Participated in writing the manuscript.
Ref. II – Co-conducted the study, managed high-throughput expression data and performed cross-validation analysis on Reactome pathways and participated in interpreting the results. Participated in writing the manuscript.
Ref. III – Designed and implemented Multi Experiment Matrix (MEM) tool and its web interface. Downloaded and prepared high-throughput expression data used by the application. Participated in developing the rank aggregation algorithm. Performed one of the proof of principle analyses in the article.
Participated in writing the manuscript.
Ref. IV – Performed one of the proof of principle analyses for the study.
INTRODUCTIONHigh-throughput gene expression data has been generated across the globe for almost two decades. A wealth of publicly available data has been gathered into large database such as ArrayExpress or GEO. Although once analysed, the data still contain answers to questions unexplored by others. As new methods of data analysis are developed and innovative visualisations become possible, a systematic approach to revisit and reanalyse existing data might reveal new knowledge.
In the ﬁrst part of this thesis we have a short overview of high-throughput gene expression data, introduce common analysis and visualisation methods for single datasets and cover relevant meta-analysis pipelines. Beside public gene expression databases, we also provide overview of pathway databases KEGG and Reactome, which are extensively used within publications that are part of this thesis.
In the practical part of this thesis, we ﬁrst demonstrate how it is possible to visualise and animate high-throughput expression data using KEGG pathways.
Visualisation of expression data in the context of KEGG pathway and observing the expression dynamics across samples enables more detailed interpretation of experimental results. To make it accessible to wider audience we have implemented KEGGanim web tool.
KEGG, nor any other public, pathway database does not cover entire genome.
Only roughly one third of all genes are annotated to biological pathways. We present a study where we measured the predictive power of high-throughput gene expression data to reconstruct Reactome pathways and to propose potential new candidates. A high-throughput public data-collection with more than 6000 samples was used to perform cross-validation on 35 Reactome pathways. We give overview of the results and discuss observed beneﬁt of using only a subset of pathway genes in the analysis as they might be more tightly co-regulated than entire pathway.
Similarly can be argued about gene expression data, that only subset of expression data should be used to study condition-speciﬁc co-expression patterns of related genes. It is proposed that only approximately one ﬁfth of all genes are at once expressed in any biological condition. We describe a framework where coexpression queries can be performed across hundreds of publicly available highthroughput gene expression datasets. Relevant datasets are ﬁrst selected based on standard deviation of the query gene. In each dataset co-expression values are calculated and all genes are ranked based on found distances. Finally, novel statistical rank aggregation approach is used to create a uniﬁed prioritised list of 11 globally co-expressed genes. Method has been implemented in Multi Experiment Matrix (MEM) web tool.
Described rank aggregation method is suitable to solve problems also outside MEM framework and has been published as R package. We provide an overview of some of the other experimental settings with real and simulated data to highlight the features of the presented robust rank aggregation method.
I. REVIEW OF LITERATUREIn eukaryotic cells the hereditary information is stored as long sequences of deoxyribonucleic acid (DNA) molecules. The long DNA molecules are also referred as polynucleotides as they contain single nucleotides in repetition. The order of nucleotide molecules in these long chains deﬁnes the information they contain.
Regions within DNA, that are used to encode other types of functional molecular polymers are referred as genes. Hence, the overall sum of DNA molecules is also called genome. In human genome the total length of DNA molecules is approximately three billion bases. It is organised into individual molecules, 22 autosomal chromosomes, which are represented by 2 copies – one copy from mother and the other from father, and two sex chromosomes. All together there are 46 DNA molecules per cell.
There are approximately 22000 genes deﬁned in human genome. Genetic information is read from the DNA through process called transcription. The transcription process yields messenger ribonucleic acid (messenger RNA or mRNA) which is another type of polynucleotide. It is similar to DNA, but instead of deoxyribose it has ribose and instead of thymine is has uracil. Messenger RNA is used to transport genetic code out of the nucleus. In cell cytoplasm there are molecular machineries called ribosomes that process mRNA to produce proteins through translation. Proteins are the main building blocks of the cells. They participate in reactions as enzymes and signalling agents and also take part in transcriptional regulation of genes. Each protein can have very speciﬁc task or several depending on its conﬁguration and post-processing. Compared to 21855 protein coding genes there are 86434 proteins deﬁned for human in Ensembl database version 80 (Cunningham et al., 2015). For each gene there is a number of options how the mRNA can be alternatively spliced (Modrek and Lee, 2002).
Although only 1.5% of the entire genome is covered by protein-coding genes, a recent study states that more than 75% of the genome is covered by other transcriptional activity (Kellis et al., 2014), most of it is very rare. This percentage might still be an underestimate as only a selection of cell types was covered.
In human body there are hundreds of different types of tissues and cells. Although each cell contains the same DNA, the way how information is read and processed will lead to different cell types and different stages in cell lifecycle.
Malfunctions in DNA reading or gene regulation can lead to various diseases including cancer. Gene regulation is a complicated process and consists of many steps. One of the more straightforward steps is the regulation through transcription. The existence and quantity of mRNA molecules are the ﬁrst prerequisites 13 for protein production. There are no cost effective high-throughput methods to quantify protein levels in cells, but there are high-throughput methods to quantify mRNA levels.
In this thesis we focus on characterisation of gene expression on transcriptional level as this can be performed in high-throughput manner and has been done so for the last two decades (DeRisi et al., 1997; Lashkari et al., 1997).
1.1. High-throughput expression data The advances in biotechnology have given rise to microarrays. Microarrays are glass slides, or other hard surface slides, that are covered by small oligonucleotide molecules (probes). The oligonucleotide molecules are attached to the microarray surface by one end. Their sequence is complementary to a sequence of a speciﬁc gene (Lockhart et al., 1996). Microarrays allow to quantify mRNA levels for many thousands of genes simultaneously from a biological sample. First mRNA is extracted from the biological sample and converted into complementary DNA (cDNA) by reverse transcriptase. Probes catch cDNA molecules from the sample solution in sequence speciﬁc manner. Each microarray can contain hundreds of thousands different probes corresponding to different genes, covering vast majority of genes for an organism. This kind of technology allows to take transcriptional still images of cellular activity. More images lead to better understanding of underlying processes and help us to decipher cellular functions.
The microarrays discussed within this thesis are gene expression microarrays.
There are also other types of microarrays, for example, genotyping or next generation sequencing that are also performed in a microarray format, but these are not the focus of the current thesis.
1.1.1. Normalisation Generating the data is only the ﬁrst step in the whole experiment. Methods to process, normalise and analyse are essential to interpret the gene expression microarray data. Raw microarray data is considered to be noisy (Bolstad et al., 2003).
There are two principal sources of noise: biological and technical. Both type of noise can be controlled or tested by generating more biological and technical replicate samples (Klebanov and Yakovlev, 2007). Still, in the raw format the data is rarely suitable for interpretation. Statistical methods are used to transform the data so, that it would meet the requirements of the analysis methods, while still retaining its biological signal. This process is called normalisation. The objective of normalisation is to make separate samples comparable to each other within the experiment. Many normalisation applications also transform the data so, that signal value distribution would look normally distributed.
Different microarray technologies have different standards. Several normalisation methods have been developed to meet the design of Affymetrix GeneChips, 14 for example. Robust multi-array average (RMA) (Irizarry et al., 2003), MAS5.0 (Hubbell et al., 2002), FARMS (Hochreiter et al., 2006) to name a few.
On Affymetrix GeneChip platforms a probe set is small collection of probes that represent the same transcript. There can be more than one probe set representing a single gene. The result of preprocessing of gene expression microarray data is a numeric matrix – expression matrix, where columns represent different samples and each row represents expression values summarised on a probe set or a gene level. A row in this matrix and a column are referred as probe set and sample expression proﬁle, respectively.
Most widely used normalisation method, to date, is RMA. It uses log transformation and quantile normalisation between samples. Distribution quantile values are made equal between signals across all samples and signals from individual samples. That ensures that all individual samples follow the same signal value distribution and therefore are more comparable to each other.