FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |   ...   | 12 |

«PRIIT ADLER Analysis and visualisation of large scale microarray data DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS 277 DISSERTATIONES ...»

-- [ Page 1 ] --







Analysis and visualisation

of large scale microarray data






Analysis and visualisation of large scale microarray data Institute of Molecular and Cellular Biology, University of Tartu, Estonia Dissertation is accepted for the commencement of the degree of Doctor of Philosophy in bioinformatics at University of Tartu on 19th of June 2015 by the Council of the Institute of Molecular and Cellular Biology, University of Tartu.


Prof. Jaak Vilo, PhD Institute of Computer Science University of Tartu Tartu, Estonia Prof. Juhan Sedman, PhD Institute of Molecular and Cell Biology University of Tartu Tartu, Estonia


Gabriella Rustici, PhD School of the Biological Sciences University of Cambridge Cambridge, United Kingdom


Room No 105, 23B Riia St, Tartu, on August 26th, 2015, at 10:15 The publication of this dissertation was financed by the Institute of Computer Science, University of Tartu.

ISSN 1024-6479 ISBN 978-9949-32-873-4 (print) ISBN 978-9949-32-874-1 (pdf) Copyright: Priit Adler, 2015 University of Tartu Press www.tyk.ee “An education was a bit like a communicable sexual disease. It made you unsuitable for a lot of jobs and then you had the urge to pass it on” Terry Pratchett





1.1. High-throughput expression data.................. 14 1.1.1. Normalisation........................ 14 1.1.2. Quality assessment...................... 15 1.1.3. Visualisation......................

–  –  –

I Adler, P.*, Reimand, J.*, Jänes, J., Kolde, R., Peterson, H., and Vilo, J.

(2008). KEGGanim: pathway animations for high-throughput data. Bioinformatics, 24(4):588–90.

II Adler, P.*, Peterson, H.*, Agius, P., Reimand, J., and Vilo, J. (2009). Ranking genes by their co-expression to subsets of pathway members. Annals of the New York Academy of Sciences, 1158:1–13.

III Adler, P.*, Kolde, R.*, Kull, M., Tkachenko, A., Peterson, H., Reimand, J., and Vilo, J. (2009). Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biology, 10(12):R139.

IV Kolde, R., Laur, S., Adler, P., and Vilo, J. (2012). Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics, 28(4):573–580.

The articles listed above have been printed with the permission of the copyright owners.

My contribution to these articles:

Ref. I – Designed and implemented the visualisation framework for KEGG pathways and implemented web application. Prepared expression data used as examples. Participated in writing the manuscript.

Ref. II – Co-conducted the study, managed high-throughput expression data and performed cross-validation analysis on Reactome pathways and participated in interpreting the results. Participated in writing the manuscript.

Ref. III – Designed and implemented Multi Experiment Matrix (MEM) tool and its web interface. Downloaded and prepared high-throughput expression data used by the application. Participated in developing the rank aggregation algorithm. Performed one of the proof of principle analyses in the article.

Participated in writing the manuscript.

Ref. IV – Performed one of the proof of principle analyses for the study.

–  –  –



High-throughput gene expression data has been generated across the globe for almost two decades. A wealth of publicly available data has been gathered into large database such as ArrayExpress or GEO. Although once analysed, the data still contain answers to questions unexplored by others. As new methods of data analysis are developed and innovative visualisations become possible, a systematic approach to revisit and reanalyse existing data might reveal new knowledge.

In the first part of this thesis we have a short overview of high-throughput gene expression data, introduce common analysis and visualisation methods for single datasets and cover relevant meta-analysis pipelines. Beside public gene expression databases, we also provide overview of pathway databases KEGG and Reactome, which are extensively used within publications that are part of this thesis.

In the practical part of this thesis, we first demonstrate how it is possible to visualise and animate high-throughput expression data using KEGG pathways.

Visualisation of expression data in the context of KEGG pathway and observing the expression dynamics across samples enables more detailed interpretation of experimental results. To make it accessible to wider audience we have implemented KEGGanim web tool.

KEGG, nor any other public, pathway database does not cover entire genome.

Only roughly one third of all genes are annotated to biological pathways. We present a study where we measured the predictive power of high-throughput gene expression data to reconstruct Reactome pathways and to propose potential new candidates. A high-throughput public data-collection with more than 6000 samples was used to perform cross-validation on 35 Reactome pathways. We give overview of the results and discuss observed benefit of using only a subset of pathway genes in the analysis as they might be more tightly co-regulated than entire pathway.

Similarly can be argued about gene expression data, that only subset of expression data should be used to study condition-specific co-expression patterns of related genes. It is proposed that only approximately one fifth of all genes are at once expressed in any biological condition. We describe a framework where coexpression queries can be performed across hundreds of publicly available highthroughput gene expression datasets. Relevant datasets are first selected based on standard deviation of the query gene. In each dataset co-expression values are calculated and all genes are ranked based on found distances. Finally, novel statistical rank aggregation approach is used to create a unified prioritised list of 11 globally co-expressed genes. Method has been implemented in Multi Experiment Matrix (MEM) web tool.

Described rank aggregation method is suitable to solve problems also outside MEM framework and has been published as R package. We provide an overview of some of the other experimental settings with real and simulated data to highlight the features of the presented robust rank aggregation method.



In eukaryotic cells the hereditary information is stored as long sequences of deoxyribonucleic acid (DNA) molecules. The long DNA molecules are also referred as polynucleotides as they contain single nucleotides in repetition. The order of nucleotide molecules in these long chains defines the information they contain.

Regions within DNA, that are used to encode other types of functional molecular polymers are referred as genes. Hence, the overall sum of DNA molecules is also called genome. In human genome the total length of DNA molecules is approximately three billion bases. It is organised into individual molecules, 22 autosomal chromosomes, which are represented by 2 copies – one copy from mother and the other from father, and two sex chromosomes. All together there are 46 DNA molecules per cell.

There are approximately 22000 genes defined in human genome. Genetic information is read from the DNA through process called transcription. The transcription process yields messenger ribonucleic acid (messenger RNA or mRNA) which is another type of polynucleotide. It is similar to DNA, but instead of deoxyribose it has ribose and instead of thymine is has uracil. Messenger RNA is used to transport genetic code out of the nucleus. In cell cytoplasm there are molecular machineries called ribosomes that process mRNA to produce proteins through translation. Proteins are the main building blocks of the cells. They participate in reactions as enzymes and signalling agents and also take part in transcriptional regulation of genes. Each protein can have very specific task or several depending on its configuration and post-processing. Compared to 21855 protein coding genes there are 86434 proteins defined for human in Ensembl database version 80 (Cunningham et al., 2015). For each gene there is a number of options how the mRNA can be alternatively spliced (Modrek and Lee, 2002).

Although only 1.5% of the entire genome is covered by protein-coding genes, a recent study states that more than 75% of the genome is covered by other transcriptional activity (Kellis et al., 2014), most of it is very rare. This percentage might still be an underestimate as only a selection of cell types was covered.

In human body there are hundreds of different types of tissues and cells. Although each cell contains the same DNA, the way how information is read and processed will lead to different cell types and different stages in cell lifecycle.

Malfunctions in DNA reading or gene regulation can lead to various diseases including cancer. Gene regulation is a complicated process and consists of many steps. One of the more straightforward steps is the regulation through transcription. The existence and quantity of mRNA molecules are the first prerequisites 13 for protein production. There are no cost effective high-throughput methods to quantify protein levels in cells, but there are high-throughput methods to quantify mRNA levels.

In this thesis we focus on characterisation of gene expression on transcriptional level as this can be performed in high-throughput manner and has been done so for the last two decades (DeRisi et al., 1997; Lashkari et al., 1997).

1.1. High-throughput expression data The advances in biotechnology have given rise to microarrays. Microarrays are glass slides, or other hard surface slides, that are covered by small oligonucleotide molecules (probes). The oligonucleotide molecules are attached to the microarray surface by one end. Their sequence is complementary to a sequence of a specific gene (Lockhart et al., 1996). Microarrays allow to quantify mRNA levels for many thousands of genes simultaneously from a biological sample. First mRNA is extracted from the biological sample and converted into complementary DNA (cDNA) by reverse transcriptase. Probes catch cDNA molecules from the sample solution in sequence specific manner. Each microarray can contain hundreds of thousands different probes corresponding to different genes, covering vast majority of genes for an organism. This kind of technology allows to take transcriptional still images of cellular activity. More images lead to better understanding of underlying processes and help us to decipher cellular functions.

The microarrays discussed within this thesis are gene expression microarrays.

There are also other types of microarrays, for example, genotyping or next generation sequencing that are also performed in a microarray format, but these are not the focus of the current thesis.

1.1.1. Normalisation Generating the data is only the first step in the whole experiment. Methods to process, normalise and analyse are essential to interpret the gene expression microarray data. Raw microarray data is considered to be noisy (Bolstad et al., 2003).

There are two principal sources of noise: biological and technical. Both type of noise can be controlled or tested by generating more biological and technical replicate samples (Klebanov and Yakovlev, 2007). Still, in the raw format the data is rarely suitable for interpretation. Statistical methods are used to transform the data so, that it would meet the requirements of the analysis methods, while still retaining its biological signal. This process is called normalisation. The objective of normalisation is to make separate samples comparable to each other within the experiment. Many normalisation applications also transform the data so, that signal value distribution would look normally distributed.

Different microarray technologies have different standards. Several normalisation methods have been developed to meet the design of Affymetrix GeneChips, 14 for example. Robust multi-array average (RMA) (Irizarry et al., 2003), MAS5.0 (Hubbell et al., 2002), FARMS (Hochreiter et al., 2006) to name a few.

On Affymetrix GeneChip platforms a probe set is small collection of probes that represent the same transcript. There can be more than one probe set representing a single gene. The result of preprocessing of gene expression microarray data is a numeric matrix – expression matrix, where columns represent different samples and each row represents expression values summarised on a probe set or a gene level. A row in this matrix and a column are referred as probe set and sample expression profile, respectively.

Most widely used normalisation method, to date, is RMA. It uses log transformation and quantile normalisation between samples. Distribution quantile values are made equal between signals across all samples and signals from individual samples. That ensures that all individual samples follow the same signal value distribution and therefore are more comparable to each other.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 12 |

Similar works:

«Marshall, S. D. and G. W. Uetz. 1990. Incorporation of urticating hairs into silk : A novel defense mechanism in two Neotropical tarantulas (Araneae, Theraphosidae), J. Arachnol., 18 :143-149. INCORPORATION OF URTICATING HAIRS INTO SILK : A NOVEL DEFENSE MECHANISM IN TWO NEOTROPICAL TARANTULAS (ARANEAE, THERAPHOSIDAE) Samuel D. Marshall' and George W. Uetz Department of Biological Sciences, M. L. #006 University of Cincinnati Cincinnati, Ohio 45221 USA ABSTRACT Two species of New...»

«Protocols for Surveying and Evaluating Impacts to Special Status Native Plant Populations and Natural Communities State of California CALIFORNIA NATURAL RESOURCES AGENCY Department of Fish and Game November 24, 20091 INTRODUCTION AND PURPOSE The conservation of special status native plants and their habitats, as well as natural communities, is integral to maintaining biological diversity. The purpose of these protocols is to facilitate a consistent and systematic approach to the survey and...»

«GRADUATE STUDENT GUIDE BIOPHYSICS PROGRAM & DEPARTMENT OF STRUCTURAL BIOLOGY 2011-2012 STANFORD UNIVERSITY Revised September 2011 Stanford University Stanford University admits students of either sex and any race, color, religion, sexual orientation, or national and ethnic origin to all the rights, privileges, programs, and activities generally accorded or made available to students at the university. It does not discriminate against students on the basis of sex, race, color, handicap,...»

«38 THE SIGNIFICANCE AND IMPACTS OF PROTEIN DISORDER AND CONFORMATIONAL VARIANTS Jenny Gu and Vincent Hilser INTRODUCTION Protein disorder is a topic worth attention from the structural bioinformatics community largely for the technical challenges it presents to the field, but also for its biological and functional implications. The success of structural genomic efforts using X-ray crystallography depends on overcoming several potential bottlenecks (Chapter 40), one of which is the formation of...»

«WHO/IVB/06.13 WHO/CDS/EPR/GIP/2006.1 V ORIGINAL: ENGLISH Global pandemic influenza action plan to increase vaccine supply Immunization, Vaccines and Biologicals Epidemic and Pandemic Alert and Response WHO/IVB/06.13 WHO/CDS/EPR/GIP/2006.1 V ORIGINAL: ENGLISH Global pandemic influenza action plan to increase vaccine supply Immunization, Vaccines and Biologicals Epidemic and Pandemic Alert and Response The Department of Immunization, Vaccines and Biologicals and the Department of Epidemic and...»

«Selected interactions between phytoplankton, zooplankton and the microbial food web: Microcosm experiments in marine and limnic habitats Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften Dr. rer. nat. der Fakultät für Biologie der Ludwig-Maximilians-Universität München by Alexis Katechakis München 2005 Selected interactions between phytoplankton, zooplankton and the microbial food web: Microcosm experiments in marine and limnic habitats Dissertation zur Erlangung des...»

«Pat r o n e s d e s u c e s i ó n V e g e t a l : im P l i c a n c i a s c o n s e rVa c i ó n lo m a s a t i q u i Pa Pa r a l a de las de d e l de s i e r t o c o s t e r o d e l s u r d e l Pe r ú Patrones de Sucesión Vegetal: Implicancias para la conservación de las Lomas de Atiquipa del Desierto Costero del Sur del Perú diego a. sotomayor melo, Percy JiméneZ milón Departamento Académico de Biología. Universidad Nacional de San Agustín de Arequipa Email:...»

«International Journal of Environmental & Science Education, 2015, 10(3), 301-318 A US-China Interview Study: Biology Students’ Argumentation and Explanation about Energy Consumption Issues Hui Jin Educational Testing Service, USA Hayat Hokayem Texas Christian University, USA Sasha Wang Boise State University, USA Xin Wei People's Education Press, CHINA Received 30 January 2015 Revised 09 February 2015 Accepted 09 February 2015 As China and the United States become the top two carbon...»

«3 He a lt h 2 2 Fire 0 3 0 Re a c t iv it y P e rs o n a l P ro t e c t io n Material Safety Data Sheet Phenol, Liquified, neutralized, for molecular biology MSDS Section 1: Chemical Product and Company Identification Product Name: Phenol, Liquified, neutralized, for Contact Information: molecular biology Sciencelab.com, Inc. 14025 Smith Rd. Catalog Codes: SLP5032 Houston, Texas 77396 CAS#: Mixture. US Sales: 1-800-901-7247 International Sales: 1-281-441-4400 RTECS: Not applicable. Order...»

«The Effect of Thalidomide, an Angiogenesis Inhibitor, on the Estrus Cycle and Reproductive Function of Female Mice Ashley Dockendorff Department of Biology Hartwick College Oneonta, NY This thesis is submitted in partial satisfaction of the requirements for the degree of Bachelor of Arts from the Department of Biology, Hartwick College. _ _ Thesis Advisor Date _ _ Chair, Biology Department Date The Effect of Thalidomide, an Angiogenesis Inhibitor, on the Estrus Cycle and Reproductive Function...»

«  1   Curriculum Vita John Edward Korstad June 1, 2011 Personal Data: Born: July 4, 1949, in Woodland, California, U.S.A. Marital Status: Married to Sally D. (Steffen) Korstad; 4 children and currently 5 grandchildren 301 E. 122nd Ct. S., Jenks, OK 74037 Home Address: Current Position: Professor of Biology (since Fall 1980) Department of Biology Oral Roberts University 7777 S. Lewis Tulsa, OK 74171 Phone Numbers: Home: (918) 853-2580 School: (918) 495-6942 Cell: (918) 853-3579 Fax: (918)...»

«Curriculum vitae Personal: Name: Hag Ibrahim, Rashid Ismael. Sex: Male Nationality: SUDAN. Languages: English, Arabic, Japanese, and some Korean.Current Address: Department of Biological Sciences, College of Science, King Faisal University, PO Box 380, AlHufof, Al-Ahsaa 31982, Saudi Arabia Phone (office): +966(0)3588 7440 Cell: +966 (0)5301 58323 E-mail: ribrahim@kfu.edu.sa Current Position: Assistant Professor (Tenure Track): Department of Biological Sciences, College of Science, King Faisal...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.