FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«Mapping protein information to disease terminologies Anaïs Mottaz1, Yum L. Yip1,2, Patrick Ruch2,3, and Anne-Lise Veuthey1 1 Swiss Institute of ...»

-- [ Page 1 ] --

Journal of Integrative Bioinformatics 2007 http://journal.imbio.de/

Mapping protein information to disease terminologies

Anaïs Mottaz1, Yum L. Yip1,2, Patrick Ruch2,3, and Anne-Lise Veuthey1


Swiss Institute of Bioinformatics,


Dept. of Structural Biology and Bioinformatics, University of Geneva,


Medical Informatics Service, University Hospitals of Geneva,

Geneva, Switzerland


In order to improve the accessibility of genomic and proteomic information to medical researchers, we have developed a procedure to link biological information on proteins involved in diseases to the MeSH and ICD-10 disease terminologies. For this purpose, we took advantage of the manually curated disease annotations in more than 2,000 human protein entries of the UniProt KnowledgeBase. We mapped disease names extracted from the entry comment lines or from the corresponding OMIM entry to the MeSH. The method was assessed on a benchmark set of 200 manually mapped disease comment lines. We obtained a recall of 54% for 91% precision. The same procedure was used to map the more than 3,000 diseases in Swiss-Prot to MeSH with comparable efficiency.

Tested on ICD-10, the coverage of the mapped terms was lower, which could be explained by the coarse-grained structure of this terminology for hereditary disease description. The mapping is provided as supplementary material at http://research.isb- sib.ch/unimed.

1 Introduction With the emergence of high-throughput technologies, the amount of biomedical data available to researchers and clinicians has increased drastically over the last decade. In the genomic/proteomic era, new methods of knowledge management will soon allow researchers to move beyond the analysis of single molecule or pathway to consider global mechanisms, such as pathological processes, from an integrated point of view. One of the challenges for bioinfomatics in this context is to bridge the gap between biological knowledge and clinical data. Currently, the main obstacle to achieve this objective is the compartmentalization of the data in different databases, and the inconsistencies in the vocabulary used by these resources to describe biomedical entities and concepts. A key solution to this interoperability problem lies in the development of common terminologies capable of acting as a metadata layer to provide the missing links between the various resources. Successful initiatives for the development of standardized vocabularies in the biological domain started some years ago with the creation of the Gene Ontology (GO) for the description of biological functions and processes [1]. It was followed by the developments of numerous biological ontologies under the Open Biological Ontologies initiative (OBO) [2]. In the medical domain, the effort on the development of standard terminologies started many years before these initiatives in the molecular biology domain. Key vocabularies such as the International Classification of Diseases (ICD) [3], the SNOMED clinical terminology [4], and the Medical Subject Headings (MeSH) [5] were developed in order to standardize information on various domains of medicine, from patient care to biomedical literature indexing. The Unified Medical Language System (UMLS) [6] was developed by the US National Library of Medicine (NLM) to function as an umbrella over these resources by providing a system of interrelations between all these terminologies.

Journal of Integrative Bioinformatics, 4(3):79, 2007 1 Journal of Integrative Bioinformatics 2007 http://journal.imbio.de/ Even if the recent integration of GO in the UMLS has opened new ways of linking biological and medical resources via terminologies, relationships between gene functions and diseases are still poorly documented in terminologies. Several initiatives have been set up to link phenotypes to genotypes [7], and systems have been developed to detect such associations.

For instance, GenesTraceTM [8] and BioMeKe [9] use the relationships between GO and UMLS concepts of disease-related semantic types to infer gene-disease relationships.

PhenoGO uses natural language processing methods to assign phenotypic context to GO annotations [10]. The MedGene database gathers relationships between human gene names and diseases extracted from MEDLINE [11]. GFINDer uses textual information from the Online Meendelian Inheritance in Man database (OMIM) to analyze correlation of disease with gene expression in microarray results [12]. All these systems rely on inference and, therefore, depend closely upon the accuracy of the various methods. A straighter way to link genes to diseases would be to use the disease-related information directly provided by some specific biological databases. Take the example of the UniProt Knowledgebase (UniProtKB) [13], the most comprehensive protein warehouse with extensive annotation and crossreferences to other database resources. In UniProtKB, more than 2,000 human proteins contain manually curated information related to their involvement in pathologies. This information comes with the type and position of the single amino acid polymorphisms known to cause the disease, and cross-references to variant databases and genomic resources, such as dbSNP and Ensembl. While this information is clearly of value, it is not easily accessible for clinical researchers due to the fact that UniProtKB does not use standard medical vocabularies to describe diseases associated to proteins and their variants.

In this study, we have developed an automatic approach to map the disease terms in

UniProtKB to two well-known and widely used disease terminologies within the UMLS:

MeSH - the controlled vocabulary thesaurus used for biomedical and health-related documents indexing [5], and ICD-10 - the official disease classification provided by the WHO [3]. We took advantage of the manual annotation in UniProtKB as well as the curated links of UniProtKB entries to OMIM, the comprehensive knowledge base of human genes and genetic diseases [14]. A benchmark set was created for the refinement of term matching algorithm as well as for the definition of matching score and score threshold. This work provides a basis for further work aiming to increase the interoperability between data resources from the medical informatics and the bioinformatics domains.

2 Methods2.1 Extraction of disease names

The UniProtKB/Swiss-Prot (release 52.5), and the OMIM (version May 2007) were used for this study. In UniProtKB/Swiss-Prot entries, disease information related to the protein is expressed in free text comment lines qualified by the category ‘Disease’ (Figure 1). By manual inspection, we first established a list of regular expressions that indicates the presence of a disease name within these lines (e.g. ‘cause(s)’, ‘cause of’, ‘involved in’, ‘contribute(s) to’, ‘induce(s)’). The disease name was usually delimited either by the end of a sentence, a conjunction or relative clause, or by the corresponding OMIM identifier. We also defined a list of specific words, such as ‘susceptibility to’, ‘development (of)’, ‘various types of’ to remove terms that have no direct connection with the disease name. In rare cases where several diseases were described in the same comment line, we restricted the extraction to the first mentioned disease.

In parallel, we took advantage of the citations to OMIM phenotypes (#) and genes with phenotypes (+) in the disease comment lines to extract the fields Title and Alternative titles;

symbols from the corresponding OMIM entries. These two fields provide the disease name in

–  –  –

OMIM as well as a set of synonyms. For names coming from “gene with phenotype (+)” entries, we did not try to distinguish between gene names and diseases names, both types were included in the disease list.

Figure 1: disease comment lines in a UniProtKB/Swiss-Prot entry

–  –  –

We mapped the extracted disease names to the terms from the disease category of the MeSH terminology (version 2007). The MeSH thesaurus is structured in a hierarchy of descriptors, each descriptor including a set of related concepts, and each concept itself containing a set of terms, which are synonyms and lexical variants. We mapped the disease names to the MeSH terms and linked the results to the corresponding MeSH descriptors. For ICD-10, we mapped the disease names to all non-redundant terms of ICD-10, without distinction of their types.

The mapping procedure consisted of two successive term matching steps:

(1) we found exact matches, where all words composing the name had an identical correspondent in a MeSH term and vice versa, the word order and the case not being taken into consideration.

(2) in case of no exact match, we looked for partial matches by decomposing the name into its word components and calculated a similarity score for names having at least one word in common.

The score used to determine the similarity between two terms was calculated as a function of the number of words in common minus the number of words that differ. In order to take into account the informative content of each word composing the term, we weighted them according to an adaptation of the weighting schema ‘Term Frequency X Inverse Document Frequency’ (TF X IDF), commonly used in information retrieval techniques [15]. We calculated the inverse document frequency (IDF) of each word present in the three sources of terms, namely Swiss-Prot disease lines, OMIM Titles and Alternative titles, and disease MeSH terms or ICD-10 terms. The similarity score was calculated according to the following


–  –  –

Where freq=n/N, with n the number of occurrence of the word in all OMIM (Titles, Alternative titles), Swiss-Prot disease comment lines, and MeSH terms (disease category) or ICD-10 terms. N represents the total number of words in these documents. cw stands for words in common and ncw for words present in only one of the terms. The term size(disease) is a normalization factor consisting of the number of words composing the disease name to be mapped.

Hyphenated words were treated in a special way to avoid false positive matches without penalizing the sensitivity. Each of their components was considered as distinct word. If all components had a matched equivalent, their respective weights were summed up in the score calculation. Otherwise, their weights were subtracted.

2.3 Mapping evaluation

In order to evaluate the mapping procedure, 200 disease comments from 97 UniProtKB/Swiss-Prot entries were manually mapped to MeSH by a medical expert. SwissProt entries were selected randomly. However, care was taken so that the chosen sample of entries would be representative and lead to a proportion of exact and partial matches similar to that found in a preliminary mapping attempt. The disease terms were mapped, whenever possible, to a single MeSH term of the same granularity or close in the hierarchy. However, when no equivalent term was found in the terminology, the disease name was mapped to several parents in different hierarchies or to high level concepts.

The mapping procedure was assessed in terms of precision=TP/(TP+FP) and recall=TP/total number of terms, where TP is the number of correct mappings (true positives), and FP the number of incorrect mappings (false positives).

3 Results In UniProtKB/Swiss-Prot (release 52.5), 2,167 human protein entries contained information on the involvement of these proteins in diseases. This corresponded to a total of 3,197 diseases, mainly of genetic causes. Among these diseases, 2,410 had a link to a corresponding phenotype described in OMIM, which represented 77% of the total OMIM entries of phenotypes with a known molecular basis (version May. 2007). We mapped the disease names to the 38,646 terms of the MeSH disease category (version 2007) and 29,550 nonredundant terms of ICD-10. We treated independently names provided by Swiss-Prot and those provided by OMIM. A benchmark set consisting of 200 disease comment lines with 173 references to OMIM was used to evaluate the mapping procedure.

3.1 Disease name extraction

Swiss-Prot disease names were extracted from the comment lines with a set of regular expressions. As the Swiss-Prot disease lines are usually well structured, we were able to extract almost all disease names. The extraction failed in only 7 comment lines where a clear

reference to a disease was not expressed, for instance:

“(CBL) can be converted to an oncogenic protein by deletions or mutations that disturb its ability to down-regulate RTKs.” (P22681) The system was constructed to extract only a single disease name per line. By manual assessment of the extraction results, we noticed that in some cases it failed to treat correctly

lines such as:

“KRT16 and KRT17 are coexpressed only in pathological situations such as metaplasias and carcinomas of the uterine cervix and in psoriasis vulgaris.” (P08779) Journal of Integrative Bioinformatics, 4(3):79, 2007 4 Journal of Integrative Bioinformatics 2007 http://journal.imbio.de/ We did not investigate further these cases, as the structure of disease lines is planned for a revision in the framework of Swiss-Prot comment standardization efforts.

Extraction of OMIM’s disease names from Title and Alternative title; symbols was simple.

We kept all words composing a term, except qualifiers such as “included” or “obsolete”.

3.2 Mapping on the benchmark

The results from a benchmark of 200 diseases manually mapped to MeSH terms are shown in Table 1. The mapping was done independently on disease names extracted from Swiss-Prot and on Title or Alternative titles of OMIM.

The mapping procedure was divided into two successive steps. First, we checked for exact matches with MeSH terms. Exact matches covered about 20% of the benchmark with an excellent precision. The only three false positive matches were caused by a difference of classification between MeSH and OMIM. More specifically, OMIM considers these terms as synonyms, whereas MeSH classified them in different concepts. For instance, two types of epidermolysis bullosa, which are distinct MeSH descriptors, are synonyms in OMIM. When we gathered the exact matches provided by the two resources, the coverage increased to 26%.

Pages:   || 2 |

Similar works:

«SANCO – E.2 (01)D/521521 SHORT REPORT OF THE STANDING COMMITTEE ON THE FOOD CHAIN AND ANIMAL HEALTH (Section Animal health and welfare) HELD IN BRUSSELS ON 10-11 SEPTEMBER 2002 President: Mr. Bernard Van Goethem All the Member States were present.1. EXCHANGE OF VIEWS OF THE COMMITTEE ON THE EVOLUTION OF ANIMAL DISEASES IN THE COMMUNITY INCLUDING: Classical swine fever France The French delegation distributed and presented an update report as regards the classical swine fever situation in wild...»

«THE EFFICACY OF A HOMOEOPATHIC COMPLEX (CARBO VEGETABILIS D9, LYCOPODIUM CLAVATUM D9, NUX VOMICA D9 AND ROBINIA PSEUDOACACIA D9) IN THE TREATMENT OF FUNCTIONAL DYSPEPSIA. By: EROSHA SURJOODEEN Mini-dissertation submitted to the Faculty of Health Sciences at the Durban University of Technology in partial compliance with the requirements for a Master’s Degree in Technology: Homeopathy. I, Erosha Surjoodeen, declare that this dissertation represents my own work in both conception and execution....»

«BRIEFIN G PAP ER N O. 8 PAG E 1 O F 9 SAM F. HA LABI 04.16.14 THE UNCERTAIN FUTURE OF VACCINE DEVELOPMENT AND DEPLOYMENT FOR INFLUENZA PANDEMICS On December 13, 2013, a joint communique O ’NE IL L INST IT U T E The O’Neill Institute for National issued by seven European and North and Global Health Law at Georgetown University was American governments, Japan, and the established to respond to the need for innovative solutions to the most European Commission noted three influenza pressing...»

«: SHIPS uNITIES R ARTNE H COmm P LDING IONS WIT BUI DS TH NEE S RSAT CONVE ENTAl HEAl ENGTH m R ABOuT mmuNITy ST O AND C RITIES H DISpA HEAlT S C DAVI R REDuCING u fO CENTER Acknowledgments This project conducted by the UC Davis Center for Reducing Health Disparities (CRHD) in collaboration with the California Department of Mental Health represents an effort to reach out, to engage, and collect community voices that have previously not been heard. Through this project, CRHD developed...»

«UN Daily News For updates and e-mail alerts, visit UN NEWS CENTRE at www.un.org/news Thursday, 3 November 2016 Issue DH/7278 In the headlines: • UN agency working to address women’s health and • UN rights experts urge DR Congo to lift 'unjustified' protection needs in storm-hit Haiti ban on protests • South Sudan: UN peacekeeping chief sets up task • UN officials emphasize the need for coherence and force to carry out recommendations of probe into coordination to achieve 2030 Agenda...»

«Access to unapproved therapeutic goods via the Special Access Scheme November 2009 Therapeutic Goods Administration About the Therapeutic Goods Administration (TGA) · The TGA is a division of the Australian Government Department of Health and Ageing, and is responsible for regulating medicines and medical devices. · TGA administers the Therapeutic Goods Act 1989 (the Act), applying a risk management approach designed to ensure therapeutic goods supplied in Australia meet acceptable standards...»

«Assessment of Bacterial Profile and Antimicrobial Susceptibility Pattern of Catheter-Associated Urinary Tract Infections in Comparison with non-Catheterized Urinary tract infections in Jimma University Hospital, Southwest Ethiopia BY TESHAGER, LULE TESHAGER, B. Sc Department of Microbiology, Immunology, and Parasitology Faculty of Medicine, Addis Ababa University JUNE 2005 Assessment of Bacterial Profile and Antimicrobial Susceptibility Pattern of Catheter-Associated Urinary Tract Infections in...»

«Domestic Water Quantity, Service, Level and Health © World Health Organization 2003 The illustration of the cover page is extracted from Rescue Mission: Planet Earth, © Peace Child International 1994; used by permission All rights reserved. Publications of the World Health Organization can be obtained from Marketing and Dissemination, World Health Organization, 20 Avenue Appia, 1211 Geneva 27, Switzerland (tel: +41 22 791 2476; fax: +41 22 791 4857; email: bookorders@who.int). Requests for...»

«CANADIAN PANDEMIC INFLUENZA PREPAREDNESS: Planning Guidance for the Health Sector Glossary of Terms and List of Acronyms LIST OF ACRONYMS ABHR(s) Alcohol-based hand rub(s) AEFI Adverse Events Following Immunization AGMP(s) Aerosol-generating medical procedure(s) CBRNE Chemical, Biological, Radiological, Nuclear and Explosive CCMOH Council of Chief Medical Officers of Health CDC Centers for Disease Control and Prevention CEPR Centre for Emergency Preparedness and Response CFIA Canadian Food...»

«2016 Nova Law Review Symposium October 14, 2016 Regulating Innovation in Healthcare: Protecting the Public or Stifling Progress? Nova Law Review 2016-17 Board Members Alison P. Barbiero Editor in Chief Joseph R. Kadis Executive Editor Timothy D. Shields Managing Technical Editor Krystal Acosta Lead Articles Editor Samantha E. Bowen Lead Technical Editor Olivia S. Choing Articles Editor Kristi Desoiza Articles Editor Michelle Diaz Articles Editor Elyse Malouf Articles Editor Dina L. Rosenbaum...»

«Home environmental influences on adolescents’ energy balance related behaviours The HEIA cohort study Torunn Holm Totland PHD Thesis at the Faculty of Medicine UNIVERSITY OF OSLO Oslo 2013 © Torunn Holm Totland, 2014 Series of dissertations submitted to the Faculty of Medicine, University of Oslo No. 1732 ISBN 978-82-8264-630-7 All rights reserved. No part of this publication may be reproduced or transmitted, in any form or by any means, without permission. Cover: Inger Sandved Anfinsen....»

«Youth Mentoring A good thing? RICHARD MEIER Centre for Policy Studies THE AUTHOR Richard Meier is a freelance writer and researcher on social policy issues with a long-standing interest in children’s and adolescents’ emotional well-being and mental health. As a policy officer for the UK’s children’s mental health charity YoungMinds, he contributed briefing papers and policy guidance on subjects including youth mentoring, adolescent brain development and supporting students with...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.