«Mapping protein information to disease terminologies Anaïs Mottaz1, Yum L. Yip1,2, Patrick Ruch2,3, and Anne-Lise Veuthey1 1 Swiss Institute of ...»
Journal of Integrative Bioinformatics 2007 http://journal.imbio.de/
Mapping protein information to disease terminologies
Anaïs Mottaz1, Yum L. Yip1,2, Patrick Ruch2,3, and Anne-Lise Veuthey1
Swiss Institute of Bioinformatics,
Dept. of Structural Biology and Bioinformatics, University of Geneva,
Medical Informatics Service, University Hospitals of Geneva,
In order to improve the accessibility of genomic and proteomic information to medical researchers, we have developed a procedure to link biological information on proteins involved in diseases to the MeSH and ICD-10 disease terminologies. For this purpose, we took advantage of the manually curated disease annotations in more than 2,000 human protein entries of the UniProt KnowledgeBase. We mapped disease names extracted from the entry comment lines or from the corresponding OMIM entry to the MeSH. The method was assessed on a benchmark set of 200 manually mapped disease comment lines. We obtained a recall of 54% for 91% precision. The same procedure was used to map the more than 3,000 diseases in Swiss-Prot to MeSH with comparable efficiency.
Tested on ICD-10, the coverage of the mapped terms was lower, which could be explained by the coarse-grained structure of this terminology for hereditary disease description. The mapping is provided as supplementary material at http://research.isb- sib.ch/unimed.
1 Introduction With the emergence of high-throughput technologies, the amount of biomedical data available to researchers and clinicians has increased drastically over the last decade. In the genomic/proteomic era, new methods of knowledge management will soon allow researchers to move beyond the analysis of single molecule or pathway to consider global mechanisms, such as pathological processes, from an integrated point of view. One of the challenges for bioinfomatics in this context is to bridge the gap between biological knowledge and clinical data. Currently, the main obstacle to achieve this objective is the compartmentalization of the data in different databases, and the inconsistencies in the vocabulary used by these resources to describe biomedical entities and concepts. A key solution to this interoperability problem lies in the development of common terminologies capable of acting as a metadata layer to provide the missing links between the various resources. Successful initiatives for the development of standardized vocabularies in the biological domain started some years ago with the creation of the Gene Ontology (GO) for the description of biological functions and processes . It was followed by the developments of numerous biological ontologies under the Open Biological Ontologies initiative (OBO) . In the medical domain, the effort on the development of standard terminologies started many years before these initiatives in the molecular biology domain. Key vocabularies such as the International Classification of Diseases (ICD) , the SNOMED clinical terminology , and the Medical Subject Headings (MeSH)  were developed in order to standardize information on various domains of medicine, from patient care to biomedical literature indexing. The Unified Medical Language System (UMLS)  was developed by the US National Library of Medicine (NLM) to function as an umbrella over these resources by providing a system of interrelations between all these terminologies.
Journal of Integrative Bioinformatics, 4(3):79, 2007 1 Journal of Integrative Bioinformatics 2007 http://journal.imbio.de/ Even if the recent integration of GO in the UMLS has opened new ways of linking biological and medical resources via terminologies, relationships between gene functions and diseases are still poorly documented in terminologies. Several initiatives have been set up to link phenotypes to genotypes , and systems have been developed to detect such associations.
For instance, GenesTraceTM  and BioMeKe  use the relationships between GO and UMLS concepts of disease-related semantic types to infer gene-disease relationships.
PhenoGO uses natural language processing methods to assign phenotypic context to GO annotations . The MedGene database gathers relationships between human gene names and diseases extracted from MEDLINE . GFINDer uses textual information from the Online Meendelian Inheritance in Man database (OMIM) to analyze correlation of disease with gene expression in microarray results . All these systems rely on inference and, therefore, depend closely upon the accuracy of the various methods. A straighter way to link genes to diseases would be to use the disease-related information directly provided by some specific biological databases. Take the example of the UniProt Knowledgebase (UniProtKB) , the most comprehensive protein warehouse with extensive annotation and crossreferences to other database resources. In UniProtKB, more than 2,000 human proteins contain manually curated information related to their involvement in pathologies. This information comes with the type and position of the single amino acid polymorphisms known to cause the disease, and cross-references to variant databases and genomic resources, such as dbSNP and Ensembl. While this information is clearly of value, it is not easily accessible for clinical researchers due to the fact that UniProtKB does not use standard medical vocabularies to describe diseases associated to proteins and their variants.
In this study, we have developed an automatic approach to map the disease terms in
UniProtKB to two well-known and widely used disease terminologies within the UMLS:
MeSH - the controlled vocabulary thesaurus used for biomedical and health-related documents indexing , and ICD-10 - the official disease classification provided by the WHO . We took advantage of the manual annotation in UniProtKB as well as the curated links of UniProtKB entries to OMIM, the comprehensive knowledge base of human genes and genetic diseases . A benchmark set was created for the refinement of term matching algorithm as well as for the definition of matching score and score threshold. This work provides a basis for further work aiming to increase the interoperability between data resources from the medical informatics and the bioinformatics domains.
2 Methods2.1 Extraction of disease names
The UniProtKB/Swiss-Prot (release 52.5), and the OMIM (version May 2007) were used for this study. In UniProtKB/Swiss-Prot entries, disease information related to the protein is expressed in free text comment lines qualified by the category ‘Disease’ (Figure 1). By manual inspection, we first established a list of regular expressions that indicates the presence of a disease name within these lines (e.g. ‘cause(s)’, ‘cause of’, ‘involved in’, ‘contribute(s) to’, ‘induce(s)’). The disease name was usually delimited either by the end of a sentence, a conjunction or relative clause, or by the corresponding OMIM identifier. We also defined a list of specific words, such as ‘susceptibility to’, ‘development (of)’, ‘various types of’ to remove terms that have no direct connection with the disease name. In rare cases where several diseases were described in the same comment line, we restricted the extraction to the first mentioned disease.
In parallel, we took advantage of the citations to OMIM phenotypes (#) and genes with phenotypes (+) in the disease comment lines to extract the fields Title and Alternative titles;
symbols from the corresponding OMIM entries. These two fields provide the disease name in
OMIM as well as a set of synonyms. For names coming from “gene with phenotype (+)” entries, we did not try to distinguish between gene names and diseases names, both types were included in the disease list.
Figure 1: disease comment lines in a UniProtKB/Swiss-Prot entry
We mapped the extracted disease names to the terms from the disease category of the MeSH terminology (version 2007). The MeSH thesaurus is structured in a hierarchy of descriptors, each descriptor including a set of related concepts, and each concept itself containing a set of terms, which are synonyms and lexical variants. We mapped the disease names to the MeSH terms and linked the results to the corresponding MeSH descriptors. For ICD-10, we mapped the disease names to all non-redundant terms of ICD-10, without distinction of their types.
The mapping procedure consisted of two successive term matching steps:
(1) we found exact matches, where all words composing the name had an identical correspondent in a MeSH term and vice versa, the word order and the case not being taken into consideration.
(2) in case of no exact match, we looked for partial matches by decomposing the name into its word components and calculated a similarity score for names having at least one word in common.
The score used to determine the similarity between two terms was calculated as a function of the number of words in common minus the number of words that differ. In order to take into account the informative content of each word composing the term, we weighted them according to an adaptation of the weighting schema ‘Term Frequency X Inverse Document Frequency’ (TF X IDF), commonly used in information retrieval techniques . We calculated the inverse document frequency (IDF) of each word present in the three sources of terms, namely Swiss-Prot disease lines, OMIM Titles and Alternative titles, and disease MeSH terms or ICD-10 terms. The similarity score was calculated according to the following
Where freq=n/N, with n the number of occurrence of the word in all OMIM (Titles, Alternative titles), Swiss-Prot disease comment lines, and MeSH terms (disease category) or ICD-10 terms. N represents the total number of words in these documents. cw stands for words in common and ncw for words present in only one of the terms. The term size(disease) is a normalization factor consisting of the number of words composing the disease name to be mapped.
Hyphenated words were treated in a special way to avoid false positive matches without penalizing the sensitivity. Each of their components was considered as distinct word. If all components had a matched equivalent, their respective weights were summed up in the score calculation. Otherwise, their weights were subtracted.
2.3 Mapping evaluation
In order to evaluate the mapping procedure, 200 disease comments from 97 UniProtKB/Swiss-Prot entries were manually mapped to MeSH by a medical expert. SwissProt entries were selected randomly. However, care was taken so that the chosen sample of entries would be representative and lead to a proportion of exact and partial matches similar to that found in a preliminary mapping attempt. The disease terms were mapped, whenever possible, to a single MeSH term of the same granularity or close in the hierarchy. However, when no equivalent term was found in the terminology, the disease name was mapped to several parents in different hierarchies or to high level concepts.
The mapping procedure was assessed in terms of precision=TP/(TP+FP) and recall=TP/total number of terms, where TP is the number of correct mappings (true positives), and FP the number of incorrect mappings (false positives).
3 Results In UniProtKB/Swiss-Prot (release 52.5), 2,167 human protein entries contained information on the involvement of these proteins in diseases. This corresponded to a total of 3,197 diseases, mainly of genetic causes. Among these diseases, 2,410 had a link to a corresponding phenotype described in OMIM, which represented 77% of the total OMIM entries of phenotypes with a known molecular basis (version May. 2007). We mapped the disease names to the 38,646 terms of the MeSH disease category (version 2007) and 29,550 nonredundant terms of ICD-10. We treated independently names provided by Swiss-Prot and those provided by OMIM. A benchmark set consisting of 200 disease comment lines with 173 references to OMIM was used to evaluate the mapping procedure.
3.1 Disease name extraction
Swiss-Prot disease names were extracted from the comment lines with a set of regular expressions. As the Swiss-Prot disease lines are usually well structured, we were able to extract almost all disease names. The extraction failed in only 7 comment lines where a clear
reference to a disease was not expressed, for instance:
“(CBL) can be converted to an oncogenic protein by deletions or mutations that disturb its ability to down-regulate RTKs.” (P22681) The system was constructed to extract only a single disease name per line. By manual assessment of the extraction results, we noticed that in some cases it failed to treat correctly
lines such as:
“KRT16 and KRT17 are coexpressed only in pathological situations such as metaplasias and carcinomas of the uterine cervix and in psoriasis vulgaris.” (P08779) Journal of Integrative Bioinformatics, 4(3):79, 2007 4 Journal of Integrative Bioinformatics 2007 http://journal.imbio.de/ We did not investigate further these cases, as the structure of disease lines is planned for a revision in the framework of Swiss-Prot comment standardization efforts.
Extraction of OMIM’s disease names from Title and Alternative title; symbols was simple.
We kept all words composing a term, except qualifiers such as “included” or “obsolete”.
3.2 Mapping on the benchmark
The results from a benchmark of 200 diseases manually mapped to MeSH terms are shown in Table 1. The mapping was done independently on disease names extracted from Swiss-Prot and on Title or Alternative titles of OMIM.
The mapping procedure was divided into two successive steps. First, we checked for exact matches with MeSH terms. Exact matches covered about 20% of the benchmark with an excellent precision. The only three false positive matches were caused by a difference of classification between MeSH and OMIM. More specifically, OMIM considers these terms as synonyms, whereas MeSH classified them in different concepts. For instance, two types of epidermolysis bullosa, which are distinct MeSH descriptors, are synonyms in OMIM. When we gathered the exact matches provided by the two resources, the coverage increased to 26%.