FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«To cite this version: Sophie Aubin, Thierry Hamon. Improving Term Extraction with Terminological Resources. Tapio Salakoski, Filip Ginter, Sampo ...»

-- [ Page 1 ] --

Improving Term Extraction with Terminological


Sophie Aubin, Thierry Hamon

To cite this version:

Sophie Aubin, Thierry Hamon. Improving Term Extraction with Terminological Resources.

Tapio Salakoski, Filip Ginter, Sampo Pyysalo, Tapio Pahikkala. 2006, Springer, pp.380, 2006,

LNAI 4139. hal-00091444

HAL Id: hal-00091444


Submitted on 6 Sep 2006

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´e au d´pˆt et ` la diffusion de documents e eo a entific research documents, whether they are pub- scientifiques de niveau recherche, publi´s ou non, e lished or not. The documents may come from ´manant des ´tablissements d’enseignement et de e e teaching and research institutions in France or recherche fran¸ais ou ´trangers, des laboratoires c e abroad, or from public or private research centers. publics ou priv´s.

e Improving term extraction with terminological resources Sophie Aubin and Thierry Hamon LIPN – UMR CNRS 7030 99 av. J.B. Cl´ment, F-93430 Villetaneuse e T´l. : 33 1 49 40 40 82, Fax. : 33 1 48 26 07 12 e firstname.lastname@lipn.univ-paris13.fr, WWW home page: www-lipn.univ-paris13.fr/~lastname Abstract. Studies of different term extractors on a corpus of the biomed- ical domain revealed decreasing performances when applied to highly technical texts. Facing the difficulty or impossibility to customize existing tools, we developed a tunable term extractor. It exploits linguistic-based rules in co

–  –  –

interface [2] or the exploitation of external resources. We propose a combination of the three methods.

The terminology extractor we implemented uses techniques comparable to state-of-the-art tools, among which chunking based on morpho-syntactic frontiers and production of the syntactic analysis of the terms extracted. We further propose new solutions for chunking and parsing by using external resources. In addition, we chose to perform positive filtering in the parsing step through the mechanism of islands of reliability (see Section 3.1). In comparison, other tools produce all parsing solutions and filter out non valid ones a posteriori.

We first discuss the limitations of matching existing terminologies on corpora and of automatic extraction tools. As an answer to this, we propose a combination of terminology extraction with the exploitation of testified resources. We describe the extraction process of Y TEA that implements the method we proA pose. We finally present the results of experiments run on a biomedical corpus to characterise the effects of recycling existing terminologies in a term extractor.

2 Which approach to identify terms?

Terms can be identified in corpora regarding two approaches : matching terms issued from terminological resources, or designing automatically term extraction methods.

Using terminological resources to identify terms in texts addresses the question of the usability of resources on working corpora, namely their coverage and their adequacy. This leads to evaluate how terms issued from resources, i.e. testied terms, match in the working corpus. As terminological resources are widely available in the biomedical field, many experiments have been done on recycling terminologies to identify terms in medical and biological corpora. Coverage is generally mitigated. The coverage of well-known classifications as ICD-9, ICD-10 or SNOMED III have been observed on a 14,247 word corpus of clinical texts [6]. The evaluation leads to conclude that no classification covers sufficiently the corpus, although SNOMED has the better content coverage. Similar observations have been noted regarding the evaluation of the usability of Gene Ontology for NLP [7]. 37% of the GO terms are found in a 400,000 Medline citation corpus.

Results vary depending on the GO categories from 28% to 53 % in the Medline corpus. [7] consider that this low content coverage could be due to the size of the working corpus or its narrow scope. Still, content coverage is even worse on a set of 3 million randomly selected noun phrases among 14 million terms extracted from the Medline corpus [8]: most of them are not present in UMLS.

In [9], we showed that, in the context of the indexation of specialized texts, even if the combination of resources is useful to identify numerous testified terms or variants, the indexation varies greatly according to the documents.

Alternatives, based on the automatic extraction of terms, have been widely proposed since the 90’s. [4] give an overview of the proposed term extractors.

These term identification methods generally exploit linguistic information like boundaries or, more often, patterns. Such approaches are difficult to evaluate III without a golden standard and evaluations vary according to the methods. However, the recall is generally good ([2] estimates the silence to 5%), while the precision is rather low ([2] rejects 50% of the extracted term candidates, the system discussed in [10] has an error rate of 20%).

Pure term extraction methods rarely use terminological resources. Such domain information is rather exploited at the filtering step [10]. However, the usefulness of terminological resources in a term extraction process is demonstrated in FASTR [11]. Results of this term variant extraction system are rather good as term variation acquisition increases the terminological resource coverage. The limitation of this approach is the acquisition of terms unrelated to testified ones.

Regarding the works discussed above, it seems obvious that terminological resources provide precious information that must be used in a term identification task. However, exploiting terminological resources requires their availability and adequacy on the targeted corpus. On the opposite, automatic term extraction approaches suffer from a necessary human validation step. In that respect, we aim at combining both approaches by developing a term extraction method that exploits terminological resources when available.

3 Strategy of term extraction

The software Y TEA, developed in the context of the ALVIS1 project, aims at A extracting noun phrases that look like terms from a corpus. It provides their syntactic analysis in a head-modifier format. As an input, the term extractor requires a corpus which has been segmented into words and sentences, lemmatized and tagged with part-of-speech (POS) information. The implementation of this term extractor allows to process large corpora. It is not dependent on a specific language in the sense that all linguistic features can be modified or created for a new language, sub-language or tagset. In the experiments described here, we used the genia tagger2 [12] which is specifically designed for biomedical corpora and uses the Penn TreeBank tagset.

The main strategy of analysis of the term candidates is based on the exploitation of simple parsing patterns and endogenous disambiguation. Exogenous disambiguation is also made possible for the identification and the analysis of term candidates by the use of external resources, i.e. lists of testified terms.

This section includes the presentation of both endogenous and exogenous disambiguation strategies. We also describe the whole extraction process implemented in Y TEA.


3.1 Endogenous and exogenous disambiguation

Endogenous disambiguation consists in the exploitation of intermediate extraction results for the parsing of a given Maximal Noun Phrase (MNP).

1 European Project STREP IST-1-002068-STP,http://www.alvis.info/alvis/ 2 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ IV All the MNPs corresponding to parsing patterns are parsed first. In a second step, remaining unparsed MNPs are processed using the MNPs parsed during the first step as islands of reliability. An island of reliability is a subsequence (contiguous or not) of a MNP that corresponds to a shorter term candidate

in either its inflected or lemmatized form. It is used as an anchor as follows:

the subsequence covered by the island is reduced to the word found to be the syntactic head of the island. Parsing patterns are then applied to the simplified MNP.

This feature allows the parse of complex noun phrases using a limited number of simple parsing patterns (80 patterns containing a maximum of 3 content words were defined for the experiments described below). In addition, islands increase the degree of reliability of the parse as shown in Figure 1.

Fig. 1. Effect of an island on parsing

Y TEA allows exogenous disambiguation, i.e. the exploitation of existing (tesA tified) terminologies to assist the chunking, parsing and extraction steps.

During chunking, sequences of words corresponding to testified terms are identified. They cannot be further split or deleted. Their POS tags and lemmas can be corrected according to those associated to the testified term. If an MNP corresponds to a testified term for which a parse exists (provided by the user or computed using parsing patterns), it is recorded as a term candidate with the highest score of reliability. Similarly to endogenous disambiguation, subsequences of MNPs corresponding to testified terms are used as islands of reliability in order to augment the number and quality of parsed MNPs.

3.2 Term candidate extraction process A noun phrase is extracted from the corpus and considered a term candidate if at least one parse is found for it. This is performed in three main steps, (1) chunking, i.e. construction of a list of Maximal Noun Phrases from the corpus, (2) parsing, i.e. attempts to find at least one syntactic parse for each MNP and, (3) extraction of term candidates. The result of the term extraction process is two lists of noun phrases: one contains parsed MNPs, called term candidates, the other contains MNPs for which no parse was found. Both lists are proposed to the user through a validation interface (ongoing development).

1. Chunking: the corpus is chunked into Maximal Noun Phrases.

The POS tags associated to the words of the corpus are used to delimit the MNPs according to the resources provided by the user: chunking frontiers and exceptions, forbidden structures and potentially, testified terms.

V Chunking frontiers are tags or words that are not allowed to appear in MNPs, e.g. verbs (VBG) or prepositions (IN). Chunking exceptions are used to refine frontiers. For instance, ”of” is a frontier exception to prepositions, ”many” and ”several” being exceptions to adjectives. Forbidden structures are exceptions for more complex structures and are used to prevent from extracting sequences that look like terms (syntactically valid) but are known not to be terms or parts of terms like ”of course”. MNPs and subparts of MNPs corresponding to testified terms (when available) are protected and cannot be modified using the chunking data. For instance, the tag FW is a priori not allowed in MNPs. However, if an MNP is equal to or contains the testified term ”in/IN vitro/FW”, it will be kept as such.

2. Parsing: for each identified MNP type, except monolexical MNPs, different parsing methods are applied in decreasing order of reliability. Once a method succeeds in parsing the MNP, the parsing process comes to an end. Still, one method can compute several parses for the same MNP, making the parsing

non-deterministic if desired. We consider 3 different parsing methods:

– tt-covered: the MNP inflected or lemmatized form corresponds to one or several combined testified terms (TT);

– pattern-covered: the POS sequence of the (possibly simplified) MNP corresponds to a parsing pattern provided by user;

– progressive: the MNP is progressively reduced at its left and right ends by the application of parsing patterns. Islands of reliability from term candidates or testified terms are also used to reduce the MNP sequence of the MNP to allow the application of parsing patterns.

3. Extraction of term candidates: MNPs that received a parse in the previous processing step are considered term candidates. Statistical measures will further be implemented to order MNPs according to their likelihood to be a term in order to facilitate their validation by the user.

4 Experiments To characterise the effects of resources on term extraction, we compare the results provided by Y TEA using or not existing terminologies on a biomedical corpus.

A We present and comment the effects on chunking, parsing and extraction of the term candidates.

4.1 Materials Working corpus We carry out an experiment on a corpus of 16,600 sentences (438,513 words) describing genomic interaction of the model organism “Bacillus subtilis”. The corpus was tagged and lemmatized using the genia tagger [12].

Terminological resources To study the reuse of terminologies in the term extractor, we tested two types of resources: terms from two public databases and a list of terms extracted from the working corpus. We first selected and VI merged two specialized resources covering genomic vocabulary: Gene Ontology [13] and MeSH [14], both issued from the december 2005 release of UMLS [15].

The Gene Ontology resource3 (henceforth GO) aims at proposing a controlled vocabulary related to the genomic description of any organism, prokaryotes as well as eukaryotes [16]. GO proposes a list of 24,803 terms. The Medical Subject Headings thesaurus4 (henceforth MeSH) is dedicated to the indexation of the Medline database. The UMLS version of the MeSH offers 390,489 terms used in the medical domain [17].

The TAC (Terms Acquired in Corpus) resource is a list of 515 terms extracted from our working corpus using three term extractors [5]. The 515 terms occur at least 20 times in the corpus and were validated by a biologist.

4.2 Results

–  –  –

Chunking is affected by resources in several ways. As shown in Table 1, they allow the identification of new MNPs that were originally rejected due to their POS tag(s). In addition, the MNPs tend to be longer and monolexical terms less numerous. As MNPs are more complex, the number of types of POS sequences to be parsed is augmented. However, this increase in diversity is expected to be compensated by the parsing mechanism related to islands of reliability.

Pages:   || 2 |

Similar works:

«1995—No. 455 BOXING AND WRESTLING CONTROL ACT 1986— REGULATION (Boxing and Wrestling Control Regulation 1995) NEW SOUTH WALES [Published in Gazette No. 105 of 1 September 1995] HIS Excellency the Governor, with the advice of the Executive Council, and in pursuance of the Boxing and Wrestling Control Act 1986, has been pleased to make the Regulation set forth hereunder. GABRIELLE HARRISON, Minister for Sport and Recreation. PART l—PRELIMINARY Citation 1. This Regulation may be cited as the...»

«Project Paper 2 Senegal Country and Research Areas Report Final Version, 2010-10-01 Université Cheikh Anta Diop de Dakar (UCAD) Responsible institutions: Peace Research Institute Oslo (PRIO) Authors: Papa Demba Fall María Hernández Carretero Mame Yassine Sarr 1 Contents Introduction Country Background Geography Demographic Situation Political Situation Post-Colonial Period Wade Presidency The Casamance Conflict Socio-Economic Situation Employment Education Gender Equality Public Health...»


«Introduction to Bitcoin Mining A Guide For Gamers, Geeks, and Everyone Else by David R. Sterry If you find this eBook useful and would like to see it extended, send donations to 1i2mRogbNByFLxuhD7HtjxDut8GDPnmYj For the most recent version please visit CoinDL.com Copyright © 2012 David R. Sterry Introduction to Bitcoin Mining Contents Introduction Why Start Mining? What Is Mining? Finding Valid Blocks Creating New Bitcoins Mining Hardware Mining Software Running your miner Running Multiple...»

«Yes, user!: compiling a corpus according to what the user wants Rachel Aires1,2 Diana Santos2 Sandra Aluísio1 1 NILC University of São Paulo, Brazil 2 Linguateca SINTEF ICT, Norway raires@icmc.usp.br diana.santos@sintef.no sandra@icmc.usp.br Abstract This paper describes a corpus of webpages, named “Yes, user!”. These pages were classified in order to satisfy different types of users' needs. We introduce the assumptions on which the corpus is based, show its classification scheme in...»

«Dominance Solvable Games Felix Munoz-Garcia EconS 424 Strategy and Game Theory Washington State University Solution Concepts The.rst solution concept we will introduce is that of deleting dominated strategies. Intuitively, we seek to delete from the set of strategies o¤ every players those strategies that can never be bene.cial for him regardless of the strategies selected by his opponents. Lets apply this solution concept to the standard prisoner’s dilemma game and then we will de.ne it...»

«Curriculum Vitae Diane Dewhurst Belcher Education Ph.D. The Ohio State University, English M.A. The Ohio State University, English B.A. The George Washington University, English (with Distinction and Special Honors) Teaching and Administrative Experience Georgia State University Professor, Applied Linguistics / ESL, 2008Director of Graduate Studies, Applied Linguistics / ESL, 2004Associate Professor, Applied Linguistics / ESL, 2003-2008 The Ohio State University Adjunct Associate Professor,...»

«First Aid for Small Animals First Aid for Small Animals REPRINTED WITH PERMISSION FROM THE NATIONAL ANIMAL CONTROL OFFICERS TRAINING MANUAL CHAPTER 13 And MICHAEL RIEGGER, DVMNorthwest Animal Clinic & Hospital, Albuquerque New Mexico Diplomate, American Board of Veterinary Practitioners Important phone numbers Pleasant Valley Pet Clinic 530-644-2424 POISON CONTROL CENTER 1-800-222-1222 (PET POISON HELPLINE) SEE PET POISON APP ON PAGE 11 The Rule of Thumb (or paw in this case) is if your pet is...»

«A Songwriting Workshop Handbook By Tom Cabaniss Preface “I think everybody has a song in their heart. a story to tell.” —David Broxton, Participating Songwriter and Valley Lodge Resident, 2010 In 2009–2010, Carnegie Hall’s Weill Music Institute conducted a six-month creative songwriting workshop at Valley Lodge Homeless Shelter in New York City, a residence for seniors ages 50 and up with a range of developmental disabilities. Seven residents participated in the workshop, which met...»

«Computing Tigers and Goats 131 COMPUTING TIGERS AND GOATS Lim Yew Jin1 J. Nievergelt 2 Singapore Zurich, Switzerland ABSTRACT Bagha Chal, or Moving Tiger, is an ancient Nepali board game also known as Tigers and Goats. We describe the game, some of its characteristics, and insights gained from an incomplete computer analysis. As in some other games such as Merrill’s, play starts with a placement phase where twenty pieces are dropped on the board, followed by a sliding phase; during both...»

«IL REGOLAMENTO ART.1 FORMULA E ITINERARIO RCS Sport S.p.A., indice e organizza, con la collaborazione dei Comitati di Tappa, il 97° GIRO D'ITALIA, corsa ciclistica internazionale a tappe iscritta nel Calendario Mondiale UCI Grandi Giri. RCS Sport S.p.A. ha sede a Milano, in via Rizzoli, 8 – tel. 0039.02.2584.8764 – fax 0039.02.2900.9684 – e-mail ciclismo.rcssport@rcs.it, indirizzo internet www.giroditalia.it. Il Giro d’Italia, che si disputa nel rispetto del regolamento dell’Unione...»

«TWO SOLITUDES PLENUM II Two Solitudes Design as an Approach to Media Research* ILPO KOSKINEN In this speech, my aim is to explore how design and media research relate and to point out ways in which these disciplines could benefit from closer contact than what they seem to have today. The paper is primarily written for researchers in communication by a sociologist who has grown increasingly familiar with design over the past decade through his work at a design school. Whenever we discuss design,...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.