WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 |

«Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking Estelle DELPECH1, Béatrice DAILLE1, ...»

-- [ Page 1 ] --

Extraction of domain-specific bilingual lexicon from

comparable corpora: compositional translation and

ranking

Estelle DELPECH1, Béatrice DAILLE1, Emmanuel MORIN1, Claire LEMAIRE2,3

(1) UNIVERSITÉ DE NANTES – LINA UMR 6241, 2 rue de la Houssinière, BP 92208, 44322 Nantes,

Cedex 3, France

(2) UNIVERSITÉ STENDHAL – GRENOBLE 3, BP 25, 38040 Grenoble Cedex 9, France

(3) LINGUA ET MACHINA, c/o Inria Rocquencourt BP 105, Le Chesnay Cedex 78153, France

(1){name.surname}@univ-nantes.fr (2){initials}@lingua-et-machina.com

ABSTRACT

This paper proposes a method for extracting translations of morphologically constructed terms from comparable corpora. The method is based on compositional translation and exploits translation equivalences at the morpheme-level, which allows for the generation of “fertile” translations (translation pairs in which the target term has more words than the source term).

Ranking methods relying on corpus-based and translation-based features are used to select the best candidate translation. We obtain an average precision of 91% on the Top1 candidate translation. The method was tested on two language pairs (English-French and English-German) and with a small specialized comparable corpora (400k words per language).

TITLE AND

Abstract

IN ANOTHER LANGUAGE, FRENCH

Extraction de lexiques bilingues spécialisés à partir de corpus comparales : traduction compositionnelle et ordonnancement Cet article propose une méthode permettant d'extraire des traductions de termes morphologiquement construits à partir de corpus comparables. La méthode se base sur la traduction compositionnelle et exploite des équivalences traductionnelles au niveau morphologique, ce qui nous permet de générer des traductions “fertiles” (des paires de traductions dans lesquelles le terme cible a plus de mots que le terme source). Des méthodes d'ordonnancement s'appuyant sur des traits extraits du corpus et des paires de traduction sont utilisées pour sélectionnner la meilleur traduction candidate. Nous obtenons une précision de 91% sur le Top1 en moyenne. La méthode a été testée sur deux paires de langues (anglais- français et anglais-allemand) et sur un corpus comparable spécialisé de petite taille (400k mots par langue).

KEYWORDS: COMPUTER-AIDED TRANSLATION, MACHINE TRANSLATION, COMPARABLE CORPORA, LEARNING-TO-

RANK, COMPOSITIONALITY, TERMINOLOGY

MOTS-CLÉS : TRADUCTION ASSISTÉE PAR ORDINATEUR, TRADUCTION AUTOMATIQUE, CORPUS COMPARABLES,

LEARNING-TO-RANK, COMPOSITIONNALITÉ, TERMINOLOGIE

Introduction Comparable corpora are composed of texts in different languages which are not translations but deal with the same subject matter and were produced in similar situations of communication.

They are used in Computer-Aided Translation to provide technical translators with domain- specific bilingual lexicons when there is no parallel data available (e.g. translation memories, multilingual terminologies). This situation happens when translators have to translate texts which deal with emerging technical domains or when the translation is done from/to an under-resourced language. Comparable corpora also have the advantage of containing more idiomatic expressions than parallel corpora do because the target texts do not bear the influence of the source language.

Indeed, Baker (1996) observed that translated texts tend to bear features like explicitation, simplification, normalization and levelling out. As a consequence, one of the difficulties with comparable corpora is that the translation of a source term may not be present in its “normalized” or “canonical” form but rather in the form of a morphological or paraphrastic variant (e.g. postmenopausal translates to après la ménopause 'after the menopause' instead of postménopausique). Another limitation is that algorithms output, for each source term, a set of candidate translations instead of just one target term. This state of affairs makes it very challenging for translators to use lexicons extracted from comparable corpora in real-life situations (Delpech, 2011).

The solution that consists in increasing the size of the corpus in order to find more translation pairs or to extract parallel segments of text (Fung & Cheung, 2004; Rauf & Schwenk, 2009) is only possible when large amounts of texts are available. In the case of the extraction of domainspecific lexicons, we quickly face the problem of data scarcity: in order to extract high-quality lexicons, the corpus must contain text dealing with very specific subject domains and the target and source texts must be highly comparable. If one tries to increase the size of the corpus, one takes the risk of decreasing its quality by adding out-of-domain texts. Studies support the idea that the quality of the corpora is more important than its size. Morin et al. (2007) show that the discourse categorization of the documents increases the precision of the lexicon despite the data sparsity. Bo & Gaussier (2010) show that they improve the quality of the extracted lexicon if they improve the comparability of the corpus by selecting a smaller – but more comparable – corpus from an initial set of documents.





This paper proposes methods for ranking and extracting canonical translations as well as translation variants, with a special focus on the extraction of fertile translations. In parallel texts processing, the notion of fertility has been defined by Brown et al. (1993). They defined the fertility of a source word e as the number of target words to which e is connected in a randomly selected alignment. Similarly, we call a fertile translation a translation pair in which the target term has more words than the source term. The identification of fertile translations is useful because (i) they frequentlty correspond to non-canonical translations, e.g. paraphrastic variants and (ii) they tend to correspond to vulgarized forms of technical terms (e.g. « cytotoxic » vs. « toxic to the cells ») which are useful when the translator translates lay science texts. Up to now, fertility has received little attention in the field of comparable corpora processing. To our knowledge, only Daille & Morin (2005) and Weller et al. (2011) tried to extract translation pairs of different lengths from comparable corpora. Daille & Morin (2005) focus on the specific case of multi-word terms whose meaning is not compositional and tried to align these multi-word terms with either single-word terms or multi-word terms using a context-based approach. Weller et al. (2011) concentrate on translating noun compounds to noun phrases. Similar to the approach presented here, Claveau & Kijak (2011) use translation equivalences between morphemes to generate translations and can handle fertility. However it is not suited for comparable corpora since it requires domain-specific parallel data (in their case, a multilingual terminology) to learn alignment probabilities.

Our method is based on compositional translation. We chose this approach because: (i) according to Namer & Baud (2007), compositional terms form a major part of the new terms found in technical and scientific domains, this is not restricted to the field of biomedicine as it is generally believed ; (ii) compositionality-based methods have been shown to clearly outperform contextbased ones for the translation of terms with compositional meaning, both in terms of translation accuracy and rank of the correct candidate translation (Morin & Daille, 2010) ; (iii) we believe that compositionality-based methods offer the opportunity to generate fertile translations if combined with a morphology-based approach. This method, which we call morphocompositional translation, consists in: (i) decomposing the source term into morphemes: postmenopause is split into post- + menopause1 ; (ii) translating the morphemes to bound morphemes or fully autonomous words: post- becomes post- or après, menopause becomes ménopause ; (iii) recomposing the translated elements into a target term: post-ménopause 'postmenopause', après la ménopause 'after the menopause'. Fertile translations can be generated because we allow bound morphemes to be translated to autonomous lexical items (e.g. prefix post- → preposition après). The proposed ranking methods exploit various corpus-based and translation-based features.

This paper falls into 4 sections. Section 1 outlines recent research in compositional approaches to bilingual lexicon extraction. Section 2 explains the methods we designed for translation generation and ranking. Section 3 describes our experimental data. Section 4 presents and discusses the results of our experimentations.

1 Compositional approaches to bilingual lexicon extraction

The core of compositional translation consists in generating candidate translations following the principle of compositionality: “the meaning of the whole is a function of the meaning of the parts” (Keenan & Faltz, 1985, pp. 24-25). Once the candidate translations have been generated, one generally ranks them and selects the TopN candidate translations. Generation methods are described in section 1.1. Ranking methods are described in section 2.3.

1.1 Generation methods Compositional translation consists in decomposing the source term into atomic components, translating these components into the target language and recomposing the translated components into target terms. Existing implementations differ on the kind of atomic components they use for translation.

Lexical compositional translation (Baldwin & Tanaka, 2004; Grefenstette, 1999; Morin & Daille, 2009; Robitaille et al., 2006) deals with multi-word term to multi-word term alignment We use the following notations: trailing hyphen for prefixes (a-), leading hyphen for suffixes (-a), both for 1 confixes (-a-), no hyphen for autonomous morphemes (a) and a plus sign (+) for intra-word morpheme boundaries. The term confix is borrowed from (Martinet, 1979) and refers to neoclassical (Latin or Ancient Greek) roots.

and uses lexical words as atomic components: rate of evaporation is translated into French as taux d'évaporation by translating rate to taux and evaporation to évaporation using dictionary lookup. Recomposition may be done by permuting the translated components (Morin & Daille,

2010) or with translation patterns (Baldwin & Tanaka, 2004).

Sublexical compositional translation deals with single-word term translation. The atomic components are subparts of the source single-word term. Cartoni (2009) translates neologisms created by prefixation with a formalism called Bilingual Lexeme Formation Rules. Atomic components are the prefix and the lexical base: Italian neologism ricostruire 'rebuild' is translated into French reconstruire by translating the prefix ri- to re- and the lexical base costruire as construire. Weller et al. (2011) translate two types of single-word term. German single-word terms formed by the concatenation of two neoclassical roots are decomposed into these two roots, then the roots are translated into target language roots and recomposed into an English or French single-word term, e.g. Kalori1metrie2 is translated as calori1metry2. German NOUN1+NOUN2 compounds are translated into French and English NOUN1 NOUN2 or NOUN1 PREP NOUN2 multi-word terms, e.g. Elektronen N1-mikroskopN2 is translated to electronN1 microscopeN2.

Garera & Yarowsky (2008) translate various compound sequences (NOUN1+NOUN2, ADJ1+NOUN2 …). They generate an English literal gloss of the compounds with the compositional method (for instance, the English gloss for the Albanian word hekurudhë 'railway' is iron path). Then, they search for entries in Lx-to-English dictionaries where the entry in language Lx is a word-to-word translation of the English gloss (e.g. iron path matches the German entry Eisenbahn and the Italian entry ferrovia). The final candidate translations are the fluent English translations proposed by the bilingual dictionaries (e.g. Eisenbahn and ferrovia both translate to railway ;

railway is considered as a potential translation for hekurudhë).

1.2 Ranking and selection methods Generally, compositional translation generates several possible translations for one source term.

One has to find a way to rank the translations from the most to the least reliable. Garera & Yarowsky (2008) tried two ranking methods: (i) a probability score P based on the number of different languages exhibiting the association between the literal gloss and the fluent translation ;

(ii) the probability score P combined with the similarity of the source and target words' contexts using context-based methods like in the work of Rapp (1995) and Fung (1997). Robitaille et al.

(2006) extract translation pairs from a corpus built by querying a search engine with a set of seed translation pairs. They select the candidate translations which are semantically related to the target seed terms. The semantic similarity measure is based on the number of hits containing the seed term and/or the candidate translation (Jaccard coefficient). Other works simply select the candidate translations which occur in the target corpus (Weller et al., 2001 ; Morin and Daille,

2010) or which are significantly attested on the Web (Cartoni, 2009).

Only Baldwin and Takana (2004) use machine learning. They train a SVM classifier with corpusbased, dictionary-based and translation pattern-based features and use the value returned by the classifier (a continuous value between -1 and +1) to rank the candidate translations. Their approach is tantamount to point-wise approaches in learning-to-rank. To our knowledge, no research work has investigated the possible contribution of advanced learning-to-rank algorithms to candidate translations ranking. Learning-to-rank algorithms are widely used in Information Retrieval for ranking documents from the most to the least relevant to a given query (Li, 2011).

They can be easily ported to the problem of ranking the candidate translations of a source term.



Pages:   || 2 | 3 | 4 |


Similar works:

«UNIVERSITY OF OTTAWA FACULTY OF ARTS THEATRE DEPARTMENT Milligan’s Accordion: The Distortion of Time and Space in The Goon Show Richard Cousins Thesis supervisor: Daniel Mroz, Ph.D. Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the Master of Arts (M.A.) (Theatre Theory & Dramaturgy) -“ SEAGOON:.looking for the lost year has made me a weak old man. BLUEBOTTLE: Ohyou hear that, Eccles? ECCLES: What? BLUEBOTTLE: He's only a...»

«Like Us On Facebook Your Quick Sport Fix Page 1 Type to enter text www.betfan.com www.betkudos.com www.winninginformationnetwork.com http://members.tipsterplanet.com Issue 24 Saturday 1st August 2015 www.freeracingtips.co.uk www.raceadvisor.co.uk Welcome to issue 24 of “Your Quick Sport Fix!” www.tiptv.co.uk It’s your new bite sized sports newsletter that’s distributed freely online via Email and Social Media by Sport Fans. We hope you like it and share it....»

«CloudTransport: Using Cloud Storage for Censorship-Resistant Networking Chad Brubaker1,2, Amir Houmansadr2, and Vitaly Shmatikov2 1 Google 2 The University of Texas at Austin Abstract. Censorship circumvention systems such as Tor are highly vulnerable to network-level filtering. Because the traffic generated by these systems is disjoint from normal network traffic, it is easy to recognize and block, and once the censors identify network servers (e.g., Tor bridges) assisting in circumvention,...»

«i [H.A.S.C. No. 112–83] DOD’S ENTERPRISE RESOURCE PLANNING (ERP) SYSTEM IMPLEMENTATION EFFORTS HEARING BEFORE THE PANEL ON DEFENSE FINANCIAL MANAGEMENT AND AUDITABILITY REFORM OF THE COMMITTEE ON ARMED SERVICES HOUSE OF REPRESENTATIVES ONE HUNDRED TWELFTH CONGRESS FIRST SESSION HEARING HELD OCTOBER 27, 2011 U.S. GOVERNMENT PRINTING OFFICE : WASHINGTON 2012 71–454 For sale by the Superintendent of Documents, U.S. Government Printing Office, http://bookstore.gpo.gov. For more information,...»

«TURKISH PUBLIC ATTITUDES TOWARD THE MILITARY AND ERGENEKON: CONSEQUENCES FOR THE CONSOLIDATION OF DEMOCRACY Yaprak Gürsoy 2012 Working Paper No: 5 EU/5/2012 İstanbul Bilgi University, European Institute, Dolapdere Campus, Kurtulufl Deresi Cad. Yahya Köprüsü Sk. No: 1 34440 Dolapdere / ‹stanbul, Turkey Phone: +90 212 311 52 40 • Fax: +90 212 250 87 48 e-mail: europe@bilgi.edu.tr • http://eu.bilgi.edu.tr TURKISH PUBLIC ATTITUDES TOWARD THE MILITARY AND ERGENEKON: CONSEQUENCES FOR THE...»

«1 SECTION ONE: BASIC ISSUES This section covers why men have sex with men, how men have sex with men, social and personal issues. WHO? Peter, a 17-year-old in a boarding school in South Africa, sometimes crawls late at night into the bed of his 16-year-old friend Daniel. They play with each other. Peter talks about girls and so does Daniel, although the younger boy is more interested in his friend. Vladimir, a 20-year-old Russian, has been in prison for a year. He had a girlfriend before he was...»

«FRANKENSTEIN’S CAT Cuddling Up to Biotech’s Brave New Beasts EMILY ANTHES A Oneworld Book First published in Great Britain and the Commonwealth by Oneworld Publications 2013 Originally published in the United States by Scientific American Books, an imprint of Farrar, Straus and Giroux Copyright © Emily Anthes 2013 The moral right of Emily Anthes to be identified as the Author of this work has been asserted by her in accordance with the Copyright, Designs and Patents Act 1988 All rights...»

«Quaderns de filosofia i ciència, 37, 2007, pp. 29-38 FOUCAULT, LA PSICOANÀLISI I EL SUBJECTE* Enric J. Novella Resum: Aquest assaig pretén revaluar la complexa relació del pensament de Michel Foucault amb la psicoanàlisi des de la perspectiva de les seues últimes reflexions entorn del subjecte i les cultures de la subjectivitat. Com es veurà, el plantejament d’aquesta renovada relació tan sols cobra sentit si, d’una banda, la problematització foucaultiana del subjecte i el seu...»

«PUBLISHED UNITED STATES COURT OF APPEALS FOR THE FOURTH CIRCUIT  MARYLAND TRANSIT ADMINISTRATION, Petitioner,  v. No. 11-1412 SURFACE TRANSPORTATION BOARD; UNITED STATES OF AMERICA,  Respondents. On Petition for Review of an Order of the Surface Transportation Board. (32609) Argued: September 19, 2012 Decided: November 21, 2012 Before NIEMEYER and DIAZ, Circuit Judges, and Max O. COGBURN, Jr., United States District Judge for the Western District of North Carolina, sitting by...»

«The Signs of Death Pontifical Academy of Sciences, Scripta Varia 110, Vatican City 2007 www.pas.va/content/dam/accademia/pdf/sv110/sv110-davis.pdf THE MINIMALLY CONSCIOUS STATE: NEUROIMAGING AND REGENERATION* STEPHEN DAVIS Background The minimally conscious state (MCS) is a clinical manifestation of severe brain injury. While there are no evidence-based criteria, diagnostic guidelines were reached in a series of consensus development workshops [1]. The differential diagnosis of the minimally...»

«Report on the current state of “Japanese University Rocket Projects” Second Edition October 2012 -1-2Report on the current state of “Japanese University Rocket Projects” October 2012 -3(ii) (iii) -4CONTENTS Introduction Report on Japanese University Rocket Projects A Akita University H Hokkaido University – Laboratory of Space Systems T Tokai University W Wakayama University – Institute for Education on Space Other Important Universities XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...»

«These Guidelines were originally developed for Mercy Hospice Auckland (formerly St Joseph’s Mercy Hospice), New Zealand, but demand from other palliative care providers and a substantial grant from the Genesis Oncology Trust (www. genesisoncology.org.nz) has enabled them to be produced in this convenient and easy to read book. The Guidelines have been independently reviewed to ensure the information presented within them is accurate and up-to-date. This review was undertaken by Jenny Phillips...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.