Extraction of domain-specific bilingual lexicon from comparable corpora: compositional translation and ranking Estelle DELPECH1, Béatrice DAILLE1, Emmanuel MORIN1, Claire LEMAIRE2,3

Extraction of domain-specific bilingual lexicon from

comparable corpora: compositional translation and


Estelle DELPECH1, Béatrice DAILLE1, Emmanuel MORIN1, Claire LEMAIRE2,3

(1) UNIVERSITÉ DE NANTES – LINA UMR 6241, 2 rue de la Houssinière, BP 92208, 44322 Nantes,

Cedex 3, France

(2) UNIVERSITÉ STENDHAL – GRENOBLE 3, BP 25, 38040 Grenoble Cedex 9, France

(3) LINGUA ET MACHINA, c/o Inria Rocquencourt BP 105, Le Chesnay Cedex 78153, France

(1){name.surname}@univ-nantes.fr (2){initials}@lingua-et-machina.com


This paper proposes a method for extracting translations of morphologically constructed terms from comparable corpora. The method is based on compositional translation and exploits translation equivalences at the morpheme-level, which allows for the generation of “fertile” translations (translation pairs in which the target term has more words than the source term).

Ranking methods relying on corpus-based and translation-based features are used to select the best candidate translation. We obtain an average precision of 91% on the Top1 candidate translation. The method was tested on two language pairs (English-French and English-German) and with a small specialized comparable corpora (400k words per language).




Extraction de lexiques bilingues spécialisés à partir de corpus comparales : traduction compositionnelle et ordonnancement Cet article propose une méthode permettant d'extraire des traductions de termes morphologiquement construits à partir de corpus comparables. La méthode se base sur la traduction compositionnelle et exploite des équivalences traductionnelles au niveau morphologique, ce qui nous permet de générer des traductions “fertiles” (des paires de traductions dans lesquelles le terme cible a plus de mots que le terme source). Des méthodes d'ordonnancement s'appuyant sur des traits extraits du corpus et des paires de traduction sont utilisées pour sélectionnner la meilleur traduction candidate. Nous obtenons une précision de 91% sur le Top1 en moyenne. La méthode a été testée sur deux paires de langues (anglais- français et anglais-allemand) et sur un corpus comparable spécialisé de petite taille (400k mots par langue).





Introduction Comparable corpora are composed of texts in different languages which are not translations but deal with the same subject matter and were produced in similar situations of communication.

They are used in Computer-Aided Translation to provide technical translators with domain- specific bilingual lexicons when there is no parallel data available (e.g. translation memories, multilingual terminologies). This situation happens when translators have to translate texts which deal with emerging technical domains or when the translation is done from/to an under-resourced language. Comparable corpora also have the advantage of containing more idiomatic expressions than parallel corpora do because the target texts do not bear the influence of the source language.

Indeed, Baker (1996) observed that translated texts tend to bear features like explicitation, simplification, normalization and levelling out. As a consequence, one of the difficulties with comparable corpora is that the translation of a source term may not be present in its “normalized” or “canonical” form but rather in the form of a morphological or paraphrastic variant (e.g. postmenopausal translates to après la ménopause 'after the menopause' instead of postménopausique). Another limitation is that algorithms output, for each source term, a set of candidate translations instead of just one target term. This state of affairs makes it very challenging for translators to use lexicons extracted from comparable corpora in real-life situations (Delpech, 2011).

The solution that consists in increasing the size of the corpus in order to find more translation pairs or to extract parallel segments of text (Fung & Cheung, 2004; Rauf & Schwenk, 2009) is only possible when large amounts of texts are available. In the case of the extraction of domainspecific lexicons, we quickly face the problem of data scarcity: in order to extract high-quality lexicons, the corpus must contain text dealing with very specific subject domains and the target and source texts must be highly comparable. If one tries to increase the size of the corpus, one takes the risk of decreasing its quality by adding out-of-domain texts. Studies support the idea that the quality of the corpora is more important than its size. Morin et al. (2007) show that the discourse categorization of the documents increases the precision of the lexicon despite the data sparsity. Bo & Gaussier (2010) show that they improve the quality of the extracted lexicon if they improve the comparability of the corpus by selecting a smaller – but more comparable – corpus from an initial set of documents.

This paper proposes methods for ranking and extracting canonical translations as well as translation variants, with a special focus on the extraction of fertile translations. In parallel texts processing, the notion of fertility has been defined by Brown et al. (1993). They defined the fertility of a source word e as the number of target words to which e is connected in a randomly selected alignment. Similarly, we call a fertile translation a translation pair in which the target term has more words than the source term. The identification of fertile translations is useful because (i) they frequentlty correspond to non-canonical translations, e.g. paraphrastic variants and (ii) they tend to correspond to vulgarized forms of technical terms (e.g. « cytotoxic » vs. « toxic to the cells ») which are useful when the translator translates lay science texts. Up to now, fertility has received little attention in the field of comparable corpora processing. To our knowledge, only Daille & Morin (2005) and Weller et al. (2011) tried to extract translation pairs of different lengths from comparable corpora. Daille & Morin (2005) focus on the specific case of multi-word terms whose meaning is not compositional and tried to align these multi-word terms with either single-word terms or multi-word terms using a context-based approach. Weller et al. (2011) concentrate on translating noun compounds to noun phrases. Similar to the approach presented here, Claveau & Kijak (2011) use translation equivalences between morphemes to generate translations and can handle fertility. However it is not suited for comparable corpora since it requires domain-specific parallel data (in their case, a multilingual terminology) to learn alignment probabilities.

Our method is based on compositional translation. We chose this approach because: (i) according to Namer & Baud (2007), compositional terms form a major part of the new terms found in technical and scientific domains, this is not restricted to the field of biomedicine as it is generally believed ; (ii) compositionality-based methods have been shown to clearly outperform contextbased ones for the translation of terms with compositional meaning, both in terms of translation accuracy and rank of the correct candidate translation (Morin & Daille, 2010) ; (iii) we believe that compositionality-based methods offer the opportunity to generate fertile translations if combined with a morphology-based approach. This method, which we call morphocompositional translation, consists in: (i) decomposing the source term into morphemes: postmenopause is split into post- + menopause1 ; (ii) translating the morphemes to bound morphemes or fully autonomous words: post- becomes post- or après, menopause becomes ménopause ; (iii) recomposing the translated elements into a target term: post-ménopause 'postmenopause', après la ménopause 'after the menopause'. Fertile translations can be generated because we allow bound morphemes to be translated to autonomous lexical items (e.g. prefix post- → preposition après). The proposed ranking methods exploit various corpus-based and translation-based features.

This paper falls into 4 sections. Section 1 outlines recent research in compositional approaches to bilingual lexicon extraction. Section 2 explains the methods we designed for translation generation and ranking. Section 3 describes our experimental data. Section 4 presents and discusses the results of our experimentations.

1 Compositional approaches to bilingual lexicon extraction

The core of compositional translation consists in generating candidate translations following the principle of compositionality: “the meaning of the whole is a function of the meaning of the parts” (Keenan & Faltz, 1985, pp. 24-25). Once the candidate translations have been generated, one generally ranks them and selects the TopN candidate translations. Generation methods are described in section 1.1. Ranking methods are described in section 2.3.

1.1 Generation methods Compositional translation consists in decomposing the source term into atomic components, translating these components into the target language and recomposing the translated components into target terms. Existing implementations differ on the kind of atomic components they use for translation.

Lexical compositional translation (Baldwin & Tanaka, 2004; Grefenstette, 1999; Morin & Daille, 2009; Robitaille et al., 2006) deals with multi-word term to multi-word term alignment We use the following notations: trailing hyphen for prefixes (a-), leading hyphen for suffixes (-a), both for 1 confixes (-a-), no hyphen for autonomous morphemes (a) and a plus sign (+) for intra-word morpheme boundaries. The term confix is borrowed from (Martinet, 1979) and refers to neoclassical (Latin or Ancient Greek) roots.

and uses lexical words as atomic components: rate of evaporation is translated into French as taux d'évaporation by translating rate to taux and evaporation to évaporation using dictionary lookup. Recomposition may be done by permuting the translated components (Morin & Daille,

2010) or with translation patterns (Baldwin & Tanaka, 2004).

Sublexical compositional translation deals with single-word term translation. The atomic components are subparts of the source single-word term. Cartoni (2009) translates neologisms created by prefixation with a formalism called Bilingual Lexeme Formation Rules. Atomic components are the prefix and the lexical base: Italian neologism ricostruire 'rebuild' is translated into French reconstruire by translating the prefix ri- to re- and the lexical base costruire as construire. Weller et al. (2011) translate two types of single-word term. German single-word terms formed by the concatenation of two neoclassical roots are decomposed into these two roots, then the roots are translated into target language roots and recomposed into an English or French single-word term, e.g. Kalori1metrie2 is translated as calori1metry2. German NOUN1+NOUN2 compounds are translated into French and English NOUN1 NOUN2 or NOUN1 PREP NOUN2 multi-word terms, e.g. Elektronen N1-mikroskopN2 is translated to electronN1 microscopeN2.

Garera & Yarowsky (2008) translate various compound sequences (NOUN1+NOUN2, ADJ1+NOUN2 …). They generate an English literal gloss of the compounds with the compositional method (for instance, the English gloss for the Albanian word hekurudhë 'railway' is iron path). Then, they search for entries in Lx-to-English dictionaries where the entry in language Lx is a word-to-word translation of the English gloss (e.g. iron path matches the German entry Eisenbahn and the Italian entry ferrovia). The final candidate translations are the fluent English translations proposed by the bilingual dictionaries (e.g. Eisenbahn and ferrovia both translate to railway ;

railway is considered as a potential translation for hekurudhë).

1.2 Ranking and selection methods Generally, compositional translation generates several possible translations for one source term.

One has to find a way to rank the translations from the most to the least reliable. Garera & Yarowsky (2008) tried two ranking methods: (i) a probability score P based on the number of different languages exhibiting the association between the literal gloss and the fluent translation ;

(ii) the probability score P combined with the similarity of the source and target words' contexts using context-based methods like in the work of Rapp (1995) and Fung (1997). Robitaille et al.

(2006) extract translation pairs from a corpus built by querying a search engine with a set of seed translation pairs. They select the candidate translations which are semantically related to the target seed terms. The semantic similarity measure is based on the number of hits containing the seed term and/or the candidate translation (Jaccard coefficient). Other works simply select the candidate translations which occur in the target corpus (Weller et al., 2001 ; Morin and Daille,

2010) or which are significantly attested on the Web (Cartoni, 2009).

Only Baldwin and Takana (2004) use machine learning. They train a SVM classifier with corpusbased, dictionary-based and translation pattern-based features and use the value returned by the classifier (a continuous value between -1 and +1) to rank the candidate translations. Their approach is tantamount to point-wise approaches in learning-to-rank. To our knowledge, no research work has investigated the possible contribution of advanced learning-to-rank algorithms to candidate translations ranking. Learning-to-rank algorithms are widely used in Information Retrieval for ranking documents from the most to the least relevant to a given query (Li, 2011).

They can be easily ported to the problem of ranking the candidate translations of a source term.

