«Abstract NeoTrack is a web-based tool for the semiautomatic detection of neologisms in elec- tronic corpora. NeoTrack was developed for the ...»
NeoTrack: semiautomatic neologism detection
NeoTrack is a web-based tool for the semiautomatic detection of neologisms in elec-
tronic corpora. NeoTrack was developed for the Observatório de Néologia de Portu-
guês (ONP) to allow the daily observation of two major newspapers (Diário de Notícias
and Público) for the occurrence of new words. This article describes the working of the
NeoTrack application, its integration with the MorDebe database, and the criteria used in its application by the ONP.
1. Introduction NeoTrack is a web-based tool for the semiautomatic detection of neologisms in elec- tronic corpora. NeoTrack was developed for the Observatório de Néologia de Portu- guês (ONP) by the Institúto de Linguística Teórica e Computacional (ILTEC) to allow the daily observation of two major newspapers (Diário de Notícias and Público) for the occurrence of new words.
The disadvantage of computer-aided corpus-based neologism research is that com- puter tools are only capable of finding formal neologism (or in the case of NeoTrack - orthographic neologisms, see section 4.1). This because without semantic analysis it is impossible to tell the meaning of the words – and hence whether words are used in a new meaning. But the advantage of computer-aided research is not just that it saves a lot of time, but more importantly that it provides the means to establish (relatively) objective criteria about what a neologism is. Without the use of computers, it is virtually impossible to determine which words in a given text are really new – words may sound new without them actually being so, or sound familiar whereas they never occurred in any text before. This is why Rey (1975) in the pre-computer era said that to label a word neologistic is no more than the expression of a subjective sentiment.
With the use of computer-aided corpus research, it becomes possible to really es- tablish which words are new by comparing the new text to the collection of all the text in a reference corpus. This makes it possible to find not just words that are completely newly created and also feel new, such as pen-drive, but also words from the potential lexicon that have recently become actualised. An example is the word actor-chave, which is a predictable word, not in the dictionary, which recently come into actual use (according to the criteria of ONP, described in section 4.2).
Because of its more objective character, computer-aided neologism research can be used for more than just updating dictionaries: the analysis of the neologisms obtained with NeoTrack gives an impression of the dynamics of the Portuguese language: which processes are most frequently used for the creation of new words, which languages are mostly used for new loanwords, which suffixes are most productive in new words, etc.
This articledescribes the NeoTrack application: its design and user-interface, an the way NeoTrack is integrated with the MorDebe database. Along side this article describe the criteria used in the application of NeoTrack by the ONP.
2. NeoTrack Design NeoTrack is a light-weight tools for the observation of neologisms using a method of exclusion based neologism candidate extraction. This method says that a word in a text is possibly a neologism (a neologism candidate) when it does not appear in a list of previously known words, called the exclusion list. Neologism candidate extraction a semiautomatic process: the computer is used to generate a list of all possible neologisms – but it is up to a human user to decide whether these neologism candidates are indeed neologisms or false candidates. Although this latter step could in principle be made automatic, fully automatic neologism extraction is more commonly fully stastics-based without the intervention of a neologism candidate list.
The way neologism candidate extraction is implemented in NeoTrack is illustrated in figure (1): to extract all the neologism candidates from a given text (corpus file), the system first creates a list of all the unique words occurring in that text (corpus words) by tokenising the cleaned-up version of the corpus file (corpus text). This list is then compared with a list of known words (exclusion list) to render a list of all the words the system does not recognise: the neologism candidates.
Figure 1: Neotrack flow-chart The exclusion list in NeoTrack is created from a morphological database called MorDebe, which itself is derived from lexicographic resources. The MorDebe datbase in turn is created partially from lexicographic sources (see section 2.1). The exclusion list in NeoTrack does not contain only the citation forms of all the words, but also all the inflectional forms.
NeoTrack not only extracts the lists of neologism candidates, but features a userfrienly interface to split the neologisms from the false candidates. False candidates are those unknown words that are not neologistic, either because they are existing words that were missing from the exclusion list, or because they are strings that should not be counted as words: proper names, typographic errors, etc. The user interface (see section
3) is fully web-based an can be accessed via any Internet browser. The use of a serverbased system allows the linguists to work from any computer they want – even allowing neologism observation from an Internet café. This is not merely a convenience, but it allows researchers of different institutes, and even of different countries speaking the same language, to cooperate in a single project, working with the same neologism database.
2.1 Integration with MorDebe
NeoTrack is integrated with a morphological database called MorDebe (Janssen 2005a;
Janssen 2005b). MorDebe is a large-scale lexical resource which contains a large amount of correct Portuguese words, including all their inflected forms. MorDebe is an online service that works as an orthographic guide, a verb dictionary, and an inverse dictionary – with a rich set of search options.The design of the database is languageindependent, but only data for Portuguese are available for the moment. The aim of MorDebe is not to provide as many words as possible, but to provide a lexicographically controlled lexicon with manual verification at every point. The database started with a semi-automatic inflection of the lemmas of the Porto Editora dictionary, but has since been updated with words from various sources including the CETEMPublico corpus, the Academia and Houaiss dictionaries and the NeoTrack research. At this moment, the database contains well over 125.000 lexical entries for Portuguese, with an emphasis on the European variant of Portuguese.
MorDebe is not just used in NeoTrack, but was even originally conceived for the purpose of the ONP neologism observatory with NeoTrack - and the two system are fully integrated. On the one hand, the exclusion list used in NeoTrack is created on-thefly from the MorDebe database: just before extracting the exclusion list from the corpus words to create the neologism candidates, the exclusion list is (optionally) updated with all the word-forms in the MorDebe database, to also exclude the most recently added words. In this way, the observation of neologism speeds up with the growth of MorDebe because the number of neologism candidates will diminish.
On the other hand, NeoTrack is used as one of the methods to keep MorDebe upto-date: when a linguist in the use of NeoTrack encounters a word that is not a neologism, but an existing correct word that was somehow missing from MorDebe, that word can be directly added to MorDebe from the interface of NeoTrack (after corpus verification). Also, the MorDebe database is periodically updated with all those words from the neologism data base created by NeoTrack that turn out not to be occasionalisms.
3. User Interface The NeoTrack user interface is divided into three major parts: the management of source files, the neologism candidate sorting, and the neologism database itself. This section gives a brief overview of the design of these three parts.
3.1. File Management NeoTrack is a web-based system, which means that all files to be processed need to be uploaded to the server first. Processing a corpus file therefore happens in two steps: in a first step, the file is uploaded from the local computer to the server, and stored in the list of files to be processed. And in a second step, the corpus file is analysed, and the list of neologism candidates is extracted. The final step of this process leads to a list of ni progress, as shown in figure (2). Each file is shown with its source – and the amount of neologism candidates encountered in the text – with an indication of how many candidates have yet to be processed. With each file, as with all other data in the system, NeoTrack keeps track of which user added the file, and when he/she did so.
Figure 2: File Management For each corpus in progress, it is possible to start/continue the process of neologism candidate sorting (see next section), or view the list of all the candidates of that corpus together with their status: open, or an indication of the action performed on the candidate. To get more information about the corpora, it is also possible to view the frequency distribution list of all token words in the different corpora, or view the original HTML file.
In the design of NeoTrack, the comparison between the input text and the exclusion list is done only once. This means that if a word is added to the exclusion list after the candidate list has been created will not affect existing candidates list. Therefore, it is recommendable to keep files in the unanalysed until it is actually being treated.
3.2. Neologism candidate sorting
Figure 3: Candidate sorting The main window for the manual sorting of neologisms is shown in figure (3), with circles added to indicate the main components. Every neologism candidate carries with it the spelling of the neologistic form, as well as the source in which it was encountered.
This information is shown under (1). Next to that is an indication of the number of candidates in that source that have not been processed yet. To decide whether a candidate is a neologism, the original context is shown under (5) – where clicking on the line number will display the original HTML file to see the entire context. If the candidate appears more than once, multiple context lines are displayed.
The main purpose of the sorting window is to allow the user to decide whether the neologism candidate is indeed a neologism or not. When the candidate is a neologism, the relevant data about that neologism can be entered under (2) – the citation form, syntactic category, its typography, and neologism type. The context in which the neologism occurred is automatically selected – but can be edited when the context is longer or shorter than desired. When the same candidate appears various times in the same source, the context of the first occurrence is selected. When validated as a neologism under (2) the candidate will be put in the neologism database, with all the associated data.
When the candidate is not a neologism but a false candidate, it will not be stored in the neologism database, and can be discarded. There are several reasons for discarding a candidate as a neologism – which are shown under (4): the candidate can be a typographic error, or a proper name. It can be a part of a foreign-language quotation, or it can be something which is not a word – such as an e-mail address, a code, etc. All the buttons under 4 do the same – but the motivation for rejecting a neologism is kept on file, to be able to use that information later, for instance to select all proper names. It is also possible to postpone a specific candidate until later in case there is some doubt about it.
Finally, the candidate can also be non-neologistic because it is an existing correct word, but just one that was not yet on the exclusion list. In that case, the word is not only removed from the candidate list, but added to the exclusion list so that it will not show up as a neologism candidate again. Since MorDebe is used for the creation of the exclusion list in NeoTrack, the word can be directly added to MorDebe under (3) – by indicating citation form and word class. Clicking on ‘Add’ will open the MorDebe administration page, where not just the particular form occurring in the source, but the entire inflectional paradigm of the word will be added to the MorDebe database. To decide whether a word is new or old, it is necessary to consult reference corpora. Therefore, under (6) are some quick links to look up the candidate in some on-line corpora.
3.3. Neologism database In the neologism database section, it is possible to view and edit all the neologism already stored in the neologism database. An example from the ONP neologism list is shown in figure (4).
Figure 4: Neologism database listing Each candidate is shown with the source it was encountered in, and the person who treated the neologism. By clicking on view it is possible to see all the data associated with an individual neologism. It is also possible to search the neologism database on all the various fields, or edit erroneous items in the neologism database.
4. Identifying Neologisms
An important aspect of the detection of neologisms is a proper specification of which words do count as neologisms. Although there are various ways of defining what a neologism is, the definition used by the ONP is called the extended lexicographic diachronic criterion (Janssen, unpublished). This criterion is a hybrid of the traditional lexicographic criterion and the corpus-based criterion (Cabré, 1992).
On the one hand the hybrid criterion is dictionary based in the sense that it uses dictionaries for its exclusion list: any word appearing in the dictionary is not a neologism. The dictionaries used for this purpose by the ONP are the Porto Editora, Houaiss, and Academia dictionaries (see 4.2). Rather than using the dictionaries directly, the system is based on a morphological database explicitly listing all inflected forms of all the lemmas, information often left implicit in dictionaries.