FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«Abstract NeoTrack is a web-based tool for the semiautomatic detection of neologisms in elec- tronic corpora. NeoTrack was developed for the ...»

-- [ Page 1 ] --

NeoTrack: semiautomatic neologism detection

Maarten Janssen

ILTEC, Lisboa


NeoTrack is a web-based tool for the semiautomatic detection of neologisms in elec-

tronic corpora. NeoTrack was developed for the Observatório de Néologia de Portu-

guês (ONP) to allow the daily observation of two major newspapers (Diário de Notícias

and Público) for the occurrence of new words. This article describes the working of the

NeoTrack application, its integration with the MorDebe database, and the criteria used in its application by the ONP.

1. Introduction NeoTrack is a web-based tool for the semiautomatic detection of neologisms in elec- tronic corpora. NeoTrack was developed for the Observatório de Néologia de Portu- guês (ONP) by the Institúto de Linguística Teórica e Computacional (ILTEC) to allow the daily observation of two major newspapers (Diário de Notícias and Público) for the occurrence of new words.

The disadvantage of computer-aided corpus-based neologism research is that com- puter tools are only capable of finding formal neologism (or in the case of NeoTrack - orthographic neologisms, see section 4.1). This because without semantic analysis it is impossible to tell the meaning of the words – and hence whether words are used in a new meaning. But the advantage of computer-aided research is not just that it saves a lot of time, but more importantly that it provides the means to establish (relatively) objective criteria about what a neologism is. Without the use of computers, it is virtually impossible to determine which words in a given text are really new – words may sound new without them actually being so, or sound familiar whereas they never occurred in any text before. This is why Rey (1975) in the pre-computer era said that to label a word neologistic is no more than the expression of a subjective sentiment.

With the use of computer-aided corpus research, it becomes possible to really es- tablish which words are new by comparing the new text to the collection of all the text in a reference corpus. This makes it possible to find not just words that are completely newly created and also feel new, such as pen-drive, but also words from the potential lexicon that have recently become actualised. An example is the word actor-chave, which is a predictable word, not in the dictionary, which recently come into actual use (according to the criteria of ONP, described in section 4.2).

Because of its more objective character, computer-aided neologism research can be used for more than just updating dictionaries: the analysis of the neologisms obtained with NeoTrack gives an impression of the dynamics of the Portuguese language: which processes are most frequently used for the creation of new words, which languages are mostly used for new loanwords, which suffixes are most productive in new words, etc.

This articledescribes the NeoTrack application: its design and user-interface, an the way NeoTrack is integrated with the MorDebe database. Along side this article describe the criteria used in the application of NeoTrack by the ONP.

2. NeoTrack Design NeoTrack is a light-weight tools for the observation of neologisms using a method of exclusion based neologism candidate extraction. This method says that a word in a text is possibly a neologism (a neologism candidate) when it does not appear in a list of previously known words, called the exclusion list. Neologism candidate extraction a semiautomatic process: the computer is used to generate a list of all possible neologisms – but it is up to a human user to decide whether these neologism candidates are indeed neologisms or false candidates. Although this latter step could in principle be made automatic, fully automatic neologism extraction is more commonly fully stastics-based without the intervention of a neologism candidate list.

The way neologism candidate extraction is implemented in NeoTrack is illustrated in figure (1): to extract all the neologism candidates from a given text (corpus file), the system first creates a list of all the unique words occurring in that text (corpus words) by tokenising the cleaned-up version of the corpus file (corpus text). This list is then compared with a list of known words (exclusion list) to render a list of all the words the system does not recognise: the neologism candidates.

Figure 1: Neotrack flow-chart The exclusion list in NeoTrack is created from a morphological database called MorDebe, which itself is derived from lexicographic resources. The MorDebe datbase in turn is created partially from lexicographic sources (see section 2.1). The exclusion list in NeoTrack does not contain only the citation forms of all the words, but also all the inflectional forms.

NeoTrack not only extracts the lists of neologism candidates, but features a userfrienly interface to split the neologisms from the false candidates. False candidates are those unknown words that are not neologistic, either because they are existing words that were missing from the exclusion list, or because they are strings that should not be counted as words: proper names, typographic errors, etc. The user interface (see section

3) is fully web-based an can be accessed via any Internet browser. The use of a serverbased system allows the linguists to work from any computer they want – even allowing neologism observation from an Internet café. This is not merely a convenience, but it allows researchers of different institutes, and even of different countries speaking the same language, to cooperate in a single project, working with the same neologism database.


2.1 Integration with MorDebe

NeoTrack is integrated with a morphological database called MorDebe (Janssen 2005a;

Janssen 2005b). MorDebe is a large-scale lexical resource which contains a large amount of correct Portuguese words, including all their inflected forms. MorDebe is an online service that works as an orthographic guide, a verb dictionary, and an inverse dictionary – with a rich set of search options.The design of the database is languageindependent, but only data for Portuguese are available for the moment. The aim of MorDebe is not to provide as many words as possible, but to provide a lexicographically controlled lexicon with manual verification at every point. The database started with a semi-automatic inflection of the lemmas of the Porto Editora dictionary, but has since been updated with words from various sources including the CETEMPublico corpus, the Academia and Houaiss dictionaries and the NeoTrack research. At this moment, the database contains well over 125.000 lexical entries for Portuguese, with an emphasis on the European variant of Portuguese.

MorDebe is not just used in NeoTrack, but was even originally conceived for the purpose of the ONP neologism observatory with NeoTrack - and the two system are fully integrated. On the one hand, the exclusion list used in NeoTrack is created on-thefly from the MorDebe database: just before extracting the exclusion list from the corpus words to create the neologism candidates, the exclusion list is (optionally) updated with all the word-forms in the MorDebe database, to also exclude the most recently added words. In this way, the observation of neologism speeds up with the growth of MorDebe because the number of neologism candidates will diminish.

On the other hand, NeoTrack is used as one of the methods to keep MorDebe upto-date: when a linguist in the use of NeoTrack encounters a word that is not a neologism, but an existing correct word that was somehow missing from MorDebe, that word can be directly added to MorDebe from the interface of NeoTrack (after corpus verification). Also, the MorDebe database is periodically updated with all those words from the neologism data base created by NeoTrack that turn out not to be occasionalisms.

3. User Interface The NeoTrack user interface is divided into three major parts: the management of source files, the neologism candidate sorting, and the neologism database itself. This section gives a brief overview of the design of these three parts.

3.1. File Management NeoTrack is a web-based system, which means that all files to be processed need to be uploaded to the server first. Processing a corpus file therefore happens in two steps: in a first step, the file is uploaded from the local computer to the server, and stored in the list of files to be processed. And in a second step, the corpus file is analysed, and the list of neologism candidates is extracted. The final step of this process leads to a list of ni progress, as shown in figure (2). Each file is shown with its source – and the amount of neologism candidates encountered in the text – with an indication of how many candidates have yet to be processed. With each file, as with all other data in the system, NeoTrack keeps track of which user added the file, and when he/she did so.

Figure 2: File Management For each corpus in progress, it is possible to start/continue the process of neologism candidate sorting (see next section), or view the list of all the candidates of that corpus together with their status: open, or an indication of the action performed on the candidate. To get more information about the corpora, it is also possible to view the frequency distribution list of all token words in the different corpora, or view the original HTML file.

In the design of NeoTrack, the comparison between the input text and the exclusion list is done only once. This means that if a word is added to the exclusion list after the candidate list has been created will not affect existing candidates list. Therefore, it is recommendable to keep files in the unanalysed until it is actually being treated.

3.2. Neologism candidate sorting

Figure 3: Candidate sorting The main window for the manual sorting of neologisms is shown in figure (3), with circles added to indicate the main components. Every neologism candidate carries with it the spelling of the neologistic form, as well as the source in which it was encountered.

This information is shown under (1). Next to that is an indication of the number of candidates in that source that have not been processed yet. To decide whether a candidate is a neologism, the original context is shown under (5) – where clicking on the line number will display the original HTML file to see the entire context. If the candidate appears more than once, multiple context lines are displayed.

The main purpose of the sorting window is to allow the user to decide whether the neologism candidate is indeed a neologism or not. When the candidate is a neologism, the relevant data about that neologism can be entered under (2) – the citation form, syntactic category, its typography, and neologism type. The context in which the neologism occurred is automatically selected – but can be edited when the context is longer or shorter than desired. When the same candidate appears various times in the same source, the context of the first occurrence is selected. When validated as a neologism under (2) the candidate will be put in the neologism database, with all the associated data.

When the candidate is not a neologism but a false candidate, it will not be stored in the neologism database, and can be discarded. There are several reasons for discarding a candidate as a neologism – which are shown under (4): the candidate can be a typographic error, or a proper name. It can be a part of a foreign-language quotation, or it can be something which is not a word – such as an e-mail address, a code, etc. All the buttons under 4 do the same – but the motivation for rejecting a neologism is kept on file, to be able to use that information later, for instance to select all proper names. It is also possible to postpone a specific candidate until later in case there is some doubt about it.

Finally, the candidate can also be non-neologistic because it is an existing correct word, but just one that was not yet on the exclusion list. In that case, the word is not only removed from the candidate list, but added to the exclusion list so that it will not show up as a neologism candidate again. Since MorDebe is used for the creation of the exclusion list in NeoTrack, the word can be directly added to MorDebe under (3) – by indicating citation form and word class. Clicking on ‘Add’ will open the MorDebe administration page, where not just the particular form occurring in the source, but the entire inflectional paradigm of the word will be added to the MorDebe database. To decide whether a word is new or old, it is necessary to consult reference corpora. Therefore, under (6) are some quick links to look up the candidate in some on-line corpora.

3.3. Neologism database In the neologism database section, it is possible to view and edit all the neologism already stored in the neologism database. An example from the ONP neologism list is shown in figure (4).

Figure 4: Neologism database listing Each candidate is shown with the source it was encountered in, and the person who treated the neologism. By clicking on view it is possible to see all the data associated with an individual neologism. It is also possible to search the neologism database on all the various fields, or edit erroneous items in the neologism database.

4. Identifying Neologisms

An important aspect of the detection of neologisms is a proper specification of which words do count as neologisms. Although there are various ways of defining what a neologism is, the definition used by the ONP is called the extended lexicographic diachronic criterion (Janssen, unpublished). This criterion is a hybrid of the traditional lexicographic criterion and the corpus-based criterion (Cabré, 1992).

On the one hand the hybrid criterion is dictionary based in the sense that it uses dictionaries for its exclusion list: any word appearing in the dictionary is not a neologism. The dictionaries used for this purpose by the ONP are the Porto Editora, Houaiss, and Academia dictionaries (see 4.2). Rather than using the dictionaries directly, the system is based on a morphological database explicitly listing all inflected forms of all the lemmas, information often left implicit in dictionaries.

Pages:   || 2 |

Similar works:

«Gender, Cities and Climate Change Gotelind Alber Thematic report prepared for Cities and Climate Change Global Report on Human Settlements 2011 Available from http://www.unhabitat.org/grhs/2011 Gotelind Alber is an independent researcher and advisor on sustainable energy and climate change policy with a special focus on local strategies to address climate change, energy efficiency and renewable energy, multi-level governance, gender issues and climate justice. She is co-founder of the global...»

«2016 Honors & Awards Recipients ANS Annual Meeting June 12-16, 2016 Hyatt Regency New Orleans New Orleans, LA ANS Opening Plenary Session Awards Monday, June 13, 2016 8:00 am–11:30 am Empire AB AWARD RECIPIENTS Presentation of Awards Steve Zinkle Honors and Awards Committee ANS Fellow Awards Henry DeWolf Smyth Nuclear Statesman Award Honors and Awards Committee Arthur Holly Compton Award in Education Landis Young Member Engineering Achievement Award Mishima Award W. Bennett Lewis Award Fuel...»

«Called “The Black Pope” by many of his followers, Anton LaVey began the road to High Priesthood of the Church of Satan when he was only 16 years old and an organ player in a carnival: “On Saturday night I would see men lusting after half-naked girls dancing at the carnival, and on Sunday morning when I was playing the organ for tent-show evangelists at the other end of the carnival lot, I would see these same men sitting in the pews with their wives and children, asking God to forgive...»

«THE NONTIMBER VALUES OF TROPICAL FORESTS by Norman Myers WORKING PAPER 10 FORESTRY FOR SUSTAINABLE DEVELOPMENT PROGRAM Department of Forest Resources College of Natural Resources University of Minnesota 1530 N. Cleveland Avenue St. Paul, Minnesota 55108 November 1990PREFACE BACKGROUND RESERVES TRADrI'IONAL FOREST LAND AGROECOSYSIEM!j VALUEOFBIODIVERSITY BUFFER ZONES AS A FORESTSAFEGUARD MEASURE. 10 REFERENCES PREFACE The author of this paper, Dr. Norman Myers, is an associate in the Forestry...»

«José Antonio Amaya NOMBRES GENÉRICOS DEDICADOS A PERSONAJES CONCRETOS por Mutis y sus colaboradores (1760-1811) No se conoce ningún trabajo que presente sistemáticamente las denominaciones genéricas que crearon José Celestino Mutis y sus colaboradores en la Expedición Botánica. Con este artículo se procura llenar ese vacío. Para el efecto se han utilizado las descripciones y las observaciones botánicas manuscritas elaboradas por los miembros de la Expedición Botánica. Se trata de...»

«XR Flight Operations Manual Version 2.7 Publication Date: 20-Aug-2016 Vessel Versions: XR5 1.10 / XR1 1.12 / XR2 1.7 Copyright 2006-2016 Douglas Beachy. All Rights Reserved. This software is freeware and may not be sold. Web: http://www.alteaaerospace.com Email: mailto:dougb@alteaaerospace.com Orbiter Forum: dbeachy1 (http://orbiter-forum.com) XR Flight Operations Manual Version 2.7 1 Copyright 2006-2016 Douglas Beachy. All Rights Reserved. Table of Contents DG-XR1 Development Team XR5 Vanguard...»

«6th International Conference on Earthquake Geotechnical Engineering 1-4 November 2015 Christchurch, New Zealand Liquefaction Behavior of Silt and Sandy Silts from Cyclic Ring Shear Tests A. El Takch 1, A. Sadrekarimi 2, M. H. El Naggar 3 ABSTRACT Cyclic ring shear tests are employed in this study to investigate the cyclic resistance ratio (CRR) of reconstituted samples of non-plastic silt and sandy silts with 50% and 75% silt content. In these experiments, liquefaction and strain-softening...»

«BEFORE THE AUTHORITY FOR ADVANCE RULINGS (INCOME TAX) NEW DELHI 26th Day of July, 2011 A.A.R. Nos. 858-861 of 2009 PRESENT Mr Justice. P.K. Balasubramanyan (Chairman) Mr. V.K. Shridhar (Member) Name & address of the applicant LS Cable Limited, (12-16F) LS Tower, 1026-6, Hogye-dong Gyeonggi-do, 431-080 Korea Commissioner Concerned Director of Income-tax-I (International Taxation) New Delhi Present for the Applicant Mr.N.Venkataraman, Sr.Advocate Mr. Taranpreet Singh, FCA Mr.Satish Aggarwal, FCA...»

«Wisconsin’s Broadband Internet Availability January 2012 William Esbeck Executive Director, Wisconsin State Telecommunications Association (608)256-8866 bill.esbeck@wsta.info    Wisconsin’s Broadband Internet Availability I. Executive Summary – page 3 II. Overview – page 6 III. Sources – page 7 IV. Defining Broadband – page 8 V. National Rankings, Reports and Surveys that Measure Broadband Availability and Use a. National Broadband Map Data and Rankings – page 9 b. Federal...»

«Innovation Bureaucracy: Does the organization of government matter when promoting innovation? Erkki Karo & Rainer Kattel1 Draft to be presented at SPRU, Oct 23, 2015 The high administration of society embraces the invention, examination, and execution of projects useful to the people. The high administrative capacity thus involves three capacities: the capacity of the artists, the capacity of the scientists, and the capacity of the industrialists, whose collaboration fulfills all the conditions...»

«ACTA CHROMATOGRAPHICA, NO. 18, 2007 IMPROVED SAPONIFICATION THEN MILD BASE AND ACID-CATALYZED METHYLATION IS A USEFUL METHOD FOR QUANTIFYING FATTY ACIDS, WITH SPECIAL EMPHASIS ON CONJUGATED DIENES M. Czauderna*, J. Kowalczyk, K. Korniluk, and I. Wąsowska The Kielanowski Institute of Animal Physiology and Nutrition, Polish Academy of Sciences, 05-110 Jabłonna, Poland SUMMARY The objective of this study was to evaluate mild lipid saponification then gentle base and acid-catalyzed methylation,...»

«Indigenous Governance and Mining in Bolivia Kathryn Robb, Mark Moran, Victoria Thom and Justin Coburn Prepared for INTERNATIONAL MINING FOR DEVELOPMENT CENTRE (IM4DC) Authors Kathryn Robb, Mark Moran, Victoria Thom and Justin Coburn Joint Collaborator World Vision Acknowledgements The authors wish to acknowledge Sarah Dix for project management and development of the interview guide; Williams Colque for carrying out field work in Bolivia, contextualizing the interview guide, and participating...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.