FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 |

«LEXICAL BUNDLES IN L1 AND L2 ACADEMIC WRITING Yu-Hua Chen and Paul Baker Lancaster University This paper adopts an automated frequency-driven ...»

-- [ Page 1 ] --

June 2010, Volume 14, Number 2

Language Learning & Technology

http://llt.msu.edu/vol14num2/chenbaker.pdf pp. 30–49


Yu-Hua Chen and Paul Baker

Lancaster University

This paper adopts an automated frequency-driven approach to identify frequently-used

word combinations (i.e., lexical bundles) in academic writing. Lexical bundles retrieved from one corpus of published academic texts and two corpora of student academic writing (one L1, the other L2), were investigated both quantitatively and qualitatively. Published academic writing was found to exhibit the widest range of lexical bundles whereas L2 student writing showed the smallest range. Furthermore, some high-frequency expressions in published texts, such as in the context of, were underused in both student corpora, while the L2 student writers overused certain expressions (e.g., all over the world) which native academics rarely used. The findings drawn from structural and functional analyses of lexical bundles also have some pedagogical implications.


“Phraseology” (Granger & Meunier, 2008; Meunier & Granger, 2007) and “formulaic sequences/language” (Schmitt, 2004; Wray, 2002, 2008) are two umbrella terms often used to refer to various types of multi-word units. In recent years, an increasing number of studies have made use of corpus data to add weight to the importance of multi-word units in language. For instance Altenberg (1998), in his exploration of the London-Lund Corpus, estimated that 80% of the words in the corpus formed part of recurrent word combinations. As Wray (2002, p. 9) observes, however, there is a “problem of terminology” when describing word co-occurrence. On the one hand, the same term might be used in different ways by different scholars; on the other hand, various terms are used to refer to similar or even the same notion of word co-occurrence. Some examples of such terms include clusters (Hyland, 2008a;

Schmitt, Grandage & Adolphs, 2004; also used in the corpus tool WordSmith), recurrent word combinations (Altenberg, 1998; De Cock, 1998), phrasicon (De Cock, Granger, Leech, & McEnery, 1998), n-grams (Stubbs, 2007a, 2007b) and lexical bundles (e.g., Biber & Barbieri, 2007; Cortes, 2002).

These terms—clusters, phrasicon, n-grams, recurrent word combinations, lexical bundles—actually refer to continuous word sequences retrieved by taking a corpus-driven approach with specified frequency and distribution criteria. The retrieved recurrent sequences are fixed multi-word units that have customary pragmatic and/or discourse functions, used and recognized by the speakers of a language within certain contexts. This methodology is considered to be a frequency-based approach for determining phraseology (see Granger & Paquot, 2008).

From a psycholinguistic viewpoint, formulaic language has been found to have “a processing advantage over creatively generated language” for non-native as well as native speakers (Conklin & Schmitt, 2008, p. 72), although different psycholinguistic studies have used various types of formulaic language, such as idioms (e.g., take the bull by the horn) or non-idiomatic phrases (e.g., as soon as), as the target forms. A particularly inspirational study was conducted by Jiang and Nekrasova (2007), in which they utilized corpus-derived recurrent word combinations as materials in two online grammaticality-judgment experiments. Their findings provide “prevailing evidence in support of the holistic nature of formula representation and processing in second language speakers” (Jiang & Nekrasova, 2007, p. 433). Schmitt et al. (2004) also investigated the psycholinguistic validity of corpus-derived recurrent clusters and share some similarities with Jiang and Nekrasova (2007).

In a series of lexical bundle studies conducted by Biber and colleagues (Biber & Barbieri, 2007; Biber & Conrad, 1999; Biber, Conrad, & Cortes, 2003, 2004; Biber, Johansson, Leech, Conrad, & Finegan, 1999), it was found that conversation and academic prose present distinctive distribution patterns of lexical

–  –  –

bundles. For example, most bundles in conversation are clausal, whereas most bundles in academic prose are phrasal. Other studies of bundles have focused primarily on comparisons between expert and nonexpert writing. Cortes (2002) investigated bundles in native freshman compositions and found that the bundles used by these novice writers were functionally different from those in published academic prose.

In another study, Cortes (2004) compared native student writing with that in academic journals, concluding that students rarely used the lexical bundles identified in the corpus of published writing. Even if they did, the students used these bundles in a different manner. Working with academic writing only, Hyland (2008b) indicated that there was disciplinary variation in the use of lexical bundles. He also investigated the role of lexical bundles in published academic prose and in postgraduate writing and found that postgraduate students tended to employ more formulaic expressions than native academics in order to display their competence (Hyland, 2008a).

To date, only a few studies of L2 written data have performed structural and functional categorization of lexical bundles. Although Hyland, in his two studies (2008a, 2008b), included masters’ theses and doctoral dissertations produced by L2 English students in Hong Kong, he did not begin from a perspective of second-language learning. Instead, he treated L2 postgraduate writing as “highly proficient,” on the ground that all the data in his corpus of texts had been awarded high passes. Drawing on the previous research, the present study aims to compare the use of recurrent word combinations in native-speaker and non-native speaker academic writing in order to reveal the potential problems in second language learning. Quantitative and qualitative analyses were carried out on three corpora in order to identify similarities and differences in recurrent word combinations at different levels of writing proficiency. One corpus (the L2 or learner corpus) contained writing from L1 Chinese learners of L2 English, while the two other comprised L1 writing: one from academics (whom we term “expert” writers) and the other university students (who are similar in background to the L1 Chinese learners, aside from their first language). Lexical bundles is adopted as the primary term throughout this study, as it is used by Biber in a series of studies upon which the theoretical and analytical framework of the current study is based. Another term, recurrent word combination, is also used interchangeably, given its transparent literal meaning.


Data Two existing corpora are used in the present study: the Freiburg-Lancaster-Oslo/Bergen (FLOB) corpus, and the British Academic Written English (BAWE) corpus. To ensure comparability, only part of each corpus was selected for investigation. The FLOB corpus is a one-million-word corpus of written British English from the early 1990s, comprising fifteen genre categories. For the current study, only the category of academic prose, FLOB-J, was used to represent native expert writing. FLOB-J contains eighty 2,000word excerpts from published academic texts, retrieved from journals or book sections. With regard to L1 and L2 student academic writing, parts of the BAWE corpus were utilized. The BAWE corpus, released in 2008, contains approximately 3,000 pieces (approx. 6.5m. words) of proficient assessed student writing from British universities. Two subcorpora were selected from the BAWE corpus: BAWE-CH contains essays produced by L1 Chinese students of L2 English, and BAWE-EN is a comparable dataset contributed by peer L1 English students. FLOB-J, BAWE-CH and BAWE-EN cover a wide range of disciplines, including arts and humanities, life sciences, physical sciences and social sciences (for BAWE, see Alsop & Nesi, 2009; for FLOB, see Hundt, Sand & Siemund, 1998). The size of each finalized corpus for investigation is around 150,000 words (see Table 1).

–  –  –

Table 1. Constituents of the Three Academic Corpora Representation Corpus Word count Average length of text No.

of texts Native expert writing FLOB-J 164,742 2,059 80 Native peer writing BAWE-EN 155,781 2,596 60 Learner writing BAWE-CH 146,872 2,771 53 Operationalization Several key criteria have been pinpointed in the literature regarding how to generate a list of lexical bundles using automated corpus tools. The first criterion is the cut-off frequency, which determines the number of lexical bundles to be included in the analysis. The normalized frequency threshold for large written corpora generally ranges between 20-40 per million words (e.g., Biber et al., 2004; Hyland, 2008b), while for relatively small spoken corpora, a raw cut-off frequency is often used, ranging from 2e.g., Altenberg, 1998; De Cock, 1998). The second criterion is the requirement that combinations have to occur in different texts, usually in at least 3-5 texts (e.g., Biber & Barbieri, 2007; Cortes, 2004), or 10% of texts (e.g., Hyland, 2008a), which helps to avoid idiosyncrasies from individual writers/speakers.

The last issue concerns the length of word combinations, usually 2-, 3-, 4-, 5-, or 6-word units. Four-word sequences are found to be the most researched length for writing studies, probably because the number of 4-word bundles is often within a manageable size (around 100) for manual categorization and concordance checks. The frequency and dispersion thresholds adopted vary from study to study, and even the sizes of corpora and subcorpora differ drastically, ranging from around 40,000 to over 5 million words.

After repeated experiments with the corpus data under investigation, the frequency and distribution thresholds for determining 4-word lexical bundles were set to 4 times or more (approximately 25 times per million words on average), occurring in at least three texts. This resulted in an “optimum” number of bundles, which was considered sufficiently representative of the corpora being examined. One might argue that an identical standardized threshold, such as 20 or 40 times per million words, should be applied to each of the corpora investigated, as generally reported in the literature. However, when a normalized rate is converted to raw frequencies, it substantially affects the number of generated word combinations when comparing corpora of various sizes. For instance, if we compare an 80,000-word corpus with a 40,000-word corpus with a cut-off standardized frequency set at 40 times per million words, it means that the converted raw-frequency threshold for the larger corpus is 3.2, whereas the converted raw-frequency threshold for the smaller corpus is much lower, at 1.6. Any decimals have to be rounded up or down in order to function as an operational cut-off frequency. Yet rounding down 3.2 to 3 results in a normalized rate of 37.5 whereas rounding up 1.6 to 2 generates a normalized rate of 50, both of which are different from the originally reported frequency threshold of 40 times per million words. Reporting only the standardized frequency criterion could therefore be misleading, because a standardized cut-off frequency would inevitably lose its expected impartiality after being converted into raw frequencies corresponding to different corpus sizes. In this study, it could be argued that both the raw cut-off frequency and corresponding normalized frequency should be reported in order to reflect transparently the threshold adopted. For the sake of comparison, if the frequency threshold is set at 25 times per million words for the present study, the converted raw frequencies for each corpus are 3.7, 3.9 and 4.1 times respectively, which are all rounded up or down to 4 (cf. Table 2 and Table 3).

–  –  –

After automatic retrieval of 4-word clusters using the corpus tool WordSmith 4.0 (Scott, 2007), word sequences containing content words that were present in the essay questions (e.g., financial and non financial), or any other context-dependent bundles, usually incorporating proper nouns (e.g., in the UK and, the Second World War), were manually excluded from the extracted bundle lists. It was also found that overlapping word sequences could inflate the results of quantitative analysis. Overlaps were thus checked manually via concordance analyses. Two major types of overlaps are discussed here. One is “complete overlap,” referring to two 4-word bundles which are actually derived from a single 5-word combination. For example, it has been suggested and has been suggested that both occur six times, coming from the longer expression it has been suggested that. The other type of overlap is “complete subsumption,” referring to a situation where two or more 4-word bundles overlap and the occurrences of one of the bundles subsume those of the other overlapping bundle(s). For example, as a result of occurs 17 times, while a result of the occurs five times, both of which occur as a subset of the 5-word bundle as a result of the. Each case of the above overlapping word sequences (12 cases in total) were combined into one longer unit so as to guard against inflated results.

A further potential problem when comparing bundles across corpora involves what is actually counted (i.e., type/token distinction). Should we count the number of types of bundles (e.g., counting as a result of and it is possible to each as one type of bundle), or should we count the total occurrence of bundles (e.g., as a result of might occur 20 times in one corpus and 50 times in another)? One corpus could exhibit a very narrow range of bundles but have very high frequencies of them, while another might have the opposite pattern. We therefore distinguished between different types of bundles (types) and frequencies of bundles (tokens).1 The numbers of bundle types and tokens, before and after data refinement, including removing context-dependent bundles and overlapping ones, are shown in Table 4 below.

–  –  –


Pages:   || 2 | 3 | 4 |

Similar works:

«Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing Espoo 2001 Report 61 FREQUENCY-WARPED AUTOREGRESSIVE MODELING AND FILTERING Aki Härmä Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing Espoo 2001 Report 61 FREQUENCY-WARPED AUTOREGRESSIVE MODELING AND FILTERING Aki Härmä Dissertation for the degree of Doctor of Science in Technology to be presented with due permission for public examination and debate in Auditorium S4,...»

«Presentation of the funded projects in 2010 for the ARPEGE Programme Page AEOLUS – Managing the complexity of the cloud 3 ARMS – A multi arms robotic system for muscle separation 4 BMOS – Biometric matching on smartcard 5 CERCLES2 – Compositional certification of critical and safe embedded 6 software COROUSSO – Modelling and control of robots for machining operations of 7 large composite parts and friction stir welding EMAISECI – Electro-magnetic analysis and injection of secure...»

«Measures on minimally generated Boolean algebras Piotr Borodulin{Nadzieja Abstract We investigate properties of minimally generated Boolean algebras. It is shown that all measures dened on such algebras are separable but not necessarily weakly uniformly regular. On the other hand, there exist Boolean algebras small in terms of measures which are not minimally generated. We prove that under CH a measure on a retractive Boolean algebra can be nonseparable. Some relevant examples are indicated....»

«Regular City Council Meeting November 26, 2013 A Regular Meeting of the Chesapeake City Council was called to order by Mayor Alan P. Krasnoff on November 26, 2013 at 6:30 p.m. in the City Hall Building, 306 Cedar Road. INVOCATION: Mayor Alan P. Krasnoff PLEDGE OF ALLEGIANCE: Gabby Turdici, 8th grade student at Great Bridge Middle School ROLL CALL BY CLERK OF COUNCIL: Present: Council Member Lonnie E. Craig Vice Mayor John de Triquet Council Member Robert C. Ike, Jr. Council Member Suzy H. Kelly...»

«TWO SOLITUDES PLENUM II Two Solitudes Design as an Approach to Media Research* ILPO KOSKINEN In this speech, my aim is to explore how design and media research relate and to point out ways in which these disciplines could benefit from closer contact than what they seem to have today. The paper is primarily written for researchers in communication by a sociologist who has grown increasingly familiar with design over the past decade through his work at a design school. Whenever we discuss design,...»

«CONTRAT DE DEVELOPPEMENT TERRITORIAL PARIS-SACLAY Versailles Grand Parc / Saint-Quentin-en-Yvelines / Vélizy-Villacoublay NOTICE EXPLICATIVE Projet de Contrat de Développement Territorial validé en Comité de pilotage du 16 juillet 2014 1 La présente enquête publique porte sur le projet de Contrat de Développement Territorial (CDT) « Paris-Saclay Versailles Grand Parc / Saint-Quentin-en-Yvelines / VélizyVillacoublay » validé par les Communautés d’Agglomération de Versailles Grand...»

«South America SOUTH AMERICA 101 South America 102 South America Argentina I. Summary Argentina is a transshipment point for Andean-produced cocaine destined for Europe and for Colombian heroin destined for the United States. It is also a source country for precursor chemicals, owing to its advanced chemical production facilities. Seizures of cocaine in 2007 were on par with levels in 2006, but authorities reported an increase in the number of small labs that convert cocaine base to cocaine...»

«Climate-change Impacts on the Biodiversity of the Terai Arc Landscape and the Chitwan-Annapurna Landscape By Gokarna Jung Thapa Eric Wikramanayake Jessica Forrest 2013 i Climate-change Impacts on the Biodiversity of the Terai Arc Landscape and the Chitwan-Annapurna Landscape By Gokarna Jung Thapa1 Eric Wikramanayake2 Jessica Forrest2 2013 1 WWF Nepal Program, Baluwatar, Kathmandu, NEPAL 2 Conservation Science Program, WWF US, Washington DC, USA ii © WWF 2013 All rights reserved Any...»

«MARCH 2015 – EDITION 30 Knight Frank Research Compass Report A monthly snapshot of significant property news from the Australasia region. National The 2015 Intergenerational Report has been released by the Federal government. The report has projected Australia's population will grow approximately 1.3% per annum over the next 40 years, to 30.7 million by 2055. At this time, newly born babies would expect to live well into their nineties, with men living until 95 years of age and women to 96...»

«365 SPIRIT A DAILY JOURNEY FOR YOUR SOUL Aaron Zerah ♦ Inspirational Stories, Poems, Prayers, and Meditations from Around the World ♦ A Personal Note from the Author Every day is a good day. So it is revealed in every culture, every spiritual tradition. What makes a day good? Appreciation, insight, courage, willingness, joy and release — essentially meeting the day with presence. We humans are set up to take life, death, and everything for that matter, day by day. There's something about...»

«Medicare Claims Processing Manual Chapter 14 Ambulatory Surgical Centers Table of Contents (Rev. 3430, 12-29-15) Transmittals for Chapter 14 10 General 10.1 Definition of Ambulatory Surgical Center (ASC) 10.2 Ambulatory Surgical Center Services on ASC List 10.3 Services Furnished in ASCs Which Are Not ASC Facility Services or Covered Ancillary Services 10.4 Coverage of Services in ASCs Which Are Not ASC Facility Services or Covered Ancillary Services 20 List of Covered Ambulatory Surgical...»

«The materials listed in this document are available for research at the University of  Record Series Number  12/9/25 Illinois Archives.  For more information, email illiarch@illinois.edu or search  Fine and Applied Arts http://www.library.illinois.edu/archives/archon  for the record series number.  University Bands Harry Begian Papers, 1926, 1935-97   Box 1: BIOGRAPHICAL Scrapbook, 1966 Biography PUBLICATIONS Instrumentalist, The : Alfred Reed’s Armenian...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.