«LEXICAL BUNDLES IN L1 AND L2 ACADEMIC WRITING Yu-Hua Chen and Paul Baker Lancaster University This paper adopts an automated frequency-driven ...»
June 2010, Volume 14, Number 2
Language Learning & Technology
http://llt.msu.edu/vol14num2/chenbaker.pdf pp. 30–49
LEXICAL BUNDLES IN L1 AND L2 ACADEMIC WRITING
Yu-Hua Chen and Paul Baker
This paper adopts an automated frequency-driven approach to identify frequently-used
word combinations (i.e., lexical bundles) in academic writing. Lexical bundles retrieved from one corpus of published academic texts and two corpora of student academic writing (one L1, the other L2), were investigated both quantitatively and qualitatively. Published academic writing was found to exhibit the widest range of lexical bundles whereas L2 student writing showed the smallest range. Furthermore, some high-frequency expressions in published texts, such as in the context of, were underused in both student corpora, while the L2 student writers overused certain expressions (e.g., all over the world) which native academics rarely used. The findings drawn from structural and functional analyses of lexical bundles also have some pedagogical implications.
Schmitt, Grandage & Adolphs, 2004; also used in the corpus tool WordSmith), recurrent word combinations (Altenberg, 1998; De Cock, 1998), phrasicon (De Cock, Granger, Leech, & McEnery, 1998), n-grams (Stubbs, 2007a, 2007b) and lexical bundles (e.g., Biber & Barbieri, 2007; Cortes, 2002).
These terms—clusters, phrasicon, n-grams, recurrent word combinations, lexical bundles—actually refer to continuous word sequences retrieved by taking a corpus-driven approach with specified frequency and distribution criteria. The retrieved recurrent sequences are fixed multi-word units that have customary pragmatic and/or discourse functions, used and recognized by the speakers of a language within certain contexts. This methodology is considered to be a frequency-based approach for determining phraseology (see Granger & Paquot, 2008).
From a psycholinguistic viewpoint, formulaic language has been found to have “a processing advantage over creatively generated language” for non-native as well as native speakers (Conklin & Schmitt, 2008, p. 72), although different psycholinguistic studies have used various types of formulaic language, such as idioms (e.g., take the bull by the horn) or non-idiomatic phrases (e.g., as soon as), as the target forms. A particularly inspirational study was conducted by Jiang and Nekrasova (2007), in which they utilized corpus-derived recurrent word combinations as materials in two online grammaticality-judgment experiments. Their findings provide “prevailing evidence in support of the holistic nature of formula representation and processing in second language speakers” (Jiang & Nekrasova, 2007, p. 433). Schmitt et al. (2004) also investigated the psycholinguistic validity of corpus-derived recurrent clusters and share some similarities with Jiang and Nekrasova (2007).
In a series of lexical bundle studies conducted by Biber and colleagues (Biber & Barbieri, 2007; Biber & Conrad, 1999; Biber, Conrad, & Cortes, 2003, 2004; Biber, Johansson, Leech, Conrad, & Finegan, 1999), it was found that conversation and academic prose present distinctive distribution patterns of lexical
bundles. For example, most bundles in conversation are clausal, whereas most bundles in academic prose are phrasal. Other studies of bundles have focused primarily on comparisons between expert and nonexpert writing. Cortes (2002) investigated bundles in native freshman compositions and found that the bundles used by these novice writers were functionally different from those in published academic prose.
In another study, Cortes (2004) compared native student writing with that in academic journals, concluding that students rarely used the lexical bundles identified in the corpus of published writing. Even if they did, the students used these bundles in a different manner. Working with academic writing only, Hyland (2008b) indicated that there was disciplinary variation in the use of lexical bundles. He also investigated the role of lexical bundles in published academic prose and in postgraduate writing and found that postgraduate students tended to employ more formulaic expressions than native academics in order to display their competence (Hyland, 2008a).
To date, only a few studies of L2 written data have performed structural and functional categorization of lexical bundles. Although Hyland, in his two studies (2008a, 2008b), included masters’ theses and doctoral dissertations produced by L2 English students in Hong Kong, he did not begin from a perspective of second-language learning. Instead, he treated L2 postgraduate writing as “highly proficient,” on the ground that all the data in his corpus of texts had been awarded high passes. Drawing on the previous research, the present study aims to compare the use of recurrent word combinations in native-speaker and non-native speaker academic writing in order to reveal the potential problems in second language learning. Quantitative and qualitative analyses were carried out on three corpora in order to identify similarities and differences in recurrent word combinations at different levels of writing proficiency. One corpus (the L2 or learner corpus) contained writing from L1 Chinese learners of L2 English, while the two other comprised L1 writing: one from academics (whom we term “expert” writers) and the other university students (who are similar in background to the L1 Chinese learners, aside from their first language). Lexical bundles is adopted as the primary term throughout this study, as it is used by Biber in a series of studies upon which the theoretical and analytical framework of the current study is based. Another term, recurrent word combination, is also used interchangeably, given its transparent literal meaning.
DATA AND METHODOLOGYData Two existing corpora are used in the present study: the Freiburg-Lancaster-Oslo/Bergen (FLOB) corpus, and the British Academic Written English (BAWE) corpus. To ensure comparability, only part of each corpus was selected for investigation. The FLOB corpus is a one-million-word corpus of written British English from the early 1990s, comprising fifteen genre categories. For the current study, only the category of academic prose, FLOB-J, was used to represent native expert writing. FLOB-J contains eighty 2,000word excerpts from published academic texts, retrieved from journals or book sections. With regard to L1 and L2 student academic writing, parts of the BAWE corpus were utilized. The BAWE corpus, released in 2008, contains approximately 3,000 pieces (approx. 6.5m. words) of proficient assessed student writing from British universities. Two subcorpora were selected from the BAWE corpus: BAWE-CH contains essays produced by L1 Chinese students of L2 English, and BAWE-EN is a comparable dataset contributed by peer L1 English students. FLOB-J, BAWE-CH and BAWE-EN cover a wide range of disciplines, including arts and humanities, life sciences, physical sciences and social sciences (for BAWE, see Alsop & Nesi, 2009; for FLOB, see Hundt, Sand & Siemund, 1998). The size of each finalized corpus for investigation is around 150,000 words (see Table 1).
Table 1. Constituents of the Three Academic Corpora Representation Corpus Word count Average length of text No.
of texts Native expert writing FLOB-J 164,742 2,059 80 Native peer writing BAWE-EN 155,781 2,596 60 Learner writing BAWE-CH 146,872 2,771 53 Operationalization Several key criteria have been pinpointed in the literature regarding how to generate a list of lexical bundles using automated corpus tools. The first criterion is the cut-off frequency, which determines the number of lexical bundles to be included in the analysis. The normalized frequency threshold for large written corpora generally ranges between 20-40 per million words (e.g., Biber et al., 2004; Hyland, 2008b), while for relatively small spoken corpora, a raw cut-off frequency is often used, ranging from 2e.g., Altenberg, 1998; De Cock, 1998). The second criterion is the requirement that combinations have to occur in different texts, usually in at least 3-5 texts (e.g., Biber & Barbieri, 2007; Cortes, 2004), or 10% of texts (e.g., Hyland, 2008a), which helps to avoid idiosyncrasies from individual writers/speakers.
The last issue concerns the length of word combinations, usually 2-, 3-, 4-, 5-, or 6-word units. Four-word sequences are found to be the most researched length for writing studies, probably because the number of 4-word bundles is often within a manageable size (around 100) for manual categorization and concordance checks. The frequency and dispersion thresholds adopted vary from study to study, and even the sizes of corpora and subcorpora differ drastically, ranging from around 40,000 to over 5 million words.
After repeated experiments with the corpus data under investigation, the frequency and distribution thresholds for determining 4-word lexical bundles were set to 4 times or more (approximately 25 times per million words on average), occurring in at least three texts. This resulted in an “optimum” number of bundles, which was considered sufficiently representative of the corpora being examined. One might argue that an identical standardized threshold, such as 20 or 40 times per million words, should be applied to each of the corpora investigated, as generally reported in the literature. However, when a normalized rate is converted to raw frequencies, it substantially affects the number of generated word combinations when comparing corpora of various sizes. For instance, if we compare an 80,000-word corpus with a 40,000-word corpus with a cut-off standardized frequency set at 40 times per million words, it means that the converted raw-frequency threshold for the larger corpus is 3.2, whereas the converted raw-frequency threshold for the smaller corpus is much lower, at 1.6. Any decimals have to be rounded up or down in order to function as an operational cut-off frequency. Yet rounding down 3.2 to 3 results in a normalized rate of 37.5 whereas rounding up 1.6 to 2 generates a normalized rate of 50, both of which are different from the originally reported frequency threshold of 40 times per million words. Reporting only the standardized frequency criterion could therefore be misleading, because a standardized cut-off frequency would inevitably lose its expected impartiality after being converted into raw frequencies corresponding to different corpus sizes. In this study, it could be argued that both the raw cut-off frequency and corresponding normalized frequency should be reported in order to reflect transparently the threshold adopted. For the sake of comparison, if the frequency threshold is set at 25 times per million words for the present study, the converted raw frequencies for each corpus are 3.7, 3.9 and 4.1 times respectively, which are all rounded up or down to 4 (cf. Table 2 and Table 3).
After automatic retrieval of 4-word clusters using the corpus tool WordSmith 4.0 (Scott, 2007), word sequences containing content words that were present in the essay questions (e.g., financial and non financial), or any other context-dependent bundles, usually incorporating proper nouns (e.g., in the UK and, the Second World War), were manually excluded from the extracted bundle lists. It was also found that overlapping word sequences could inflate the results of quantitative analysis. Overlaps were thus checked manually via concordance analyses. Two major types of overlaps are discussed here. One is “complete overlap,” referring to two 4-word bundles which are actually derived from a single 5-word combination. For example, it has been suggested and has been suggested that both occur six times, coming from the longer expression it has been suggested that. The other type of overlap is “complete subsumption,” referring to a situation where two or more 4-word bundles overlap and the occurrences of one of the bundles subsume those of the other overlapping bundle(s). For example, as a result of occurs 17 times, while a result of the occurs five times, both of which occur as a subset of the 5-word bundle as a result of the. Each case of the above overlapping word sequences (12 cases in total) were combined into one longer unit so as to guard against inflated results.
A further potential problem when comparing bundles across corpora involves what is actually counted (i.e., type/token distinction). Should we count the number of types of bundles (e.g., counting as a result of and it is possible to each as one type of bundle), or should we count the total occurrence of bundles (e.g., as a result of might occur 20 times in one corpus and 50 times in another)? One corpus could exhibit a very narrow range of bundles but have very high frequencies of them, while another might have the opposite pattern. We therefore distinguished between different types of bundles (types) and frequencies of bundles (tokens).1 The numbers of bundle types and tokens, before and after data refinement, including removing context-dependent bundles and overlapping ones, are shown in Table 4 below.