FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«1 Introduction In this study, we describe the range of prosodic variation observed in two types of dialogue context, using fully automatic methods. ...»

-- [ Page 1 ] --



Mattias Heldner

Jens Edlund

Kornel Laskowski

Antoine Pelcé

1 Introduction

In this study, we describe the range of prosodic variation observed in two types of

dialogue context, using fully automatic methods. The first type of context is that

of speaker-changes, or transitions from only one participant speaking to only the

other, involving either acoustic silence or acoustic overlap. The second type of context is comprised of mutual silence or overlap where a speaker change could in principle occur but does not. For lack of a better term, we will refer to these contexts as non-speaker-changes. More specifically, we investigate F0 patterns in the intervals immediately preceding overlaps and silences – in order to assess whether prosody before overlaps or silences may invite or inhibit speaker change.

Previous work indicates that a number of prosodic and phonetic features are associated with speaker-changes and non-speaker-changes. With respect to F0 patterns, several studies have suggested that rising as well as falling pitch patterns are correlates of speaker-changes (e.g. Local & Kelly, 1986; Local, Kelly, & Wells, 1986; Ogden, 2001), and similarly that flat F0 patterns in the middle of a speaker’s pitch range are correlates of non-speaker-changes (e.g. Caspers, 2003;

Duncan, 1972; Koiso, Horiuchi, Tutiya, Ichikawa, & Den, 1998; Local & Kelly, 1986; Ogden, 2001; Selting, 1996). Furthermore, stretches of low F0 have been reported to invite backchannels in overlap as well as following silence (Ward & Tsukahara, 2000); flat intonation has also been reported to act as an inhibitory cue for backchannels (Noguchi & Den, 1998).

A fundamental problem in exploring prosody in dialogue lies in identifying locations at which prosody may turn out to be salient, and much of prior work has relied on the concepts of turns and floors, and thereby on manual or sufficiently accurate automatic detection of punctuation, disfluencies, and dialog act types.

Frequently in naturally-occurring dialogue, these concepts are ill-defined. In previous work of our own, we investigated to what extent speaker-changes and non-speaker-changes can be predicted from a very limited number of F0 pattern types (Edlund & Heldner, 2005), as well as from a direct representation of F0 variation (Laskowski, Edlund, & Heldner, 2008a, 2008b; Laskowski, Wölfel, Heldner, & Edlund, 2008), at locations dictated by low-level characterizations of the interactive state of the dialogue. In the present study, we take one step back to instead describe the range of diversity in F0 patterns occurring immediately before mutual silences or intervals of overlapping speech. We operationalize the annotation of these transitions using a standard finite state automaton over joint speech activity states. We then extract pitch variation features for these transition types and construct descriptive models to characterize them. An important contribution of this work is the visualization of these models, yielding an end-toend methodology for zero-manual-effort analysis of pitch variation, conditioned on interactive dialogue context.

2 Methods

2.1 Materials We used speech material from the Swedish Map Task Corpus (Helgason, 2006), designed as a Swedish counterpart to the HCRC Map Task Corpus (Anderson, et al., 1991). The map task is a cooperative task involving two speakers, intended to elicit natural spontaneous dialogues. Each of two speakers has one map which the other speaker cannot see. One of the speakers, the instruction giver (g), has a route marked on his or her map. The other speaker, the instruction follower (f), has no such route. The two maps are not identical and the subjects are explicitly told that the maps differ, but not how. The task is to reproduce the giver’s route on the follower’s map ("The design of the HCRC Map Task Corpus," n.d.).

Eight speakers, five females and three males, are represented in the corpus.

The speakers formed four pairs, three female-male pairs and one female-female pair. Each speaker acted as instruction giver and follower at least once, and no speaker occurred in more than one pair. The corpus includes ten such dialogues, the total duration of which is approximately 2 hours and 18 minutes. The dialogues were recorded in an anechoic room, using close-talking microphones, with the subjects facing away from each other, and with acceptable acoustic separation of the speaker channels.

2.2 Procedures The procedures involved defining, identifying and classifying instances of the two context types, extracting F0 patterns immediately before these, and summarizing and visualizing them. In this section, we outline and motivate how this was done.

2.2.1 Identifying interaction state transitions As mentioned in the introduction, naturally occurring human-human dialogue contains a significant number of phenomena, such as backchannels, disfluencies, and cross-channel disruptions, which make it difficult to condition prosodic extraction on objectively defined syntactic or semantic boundaries. To address this problem, we limit ourselves to boundaries in conversation flow, defined by the relative timing of talkspurt deployment by the two parties. We annotate every instant in a dialogue with an explicit interaction state label; states describe the joint vocal activity of both speakers, building on a tradition of computational models of interaction (e.g. Brady, 1968; Dabbs & Ruback, 1984; Jaffe & Feldstein, 1970; Norwine & Murphy, 1938; Sellen, 1995). We note that, importantly, each participant’s vocal activity is a binary variable, such that for example backchannel speech (Yngve, 1970) is not treated differently from other speech. We use the resulting conversation state labels to identify state transitions which define the end of the target intervals at which we subsequently extract prosodic features. The procedure involves three steps, as depicted in Figure 1.

First, we perform vocal activity detection, individually for each speaker, using the VADER voice activity detector from the CMU Sphinx Project ("The CMU Sphinx Group Open Source Speech Recognition Engines," n.d.). This results in the labeling of each instant, for each speaker, as either SPEECH or SILENCE.

Figure 1. Illustration of how between-speaker silences (BSS), between-speaker overlaps (BSO), within-speaker silences (WSS), and within-speaker overlaps (WSO) are defined and classified, as well as how the target intervals (TI) are located with respect to these.

The illustration shows all three steps (as in the text) from the perspectives of both g and f.

Second, at each instant, the states of the two speakers are combined to derive a four-class label of the communicative state of the conversation, describing both speakers’ activity, from the point of view of each speaker. The four states we consider include SELF, OTHER, NONE and BOTH. For example, from the point of view of the instruction giver g, the state is SELF if g is speaking and the instruction follower f is not; it is OTHER if g is silent and f is speaking, NONE if neither speaker is speaking, and BOTH if both are. The process of defining communicative states from the point of view of speaker f is similar; we illustrate this process for both speakers in the middle panel of Figure 1.

Finally, in a third step (comprising a third pass of the data, for illustration purposes), the NONE and BOTH states from Step 2 are further classified in terms of whether they are within- or between-speaker events, from the point of view of each speaker. This division leads to four context types: within-speaker overlap, SELF–BOTH–SELF; between-speaker overlap, SELF–BOTH–OTHER; within-speaker silence, SELF–NONE–SELF; and between-speaker-silence, SELF–NONE–OTHER.

Speaker changes with neither overlap nor silence (i.e. with silence or overlap smaller than 10ms) are exceedingly rare in the material, and are not reported here.

For completion, we note that the four states, per each of two speakers, together with the two states in which either g or f are speaking alone, constitute a 10-state finite state automaton (FSA) describing the evolution of dialogue in which only one party at a time may change vocal activity state. The number of states in such an interaction FSA may be augmented to model other subclassifications, or to model sojourn times, without loss of generality; here, we limit ourselves to an FSA of 10 states, and specifically to the 4 phenomena mentioned, as it is most directly relevant to our ongoing work in conversational spoken dialogue systems.

2.2.2 Extracting F0 patterns Once the silences and overlaps are identified and classified, we collect F0 patterns from the last 500ms of speech in SELF-state preceding BSS, BSO, WSS and WSS (see the target intervals in Figure 1). It is in these intervals, approximately the last two syllables, before silences or overlaps, that we look for potential prosodic features inviting or inhibiting speaker-changes. The prosodic features we explored are all related to F0 patterns, but we use two different ways of capturing such patterns: one based on regular F0 extraction, and the other on a direct representation of F0 variation, known as the fundamental frequency variation spectrum.

The F0 estimates are computed using YIN (de Cheveigné & Kawahara, 2002).

They are then transformed from Hertz to semitones, to make the pitch excursions of men and women more comparable. The data is subsequently smoothed using a median filter (over 9 10ms frames) to eliminate outlier errors. The resulting contours of smoothed F0 estimates are shifted along the vertical octave axis such that the median of the first three voiced frames in each contour falls on the midpoint of the y-axis. By plotting the contours with partially transparent dots, the visualizations give an indication of the distribution of different patterns with darker bands for concentrations of patterns and vice versa. We refer to this visualization as bitmap clustering.

In addition, we use a recently introduced vector-valued spectral representation of F0 variation – the fundamental frequency variation (FFV) spectrum – to capture F0 variation patterns (Laskowski, Edlund, et al., 2008a, 2008b;

Laskowski, Wölfel, et al., 2008). Briefly, this technique involves passing the sequence of FFV spectra (a sample spectrum is shown in the left panel of Figure

2) through a filterbank (shown in the right panel of Figure 2), and inferring a statistical model over the filterbank representation.

0.3 1 0.2 0.5 0.1 0 0 −5.4 −3.4 −2.4−1.0 +1.0 +2.4+3.4 +5.4 −2 −1 0 +1 +2 Figure 2. A sample fundamental frequency variation spectrum (left); the x-axis is in octaves per 8ms. Filters in the filterbank (right); the two extremity filters are not shown.

The filterbank attempts to capture meaningful prosodic variation, and contains a conservative filter for perceptually “flat” pitch, two filters for “slowly changing” rising and falling pitch, two filters for “rapidly changing” rising and falling pitch, and two wide extremity filters to capture unvoiced frames.

3 Results and discussion From informal listening to the extracted regions, we observed that the instruction giver g and instruction follower f roles in the Swedish Map Task Corpus were somewhat unbalanced with respect to the kind of utterance types that occurred (see Cathcart, Carletta, & Klein, 2003 for similar observations in the HCRC Map Task Corpus). For example, whereas the speech before silences in the giver channel included a relatively high proportion of propositional statements, the follower channel instead contained a large proportion of continuers, that is backchannels indicating that the giver should go on talking (e.g. Jurafsky, Shriberg, Fox, & Curl, 1998) such as “mm” or “aa”. Because of this imbalance, we decided to analyze giver prosody and follower prosody separately.

Table 1 shows the number of instances of interaction state transition types under study, given our definitions in Section 2.2.1. We note that, interestingly, the number of observed between-speaker phenomena, including silences and overlaps, is split evenly between the giver and follower roles, while the indications of imbalance with respect to roles is evident already in the relative proportions of the within-speaker phenomena.

Table 1. The number of observed interaction state transitions under study; relative proportion per speaker role shown is in parentheses.

–  –  –

3.1 F0 patterns before between- and within-speaker silences (BSS & WSS) Figure 3 shows bitmap cluster plots of F0 patterns during the 500ms preceding between- and within-speaker silences in the giver and follower channels. Our expectations before between-speaker silences included rising as well as falling F0 contours. As can be seen, there are falls and rises both in the giver and in the follower plots; broadly, the observations are in line with our expectations.

However, it appears that there are relatively more falls in the giver plot, and relatively more rises in the follower plot. Furthermore, the falls tend to start earlier with respect to the subsequent silence than do the rises. These second-order trends are the subject of our ongoing exploratory analysis.

For the within-speaker silences, our expectations based on the literature were that we would observe mainly flat patterns. Indeed, in comparison to the betweenspeaker silences, there seem to be relatively fewer rises and falls and relatively more flat patterns in this context type. The plots for between-speaker silences have more of a fan or plume shape extending forward, whereas those for withinspeaker silences are more tightly concentrated around the midline. We note that this concentration is to some extent an artifact of the shifting of the contours along the y-axis; the effect, however, is the same for all conditions.

Pages:   || 2 |

Similar works:

«3 He a lt h 0 0 Fire 0 2 0 Re a c t iv it y P e rs o n a l P ro t e c t io n Material Safety Data Sheet Ammonia-Ammonium Chloride Buffer TS MSDS Section 1: Chemical Product and Company Identification Product Name: Ammonia-Ammonium Chloride Buffer TS Contact Information: Sciencelab.com, Inc. Catalog Codes: SLA2323 14025 Smith Rd. CAS#: Mixture. Houston, Texas 77396 US Sales: 1-800-901-7247 RTECS: Not applicable. International Sales: 1-281-441-4400 TSCA: TSCA 8(b) inventory: Ammonium hydroxide;...»

«Journal of Consciousness Studies www.imprint-academic.com/jcs Jana M. Iverson and Esther Thelen Hand, Mouth and Brain The Dynamic Emergence of Speech and Gesture Introduction The past fifteen years have seen a resurgence of interest in ideas of embodiment, the claim that bodily experiences play an integral role in human cognition (e.g., Clark, 1997; Johnson, 1987; Sheets-Johnstone, 1990; Varela et al., 1991). The notion that mind arises from having a body that interacts with the environment in...»

«Chapter 2 Overview of Oral Mucosal Delivery Michael John Rathbone, Indiran Pather and Sevda Şenel 2.1 Introduction The oral cavity is an attractive site for the delivery of drugs either locally or directly into the systemic circulation. Its attractiveness resides in the fact that the mucosal membranes, upon which drug delivery systems are located, are readily accessible to patients or their carers. This means that the delivery technology can be precisely placed on the specific oral cavity...»

«FRCC 2016 Load & Resource Reliability Assessment Report FRCC-MS-PL-081 Version: 1 3000 Bayport Drive, Suite 600 Tampa, Florida 33607-8410 (813) 289-5644 Phone (813) 289-5646 – Fax www.frcc.com Classification: Public Page 2 of 32 FRCC 2016 Load & Resource FRCC-MS-PL-081 Reliability Assessment Report Version 1 The original signatures are maintained on file. TITLE NAME DATE Denise Lam (FRCC) Steve Sim (FPL) 06/07/2016 Version Author Chris Steele (TEC) Jordan Williams (TEC) Resource Working Group...»

«大專學生佛學論文集 頁 167-180(西元 2010 年),台北市華嚴蓮社 Collections of College Students Thesis Relating to Buddhism Taipei Hua-yen Lotus Society0000-0000(暫未申請) The Reincarnation Story of Shōtoku Taishi 聖徳太子 (573-621): Rethinking a Buddhist Lineage in the 8-9th Century China and Japan -兼論「在家眾可否研讀律藏」問題Pei-Ying Lin SOAS, London Abstract: This essay aims to answer a question: In Tang 唐 (618-907) China and Heian 平...»

«Dan Turkel M and Eraserhead – 5/5/15 “Silently. but I still hear it!” (Non-)Diegetic Sound and Silence in M and Eraserhead A child-murderer is on the loose and all the neighborhood children are singing about him. One mother tries to comfort another, “As long as we can hear ‘em singing, at least we know that they’re still there,” and indeed their singing is still audible from the apartment building. But as the day turns into night, Elsie Beckmann does not return home. These are the...»

«CHAPTER 9: PRESENTING A NEW PARADIGM IN EDUCATION I. MAHARISHI’S NEW PARADIGM OF EDUCATION In the preceding chapters all the textbooks that comprise the six Vedåãga have been examined. These are the first six of thirty-six branches of Vedic Literature to be read in sequence. The program of reading consists essentially in this sequence of syllables, this sequence of sounds that the student recites. This is the curriculum. Reading and pronouncing the syllables of the ancient Vedic Literature...»

«Saliva Alcohol Vina Spiehler Ph.D., DABFT Newport Beach, CA Physiology of Saliva Excreted by Parotid, SM, and SL Glands 0.5 1.5 L/day serous and mucous alveoli 99% water w 0.3% protein and 0.3% mucins ave pH 6.4 range 5.6-7 unstimulated ave pH 7.0 max 8.0 stimulated Submandibular gland 65% Parotid gland 23% Sublingual 4% Alcohol Saliva/Blood Ratio • close correlation • Friedemann et al 1938 • close agreement • Newman Abramson 1942 • 1.05 parotid, • McColl 1979 0.95 mixed saliva •...»

«Table of Contents Administration • Award Certificate order form • Sample award • Change of officer report form Certification • Application for Certification Renewal • AFS professional certification program • Certification Application, BA/BS after 2002 • Certification Application, degree before 2002 • Guidelines for Satisfying Coursework Deficiencies • Standards of Professional Conduct Committee and Volunteer Management • Planning and evaluating Committee work (forms and...»


«THE JET PROGRAMME ENGLISH IN ELEMENTARY SCHOOL TEACHING MATERIALS COLLECTION 2013 7 Introduction English in Elementary School Introduction An increasing number of ALTs are being asked to visit elementary schools. In 2011, MEXT introduced compulsory Foreign Language Activities for fifth and sixth grade at the elementary level. Eigo Note was provided as a textbook for use in these compulsory classes and replaced by Hi, Friends! in 2012. Foreign language classes through the fifth grade level of...»

«1 Table of Contents Welcome 3 Recreational Trail Program Grant 4 Campsite Set-up 5 Tips and Trick 5 Gear Checklist 6 Tent Placement 7 Setting up your Tent 7 Building a Campfire 8 Before you build a campfire 8 Building your campfire 9 Campfire Safety 9 Other Campsite set up tips 9 The Camping Kitchen 10 Camp cooking checklist 11 Ways to cook camp food 12 Cooking over a wood fire 13 Cheap and easy 13 Foil “Hobo” Cooking 15 Taking Care of Yourself in the Great Outdoors 17 Hiking Planning Your...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.