«1 Introduction In this study, we describe the range of prosodic variation observed in two types of dialogue context, using fully automatic methods. ...»
PROSODIC FEATURES IN THE VICINITY OF
SILENCES AND OVERLAPS
In this study, we describe the range of prosodic variation observed in two types of
dialogue context, using fully automatic methods. The first type of context is that
of speaker-changes, or transitions from only one participant speaking to only the
other, involving either acoustic silence or acoustic overlap. The second type of context is comprised of mutual silence or overlap where a speaker change could in principle occur but does not. For lack of a better term, we will refer to these contexts as non-speaker-changes. More specifically, we investigate F0 patterns in the intervals immediately preceding overlaps and silences – in order to assess whether prosody before overlaps or silences may invite or inhibit speaker change.
Previous work indicates that a number of prosodic and phonetic features are associated with speaker-changes and non-speaker-changes. With respect to F0 patterns, several studies have suggested that rising as well as falling pitch patterns are correlates of speaker-changes (e.g. Local & Kelly, 1986; Local, Kelly, & Wells, 1986; Ogden, 2001), and similarly that flat F0 patterns in the middle of a speaker’s pitch range are correlates of non-speaker-changes (e.g. Caspers, 2003;
Duncan, 1972; Koiso, Horiuchi, Tutiya, Ichikawa, & Den, 1998; Local & Kelly, 1986; Ogden, 2001; Selting, 1996). Furthermore, stretches of low F0 have been reported to invite backchannels in overlap as well as following silence (Ward & Tsukahara, 2000); flat intonation has also been reported to act as an inhibitory cue for backchannels (Noguchi & Den, 1998).
A fundamental problem in exploring prosody in dialogue lies in identifying locations at which prosody may turn out to be salient, and much of prior work has relied on the concepts of turns and floors, and thereby on manual or sufficiently accurate automatic detection of punctuation, disfluencies, and dialog act types.
Frequently in naturally-occurring dialogue, these concepts are ill-defined. In previous work of our own, we investigated to what extent speaker-changes and non-speaker-changes can be predicted from a very limited number of F0 pattern types (Edlund & Heldner, 2005), as well as from a direct representation of F0 variation (Laskowski, Edlund, & Heldner, 2008a, 2008b; Laskowski, Wölfel, Heldner, & Edlund, 2008), at locations dictated by low-level characterizations of the interactive state of the dialogue. In the present study, we take one step back to instead describe the range of diversity in F0 patterns occurring immediately before mutual silences or intervals of overlapping speech. We operationalize the annotation of these transitions using a standard finite state automaton over joint speech activity states. We then extract pitch variation features for these transition types and construct descriptive models to characterize them. An important contribution of this work is the visualization of these models, yielding an end-toend methodology for zero-manual-effort analysis of pitch variation, conditioned on interactive dialogue context.
2.1 Materials We used speech material from the Swedish Map Task Corpus (Helgason, 2006), designed as a Swedish counterpart to the HCRC Map Task Corpus (Anderson, et al., 1991). The map task is a cooperative task involving two speakers, intended to elicit natural spontaneous dialogues. Each of two speakers has one map which the other speaker cannot see. One of the speakers, the instruction giver (g), has a route marked on his or her map. The other speaker, the instruction follower (f), has no such route. The two maps are not identical and the subjects are explicitly told that the maps differ, but not how. The task is to reproduce the giver’s route on the follower’s map ("The design of the HCRC Map Task Corpus," n.d.).
Eight speakers, five females and three males, are represented in the corpus.
The speakers formed four pairs, three female-male pairs and one female-female pair. Each speaker acted as instruction giver and follower at least once, and no speaker occurred in more than one pair. The corpus includes ten such dialogues, the total duration of which is approximately 2 hours and 18 minutes. The dialogues were recorded in an anechoic room, using close-talking microphones, with the subjects facing away from each other, and with acceptable acoustic separation of the speaker channels.
2.2 Procedures The procedures involved defining, identifying and classifying instances of the two context types, extracting F0 patterns immediately before these, and summarizing and visualizing them. In this section, we outline and motivate how this was done.
2.2.1 Identifying interaction state transitions As mentioned in the introduction, naturally occurring human-human dialogue contains a significant number of phenomena, such as backchannels, disfluencies, and cross-channel disruptions, which make it difficult to condition prosodic extraction on objectively defined syntactic or semantic boundaries. To address this problem, we limit ourselves to boundaries in conversation flow, defined by the relative timing of talkspurt deployment by the two parties. We annotate every instant in a dialogue with an explicit interaction state label; states describe the joint vocal activity of both speakers, building on a tradition of computational models of interaction (e.g. Brady, 1968; Dabbs & Ruback, 1984; Jaffe & Feldstein, 1970; Norwine & Murphy, 1938; Sellen, 1995). We note that, importantly, each participant’s vocal activity is a binary variable, such that for example backchannel speech (Yngve, 1970) is not treated differently from other speech. We use the resulting conversation state labels to identify state transitions which define the end of the target intervals at which we subsequently extract prosodic features. The procedure involves three steps, as depicted in Figure 1.
First, we perform vocal activity detection, individually for each speaker, using the VADER voice activity detector from the CMU Sphinx Project ("The CMU Sphinx Group Open Source Speech Recognition Engines," n.d.). This results in the labeling of each instant, for each speaker, as either SPEECH or SILENCE.
Figure 1. Illustration of how between-speaker silences (BSS), between-speaker overlaps (BSO), within-speaker silences (WSS), and within-speaker overlaps (WSO) are defined and classified, as well as how the target intervals (TI) are located with respect to these.
The illustration shows all three steps (as in the text) from the perspectives of both g and f.
Second, at each instant, the states of the two speakers are combined to derive a four-class label of the communicative state of the conversation, describing both speakers’ activity, from the point of view of each speaker. The four states we consider include SELF, OTHER, NONE and BOTH. For example, from the point of view of the instruction giver g, the state is SELF if g is speaking and the instruction follower f is not; it is OTHER if g is silent and f is speaking, NONE if neither speaker is speaking, and BOTH if both are. The process of defining communicative states from the point of view of speaker f is similar; we illustrate this process for both speakers in the middle panel of Figure 1.
Finally, in a third step (comprising a third pass of the data, for illustration purposes), the NONE and BOTH states from Step 2 are further classified in terms of whether they are within- or between-speaker events, from the point of view of each speaker. This division leads to four context types: within-speaker overlap, SELF–BOTH–SELF; between-speaker overlap, SELF–BOTH–OTHER; within-speaker silence, SELF–NONE–SELF; and between-speaker-silence, SELF–NONE–OTHER.
Speaker changes with neither overlap nor silence (i.e. with silence or overlap smaller than 10ms) are exceedingly rare in the material, and are not reported here.
For completion, we note that the four states, per each of two speakers, together with the two states in which either g or f are speaking alone, constitute a 10-state finite state automaton (FSA) describing the evolution of dialogue in which only one party at a time may change vocal activity state. The number of states in such an interaction FSA may be augmented to model other subclassifications, or to model sojourn times, without loss of generality; here, we limit ourselves to an FSA of 10 states, and specifically to the 4 phenomena mentioned, as it is most directly relevant to our ongoing work in conversational spoken dialogue systems.
2.2.2 Extracting F0 patterns Once the silences and overlaps are identified and classified, we collect F0 patterns from the last 500ms of speech in SELF-state preceding BSS, BSO, WSS and WSS (see the target intervals in Figure 1). It is in these intervals, approximately the last two syllables, before silences or overlaps, that we look for potential prosodic features inviting or inhibiting speaker-changes. The prosodic features we explored are all related to F0 patterns, but we use two different ways of capturing such patterns: one based on regular F0 extraction, and the other on a direct representation of F0 variation, known as the fundamental frequency variation spectrum.
The F0 estimates are computed using YIN (de Cheveigné & Kawahara, 2002).
They are then transformed from Hertz to semitones, to make the pitch excursions of men and women more comparable. The data is subsequently smoothed using a median filter (over 9 10ms frames) to eliminate outlier errors. The resulting contours of smoothed F0 estimates are shifted along the vertical octave axis such that the median of the first three voiced frames in each contour falls on the midpoint of the y-axis. By plotting the contours with partially transparent dots, the visualizations give an indication of the distribution of different patterns with darker bands for concentrations of patterns and vice versa. We refer to this visualization as bitmap clustering.
In addition, we use a recently introduced vector-valued spectral representation of F0 variation – the fundamental frequency variation (FFV) spectrum – to capture F0 variation patterns (Laskowski, Edlund, et al., 2008a, 2008b;
Laskowski, Wölfel, et al., 2008). Briefly, this technique involves passing the sequence of FFV spectra (a sample spectrum is shown in the left panel of Figure
2) through a filterbank (shown in the right panel of Figure 2), and inferring a statistical model over the filterbank representation.
0.3 1 0.2 0.5 0.1 0 0 −5.4 −3.4 −2.4−1.0 +1.0 +2.4+3.4 +5.4 −2 −1 0 +1 +2 Figure 2. A sample fundamental frequency variation spectrum (left); the x-axis is in octaves per 8ms. Filters in the filterbank (right); the two extremity filters are not shown.
The filterbank attempts to capture meaningful prosodic variation, and contains a conservative filter for perceptually “flat” pitch, two filters for “slowly changing” rising and falling pitch, two filters for “rapidly changing” rising and falling pitch, and two wide extremity filters to capture unvoiced frames.
3 Results and discussion From informal listening to the extracted regions, we observed that the instruction giver g and instruction follower f roles in the Swedish Map Task Corpus were somewhat unbalanced with respect to the kind of utterance types that occurred (see Cathcart, Carletta, & Klein, 2003 for similar observations in the HCRC Map Task Corpus). For example, whereas the speech before silences in the giver channel included a relatively high proportion of propositional statements, the follower channel instead contained a large proportion of continuers, that is backchannels indicating that the giver should go on talking (e.g. Jurafsky, Shriberg, Fox, & Curl, 1998) such as “mm” or “aa”. Because of this imbalance, we decided to analyze giver prosody and follower prosody separately.
Table 1 shows the number of instances of interaction state transition types under study, given our definitions in Section 2.2.1. We note that, interestingly, the number of observed between-speaker phenomena, including silences and overlaps, is split evenly between the giver and follower roles, while the indications of imbalance with respect to roles is evident already in the relative proportions of the within-speaker phenomena.
Table 1. The number of observed interaction state transitions under study; relative proportion per speaker role shown is in parentheses.
3.1 F0 patterns before between- and within-speaker silences (BSS & WSS) Figure 3 shows bitmap cluster plots of F0 patterns during the 500ms preceding between- and within-speaker silences in the giver and follower channels. Our expectations before between-speaker silences included rising as well as falling F0 contours. As can be seen, there are falls and rises both in the giver and in the follower plots; broadly, the observations are in line with our expectations.
However, it appears that there are relatively more falls in the giver plot, and relatively more rises in the follower plot. Furthermore, the falls tend to start earlier with respect to the subsequent silence than do the rises. These second-order trends are the subject of our ongoing exploratory analysis.
For the within-speaker silences, our expectations based on the literature were that we would observe mainly flat patterns. Indeed, in comparison to the betweenspeaker silences, there seem to be relatively fewer rises and falls and relatively more flat patterns in this context type. The plots for between-speaker silences have more of a fan or plume shape extending forward, whereas those for withinspeaker silences are more tightly concentrated around the midline. We note that this concentration is to some extent an artifact of the shifting of the contours along the y-axis; the effect, however, is the same for all conditions.