«Designing Illustrated Texts: How Language Production Is Influenced by Graphics Generation Wolfgang Wahlster, Elisabeth André, Winfried Graf, Thomas ...»
In: EACL91, pp. 8-14.
Designing Illustrated Texts:
How Language Production Is Influenced by Graphics
Wolfgang Wahlster, Elisabeth André, Winfried Graf, Thomas Rist
Multimodal interfaces combining, e.g., natural language and graphics take advantage of both
the individual strength of each communication mode and the fact that several modes can be
employed in parallel, e.g., in the text-picture combinations of illustrated documents. It is an important goal of this research not simply to merge the verbalization results of a natural language generator and the visualization results of a knowledge-based graphics generator, but to carefully coordinate graphics and text in such a way that they complement each other. We describe the architecture of the knowledge-based presentation system WIP which guarantees a design process with a large degree of freedom that can be used to tailor the presentation to suit the specific context. In WIP, decisions of the language generator may influence graphics generation and graphical constraints may sometimes force decisions in the language production process. In this paper, we focus on the influence of graphical constraints on text generation. In particular, we describe the generation of cross-modal references, the revision of text due to graphical constraints and the clarification of graphics through text.
Table of Contents 1 Introduction 3 2 The Architecture of WIP 5
2.1 The Presentation Planner 6
2.2 The Layout Manager 7
2.3 The Text Generator 7
2.4 The Graphics Generator 8 3 The Generation of Cross-Modal References 9 4 The Revision of Text Due to Graphical Constraints 11 5 The Clarification of Graphics through Text 13 6 Conclusion 14 Acknowledgements 15 References 15 2
1 INTRODUCTIONWith increases in the amount and sophistication of information that must be communicated to the users of complex technical systems comes a corresponding need to find new ways to present that information flexibly and efficiently. Intelligent presentation systems are important building blocks of the next generation of user interfaces, as they translate from the narrow output channels provided by most of the current application systems into high- bandwidth communications tailored to the individual user. Since in many situations information is only presented efficiently through a particular combination of communication modes, the automatic generation of multimodal presentations is one of the tasks of such presentation systems. The task of the knowledge-based presentation system WIP is the generation of a variety of multimodal documents from an input consisting of a formal description of the communicative intent of a planned presentation. The generation process is controlled by a set of generation parameters such as target audience, presentation objective, resource limitations, and target language.
One of the basic principles underlying the WIP project is that the various constituents of a multimodal presentation should be generated from a common representation. This raises the question of how to divide a given communicative goal into subgoals to be realized by the various mode-specific generators, so that they complement each other. To address this problem, we have to explore computational models of the cognitive decision processes coping with questions such as what should go into text, what should go into graphics, and which kinds of links between the verbal and non-verbal fragments are necessary.
In the project WIP, we try to generate on the fly illustrated texts that are customized for the intended target audience and situation, flexibly presenting information whose content, in contrast to hypermedia systems, cannot be fully anticipated. The current testbed for WIP is the generation of instructions for the use of an espresso-machine. It is a rare instruction manual that does not contain illustrations. WIP's 2D display of 3D graphics of machine parts help the addressee of the synthesized multimodal presentation to develop a 3D mental model of the object that he can constantly match with his visual perceptions of the real machine in front of him. Fig. 1 shows a typical text-picture sequence which may be used to instruct a user in filling the watercontainer of an espresso-machine.
Fig. 1: Example Instruction
3 Currently, the technical knowledge to be presented by WIP is encoded in a hybrid knowledge representation language of the KL-ONE family including a terminological and assertional component (see Nebel 90). In addition to this propositional representation, which includes the relevant information about the structure, function, behavior, and use of the espresso-machine, WIP has access to an analogical representation of the geometry of the machine in the form of a wireframe model.
The automatic design of multimodal presentations has only recently received significant attention in artificial intelligence research (cf. the projects SAGE (Roth et al. 89), COMET (Feiner & McKeown 89), FN/ANDD (Marks & Reiter 90) and WIP (Wahlster et al. 89)). The WIP and COMET projects share a strong research interest in the coordination of text and graphics. They differ from systems such as SAGE and FN/ANDD in that they deal with physical objects (espresso-machine, radio vs. charts, diagrams) that the user can access directly. For example, in the WIP project we assume that the user is looking at a real espresso-machine and uses the presentations generated by WIP to understand the operation of the machine. In spite of many similarities, there are major differences between COMET and WIP, e.g., in the systems' architecture. While during one of the final processing steps of COMET the layout component combines text and graphics fragments produced by modespecific generators, in WIP a layout manager can interact with a presentation planner before text and graphics are generated, so that layout considerations may influence the early stages of the planning process and constrain the mode-specific generators.
2 THE ARCHITECTURE OF WIP
The architecture of the WIP system guarantees a design process with a large degree of freedom that can be used to tailor the presentation to suit the specific context. During the design process a presentation planner and a layout manager orchestrate the mode-specific generators and the document history handler (see Fig. 2) provides information about intermediate results of the presentation design that is exploited in order to prevent disconcerting or incoherent output. This means that decisions of the language generator may influence graphics generation and that graphical constraints may sometimes force decisions in the language production process. In this paper, we focus on the influence of graphical constraints on text generation (see Wahlster et al. 91 for a discussion of the inverse influence).
Fig. 2 shows a sketch of WIP's current architecture used for the generation of illustrated documents. Note that WIP includes two parallel processing cascades for the incremental generation of text and graphics. In WIP, the design of a multimodal document is viewed as a non-monotonic process that includes various revisions of preliminary results, massive replanning or plan repairs, and many negotiations between the corresponding design and realization components in order to achieve a fine-grained and optimal division of work between the selected presentation modes.
2.1 THE PRESENTATION PLANNER The presentation planner is responsible for contents and mode selection. A basic assumption behind the presentation planner is that not only the generation of text, but also the generation of multimodal documents can be considered as a sequence of communicative acts which aim to achieve certain goals (cf. André & Rist 90a). For the synthesis of illustrated texts, we have designed presentation strategies that refer to both text and picture production.
To represent the strategies, we follow the approach proposed by Moore and colleagues (cf.
Moore & Paris 89) to operationalize RST-theory (cf. Mann & Thompson 88) for text planning.
The strategies are represented by a name, a header, an effect, a set of applicability conditions and a specification of main and subsidiary acts. Whereas the header of a strategy indicates which communicative function the corresponding document part is to fill, its effect refers to an intentional goal. The applicability conditions specify when a strategy may be used and put restrictions on the variables to be instantiated. The main and subsidiary acts form the kernel of the strategies. E.g., the strategy below can be used to enable the identification of an object shown in a picture (for further details see André & Rist 90b). Whereas graphics is to be used to carry out the main act, the mode for the subsidiary acts is open.
(Provide-Background P A ?x ?px ?pic GRAPHICS)
(BMB P A (Identifiable A ?x ?px ?pic))
(AND (Bel P (Perceptually-Accessible A ?x)) (Bel P (Part-of ?x ?z)))
(Depict P A (Background ?z) ?pz ?pic)
(Achieve P (BMB P A (Identifiable A ?z ?pz ?pic)) ?mode) For the automatic generation of illustrated documents, the presentation strategies are treated as operators of a planning system. During the planning process, presentation strategies are selected and instantiated according to the presentation task. After the selection of a strategy, the main and subsidiary acts are carried out unless the corresponding presentation goals are already satisfied. Elementary acts, such as Depict or Assert, are performed by the text and graphics generators.
2.2 THE LAYOUT MANAGER
The main task of the layout manager is to convey certain semantic and pragmatic relations specified by the planner by the arrangement of graphic and text fragments received from the mode-specific generators, i.e., to determine the size of the boxes and the exact coordinates for positioning them on the document page. We use a grid-based approach as an ordering system for efficiently designing functional (i.e., uniform, coherent and consistent) layouts (cf. Müller-Brockmann 81).
A central problem for automatic layout is the representation of design-relevant knowledge. Constraint networks seem to be a natural formalism to declaratively incorporate aesthetic knowledge into the layout process, e.g., perceptual criteria concerning the organization of boxes as sequential ordering, alignment, grouping, symmetry or similarity.
Layout constraints can be classified as semantic, geometric, topological, and temporal.
Semantic constraints essentially correspond to coherence relations, such as sequence and contrast, and can be easily reflected through specific design constraints. A powerful way of expressing such knowledge is to organize the constraints hierarchically by assigning a preference scale to the constraint network (cf. Borning et al. 89). We distinguish obligatory, optional and default constraints. The latter state default values, that remain fixed unless the corresponding constraint is removed by a stronger one. Since there are constraints that have
2.3 THE TEXT GENERATOR WIP's text generator is based on the formalism of tree adjoining grammars (TAGs). In particular, lexicalized TAGs with unification are used for the incremental verbalization of logical forms produced by the presentation planner (cf. Harbusch 90 and Schauder 91). The grammar is divided into an LD (linear dominance) and an LP (linear precedence) part so that the piecewise construction of syntactic constituents is separated from their linearization according to word order rules (Finkler & Neumann 89).
The text generator uses a TAG parser in a local anticipation feedback loop (see Jameson & Wahlster 82). The generator and parser form a bidirectional system, i.e., both processes are based on the same TAG. By parsing a planned utterance, the generator makes sure that it does not contain unintended structural ambiguities.
Since the TAG-based generator is used in designing illustrated documents, it has to generate not only complete sentences, but also sentence fragments such as NPs, PPs, or VPs, e.g., for figure captions, section headings, picture annotations, or itemized lists. Given that capability and the incrementality of the generation process, it becomes possible to interleave generation with parsing in order to check for ambiguities as soon as possible. Currently, we are exploring different domains of locality for such feedback loops and trying to relate them to resource limitations specified in WIP's generation parameters. One parameter of the generation process in the current implementation is the number of adjoinings allowed in a sentence. This parameter can be used by the presentation planner to control the syntactic complexity of the generated utterances and sentence length. If the number of allowed adjoinings is small, a logical form that can be verbalized as a single complex sentence may lead to a sequence of simple sentences. The leeway created by this parameter can be exploited for mode coordination. For example, constraints set up by the graphics generator or layout manager can force delimitation of sentences, since in a good design, picture breaks should correspond to sentence breaks, and vice versa (see McKeown & Feiner 90).
2.4 THE GRAPHICS GENERATOR
When generating illustrations of physical objects WIP does not rely on previously authored picture fragments or predefined icons stored in the knowledge base. Rather, we start from a hybrid object representation which includes a wireframe model for each object.
Although these wireframe models, along with a specification of physical attributes such as surface color or transparency form the basic input of the graphics generator, the design of illustrations is regarded as a knowledge-intensive process that exploits various knowledge sources to achieve a given presentation goal efficiently. E.g., when a picture of an object is requested, we have to determine an appropriate perspective in a context-sensitive way (cf.