«A Computational Model to Connect Gestalt Perception and Natural Language by Sheel Sanjay Dhande Bachelor of Engineering in Computer Engineering, ...»
A Computational Model to Connect Gestalt Perception
and Natural Language
Sheel Sanjay Dhande
Bachelor of Engineering in Computer Engineering,
University of Pune, 2001
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning, in partial fulﬁllment of the
requirements for the degree of
MASTER OF SCIENCE IN MEDIA ARTS AND SCIENCES
MASSACHUSETTS INSTITUTE OF TECHNOLOGYSeptember 2003 c Massachusetts Institute of Technology 2003. All rights reserved.
Program in Media Arts and Sciences August 8, 2003 Certiﬁed by.........................................................
Deb K. Roy Assistant Professor of Media Arts and Sciences Thesis Supervisor Accepted by.........................................................
Andrew Lippman Chairperson Department Committee on Graduate Students Program in Media Arts and Sciences A Computational Model to Connect Gestalt Perception and Natural Language by Sheel Sanjay Dhande Submitted to the Program in Media Arts and Sciences on August 8, 2003, in partial fulﬁllment of the requirements for the degree of
MASTER OF SCIENCE IN MEDIA ARTS AND SCIENCESAbstract We present a computational model that connects gestalt visual perception and language.
The model grounds the meaning of natural language words and phrases in terms of the perceptual properties of visually salient groups. We focus on the semantics of a class of words that we callconceptual aggregates e.g., pair, group, stuff, which inherently refer to groups of objects. The model provides an explanation for how the semantics of these natural language terms interact with gestalt processes in order to connect referring expressions to visual groups.
Our computational model can be divided into two stages. The ﬁrst stage performs grouping on visual scenes. It takes a visual scene segmented into block objects as input, and creates a space of possible salient groups arising from the scene. This stage also assigns a saliency score to each group. In the second stage, visual grounding, the space of salient groups, which is the output of the previous stage, is taken as input along with a linguistic scene description. The visual grounding stage comes up with the best match between a linguistic description and a set of objects. Parameters of the model are trained on the basis of observed data from a linguistic description and visual selection task.
The proposed model has been implemented in the form of a program that takes as input a synthetic visual scene and linguistic description, and as output identiﬁes likely groups of objects within the scene that correspond to the description. We present an evaluation of the performance of the model on a visual referent identiﬁcation task. This model may be applied in natural language understanding and generation systems that utilize visual context such as scene description systems for the visually impaired and functionally illiterate.
Thesis Supervisor: Deb K. Roy Title: Assistant Professor of Media Arts and Sciences
Acknowledgments I would like to thank my advisor Deb Roy for his valuable guidance for this work. I would also like to thank the members of cogmac, especially Peter Gorniak, for helpful discussions and advice. I am greatful to my readers, Sandy Pentland, and John Maeda, for their helpful suggestions and advice.
I met a lot of great people at MIT and I would like to collectively thank them all. Finally, I would like to thank my family for their never ending support.
5-1 Average values of results calculated using evaluation criteria C1...... 64 5-2 Average values of results calculated using evaluation criteria C2...... 65 5-3 Visual grouping scenario in which proximity alone fails........... 65
Chapter 1 Introduction Each day, from the moment we wake up, our senses are hit with a mind boggling amount of information in all forms. From the visual richness of the world around us, to the sounds and smells of our environment, our bodies are receiving a constant stream of sensory input.
Never the less we seem to make sense of all this information with relative ease. Further, we use all this sensory input to describe what we perceive using natural language. The explanation, we believe, lies in the connection between visual organization, in the form of gestalt grouping, and language.
Visual grouping has been recognized as an essential component of a computational model of the human vision system . Such visually salient groups offer a concise representation for the complexity of the real world. For example, when we see a natural scene and hear descriptions like, the pair on top or the stuff over there, intuitively we form an idea of what is being referred to. Though, if we analyze the words in the descriptions, there is no information about the properties of the objects being referred to, and in some cases no speciﬁcation of the number of objects as well. How then do we dismabiguate the correct referent object(s) from all others present in the visual scene? This resolution of ambiguity occurs through usage of the visual properties of the objects, and visual organization of the objects in the scene. These visual cues provide clues on how to
the natural scene, composed of numerous pixels, to a concise representation composed of groups of objects.
This concise representation is shared with other cognitive faculties, speciﬁcally language. It is the reason why in language, we refer to aggregate terms such as stuff, and pair, that describe visual groups composed of individual objects. Language also plays an important role in guiding visual organization, and priming our search for visually salient groups.
A natural language understanding and generation system that utilizes visual context needs to have a model of the interdependance of language and visual organization. In this thesis we present such a model that connects visual grouping and language. This work, to the best of our knowledge, is one of the ﬁrst attempts to connect the semantics of speciﬁc linguistic terms to the perceptual properties of visually salient groups.
1.1 Connecting gestalt perception and language
Gestalt perception is the ability to organize perceptual input . It enables us to perceive wholes that are greater than the sum of their parts. This sum or combination of parts into wholes is known as gestalt grouping. The ability to form gestalts is an important component of our vision system.
The relationship of language and visual perception is a well established one, and can be stated as, how we describe what we see, and how we see what is described. Words and phrases referring to implicit groups in spoken language provide evidence that our vision system performs a visual gestalt analysis of scenes. However, to date, there has been relatively little investigation of how gestalt perception aids linguistic description formation, and how linguistic descriptions guide the search for gestalt groups in a visual scene.
In this thesis, we present our work towards building an adaptive and contex-sensitive computational model that connects gestalt perception and language. To do this, we ground the meaning of English language terms to the visual context composed of perceptual properties of visually salient groups in a scene. We speciﬁcally focus on the semantics of a class of words we term as conceptual aggregates, such as group, pair and stuff. Further, to show how language affects gestalt perception, we train our computational model on data collected from a linguistic description task. The linguistic description of a visual scene is parsed to identify words and their corresponding visual group referent. We extract visual features from the group referent and use them as exemplars for training our model. In this manner our model adapts its notion of grouping by learning from human judgements of visual group referents.
For evaluation, we show the performance of our model on a visual referent identiﬁcation task. Given a scene, and a sentence describing object(s) in the scene, the set of objects that best match the description sentence is returned. For example, given the sentence the red pair, the correct pair is identiﬁed.
1.2 Towards Perceptually grounded Language understand- ing
Our vision is to build computational systems that can ground the meaning of language to their perceptual context. The term grounding is deﬁned as, acquiring the semantics of language by connecting a word, purely symbolic, to its perceptual correlates, purely nonsymbolic . The perceptual correlates of a word can span different modalities such as visual and aural. Grounded language models can be used to build natural language processing systems that can be transformed into smart applications that understand verbal instructions and in response can perform actions, or give a verbal reply.
In our research, our initial motivation was derived from the idea of building a program that can create linguistic descriptions for electronic documents and maps. Such a program could be used by visually impaired and functionally illiterate users. When we see an electronic document we implicitly tend to cluster parts of the document into groups and perceive part/whole relationships between the salient groups. The application we envision can utilize these salient groups to understand the referents of descriptions, and create its own descriptions.
There are other domains of application as well, for example, in building conversational robots that can communicate through situated, natural language. The ability to organize visual input and use it for understanding and creating language descriptions would allow a robot to act as an assistive aid and give the robot a deeper semantic understanding of conceptual aggregate terms. This research is also applicable for building language enabled intuitive interfaces for portable devices having small or no displays e.g., mobile phones.
1.3 The Big Picture
The diagram shown in Figure 1-1 shows our entire model. It can be divided into two stages.
The ﬁrst stage, indicated by the Grouping block, performs grouping on visual scenes. It takes a visual scene segmented into block objects as input, and creates a space of possible salient groups, labeled candidate groups, arising from the scene. We use a weighted sum strategy to integrate the inﬂuence of different visual properties such as, color and proximity.
This stage also assigns a saliency score to each group. In the second stage, visual grounding, denoted in the ﬁgure by the Grounding block, the space of salient groups, which is the output of the previous stage, is taken as input along with a linguistic scene description. The visual grounding stage comes up with the best match between a linguistic description and a set of objects. The parameters of this model are learned from positive and negative examples that are derived from human judgement data collected from an experimental visual referent identiﬁcation task.
In the next chapter, Chapter 2 we discuss relevant previous research related to perceptual organization, visual perception, and systems that integrate language and vision. In Chapter 3 we describe in detail the visual grouping stage, including a description of our grouping algorithm and the saliency measure of a group. In Chapter 4 we describe how visual grounding is implemented. We give details of our feature selection, the data collection task, and the training of our model. The chapter is concluded with a fully worked out example that takes the reader through the entire processing of our model, starting from a scene and a description to the identiﬁcation of the correct referent. In Chapter 5 we give details of our evaluation task and the results we achieved. In Chapter 6 we conclude, and discuss directions of future research.
The contributions of this thesis are:
• A computational model for grounding linguistic terms to gestalt perception
• A saliency measure for groups based on a hierarchical clustering framework, using a weighted distance function
• A framework for learning the weights used to combine the inﬂuence of each individual perceptual property, from a visual referent identiﬁcation task.
• A program that takes as input a visual scene and a linguistic description, and identiﬁes the correct group referent.
Chapter 2 Background Our aim is to create a system that connects language and gestalt perception. We want to apply it to the task of identifying referents in a visual scene, speciﬁed by a linguistic description. This research can be connected to two major areas of previous work. The ﬁrst area is visual perception and perceptual organization and the second area is building systems that integrate linguistic and visual information.
2.1 Perceptual Organization
Perception involves simultaneous sensing and comprehension of a large number of stimuli.
Yet, we do not see each stimulus as an individual input. Collections of stimuli are organized into groups that serve as cognitive handles for interpreting an agent’s environment. As an example consider the visual scene shown in Figure 2-1. Majority of people would parse the scene as being composed of 3 sets of 2 dots, in other words 3 pairs rather than 6 individual dots. Asked to describe the scene, most observers are likely to say three pairs of dots. This example illustrates two facets of perceptual organization, (a) the grouping of stimuli e.g., visual stimuli, and (b) the usage of these groups by other cognitive abilities e.g., language.