WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

«A Computational Model to Connect Gestalt Perception and Natural Language by Sheel Sanjay Dhande Bachelor of Engineering in Computer Engineering, ...»

-- [ Page 1 ] --

A Computational Model to Connect Gestalt Perception

and Natural Language

by

Sheel Sanjay Dhande

Bachelor of Engineering in Computer Engineering,

University of Pune, 2001

Submitted to the Program in Media Arts and Sciences,

School of Architecture and Planning, in partial fulfillment of the

requirements for the degree of

MASTER OF SCIENCE IN MEDIA ARTS AND SCIENCES

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2003 c Massachusetts Institute of Technology 2003. All rights reserved.

Author.............................................................

Program in Media Arts and Sciences August 8, 2003 Certified by.........................................................

Deb K. Roy Assistant Professor of Media Arts and Sciences Thesis Supervisor Accepted by.........................................................

Andrew Lippman Chairperson Department Committee on Graduate Students Program in Media Arts and Sciences A Computational Model to Connect Gestalt Perception and Natural Language by Sheel Sanjay Dhande Submitted to the Program in Media Arts and Sciences on August 8, 2003, in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE IN MEDIA ARTS AND SCIENCES

Abstract We present a computational model that connects gestalt visual perception and language.

The model grounds the meaning of natural language words and phrases in terms of the perceptual properties of visually salient groups. We focus on the semantics of a class of words that we callconceptual aggregates e.g., pair, group, stuff, which inherently refer to groups of objects. The model provides an explanation for how the semantics of these natural language terms interact with gestalt processes in order to connect referring expressions to visual groups.

Our computational model can be divided into two stages. The first stage performs grouping on visual scenes. It takes a visual scene segmented into block objects as input, and creates a space of possible salient groups arising from the scene. This stage also assigns a saliency score to each group. In the second stage, visual grounding, the space of salient groups, which is the output of the previous stage, is taken as input along with a linguistic scene description. The visual grounding stage comes up with the best match between a linguistic description and a set of objects. Parameters of the model are trained on the basis of observed data from a linguistic description and visual selection task.

The proposed model has been implemented in the form of a program that takes as input a synthetic visual scene and linguistic description, and as output identifies likely groups of objects within the scene that correspond to the description. We present an evaluation of the performance of the model on a visual referent identification task. This model may be applied in natural language understanding and generation systems that utilize visual context such as scene description systems for the visually impaired and functionally illiterate.

Thesis Supervisor: Deb K. Roy Title: Assistant Professor of Media Arts and Sciences

–  –  –

Acknowledgments I would like to thank my advisor Deb Roy for his valuable guidance for this work. I would also like to thank the members of cogmac, especially Peter Gorniak, for helpful discussions and advice. I am greatful to my readers, Sandy Pentland, and John Maeda, for their helpful suggestions and advice.

I met a lot of great people at MIT and I would like to collectively thank them all. Finally, I would like to thank my family for their never ending support.

Contents

–  –  –

5-1 Average values of results calculated using evaluation criteria C1...... 64 5-2 Average values of results calculated using evaluation criteria C2...... 65 5-3 Visual grouping scenario in which proximity alone fails........... 65

–  –  –

Chapter 1 Introduction Each day, from the moment we wake up, our senses are hit with a mind boggling amount of information in all forms. From the visual richness of the world around us, to the sounds and smells of our environment, our bodies are receiving a constant stream of sensory input.

Never the less we seem to make sense of all this information with relative ease. Further, we use all this sensory input to describe what we perceive using natural language. The explanation, we believe, lies in the connection between visual organization, in the form of gestalt grouping, and language.

Visual grouping has been recognized as an essential component of a computational model of the human vision system [4]. Such visually salient groups offer a concise representation for the complexity of the real world. For example, when we see a natural scene and hear descriptions like, the pair on top or the stuff over there, intuitively we form an idea of what is being referred to. Though, if we analyze the words in the descriptions, there is no information about the properties of the objects being referred to, and in some cases no specification of the number of objects as well. How then do we dismabiguate the correct referent object(s) from all others present in the visual scene? This resolution of ambiguity occurs through usage of the visual properties of the objects, and visual organization of the objects in the scene. These visual cues provide clues on how to





Abstract

the natural scene, composed of numerous pixels, to a concise representation composed of groups of objects.

This concise representation is shared with other cognitive faculties, specifically language. It is the reason why in language, we refer to aggregate terms such as stuff, and pair, that describe visual groups composed of individual objects. Language also plays an important role in guiding visual organization, and priming our search for visually salient groups.

A natural language understanding and generation system that utilizes visual context needs to have a model of the interdependance of language and visual organization. In this thesis we present such a model that connects visual grouping and language. This work, to the best of our knowledge, is one of the first attempts to connect the semantics of specific linguistic terms to the perceptual properties of visually salient groups.

1.1 Connecting gestalt perception and language

Gestalt perception is the ability to organize perceptual input [31]. It enables us to perceive wholes that are greater than the sum of their parts. This sum or combination of parts into wholes is known as gestalt grouping. The ability to form gestalts is an important component of our vision system.

The relationship of language and visual perception is a well established one, and can be stated as, how we describe what we see, and how we see what is described. Words and phrases referring to implicit groups in spoken language provide evidence that our vision system performs a visual gestalt analysis of scenes. However, to date, there has been relatively little investigation of how gestalt perception aids linguistic description formation, and how linguistic descriptions guide the search for gestalt groups in a visual scene.

In this thesis, we present our work towards building an adaptive and contex-sensitive computational model that connects gestalt perception and language. To do this, we ground the meaning of English language terms to the visual context composed of perceptual properties of visually salient groups in a scene. We specifically focus on the semantics of a class of words we term as conceptual aggregates, such as group, pair and stuff. Further, to show how language affects gestalt perception, we train our computational model on data collected from a linguistic description task. The linguistic description of a visual scene is parsed to identify words and their corresponding visual group referent. We extract visual features from the group referent and use them as exemplars for training our model. In this manner our model adapts its notion of grouping by learning from human judgements of visual group referents.

For evaluation, we show the performance of our model on a visual referent identification task. Given a scene, and a sentence describing object(s) in the scene, the set of objects that best match the description sentence is returned. For example, given the sentence the red pair, the correct pair is identified.

1.2 Towards Perceptually grounded Language understand- ing

Our vision is to build computational systems that can ground the meaning of language to their perceptual context. The term grounding is defined as, acquiring the semantics of language by connecting a word, purely symbolic, to its perceptual correlates, purely nonsymbolic [11]. The perceptual correlates of a word can span different modalities such as visual and aural. Grounded language models can be used to build natural language processing systems that can be transformed into smart applications that understand verbal instructions and in response can perform actions, or give a verbal reply.

In our research, our initial motivation was derived from the idea of building a program that can create linguistic descriptions for electronic documents and maps. Such a program could be used by visually impaired and functionally illiterate users. When we see an electronic document we implicitly tend to cluster parts of the document into groups and perceive part/whole relationships between the salient groups. The application we envision can utilize these salient groups to understand the referents of descriptions, and create its own descriptions.

There are other domains of application as well, for example, in building conversational robots that can communicate through situated, natural language. The ability to organize visual input and use it for understanding and creating language descriptions would allow a robot to act as an assistive aid and give the robot a deeper semantic understanding of conceptual aggregate terms. This research is also applicable for building language enabled intuitive interfaces for portable devices having small or no displays e.g., mobile phones.

1.3 The Big Picture

The diagram shown in Figure 1-1 shows our entire model. It can be divided into two stages.

The first stage, indicated by the Grouping block, performs grouping on visual scenes. It takes a visual scene segmented into block objects as input, and creates a space of possible salient groups, labeled candidate groups, arising from the scene. We use a weighted sum strategy to integrate the influence of different visual properties such as, color and proximity.

This stage also assigns a saliency score to each group. In the second stage, visual grounding, denoted in the figure by the Grounding block, the space of salient groups, which is the output of the previous stage, is taken as input along with a linguistic scene description. The visual grounding stage comes up with the best match between a linguistic description and a set of objects. The parameters of this model are learned from positive and negative examples that are derived from human judgement data collected from an experimental visual referent identification task.

1.4 Organization

In the next chapter, Chapter 2 we discuss relevant previous research related to perceptual organization, visual perception, and systems that integrate language and vision. In Chapter 3 we describe in detail the visual grouping stage, including a description of our grouping algorithm and the saliency measure of a group. In Chapter 4 we describe how visual grounding is implemented. We give details of our feature selection, the data collection task, and the training of our model. The chapter is concluded with a fully worked out example that takes the reader through the entire processing of our model, starting from a scene and a description to the identification of the correct referent. In Chapter 5 we give details of our evaluation task and the results we achieved. In Chapter 6 we conclude, and discuss directions of future research.

Input Scene

–  –  –

1.5 Contributions

The contributions of this thesis are:

• A computational model for grounding linguistic terms to gestalt perception

• A saliency measure for groups based on a hierarchical clustering framework, using a weighted distance function

• A framework for learning the weights used to combine the influence of each individual perceptual property, from a visual referent identification task.

• A program that takes as input a visual scene and a linguistic description, and identifies the correct group referent.

Chapter 2 Background Our aim is to create a system that connects language and gestalt perception. We want to apply it to the task of identifying referents in a visual scene, specified by a linguistic description. This research can be connected to two major areas of previous work. The first area is visual perception and perceptual organization and the second area is building systems that integrate linguistic and visual information.

2.1 Perceptual Organization

Perception involves simultaneous sensing and comprehension of a large number of stimuli.

Yet, we do not see each stimulus as an individual input. Collections of stimuli are organized into groups that serve as cognitive handles for interpreting an agent’s environment. As an example consider the visual scene shown in Figure 2-1. Majority of people would parse the scene as being composed of 3 sets of 2 dots, in other words 3 pairs rather than 6 individual dots. Asked to describe the scene, most observers are likely to say three pairs of dots. This example illustrates two facets of perceptual organization, (a) the grouping of stimuli e.g., visual stimuli, and (b) the usage of these groups by other cognitive abilities e.g., language.



Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |


Similar works:

«Old Testament Introduction (132) Lecture 5: Numbers Overview LECTURE 5 NUMBERS OVERVIEW CHASTISEMENT OF A COVENANT PEOPLE INTRODUCTION 1. Name A translation of the Septuagint’s Arithmoi, “Numbers” is an accurate title in light of all the lists and figures given in the book. In addition to the census figures for the twelve tribes and the Levites, the book contains a detailed list of the tribal offerings at the dedication of the Tabernacle in chapter 7. Totals are also given for the weekly...»

«UNREPORTED IN THE COURT OF SPECIAL APPEALS OF MARYLAND No. 1679 September Term, 2014 RAMONT KIRBY v. STATE OF MARYLAND Eyler, Deborah S., Hotten, Nazarian, JJ. Opinion by Eyler, Deborah S., J. Filed: July 15, 2105 *This is an unreported opinion, and it may not be cited in any paper, brief, motion, or other document filed in this Court or any other Maryland Court as either precedent within the rule of stare decisis or as persuasive authority. Md. Rule 1-104. — Unreported Opinion — In 1999, a...»

«Revolutionising Digital Public Service Delivery: A UK Government Perspective Alan W. Brown, Jerry Fishenden, Mark Thompson Abstract For public sector organizations across the world, the pressures for improved efficiency during the past decades are now accompanied by an equally strong need to revolutionise service delivery to create solutions that better meet citizens’ needs; to develop channels that offer efficiency and increase inclusion to all citizens being served; and to re-invent supply...»

«INVESTING Conflict Risk, Environmental Challenges IN STABILITY and the Bottom-Line Featuring Articles by: John Bray (Control Risks Group) Jason Switzer (IISD) and Mareike Hussels (UNEP FI) Daniel Wagner (Asian Development Bank) Michael Kelly (KPMG) Copyright © 2004 United Nations Environment Programme and International Institute for Sustainable Development This publication may be reproduced in whole or in part and in any form for educational or non-profit purposes without special permission...»

«United States Court of Appeals For the Eighth Circuit _ No. 15-3602 _ United States of America lllllllllllllllllllll Plaintiff Appellee v. James Vernon McKnight lllllllllllllllllllll Defendant Appellant Appeal from United States District Court for the Eastern District of Arkansas Little Rock Submitted: September 23, 2016 Filed: October 31, 2016 [Unpublished] Before WOLLMAN, BRIGHT, and KELLY, Circuit Judges. PER CURIAM. A jury found James Vernon McKnight guilty of conspiring to possess with...»

«Sunderland City Council Inspection of services for children in need of help and protection, children looked after and care leavers and Review of the effectiveness of the local safeguarding children board1 Inspection date: 11 May 2015 – 4 June 2015 Report published: 20 July 2015 Children’s services in Sunderland are inadequate There are widespread and serious failures that leave children unsafe and mean that the welfare of children looked after is not adequately safeguarded or promoted. It...»

«A world where every whale and dolphin is safe and free The River Dolphin Diploma activity pack It’s time to become a river dolphin expert ! River Dolphins in danger! Most whales and dolpihns live in the sea. But there is a small group that live in fresh water rivers and lakes. River dolphins survive in South America and Asia, and are amongst the most endangered mammals on Earth. In 2007, the Chinese river dolphin was declared extinct the first dolphin species lost due to mankind. We need to...»

«University of Belgrade University La Sapienza, Rome University of Sarajevo Master Program State Management and Humanitarian Affairs Marija Zagel Democratization in Sub-Saharan Africa: Processes and Obstacles Case Studies: Ghana, Nigeria, DR Congo Master Thesis Belgrade, January 2010 University of Belgrade University La Sapienza, Rome University of Sarajevo Master Program State Management and Humanitarian Affairs Marija Zagel Democratization in Sub-Saharan Africa: Processes and Obstacles Case...»

«Subchapter H. STUFFED TOYS General Provisions Sec.47.311.Definitions.47.312.Scope.47.313.Penalty.47.314.Registration.47.315.Seal of approval.47.316.Filling material.47.317.Tolerances.47.318.Sterilization. Classification of Flammability 47.321.Fabrics.47.322.Nonfabric. Authority The provisions of this Subchapter H issued under act of July 25, 1961 (P. L. 855, No. 372) (35 P. S. §§52015209), unless otherwise noted. Source The provisions of this Subchapter H adopted November 24, 1961; amended...»

«46 Kinesiologia Slovenica, 16, 3, 46–56 (2010) Faculty of Sport, University of Ljubljana, ISSN 1318-2269 Petra Zaletel1* A TIME-MOTION ANALYSIS OF BALLROOM Goran Vučković1 DANCERS USING AN AUTOMATIC Nic James2 TRACKING SYSTEM Andrej Rebula3 ANALIZA GIBANJA PLESALK IN Meta Zagorc1 PLESALCEV V STANDARDNIH PLESIH Z UPORABO SLEDILNEGA SISTEMA Abstract POVZETEK Številni avtorji so preučevali telesno obremenitev The physical effort of athletes has been widely studied različnih športnikov, na...»

«OPENING SPEECH BY THE PARLIAMENTARY VICE-MINISTER FOR FOREIGN AFFAIRS, MR KOJI KAKIZAWA Ladies and Gentlemen, It is a great pleasure to open the eighth meeting of the Conference of the Parties to the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES), in the presence of the Chairman of the Standing Committee, and all the distinguished delegations, international organizations and representatives of NGOs, from more than 100 countries around the world. I would...»

«Introduction Taking the Definition of Poverty Seriously F years now, poverty has been a central and selfOR SOME FORTY conscious concern in American society. The War on Poverty, officially launched in 1964 by President Lyndon B. Johnson, spawned a large research establishment and literature (Johnson 1964). As analysts have dug into this large issue, it has proved fruitful to investigate the special circumstances and dynamics of different groups of the poor, particularly the notably dependent...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.