«by Ian Hyla Jermyn A dissertation submitted in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy Department of Computer ...»
On the Use of Functionals on Boundaries
in Hierarchical Models of
Ian Hyla Jermyn
A dissertation submitted in partial fulﬁllment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science
New York University
c Ian Jermyn
All Rights Reserved, 2000
If there were any real proof that the sun is in the centre of the universe and the
earth in the third heaven, and that the sun does not go around the earth, but the earth round the sun, then we would have to proceed with great circumspection in explaining passages of scripture which appear to teach the contrary, and rather admit that we did not understand them, than declare an opinion to be false which is proved to be true. As for myself, I shall not believe that there are such proofs until they are shown to me. Nor is it proof that, if the sun be supposed to be at the centre of the universe and the earth in the third heaven, everything works out the same as if it were the other way around.
Cardinal Roberto Bellarmino, Master of Controversial Questions at the Collegio Romano, in a letter of 12 April 1615 to Paolo Antonio Foscarini, Carmelite monk, replying to an enquiry about the truth of the Copernican system (Opere, Vol. 12, p.
∗∗∗∗∗∗∗∗∗ Indeed, someone who does philosophy or psychology will perhaps say “I feel that I think in my head”. But what that means he won’t be able to say. For he will not be able to say what kind of feeling that is; but merely to use the expression that he ‘feels’;
as if he were saying “I feel this stitch here”. Thus he is unaware that it remains to be investigated what his expression “I feel” means here, that is to say: what consequences we are permitted to draw from this utterance. Whether we may draw the same ones as we would from the utterance “I feel a stitch here”.
Ludwig Wittgenstein, Remarks on the Philosophy of Psychology, Vol. 1, # 350, Basil Blackwell, Oxford, 1980. Translated by G. E. M. Anscombe.
DEDICATIONTo Leslie, who has shown me what is truly important, and In Memory of my Beloved Grandmother, Elizabeth Norah Jermyn.
v Acknowledgements My greatest thanks go to Davi Geiger who, after indulging my inclination to other things, showed me the computer vision light. As well as all the advice, encouragement, psychotherapy, and stern admonishments that he has given so well and so freely over the years, we have also become friends. I thank him very much for everything.
Hiroshi Ishikawa is my great and good friend. We have shared many things, particularly our adventure in Rio, during which, fuelled by cafezinhos and caipirinhas, we began the collaboration that produced much of the work in this thesis. I thank him very much for his friendship, his humour,and his thought.
I would like to thank Pete Wyckoff, who befriended me when I was a novice Englishman in New York, who introduced me to Leslie, and who has been a dear and steadfast friend ever since. My life now would be quite different without him.
Thank you to Fabian Monrose for being my fellow un-American and for sharing my love of music. When New York becomes too much, I know I can always rely on him for sympathy and great chicken.
When I visited India with Leslie in 1995, Laxmi Parida invited us into her home and showed us the beauty of her country. A growing friendship was sealed. Thank you so much to her for being a listening ear, a stimulus to thought and a deadly pool opponent.
Thank you to Ken Pao for his friendship and gentle company in the ofﬁce and outside it.
Au revoir to all these good friends. Remember that Antibes is a good place to visit.
Thank you very much to David Jacobs, who besides being my kindly boss during the summer at NECI, has acted as a second advisor. His calm support has been a great help.
Thank you to Ernie Davis for his perhaps unknowing encouragement to me when I needed it. I apologize to him for the lack of real AI in this thesis.
Thank you to Nava Rubin, for serving on my thesis committee, and for providing me with food for thought through her work.
vii Thank you to Alan Siegel for his attempt to convince me of the error of my ways.
Thank you to everyone at Courant, particularly Anina, Rosemary and Lourdes, who are continually friendly and helpful and kind.
I would like to thank the Instituto de Matemática Pura e Aplicada in Rio de Janeiro for their generous hospitality during the above-mentioned sojourn.
I will never forget the monkeys and butterﬂies and the steamy sounds of the tropical forest ﬂoating through my ofﬁce window.
Finally, thank you to my family, my parents Richard and Leonie, and my brother and sister Phil and Anna, for always being there, quietly supporting and encouraging me.
New York City, August 7, 2000
Object recognition is a central problem in computer vision. Typically it is assumed to follow a sequential model in which successively more speciﬁc hypotheses are generated about the image. This is a rather simplistic model, allowing as it does no margin for error at any point. We follow a more general approach in which the various representations involved are allowed to inﬂuence one another from the outset. As a guide and ultimate goal, we study the problem of ﬁnding the region occupied by human beings in images, and the separation of the region into arms, legs and head. We approach the problem as that of deﬁning a functional on the space of boundaries in images whose minimum speciﬁes the region occupied by the human ﬁgure.
Previous work that uses such functionals suffers from a number of difﬁculties. These include an uncontrollable dependence on scale, an inability to ﬁnd the global minimum for boundaries in polynomial time, and the inability to include region as well as boundary information. We present a new form of functional on boundaries in a manifold that solves these problems, and is also the unique form of functional in a speciﬁc class that possesses a nontrivial, efﬁciently computable global minimum. We describe applications of the model to single images and to the extraction of boundaries from stereo pairs and motion sequences.
In addition, the functionals used in previous work could not include information about the shape of the region sought. We develop a model for the part structures of boundaries that extends previous work to the case of real images, thus including shape information in the functional framework. We show that such part structures are hyperpaths in a hypergraph. An ‘optimal hyperpath’ algorithm is developed that globally minimizes the functional under some conditions.
We show how to use exemplars of a shape to construct a functional that includes speciﬁc information about the topology of the part structure sought.
An algorithm is developed that globally minimizes such functionals in the case of a ﬁxed boundary. The behaviour of the functional mimics an aspect of human shape comparison.
A background is drawn for the work. The study of vision is difﬁcult both philosophically and practically, but the notion of seeing machines clariﬁes the issues somewhat. A deﬁnition of a visual system as a module of a seeing machine is given, and this necessitates a discussion of image semantics as the appropriate output of a visual system. The ideas discussed are formalized using probability theory and working assumptions used to render the problem tractable. We then consider brieﬂy what it means to test a visual system empirically.
T HE nature of vision is obscure. To a great extent this reﬂects the difﬁculties associated with any discussion of mental phenomena, whether in the biological/psychological sciences or in computer science. Indeed the very use of the word phenomena here is misleading. What we refer to as mental phenomena are exclusively experiences of ourselves, unless we count particular physical and chemical measurements that may be made on our brains and whose connection to the ﬁrst kind of mental phenomena is largely unknown.
These experiences are not phenomena in the same sense that the behavior of a falling object is a phenomenon. Others do not observe my ‘mental phenomena’. They may hear me speak as if I have observed something, but we do not observe ourselves as we observe a physical event or even as we observe others, except metaphorically. It is not clear what we mean when we say that we ‘see’ something or that we ‘recognize’ an object, once we step outside the normal realms of discourse and attempt to analyze such statements in the abstract. For example, what does it mean to ask the questions “do we recognize every object in our ﬁeld of view?” or “do we see every object in our ﬁeld of view?”? Avoiding the dilemmas and confusions raised by these issues is not always easy.
By way of contrast, computer vision is the attempt to construct seeing machines. In full generality, a seeing machine is any machine that uses images to help accomplish a task. Such tasks are extremely varied. They range over almost all of human and animal activity: counting widgets passing by on 1 a conveyor belt; navigating through a complex environment; extracting the region corresponding to a human being in an image; animation; copying a design; handwriting recognition; and on and on. Human beings allegedly devote a third of the volumes of their brains to visual processing, which gives some indication of the problems facing computer vision. Nevertheless, by approaching the study of vision in this operational way, it is to be hoped that we can avoid the philosophical concerns mentioned in the ﬁrst paragraph, and eventually shed some light on what we are talking about when we discuss human vision, as well as constructing useful technology along the way.
The ﬁrst thing we will do however, is to make a simplifying assumption that reduces the operational content of our model. We will postulate a separation between those parts of the machine that deal with the images themselves and those parts that perform other tasks such as planning or locomotion. The picture is of a ‘module’ (called the visual system) that takes images as input (the images are made available according to a plan formulated elsewhere in the machine), and that produces as output statements about the image. Such a picture has advantages and disadvantages. On the positive side, it is a useful abstraction since we are not forced to contemplate general intelligent behaviour in addition to the already formidable difﬁculties of image understanding, and it opens the possibility of discovering task-independent methods. On the negative side, the separation means that we must now test the performance of the visual system independently of a speciﬁc task. In what could such a test consist? We are forced to refer the notion of image understanding to human performance, since that is the only visual system to whose output we have access.
1. IMAGE SEMANTICS In performing a given task, the images used by the seeing machine will be endowed with a semantics. This semantics encodes what the seeing machine as a whole does with the images it acquires: what consequences it can draw from these images. A semantics can be thought of as a collection of statements about the image that are true. In general the semantics will clearly depend on the task. The job of the visual system is to output a statement from the semantics on receiving an image as input.
In order for the semantics to be testable in any meaningful way, the relevant people must agree on the statements in it: ground truth is established by human consensus. This may be because the semantics is agreed upon for a speciﬁc type of image and task, for example a blueprint, but often this is not the case. For example, the statement that there is a black rectangle at such and such a location in ﬁgure 1 is unlikely to produce disagreement among observers. On the other hand, the statement that this image is a picture of a
2 FIGURE 1. A black rectangle or a book?
black book might well, and yet it is not an unreasonable interpretation of the image. While this may seem to subjectivize the notion of the meaning of an image, in practice it is all that we have once we separate visual understanding from task performance. In the future, given a theory as to why we divide the world into the objects and concepts that we use (such a theory is not inconceivable: perhaps there is an informationally optimal way to do this, to which human understanding is an approximation), this situation might be changed. In the meantime, human consensus is what we mean by image understanding.
In order to compare two visual systems, we must have not only the notion of ground truth provided by human consensus, but also a notion of how ‘close’ to correct a given statement is. Given the output of a visual system on a particular image, this latter notion (an evaluation function) will compare the statement to the image semantics and output a real number, the evaluation of the output. Two visual systems can then be compared by, for example, using a probability distribution of possible inputs and computing the mean evaluations. The evaluation function is not given a priori. It too must be agreed upon, and will in general be task-dependent. In fact, in a typical task the evaluation function will depend upon a number of other factors that only logically become available to us once we consider the task itself. For example, the resources needed for the visual system to output its statement might be extremely important in reality, and may offset the accuracy of the 3 result. These factors are completely task-dependent and we do not consider them further except to ensure that they are not prohibitive (for example, an algorithm that takes time exponential in the size of the input).
It is hard to give a clearly deﬁned semantics for many images. For example, depictions of real scenes can be given a semantics by making statements about possible scenes preﬁxed by “If a real scene had generated the image, then in that scene... ”. The problem is that in some cases there may not be enough consensus to render such statements free of their dependence on the speaker.