# «James M. Coughlan A.L. Yuille Smith-Kettlewell Eye Research Inst. Smith-Kettlewell Eye Research Inst. 2318 Fillmore St. 2318 Fillmore St. San ...»

**The Manhattan World Assumption:**

Regularities in scene statistics which

enable Bayesian inference.

James M. Coughlan A.L. Yuille

Smith-Kettlewell Eye Research Inst. Smith-Kettlewell Eye Research Inst.

2318 Fillmore St. 2318 Fillmore St.

San Francisco, CA 94115 San Francisco, CA 94115

coughlan@ski.org yuille@ski.org

Abstract

Preliminary work by the authors made use of the so-called “Manhattan world” assumption about the scene statistics of city and indoor scenes. This assumption stated that such scenes were built on a cartesian grid which led to regularities in the image edge gradient statistics. In this paper we explore the general applicability of this assumption and show that, surprisingly, it holds in a large variety of less structured environments including rural scenes. This enables us, from a single image, to determine the orientation of the viewer relative to the scene structure and also to detect target objects which are not aligned with the grid. These inferences are performed using a Bayesian model with probability distributions (e.g. on the image gradient statistics) learnt from real data.

1 Introduction In recent years, there has been growing interest in the statistics of natural images (see Huang and Mumford [4] for a recent review). Our focus, however, is on the discovery of scene statistics which are useful for solving visual inference problems.

For example, in related work [5] we have analyzed the statistics of ﬁlter responses on and oﬀ edges and hence derived eﬀective edge detectors.

In this paper we present results on statistical regularities of the image gradient responses as a function of the global scene structure. This builds on preliminary work [2] on city and indoor scenes. This work observed that such scenes are based on a cartesian coordinate system which puts (probabilistic) constraints on the image gradient statistics.

Our current work shows that this so-called “Manhattan world” assumption about the scene statistics applies far more generally than urban scenes. Many rural scenes contain suﬃcient structure on the distribution of edges to provide a natural cartesian reference frame for the viewer. The viewers’ orientation relative to this frame can be determined by Bayesian inference. In addition, certain structures in the scene stand out by being unaligned to this natural reference frame. In our theory such structures appear as “outlier” edges which makes it easier to detect them. Informal evidence that human observers use a form of the Manhattan world assumption is provided by the Ames room illusion, see ﬁgure (6), where the observers appear to erroneously make this assumption, thereby grotesquely distorting the sizes of objects in the room.

2 Previous Work and Three- Dimensional Geometry Our preliminary work on city scenes was presented in [2]. There is related work in computer vision for the detection of vanishing points in 3-d scenes [1], [6] (which proceeds through the stages of edge detection, grouping by Hough transforms, and ﬁnally the estimation of the geometry).

We refer the reader to [3] for details on the geometry of the Manhattan world and report only the main results here. Brieﬂy, we calculate expressions for the orientations of x, y, z lines imaged under perspective projection in terms of the orientation of the camera relative to the x, y, z axes. The camera orientation relative to the xyz axis system may be speciﬁed by three Euler angles: the azimuth (or compass angle) α, corresponding to rotation about the z axis, the elevation β above the xy plane, and the twist γ about the camera’s line of sight. We use Ψ = (α, β, γ) to denote all three Euler angles of the camera orientation. Our previous work [2] assumed that the elevation and twist were both zero which turned out to be invalid for many of the images presented in this paper.

We can then compute the normal orientation of lines parallel to the x, y, z axes, measured in the image plane, as a function of ﬁlm coordinates (u, v) and the camera orientation Ψ. We express the results in terms of orthogonal unit camera axes a, b and c, which are aligned to the body of the camera and are determined by Ψ. For x lines (see Figure 1, left panel) we have tan θx = −(ucx + f ax )/(vcx + f bx ), where θx is the normal orientation of the x line at ﬁlm coordinates (u, v) and f is the focal length of the camera. Similarly, tan θy = −(ucy + f ay )/(vcy + f by ) for y lines and tan θz = −(ucz + f az )/(vcz + f bz ) for z lines. In the next section will see how to relate the normal orientation of an object boundary (such as x, y, z lines) at a point (u, v) to the magnitude and direction of the image gradient at that location.

v 150

Figure 1: (Left) Geometry of an x line projected onto (u, v) image plane. θ is the normal orientation of the line in the image. (Right) Histogram of edge orientation error (displayed modulo 180◦ ). Observe the strong peak at 0◦, indicating that the image gradient direction at an edge is usually very close to the true normal orientation of the edge.

3 Pon and Pof f : Characterizing Edges Statistically Since we do not know where the x, y, z lines are in the image, we have to infer their locations and orientations from image gradient information. This inference is done using a purely local statistical model of edges. A key element of our approach is that it allows the model to infer camera orientation without having to group pixels into x, y, z lines. Most grouping procedures rely on the use of binary edge maps which often make premature decisions based on too little information. The poor quality of some of the images – underexposed and overexposed – makes edge detection particularly diﬃcult, as well as the fact that some of the images lack x, y, z lines that are long enough to group reliably.

Following work by Konishi et al [5], we determine probabilities Pon (Eu ) and Pof f (Eu ) for the probabilities of the image gradient magnitude Eu at position u in the image conditioned on whether we are on or oﬀ an edge. These distributions quantify the tendency for the image gradient to be high on object boundaries and low oﬀ them, see Figure 2. They were learned by Konishi et al for the Sowerby image database which contains one hundred presegmented images.

0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05

We extend the work of Konishi et al by putting probability distributions on how accurately the image gradient direction estimates the true normal direction of the edge. These were learned for this dataset by measuring the true orientations of the edges and comparing them to those estimated from the image gradients.

This gives us distributions on the magnitude and direction of the intensity gradient Pon (Eu |θ), Pof f (Eu ), where Eu = (Eu, φu ), θ is the true normal orientation of the edge, and φu is the gradient direction measured at point u = (u, v). We make a factorization assumption that Pon (Eu |θ) = Pon (Eu )Pang (φu − θ) and Pof f (Eu ) = Pof f (Eu )U (φu ). Pang (.) (with argument evaluated modulo 2π and normalized to 1 over the range 0 to 2π) is based on experimental data, see Figure 1 (right), and is peaked about 0 and π. In practice, we use a simple box-shaped function to model the distribution: Pang (δθ) = (1 − )/4τ if δθ is within angle τ of 0 or π, and /(2π − 4τ ) otherwise (i.e. the chance of an angular error greater than ±τ is ).

In our experiments = 0.1 and τ = 4◦ for indoors and 6◦ outdoors. By contrast, U (.) = 1/2π is the uniform distribution.

**4 Bayesian Model**

We devised a Bayesian model which combines knowledge of the three-dimensional geometry of the Manhattan world with statistical knowledge of edges in images. The model assumes that, while the majority of pixels in the image convey no information about camera orientation, most of the pixels with high edge responses arise from the presence of x, y, z lines in the three-dimensional scene. An important feature of the Bayesian model is that it does not force us to decide prematurely which pixels are on and oﬀ an object boundary (or whether an on pixel is due to x, y, or z), but allows us to sum over all possible interpretations of each pixel.

**The image data Eu at a single pixel u is explained by one of ﬁve models mu :**

mu = 1, 2, 3 mean the data is generated by an edge due to an x, y, z line, respectively, in the scene; mu = 4 means the data is generated by an outlier edge (not due to an x, y, z line); and mu = 5 means the pixel is oﬀ-edge. The prior probability P (mu ) of each of the edge models was estimated empirically to be 0.02, 0.02, 0.02, 0.04, 0.9 for mu = 1, 2,..., 5.

Using the factorization assumption mentioned before, we assume the probability of the image data Eu has two factors, one for the magnitude of the edge strength and

**another for the edge direction:**

** P (Eu |mu, Ψ, u) = P (Eu |mu )P (φu |mu, Ψ, u) (1)**

where P (Eu |mu ) equals Pof f (Eu ) if mu = 5 or Pon (Eu ) if mu = 5. Also, P (φu |mu, Ψ, u) equals Pang (φu − θ(Ψ, mu, u)) if mu = 1, 2, 3 or U (φu ) if mu = 4, 5.

Here θ(Ψ, mu, u)) is the predicted normal orientation of lines determined by the equation tan θx = −(ucx +f ax )/(vcx +f bx) for x lines, tan θy = −(ucy +f ay )/(vcy + f by ) for y lines, and tan θz = −(ucz + f az )/(vcz + f bz ) for z lines.

In summary, the edge strength probability is modeled by Pon for models 1 through 4 and by Pof f for model 5. For models 1,2 and 3 the edge orientation is modeled by a distribution which is peaked about the appropriate orientation of an x, y, z line predicted by the camera orientation at pixel location u; for models 4 and 5 the edge orientation is assumed to be uniformly distributed from 0 through 2π.

Rather than decide on a particular model at each pixel, we marginalize over all ﬁve

**possible models (i.e. creating a mixture model):**

P (Eu |Ψ, u) = P (Eu |mu, Ψ, u)P (mu ) (2) mu =1

(Although the conditional independence assumption neglects the coupling of gradients at neighboring pixels, it is a useful approximation that makes the model computationally tractable.) Thus the posterior distribution on the camera orientation is given by u P (Eu |Ψ, u)P (Ψ)/Z where Z is a normalization factor and P (Ψ) is a uniform prior on the camera orientation.

To ﬁnd the MAP (maximum a posterior) estimate, our algorithm maximizes the log posterior term log[P ({Eu }|Ψ)P (Ψ)] = log P (Ψ) + u log[ mu P (Eu |mu, Ψ, u)P (mu )] numerically by searching over a quantized set of compass directions Ψ in a certain range. For details on this procedure, as well as coarse-to-ﬁne techniques for speeding up the search, see [3].

5 Experimental Results This section presents results on the domains for which the viewer orientation relative to the scene can be detected using the Manhattan world assumption. In particular, we demonstrate results for: (I) indoor and outdoor scenes (as reported in [2]), (II) rural English road scenes, (III) rural English ﬁelds, (IV) a painting of the French countryside, (V) a ﬁeld of broccoli in the American mid-west, (VI) the Ames room, and (VII) ruins of the Parthenon (in Athens). The results show strong success for inference using the Manhattan world assumption even for domains in which it might seem unlikely to apply. (Some examples of failure are given in [3]. For example, a helicopter in a hilly scene where the algorithm mistakenly interprets the hill silhouettes as horizontal lines).

The ﬁrst set of images were of city and indoor scenes in San Francisco with images taken by the second author [2]. We include four typical results, see ﬁgure 3, for comparison with the results on other domains.

Figure 3: Estimates of the camera orientation obtained by our algorithm for two indoor scenes (left) and two outdoor scenes (right). The estimated orientations of the x, y lines, derived for the estimated camera orientation Ψ, are indicated by the black line segments drawn on the input image. (The z line orientations have been omitted for clarity.) At each point on a subgrid two such segments are drawn – one for x and one for y. In the image on the far left, observe how the x directions align with the wall on the right hand side and with features parallel to this wall. The y lines align with the wall on the left (and objects parallel to it).

We now extend this work to less structured scenes in the English countryside. Figure (4) shows two images of roads in rural scenes and two ﬁelds. These images come from the Sowerby database. The next three images were either downloaded from the web or digitized (the painting). These are the mid-west broccoli ﬁeld, the Parthenon ruins, and the painting of the French countryside.

6 Detecting Objects in Manhattan world We now consider applying the Manhattan assumption to the alternative problem of detecting target objects in background clutter. To perform such a task eﬀectively requires modelling the properties of the background clutter in addition to those of the target object. It has recently been appreciated that good statistical modelling of the image background can improve the performance of target recognition [7].

The Manhattan world assumption gives an alternative way of probabilistically modelling background clutter. The background clutter will correspond to the regular structure of buildings and roads and its edges will be aligned to the Manhattan grid. The target object, however, is assumed to be unaligned (at least, in part) to this grid. Therefore many of the edges of the target object will be assigned to model 4 by the algorithm. (Note the algorithm ﬁrst ﬁnds the MAP estimate Ψ∗ of the Figure 4: Results on rural images in England without strong Manhattan structure.

Same conventions as before. Two images of roads in the countryside (left panels) and two images of ﬁelds (right panel).

Figure 5: Results on an American mid-west broccoli ﬁeld, the ruins of the Parthenon, and a digitized painting of the French countryside.

compass orientation, see section (4), and then estimates the model by doing MAP of P (mu |Eu, Ψ∗, u) to estimate mu for each pixel u.) This enables us to signiﬁcantly simplify the detection task by removing all edges in the images except those assigned to model 4.