# «Image Alignment and Stitching: A Tutorial Richard Szeliski1 1 Microsoft Research, USA, szeliski Abstract This tutorial reviews image ...»

Foundations and Trends R in

Computer Graphics and Vision

Vol. 2, No 1 (2006) 1–104

c 2006 R. Szeliski

DOI: 10.1561/0600000009

Image Alignment and Stitching: A Tutorial

Richard Szeliski1

1

Microsoft Research, USA, szeliski@microsoft.com

Abstract

This tutorial reviews image alignment and image stitching algorithms.

Image alignment algorithms can discover the correspondence relation-

ships among images with varying degrees of overlap. They are ide-

ally suited for applications such as video stabilization, summarization, and the creation of panoramic mosaics. Image stitching algorithms take the alignment estimates produced by such registration algorithms and blend the images in a seamless manner, taking care to deal with potential problems such as blurring or ghosting caused by parallax and scene movement as well as varying image exposures. This tutorial reviews the basic motion models underlying alignment and stitching algorithms, describes eﬀective direct (pixel-based) and feature-based alignment algorithms, and describes blending algorithms used to pro- duce seamless mosaics. It ends with a discussion of open research prob- lems in the area.

1 Introduction Algorithms for aligning images and stitching them into seamless photo- mosaics are among the oldest and most widely used in computer vision.

Frame-rate image alignment is used in every camcorder that has an “image stabilization” feature. Image stitching algorithms create the high-resolution photo-mosaics used to produce today’s digital maps and satellite photos. They also come bundled with most digital cameras currently being sold, and can be used to create beautiful ultra wide- angle panoramas.

An early example of a widely used image registration algorithm is the patch-based translational alignment (optical ﬂow) technique developed by Lucas and Kanade [123]. Variants of this algorithm are used in almost all motion-compensated video compression schemes such as MPEG and H.263 [113]. Similar parametric motion estima- tion algorithms have found a wide variety of applications, including video summarization [20,93,111,203], video stabilization [81], and video compression [95,114]. More sophisticated image registration algorithms have also been developed for medical imaging and remote sensing – see [29, 71, 226] for some previous surveys of image registration techniques.

1 2 Introduction In the photogrammetry community, more manually intensive meth- ods based on surveyed ground control points or manually registered tie points have long been used to register aerial photos into large-scale photo-mosaics [181]. One of the key advances in this community was the development of bundle adjustment algorithms that could simultane- ously solve for the locations of all of the camera positions, thus yielding globally consistent solutions [207]. One of the recurring problems in creating photo-mosaics is the elimination of visible seams, for which a variety of techniques have been developed over the years [1,50,135,136,148].

In ﬁlm photography, special cameras were developed at the turn of the century to take ultra wide-angle panoramas, often by exposing the ﬁlm through a vertical slit as the camera rotated on its axis [131]. In the mid-1990s, image alignment techniques were started being applied to the construction of wide-angle seamless panoramas from regular hand-held cameras [43, 124, 193, 194]. More recent work in this area has addressed the need to compute globally consistent alignments [167, 178, 199], the removal of “ghosts” due to parallax and object movement [1, 50, 178, 210], and dealing with varying exposures [1, 116, 124, 210]. (A collection of some of these papers can be found in [19].) These techniques have spawned a large number of commercial stitching products [43, 168], for which reviews and comparison can be found on the Web.

While most of the above techniques work by directly minimizing pixel-to-pixel dissimilarities, a diﬀerent class of algorithms works by extracting a sparse set of features and then matching these to each other [7, 30, 35, 38, 129, 227]. Feature-based approaches have the advantage of being more robust against scene movement and are potentially faster, if implemented the right way. Their biggest advantage, however, is the ability to “recognize panoramas,” i.e., to automatically discover the adjacency (overlap) relationships among an unordered set of images, which makes them ideally suited for fully automated stitching of panoramas taken by casual users [30].

What, then, are the essential problems in image alignment and stitching? For image alignment, we must ﬁrst determine the appropriate mathematical model relating pixel coordinates in one image to pixel coordinates in another. Section 2 reviews these basic motion 3 models. Next, we must somehow estimate the correct alignments relating various pairs (or collections) of images. Section 3 discusses how direct pixel-to-pixel comparisons combined with gradient descent (and other optimization techniques) can be used to estimate these parameters. Section 4 discusses how distinctive features can be found in each image and then eﬃciently matched to rapidly establish correspondences between pairs of images. When multiple images exist in a panorama, techniques must be developed to compute a globally consistent set of alignments and to eﬃciently discover which images overlap one another.

These issues are discussed in Section 5.

For image stitching, we must ﬁrst choose a ﬁnal compositing surface onto which to warp and place all of the aligned images (Section 6).

We also need to develop algorithms to seamlessly blend overlapping images, even in the presence of parallax, lens distortion, scene motion, and exposure diﬀerences (Section 6). In the last section of this survey, additional applications of image stitching and open research problems were discussed.

2 Motion Models Before we can register and align images, we need to establish the mathematical relationships that map pixel coordinates from one image to another. A variety of such parametric motion models are possible, from simple 2D transforms, to planar perspective models, 3D camera rotations, lens distortions, and the mapping to nonplanar (e.g., cylindrical) surfaces [194].

To facilitate working with images at diﬀerent resolutions, we adopt a variant of the normalized device coordinates used in computer graphics [145, 216]. For a typical (rectangular) image or video frame, we let the pixel coordinates range from [−1, 1] along the longer axis, and [−a, a] along the shorter, where a is the inverse of the aspect ratio, as shown in Figure 2.1.1 For an image with width W and height H, the equations mapping integer pixel coordinates x = (x, y) to normalized device coordinates x = (x, y) are 2x − W 2y − H x= and y =, where S = max(W, H). (2.1) S S 1 Incomputer graphics, it is usual to have both axes range from [−1, 1], but this requires the use of two diﬀerent focal lengths for the vertical and horizontal dimensions, and makes it more awkward to handle mixed portrait and landscape mode images.

Fig. 2.1 Mapping from pixel coordinates to normalized device coordinates.

Note that if we work with images in a pyramid, we need to halve the S value after each decimation step rather than recomputing it from max(W, H), since the (W, H) values may get rounded oﬀ or truncated in an unpredictable manner. Note that for the rest of this paper, we use normalized device coordinates when referring to pixel coordinates.

2.1 2D (planar) Motions Having deﬁned our coordinate system, we can now describe how coordinates are transformed. The simplest transformations occur in the 2D plane and are illustrated in Figure 2.2.

** Table 2.1 Hierarchy of 2D coordinate transformations.**

The 2 × 3 matrices are extended with a third [0T 1] row to form a full 3 × 3 matrix for homogeneous coordinate transformations.

Name Matrix Number of d.o.f. Preserves Icon

Hierarchy of 2D Transformations The preceding set of transformations are illustrated in Figure 2.2 and summarized in Table 2.1. The easiest way to think of these is as a set of (potentially restricted) 3 × 3 matrices operating on 2D homogeneous coordinate vectors. Hartley and Zisserman [86] contains a more detailed description of the hierarchy of 2D planar transformations.

The above transformations form a nested set of groups, i.e., they are closed under composition and have an inverse that is a member of the same group. Each (simpler) group is a subset of the more complex group below it.

2.2 3D Transformations A similar nested hierarchy exists for 3D coordinate transformations that can be denoted using 4 × 4 transformation matrices, with 3D equivalents to translation, rigid body (Euclidean) and aﬃne transformations, and homographies (sometimes called collineations) [86].

The process of central projection maps 3D coordinates p = (X, Y, Z) to 2D coordinates x = (x, y, 1) through a pinhole at the camera origin onto a 2D projection plane a distance f along the z axis,

Fig. 2.3 Central projection, showing the relationship between the 3D and 2D coordinates p and x, as well as the relationship between the focal length f and the ﬁeld of view θ.

where K = diag(f, f, 1) is called the intrinsic calibration matrix.2 This matrix can be replaced by a more general upper-triangular matrix K that accounts for nonsquare pixels, skew, and a variable optic center location [86]. However, in practice, the simple focal length scaling used above provides high-quality results when stitching images from regular cameras.

In this paper, I prefer to use a 4 × 4 projection matrix, P, K 0 x∼ p = P p, ˜ (2.13) 0T 1 which maps the homogeneous 4-vector p = (X, Y, Z, 1) to a special kind of homogeneous screen vector x = (x, y, 1, d). This allows me to denote ˜ the upper-left 3 × 3 portion of the projection matrix P as K (making it compatible with the computer vision literature), while not dropping altogether the inverse screen depth information d (which is also sometimes called the disparity d [144]). This latter quantity is necessary to reason about mappings between images of a 3D scene, as described below.

What happens when we take two images of a 3D scene from different camera positions and/or orientations (Figure 2.4a)? A 3D point p gets mapped to an image coordinate x0 in camera 0 through the ˜

2 The last column of K usually contains the optical center (cx, cy ), but this can be set to zero if we use normalized device coordinates.

10 Motion Models combination of a 3D rigid-body (Euclidean) motion E 0,

Unfortunately, we do not usually have access to the depth coordinates of pixels in a regular photographic image. However, for a planar scene, we can replace the last row of P 0 in (2.13) with a general plane equation, n0 · p + c0 that maps points on the plane to d0 = 0 values ˆ (Figure 2.4b). Then, if we set d0 = 0, we can ignore the last column of M 10 in (2.17) and also its last row, since we do not care about the ﬁnal z-buﬀer depth. The mapping equation (2.17) thus reduces to

3 For points oﬀ the reference plane, we get out-of-plane parallax motion, which is why this representation is often called the plane plus parallax representation [110, 165, 196].

4 Note that for a single pair of images, the fact that a 3D plane is being viewed by a set

Fig. 2.5 Pure 3D camera rotation. The form of the homography (mapping) is particularly simple and depends only on the 3D rotation matrix and focal lengths.

Rotational Panoramas The more interesting case is when the camera undergoes pure rotation (Figure 2.5), which is equivalent to assuming all points are very far from the camera, i.e., on the plane at inﬁnity.

Setting t0 = t1 = 0, we get the simpliﬁed 3 × 3 homography

each image is intrinsically more stable than estimating a full 8-d.o.f.

homography, which makes this the method of choice for large-scale image stitching algorithms [30, 178, 199].

Parametrizing 3D Rotations If we are going to represent panoramas using a combination of rotations and focal lengths, what is the

**best way to represent the rotations themselves? The choices include:**

2.3 Cylindrical and Spherical Coordinates An alternative to using homographies or 3D motions to align images is to ﬁrst warp the images into cylindrical coordinates and to then use a pure translational model to align them [43, 193]. Unfortunately, this only works if the images are all taken with a level camera or with a known tilt angle.

Assume for now that the camera is in its canonical position, i.e., its rotation matrix is the identity, R = I, so that the optic axis is aligned with the z axis and the y axis is aligned vertically. The 3D ray corresponding to an (x, y) pixel is therefore (x, y, f ).

We wish to project this image onto a cylindrical surface of unit radius [193]. Points on this surface are parameterized by an angle θ and a height h, with the 3D cylindrical coordinates corresponding to (θ, h) given by

Fig. 2.7 An example of a cylindrical panorama: (a) two cylindrically warped images related by a horizontal translation; (b) part of a cylindrical panorama composited from a sequence of images.

related by a pure horizontal translation.6 This makes it attractive as an initial class project in an introductory computer vision course, since the full complexity of the perspective alignment algorithm (Subsections 3.5 and 4.3) can be avoided. Figure 2.7 shows how two cylindrically warped images from a leveled rotational panorama are related by a pure translation [199].