# «1. Introduction Consider the frequently encountered goal of determining a rule m(x) for predicting a future observation of a univariate response ...»

Suﬃcient Dimension Reduction and

Prediction in Regression

By Kofi P. Adragni and R. Dennis Cook

University of Minnesota, 313 Ford Hall, 224 Church Street S.E. Minneapolis, MN

55455, USA

Dimension reduction for regression is a prominent issue today because technological

advances now allow scientists to routinely formulate regressions in which the num-

ber of predictors is considerably larger than in the past. While several methods have

been proposed to deal with such regressions, principal components still seem to be the most widely used across the applied sciences. We give a broad overview of ideas underlying a particular class of methods for dimension reduction that includes prin- cipal components, along with an introduction to the corresponding methodology.

New methods are proposed for prediction in regressions with many predictors.

Keywords: Lasso, Partial least squares, Principal components, Principal component regression, Principal ﬁtted components

1. Introduction Consider the frequently encountered goal of determining a rule m(x) for predicting a future observation of a univariate response variable Y at the given value x of a p×1 vector X of continuous predictors. Assuming that Y is quantitative, continuous or discrete, the mean squared error E(Y −m(x))2 is minimized by choosing m(x) to be the mean E(Y |X = x) of the conditional distribution of Y |(X = x). Consequently, the prediction goal is often specialized immediately to the task of estimating the con- ditional mean function E(Y |X) from the regression of Y on X. When the response is categorical with sample space SY consisting of h categories SY = {C1,..., Ch }, the mean function is no longer a relevant quantity for prediction. Instead, given an observation x on X, the predicted category C∗ is usually taken to be the one with the largest conditional probability C∗ = arg max Pr(Ck |X = x), where the maximization is over SY. When pursuing estimation of E(Y |X) or Pr(Ck |X) it is nearly always worthwhile to consider predictions based on a function R(X) of dimension less than p, provided that it captures all of the information that X con- tains about Y so that E(Y |X) = E(Y |R(X)). We can think of R(X) as a function that concentrates the relevant information in X. The action of replacing X with a lower dimensional function R(X) is called dimension reduction; it is called suﬃcient dimension reduction when R(X) retains all the relevant information about Y. A potential advantage of suﬃcient dimension reduction is that predictions based on an estimated R may be substantially less variable than those based on X, without introducing worrisome bias. This advantage is not conﬁned to predictions, but may accrue in other phases of a regression analysis as well.

One goal of this article is to give a broad overview of ideas underlying suﬃcient dimension reduction for regression, along with an introduction to the correspond

ing methodology. Sections 1a, 1b, 2 and 3 are devoted largely to this review. Suﬃcient dimension reduction methods are designed to estimate a population parameter called the central subspace, which is deﬁned in §1b. Another goal of this article is to describe a new method of predicting quantitative responses following suﬃcient dimension reduction; categorical responses will be discussed only for contrast. The focus of this article shifts to prediction in §4 where we discuss four inverse regression models, describe the prediction methodology that stems from them, and give simulation results to illustrate their behaviour. Practical implementation issues are discussed in §5, along with additional simulation results.

(a) Dimension reduction There are many methods available for estimating E(Y |X) based on a random sample (Yi, Xi ), i = 1,..., n, from the joint distribution of Y and X. If p is sufciently small and n is suﬃciently large, it may be possible to estimate E(Y |X) adequately by using nonparametric smoothing (see, for example, Wand & Jones 1995). Otherwise, nearly all techniques for estimating E(Y |X) employ some type of dimension reduction for X, either estimated or imposed as an intrinsic part of the model or method.

Broadly viewed, dimension reduction has always been a central statistical concept. In the second half of the nineteenth century ‘reduction of observations’ was widely recognized as a core goal of statistical methodology, and principal components was emerging as a general method for the reduction of multivariate observations (Adcock 1878). Principal components was established as a ﬁrst reductive method for regression by the mid 1900s.

Dimension reduction for regression is a prominent issue today because technological advances now allow scientists to routinely formulate regressions in which p is considerably larger than in the past. This has complicated the development and ﬁtting of regression models. Experience has shown that the standard iterative paradigm for model development guided by diagnostics (Cook & Weisberg 1982, p.

7) can be imponderable when applied with too many predictors. An added complication arises when p is larger than the number of observations n, leading to the so called ‘n p’ problem. Standard methods of ﬁtting and corresponding inference procedures may no longer be applicable in such regressions. These and related issues have caused a shift in the applied sciences toward a diﬀerent regression genre with the goal of reducing the dimensionality of the predictor vector as a ﬁrst step in the analysis. Although large-p regressions are perhaps mainly responsible for renewed interest, dimension reduction methodology can be useful regardless of the size of p. For instance, it is often helpful to have an informative low-dimensional graphical summary of the regression to facilitate model building and gain insights.

For this goal p may be regarded as large when it exceeds 2 or 3 since these bounds represent the limits of our ability to view a data set in full using computer graphics.

Subsequent references to ‘large p’ in this article do not necessarily imply that n p.

Reduction by principal components is ubiquitous in the applied sciences, particularly in bioinformatics applications where principal components have been called ‘eigen-genes’ (Alter et al. 2000) in microarray data analyses and ‘meta-kmers’ in analyses involving DNA motifs. The 2006 Ad Hoc Committee Report on the ‘Hockey Stick’ Global Climate Reconstruction, authored by E. Wegman, D. Scott and Y. Said Article submitted to Royal Society Prediction in Regressions with large p 3 and commissioned by the U.S. House Energy Committee, reiterates and makes clear that past inﬂuential analyses of data on global warming are ﬂawed because of an inappropriate use of principal component methodology.

While principal components seem to be the dominant method of dimension reduction across the applied sciences, there are many other established and recent statistical methods that might be used to address large p regressions, including factor analysis, inverse regression estimation (Cook & Ni 2005), partial least squares, projection pursuit, seeded reductions (Cook et al. 2007), kernel methods (Fukumizu et al. 2009) and sparse methods like the lasso (Tibshirani 1996) that are based on penalization.

(b) Suﬃcient Dimension Reduction Dimension reduction is a rather amorphous concept in statistics, changing its character and goals depending on context. Formulated speciﬁcally for regression, the following deﬁnition (Cook 2007) of a suﬃcient reduction will help in our pursuit

**of methods for reducing the dimension of X while en route to estimating E(Y |X):**

Deﬁnition 1.1. A reduction R : Rp → Rq, q ≤ p, is suﬃcient if it satisﬁes one of

**the following three statements:**

(i) inverse reduction, X|(Y, R(X)) ∼ X|R(X), (ii) forward reduction, Y |X ∼ Y |R(X), (iii) joint reduction, X Y |R(X), where indicates independence, ∼ means identically distributed and A|B refers to the random vector A given the vector B.

Each of the three conditions in this deﬁnition conveys the idea that the reduction R(X) carries all the information that X has about Y, and consequently all the information available to estimate E(Y |X). They are equivalent when (Y, X) has a joint distribution. In that case we are free to determine a reduction inversely or jointly and then pass it to the conditional mean without additional structure: E(Y |X) = E(Y |R(X)). In some cases there may be a direct connection between R(X) and E(Y |X). For instance, if (Y, X) follows a nonsingular multivariate normal distribution then R(X) = E(Y |X) is a suﬃcient reduction, E(Y |X) = E{Y |E(Y |X)}. This reduction is also minimal suﬃcient: if T (X) is any suﬃcient reduction then R is a function of T. Further, because of the nature of the multivariate normal distribution, it can be expressed as a linear combination of the elements of X: R = βT X is minimal suﬃcient for some vector β.

Inverse reduction by itself does not require the response Y to be random, and it is perhaps the only reasonable reductive route when Y is ﬁxed by design. For instance, in discriminant analysis X|Y is a random vector of features observed in one of a number of subpopulations indicated by the categorical response Y, and no discriminatory information will be lost if classiﬁers are restricted to R.

If we consider a generic statistical problem and reinterpret X as the total data D and Y as the parameter θ, then the condition for inverse reduction becomes D|(θ, R) ∼ D|R so that R is a suﬃcient statistic. In this way, the deﬁnition of a suﬃcient reduction encompasses Fisher’s (1922) classical deﬁnition of suﬃciency.

Article submitted to Royal Society 4 K. P. Adragni, R. D. Cook One diﬀerence is that suﬃcient statistics are observable, while a suﬃcient reduction may contain unknown parameters and thus needs to be estimated. For example, if (X, Y ) follows a nonsingular multivariate normal distribution then R(X) = βT X and it is necessary to estimate β.

In some regressions R(X) may be a nonlinear function of X, and in extreme cases no reduction may be possible, so all suﬃcient reductions are one-to-one functions of X and thus equivalent to R(X) = X. Most often we encounter multi-dimensional reductions consisting of several linear combinations R(X) = ηT X, where η is an unknown p× q matrix, q ≤ p, that must be estimated from the data. Linear reductions may be imposed to facilitate progress, as in the moment-based approach reviewed in §3a. They can also arise as a natural consequence of modelling restrictions, as we will see in §3b. If η T X is a suﬃcient linear reduction then so is (ηA)T X for any q × q full rank matrix A. Consequently, only the subspace span(η) spanned by the columns of η can be identiﬁed – span(η) is called a dimension reduction subspace.

If span(η) is a dimension reduction subspace then so is span(η, η 1 ) for any matrix p× q1 matrix η 1. If span(η 1 ) and span(η 2 ) are both dimension reduction subspaces, then under mild conditions so is their intersection span(η 1 ) ∩ span(η 2 ) (Cook 1996, 1998). Consequently, the inferential target in suﬃcient dimension reduction is often taken to be the central subspace SY |X, deﬁned as the intersection of all dimension reduction subspaces (Cook 1994, 1996, 1998). A minimal suﬃcient linear reduction is then of the form R(X) = η T X, where the columns of η now form a basis for SY |X. We assume that the central subspace exists throughout this article, and use d = dim(SY |X ) to denote its dimension.

The ideas of a suﬃcient reduction and the central subspace can be used to further our understanding of existing methodology and to guide the development of new methodology. In Sections 2 and 3 we consider how suﬃcient reductions arise in three contexts: forward linear regression, inverse moment-based reduction and inverse model-based reduction.

2. Reduction in Forward Linear Regression The standard linear regression model Y = β0 + β T X + ǫ, with ǫ X and E(ǫ) = 0, implies that SY |X = span(β) and thus that R(X) = β T X is minimal suﬃcient.

The assumption of a linear regression then automatically focuses our interest on β, which can be estimated straightforwardly using ordinary least squares (OLS) when n is suﬃciently large, and it may appear that there is little to be gained from dimension reduction. However, dimension reduction has been used in linear regression to improve on the OLS estimator of β and to deal with n p regressions.

One approach consists of regressing Y on X in two steps. The ﬁrst is the reduction step: reduce X linearly to GT X using some methodology that produces G ∈ Rp×q, q ≤ p. The second step consists of using ordinary least squares to estimate the mean function E(Y |GT X) for the reduced predictors. To describe the resulting estimator β G of β and establish notation for later sections, let Y be the n × 1 vector of centred responses, let X = n X/n denote the sample mean vector, let X be ¯ i=1 ¯ the n × p matrix with rows (Xi − X)T, i = 1,..., n, let Σ = XT X/n denote the usual estimator of Σ = var(X), let C = XT Y/n, which is the usual estimator of −1 C = cov(X, Y ), and let β ols = Σ C be the vector of coeﬃcients from the OLS ﬁt

where βj is the j-th element of β, j = 1,..., p, and the tuning parameter λ is often chosen by cross validation. Several elements of β lasso are typically zero, which corresponds to setting the rows of G to be the rows of the identity matrix Ip corresponding to the nonzero elements of β lasso. However, with this G we do not necessarily have β lasso = βG, although the two estimators are often similar. Consequently, methodology based on penalization does not ﬁt exactly the general form given in equation (2.1).

Pursuing dimension reduction based on linear regression may not produce useful results if the model is not accurate, particularly if the distribution of Y |X depends on more than one linear combination of the predictors. There are many diagnostic and remedial methods available to improve linear regression models when p is not too large. Otherwise, application of these methods can be quite burdensome.

Article submitted to Royal Society 6 K. P. Adragni, R. D. Cook