WWW.DISSERTATION.XLIBX.INFO FREE ELECTRONIC LIBRARY - Dissertations, online materials

<< HOME
CONTACTS

Pages:   || 2 |

# «Abstract Analysis of variance (ANOVA) is a statistical procedure for summarizing a classical linear model—a decomposition of sum of squares into a ...»

-- [ Page 1 ] --

Analysis of variance∗

Andrew Gelman†

March 22, 2006

Abstract

Analysis of variance (ANOVA) is a statistical procedure for summarizing a classical linear

model—a decomposition of sum of squares into a component for each source of variation in the

model—along with an associated test (the F-test) of the hypothesis that any given source of

variation in the model is zero. When applied to generalized linear models, multilevel models,

and other extensions of classical regression, ANOVA can be extended in two diﬀerent directions.

First, the F-test can be used (in an asymptotic or approximate fashion) to compare nested models, to test the hypothesis that the simpler of the models is suﬃcient to explain the data.

Second, the idea of variance decomposition can be interpreted as inference for the variances of batches of parameters (sources of variation) in multilevel regressions.

1 Introduction Analysis of variance (ANOVA) represents a set of models that can be ﬁt to data, and also a set of methods for summarize an existing ﬁtted model. We ﬁrst consider ANOVA as it applies to classical linear models (the context for which it was originally devised; Fisher, 1925) and then discuss how ANOVA has been extended to generalized linear models and multilevel models. Analysis of variance is particularly eﬀective for analyzing highly structured experimental data (in agriculture, multiple treatments applied to diﬀerent batches of animals or crops; in psychology, multi-factorial experiments manipulating several independent experimental conditions and applied to groups of people; industrial experiments in which multiple factors can be altered at diﬀerent times and in diﬀerent locations).

At the end of this article, we compare ANOVA to simple linear regression.

2 Analysis of variance for classical linear models

2.1 ANOVA as a family of statistical methods When formulated as a statistical model, analysis of variance refers to an additive decomposition of data into a grand mean, main eﬀects, possible interactions, and an error term. For example, Gawron et al. (2003) describe a ﬂight-simulator experiment that we summarize as a 5 × 8 array of measurements under 5 treatment conditions and 8 diﬀerent airports. The corresponding two-way ANOVA model is yij = µ + αi + βj + ij. The data as described here have no replication, and so the two-way interaction becomes part of the error term.1 ∗ For the New Palgrave Dictionary of Economics, second edition. We thank Jack Needleman, Matthew Raﬀerty, David Pattison, Marc Shivers, Gregor Gorjanc, and several anonymous commenters for helpful suggestions and the National Science Foundation for ﬁnancial support.

† Department of Statistics and Department of Political Science, Columbia University, New York, gelman@stat.columbia.edu, www.stat.columbia.edu/∼gelman 1 If, for example, each treatment × airport condition were replicated three times, then the 120 data points could be modeled as yijk = µ + αi + βj + γij + ijk, with two sets of main eﬀects, a two-way interaction, and an error term.

–  –  –

Figure 1: Classical two-way analysis of variance for data on 5 treatments and 8 airports with no replication. The treatment-level variation is not statistically distinguishable from noise, but the airport eﬀects are statistically signiﬁcant. This and the other examples in this article come from Gelman (2005) and Gelman and Hill (2006).

This is a linear model with 1 + 4 + 7 coeﬃcients, which is typically identiﬁed by constraining the 5 8

αi = 0 and j=1 βj = 0. The corresponding ANOVA display is shown in Figure 1:

i=1

• For each source of variation, the degrees of freedom represent the number of eﬀects at that level, minus the number of constraints (the 5 treatment eﬀects sum to zero, the 8 airport eﬀects sum to zero, and each row and column of the 40 residuals sums to zero).

5 8 ¯2

• The total sum of squares—that is, j=1 (yij − y.. ) —is 0.078 + 3.944 + 1.417, which i=1 can be decomposed into these three terms corresponding to variance described by treatment, variance described by airport, and residuals.

• The mean square for each row is the sum of squares divided by degrees of freedom. Under the null hypothesis of zero row and column eﬀects, their mean squares would, in expectation, simply equal the mean square of the residuals.

• The F -ratio for each row (excluding the residuals) is the mean square, divided by the residual mean square. This ratio should be approximately 1 (in expectation) if the corresponding eﬀects are zero; otherwise we would generally expect the F -ratio to exceed 1. We would expect the F -ratio to be less than 1 only in unusual models with negative within-group correlations (for example, if the data y have been renormalized in some way, and this had not been accounted for in the data analysis.)

• The p-value gives the statistical signiﬁcance of the F -ratio with reference to the Fν1,ν2, where ν1 and ν2 are the numerator and denominator degrees of freedom, respectively. (Thus, the two F -ratios in Figure 1 are being compared to F4,28 and F7,28 distributions, respectively.) In this example, the treatment mean square is lower than expected (an F -ratio of less than 1), but the diﬀerence from 1 is not statistically signiﬁcant (a p-value of 82%), hence it is reasonable to judge this diﬀerence as explainable by chance, and consistent with zero treatment eﬀects. The airport mean square, is much higher than would be expected by chance, with an F -ratio that is highly statistically-signiﬁcantly larger than 1; hence we can conﬁdently reject the hypothesis of zero airport eﬀects.

More complicated designs have correspondingly complicated ANOVA models, and complexities arise with multiple error terms. We do not intend to explain such hierarchical designs and analyses here, but we wish to alert the reader to such complications. Textbooks such as Snedecor and Cochran (1989) and Kirk (1995) provide examples of analysis of variance for a wide range of designs.

2.2 ANOVA to summarize a model that has already been ﬁtted We have just demonstrated ANOVA as a method of analyzing highly structured data by decomposing variance into diﬀerent sources, and comparing the explained variance at each level to what would be expected by chance alone. Any classical analysis of variance corresponds to a linear model (that is, a regression model, possibly with multiple error terms); conversely, ANOVA tools can be used to summarize an existing linear model.

2 The key is the idea of “sources of variation,” each of which corresponds to a batch of coeﬃcients in a regression. Thus, with the model y = Xβ +, the columns of X can often be batched in a reasonable way (for example, from the previous section, a constant term, 4 treatment indicators, and 7 airport indicators), and the mean squares and F -tests then provide information about the amount of variance explained by each batch.

Such models could be ﬁt without any reference to ANOVA, but ANOVA tools could then be used to make some sense of the ﬁtted models, and to test hypotheses about batches of coeﬃcients.

2.3 Balanced and unbalanced data In general, the amount of variance explained by a batch of predictors in a regression depends on which other variables have already been included in the model. With balanced data, however, in which all groups have the same number of observations (for example, each treatment applied exactly eight times, and each airport used for exactly ﬁve observations), the variance decomposition does not depend on the order in which the variables are entered. ANOVA is thus particularly easy to interpret with balanced data. The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F -ratios will depend on the order in which the sources of variation are considered.

3 ANOVA for more general models Analysis of variance represents a way of summarizing regressions with large numbers of predictors that can be arranged in batches, and a way of testing hypotheses about batches of coeﬃcients. Both these ideas can be applied in settings more general than linear models with balanced data.

3.1 F tests In a classical balanced design (as in the examples of the previous section), each F -ratio compares a particular batch of eﬀects to zero, testing the hypothesis that this particular source of variation is not necessary to ﬁt the data.

More generally, the F test can compare two nested models, testing the hypothesis that the smaller model ﬁts the data adequately and (so that the larger model is unnecessary). In a linear model, the F -ratio is (SS2 −SS11)/(df 2 −df 1 ), where SS1, df 1 and SS2, df 2 are the residual sums of squares and SS /df 1 degrees of freedom from ﬁtting the larger and smaller models, respectively.

For generalized linear models, formulas exist using the deviance (the log-likelihood multiplied by −2) that are asymptotically equivalent to F -ratios. In general, such models are not balanced, and the test for including another batch of coeﬃcients depends on which other sources of variation have already been included in the model.

3.2 Inference for variance parameters A diﬀerent sort of generalization interprets the ANOVA display as inference about the variance of each batch of coeﬃcients, which we can think of as the relative importance of each source of variation in predicting the data. Even in a classical balanced ANOVA, the sums of squares and mean squares do not exactly do this, but the information contained therein can be used to estimate the variance components (Cornﬁeld and Tukey, 1956, Searle, Casella, and McCulloch, 1992). Bayesian simulation can then be used to obtain conﬁdence intervals for the variance parameters. As illustrated below, we display inferences for standard deviations (rather than variances) because these are more directly interpretable. Compared to the classical ANOVA display, our plots emphasize the estimated variance parameters rather than testing the hypothesis that they are zero.

–  –  –

Figure 2: ANOVA display for two logistic regression models of the probability that a survey respondent prefers the Republican candidate for the 1988 U.S. Presidential election, based on data from seven CBS News polls. Point estimates and error bars show median estimates, 50% intervals, and 95% intervals of the standard deviation of each batch of coeﬃcients. The large coeﬃcients for ethnicity, region, and state suggest that it might make sense to include interactions, hence the inclusion of ethnicity × region and ethnicity × state interactions in the second model.

3.3 Generalized linear models The idea of estimating variance parameters applies directly to generalized linear models as well as unbalanced datasets. All that is needed is that the parameters of a regression model are batched into “sources of variation.” Figure 2 illustrates with a multilevel logistic regression model, predicting vote preference given a set of demographic and geographic variables.

3.4 Multilevel models and Bayesian inference Analysis of variance is closely tied to multilevel (hierarchical) modeling, with each source of variation in the ANOVA table corresponding to a variance component in a multilevel model (see Gelman, 2005). In practice, this can mean that we perform ANOVA by ﬁtting a multilevel model, or that we use ANOVA ideas to summarize multilevel inferences. Multilevel modeling is inherently Bayesian in that it involves a potentially large number of parameters that are modeled with probability distributions (see, for example, Goldstein, 1995, Kreft and De Leeuw, 1998, Snijders and Bosker, 1999). The diﬀerences between Bayesian and non-Bayesian multilevel models are typically minor except in settings with many sources of variation and little information on each, in which case some beneﬁt can be gained from a fully-Bayesian approach which models the variance parameters.

4 Related topics

4.1 Finite-population and superpopulation variances So far in this article we have considered, at each level (that is, each source of variation) of a model, the standard deviation of the corresponding set of coeﬃcients. We call this the ﬁnite-population standard deviation. Another quantity of potential interest is the standard deviation of the hypothetical superpopulation from which these particular coeﬃcients were drawn. The point estimates of these two variance parameters are similar—with the classical method of moments, the estimates are identical, because the superpopulation variance is the expected value of the ﬁnite-population variance—but they will have diﬀerent uncertainties. The inferences for the ﬁnite-population standard deviations are more precise, as they correspond to eﬀects for which we actually have data.

Figure 3 illustrates the ﬁnite-population and superpopulation inferences at each level of the model for the ﬂight-simulator example. We know much more about the 5 treatments and 8 airports in our dataset than for the general populations of treatments and airports. (We similarly know more

–  –  –

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 Figure 3: Median estimates, 50% intervals, and 95% intervals for (a) ﬁnite-population and (b) superpopulation standard deviations of the treatment-level, airport-level, and data-level errors in the ﬂight-simulator example from Figure 1. The two sorts of standard deviation parameters have essentially the same estimates, but the ﬁnite-population quantities are estimated much more precisely.

(We follow the general practice in statistical notation, using Greek and Roman letters for population and sample quantities, respectively.)

–  –  –

Figure 4: ANOVA displays for a 5 × 5 latin square experiment (an example of a crossed three-way structure): (a) with no group-level predictors, (b) contrast analysis including linear trends for rows, columns, and treatments. See also the plots of coeﬃcient estimates and trends in Figure 5.

Pages:   || 2 |

Similar works:

«CHICAGO  PUBLIC LAW AND LEGAL THEORY WORKING PAPER NO. 358          JUDICIAL TACTIS IN THE EUROPEAN COURT OF HUMAN RIGHTS    Shai Dothan          THE LAW SCHOOL  THE UNIVERSITY OF CHICAGO      August 2011      This paper can be downloaded without charge at the Public Law and Legal Theory Working Paper  Series:  http://www.law.uchicago.edu/academics/publiclaw/index.html ...»

«TOWARDS CLONING THE LEAF RUST RESISTANCE GENE RPH5 By JAFAR MAMMADOV Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHYLOSOPHY In CROP AND SOIL ENVIRONMENTAL SCIENCES Dr. M.A. Saghai Maroof, Chairman Dr. G.R. Buss Dr. A. Esen Dr. C.A. Griffey Dr. J.G. Jelesko 02 August, 2004 Blacksburg, Virginia Keywords: barley, leaf rust, Rph5, molecular mapping, marker-assisted selection,...»

«ESTUDIO DE MERCADO EN EL COMERCIO MINORISTA Roberto Picaza Fraile Estudio de Mercado en el Comercio Minorista INDICE: 1.-Introducción 2.-Objetivos de una investigación 3.-Beneficios de una investigación eficaz 4.-¿Cuándo debo hacer un estudio de mercado? 5.-Métodos de investigación de mercados 6.-Etapas en la investigación de mercados 7.-¿Qué investigar? 7.1.-El Mercado – La Demanda 7.2.-La Localización 7.3.-Características de los consumidores y hábitos de compra. 14 7.4.-El...»

«Washington Center for the Book at The Seattle Public Library If All of Seattle Read the Same Book A Reading Group Toolbox for Wild Life by Molly Gloss Toolbox Contents For more information, contact: Washington Center for the Book at the Seattle Public Library 800 Pike Street Seattle, WA 9811 206-386-4100 206-386-119 Fax http://www.spl.org/wacentbook/centbook.html Nancy Pearl, Executive Director Chris Higashi, Associate Director 206-386-4184 206-386-4650 nancy.pearl@spl.org chris.higashi@spl.org...»

«Manukau: The Second CBD of Auckland Benjamin Ross Papakura Auckland 022 336 4789 5/16/2013 [MANUKAU: THE SECOND CBD OF AUCKLAND] May 16, 2013 Manukau The Second CBD of Auckland Booklet Version Ben Ross Managing Director TotaRim Consultation May 2013 totarim.consultancy@totarim.co.nz TotaRim Consultancy Limited | 1 [MANUKAU: THE SECOND CBD OF AUCKLAND] May 16, 2013 Contents Contents Foreword Purpose and Introduction Presentation Content Layout Challenges: How we see Auckland in both the sense of...»

«CHIME BELLS THE BEST OF COUNTRY YODEL By Popular demand We are excited at Jasmine to be releasing our third album of country yodellers. When the first album was proposed a few years ago many people told us “there will be no demand for yodel records”. How wrong they were! The first two volumes (JASMCD 3552 and JASMCD 3554 respectively) have proven extremely popular and feedback indicates strongly that a third album is warranted. Volume Two triggered letters, emails and even radio shows about...»

«IMEMR Current Contents September 2009 Vol. 8 No. 3 ISSN: 2071-2510 Index Medicus for the WHO Eastern Mediterranean Region with Abstracts IMEMR Current Contents September 2009 Vol. 8 No. 3 Table of Contents IMEMR Current Contents Subject Index ABO Blood-Group System Abortion, Spontaneous Acalculous Cholecystitis Accidents, Traffic Achilles Tendon Acid-Base Imbalance Acne Vulgaris Adverse Drug Reaction Reporting Systems Affect Albizzia Albuterol Almitrine Ambroxol Amino Acids, Neutral Amnion...»

«Page 1 of 13 Page 2 of 13 There are many Masonic links with the 36th Ulster Division in the form of individual members, Masonic Lodges and other instances where members took up the call to arms. This is an account of some Masonic brethren who’s Battalion, the 16th Battalion; Royal Irish Rifles (Pioneers Ulster Division) formed their own Lodge, Pioneer Masonic Lodge No 240, whilst undergoing training prior to embarkation to Somme. When formed, the Lodge went to war and worked at labour under...»

«SERIES IZA DP No. 6843 PAPER Does Emigration Benefit the Stayers? Evidence from EU Enlargement Benjamin Elsner DISCUSSION September 2012 Forschungsinstitut zur Zukunft der Arbeit Institute for the Study of Labor Does Emigration Benefit the Stayers? Evidence from EU Enlargement Benjamin Elsner IZA and IIIS Discussion Paper No. 6843 September 2012 IZA P.O. Box 7240 53072 Bonn Germany Phone: +49-228-3894-0 Fax: +49-228-3894-180 E-mail: iza@iza.org Any opinions expressed here are those of the...»

«Copyrighted material ® Unless otherwise indicated, all Scripture quotations are from the Holy Bible, New International Version, ® NIV. Copyright © 1973, 1978, 1984, 2011, by Biblica, Inc.™ Used by permission of Zondervan. All rights reserved worldwide. www.zondervan.com Verses marked nlt are taken from the Holy Bible, New Living Translation, copyright © 1996, 2004, 2007 by Tyndale House Foundation. Used by permission of Tyndale House Publishers, Inc., Carol Stream, Illinois 60188. All...»

«Wade Street Church 11.03.07 am “TO THE CHURCH OF GOD – CHRISTIANS TOGETHER” 4. MOVING FORWARD TOGETHER 1 Corinthians 4:1-21; 9:1-23 In the way that these things often happen, we arrive at this passage at an appropriate time in the life of our church here. This morning, in our studies in 1 Corinthians, we are thinking about the way in which we move forward together and the leadership that is needed for that – and just last Tuesday, at our Annual General Meeting, we confirmed the...»

«Published by Authority Vol. XLIX, No. 78 ROAD TOWN, TORTOLA THURSDAY 15 OCTOBER 2015 CONTENTS Company..................... None GOVERNMENT Supplements........................ 2857 Other......................... 2864 Statutory Appointments.............. None Court Notices....................... None COMMERCIAL Land Notices....................... 2857 Liquidation...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.