FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |   ...   | 7 |

«Discovering Significant Patterns Geoffrey I. Webb Received: 23 May 2005 / Revised: 13 February 2007 / Accepted: 14 February 2007 / Published online: ...»

-- [ Page 1 ] --

Mach Learn (2007) 68: 1–33

DOI 10.1007/s10994-007-5006-x

Discovering Significant Patterns

Geoffrey I. Webb

Received: 23 May 2005 / Revised: 13 February 2007 /

Accepted: 14 February 2007 / Published online: 14 April 2007

Springer Science+Business Media, LLC 2007


Pattern discovery techniques, such as association rule discovery, explore large

search spaces of potential patterns to find those that satisfy some user-specified constraints.

Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well- established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

Keywords Pattern discovery · Statistical evaluation · Association rules 1 Introduction This paper addresses the problem of using statistical hypothesis tests to screen individual patterns in the context of discovering large numbers of patterns from a single set of data. This problem arises in pattern discovery, as exemplified by association rule discovery (Agrawal et al. 1993), k-optimal or k-top rule discovery (Webb 1995; Scheffer and Wrobel 2002; Webb and Zhang 2005), contrast or emerging pattern discovery (Bay and Pazzani 2001; Dong and Li 1999), subgroup discovery (Klösgen 1996), interesting itemset discovery (Jaroszewicz and Simovici 2004) and impact or quantitative rule discovery (Aumann 1999; Webb 2001;

Zhang et al. 2004). All these techniques search large spaces of possible patterns P and return all patterns that satisfy user-defined constraints. Such a process entails evaluating numerous Editor: Johannes Fürnkranz.

G.I. Webb ( ) Faculty of Information Technology, Monash University, PO Box 75, Clayton, Vic., 3800, Australia e-mail: webb@infotech.monash.edu 2 Mach Learn (2007) 68: 1–33 patterns ρ ∈ P against a set of sample data D drawn from a distribution Θ, seeking those ρ that satisfy user-specified constraints φ with respect to Θ. Each time such an evaluation is performed there is a risk that the pattern ρ will satisfy φ with respect to the sample data D and hence that it will be inferred that ρ satisfies φ with respect to Θ, even though this inference is incorrect. Even if this risk is tightly bounded for each ρ, by considering numerous patterns, the risk of accepting at least one erroneous pattern can grow large. For example, suppose the risk is bounded by applying a statistical test for φ, accepting each pattern only if the significance level falls below critical value κ. If n independent patterns are assessed, the risk of accepting at least one erroneous pattern by chance is 1 − (1 − κ)n, which, for even relatively small values of κ and n, rapidly approaches 1.0. This problem is closely related to the multiple comparisons problem (Jensen and Cohen 2000) and the problem of oversearch (Quinlan and Cameron-Jones 1995).

Most pattern discovery systems do little to control this risk. While a number of approaches to controlling this risk have been developed (for example, Megiddo and Srikant 1998; Bay and Pazzani 2001; Webb 2002), each has limitations as detailed in Sect. 4.

The current paper investigates two approaches to applying statistical tests in pattern discovery. The first applies a Bonferroni correction for multiple tests (Shaffer 1995), dividing the global significance level α by the number of patterns in the search space in order to obtain the critical value κ. The second divides the available data into exploratory and holdout sets. The exploratory data are used for pattern discovery. The holdout data are then used to assess the patterns so discovered, with any set of statistical hypothesis tests the user desires being applied to each pattern in turn. The critical value used with the hypothesis tests is adjusted for the number of patterns tested using a correction for multiple tests such as the Bonferroni correction.

These approaches

• can apply any type of statistical hypothesis test to each of the individual patterns discovered by a system;

• can apply any number of such hypothesis tests to each of the patterns; and

• provide precise control over the experimentwise risk of type-1 error. That is, under the assumption that D is an iid sample from Θ, they allow the experimenter to establish a precise upper-bound on the risk of any of the multiple hypothesis tests falsely rejecting a null hypothesis and thus on the risk of falsely accepting a pattern.

Further, the holdout-evaluation approach can be applied as a simple wrapper to any pattern discovery system.

These approaches to handling multiple comparisons are well established in statistical theory and practice (Shaffer 1995). The contribution of the current work is to recognize their applicability in the context of the massive search spaces frequently explored in pattern discovery, to highlight the critical need for such techniques, and to investigate their relative strengths and weaknesses in this context.

The use of holdout evaluation should be very familiar to most machine learning researchers, who frequently use a similar process, dividing data into training and test sets, forming models from the former and obtaining unbiased estimates of their expected performance by observing their performance against the latter. The differences are that rather than obtaining an unbiased estimate of the quality of a single model, such as its error or ROC curve, we are seeking to either accept or reject each of many patterns, and we are doing so in a manner that strictly controls the risk of false discoveries.

The paper is organized as follows. Section 2 describes pattern discovery. Section 3 provides a formal problem statement. Section 4 discusses previous approaches to statistical Mach Learn (2007) 68: 1–33 3 testing within pattern discovery and outlines their limitations. Section 5 presents the new approaches. Section 6 summarises a series of experiments that assess key attributes of the new technique and compare its performance to that of previous approaches. Details of the experiments are presented in Appendices 1 and 2. Section 7 provides a general discussion including directions for future research. Section 8 concludes the paper with a summary and concluding remarks.

The holdout-evaluation technique was first presented in Webb (2003) and the directadjustment technique first presented in Webb (2006). Experiment 1 in the current paper, and Sect. 6.1 that discusses it, is reproduced from Webb (2006) and Experiments 2, 3 and 4 are more extensive variants of experiments in that previous paper. The discussions of previous approaches in Sect. 4 and the new approaches in Sect. 5 are substantially expanded upon those in the earlier papers. The general discussion presented in Sect. 7 is entirely new.

2 Pattern Discovery

Most machine learning systems learn a single model from the available data. The model learned is usually that expected to maximize accuracy or some other metric on unseen future data. Many systems that learn explicit models, such as decision tree (Quinlan 1993) or decision rule (Michalski 1983) learners, do so by searching a space of alternative models to select the model that appears to perform best with respect to the available data.

Frequently during such a search through the space of models the learner will encounter alternatives that perform equally well or almost as well as each other on the available data.

For example, during decision tree induction it is common to find multiple alternative splits at a node all of which score as well or nearly as well on the split selection criterion.

A machine learning system must inevitably make arbitrary choices between such alternative models. Quite simply, it must choose one of these models and has no non-arbitrary basis on which to do so. This is acceptable if there is no other source of information about the potential models. If several alternatives appear equally good, there is no basis on which to prefer any one over the others and it does not matter which is selected.

However, in some contexts there are factors not available to the learning system that can help distinguish between the models.

• Experts may be able to bring to bear background knowledge of a form that would be difficult to encode and make available to a learning system.

• Alternative models may have different levels of desirability because they use different attributes that represent tests with differing costs to evaluate (Turney 2000). For example, in a medical diagnosis task it might be possible to form alternative models of equivalent power, one using a simple blood test and the other using an expensive MRI scan.

Clearly the former model will be more desirable. However, most learners are unable to take account of the costs of tests during a learning process.

• Some models may be socially or politically unacceptable, for example because they do not fit into senior management’s conceptualization of the business.

• Some models may be unusable due to legislated restrictions.

• Alternative models may be more or less acceptable to a user population simply because one is consistent with existing knowledge and beliefs and the other is not.

• Finally, in some applications it may be possible to deploy multiple models and hence be unnecessary to derive just one.

4 Mach Learn (2007) 68: 1–33 For these reasons it is often desirable to find all models that satisfy some criteria with regard to the data. In such pattern discovery tasks, the user specifies constraints on the models to be discovered and the system discovers all models that satisfy those constraints. For many pattern discovery techniques the models take the form of rules. However, the same underlying techniques may be applied to other forms of model of regularities in the data, such as itemsets (Agrawal et al. 1993), sequential patterns (Agrawal and Srikant 1995) and sub-graphs (Kuramochi and Karypis 2001). The best known example of pattern discovery is association rule discovery (Agrawal et al. 1993).

3 Problem Statement

As outlined in the introduction, pattern discovery seeks to identify patterns ρ ∈ P that satisfy constraints φ with respect to distribution Θ. However, whether ρ satisfies φ with respect to Θ is assessed by reference to sample data D drawn from Θ. Although the principles extend directly to further contexts, the current research limits consideration to two types of data, transactional data and attribute-value data, and one type of pattern, rules.

For both data types, D is a multiset of n records and each record R ∈ D is a set of items R ⊆ I. For transactional data, items are atomic terms. For attribute-value data, there exists a set of a attributes A1... Aa, each attribute Ai has a domain of #Ai values dom(Ai ), each item is an attribute-value pair denoted as Ai = v, where v ∈ dom(Ai ), and each record R ∈ D contains exactly one item for each attribute.

Rules take the form X → y, where X ⊆ I, |X| ≥ 1 and y ∈ I. X is called the antecedent and y the consequent of the rule. For attribute-value data, X ∪ {y} may contain no more than one item for any one attribute.

Association rule discovery finds all rules that satisfy specified constraints φ specified as a minimum support (Agrawal et al. 1993), together with other constraints, if desired, such as minimum confidence (Agrawal et al. 1993), lift (International Business Machines 1996), or leverage (Piatetsky-Shapiro 1991). These terms are defined with respect to a rule X → y

and dataset D as follows:

• coverage(X → y) is |{R ∈ D : X ⊆ R}|;

• support(X → y) is |{R ∈ D : X ∪ {y} ⊆ R}|;

• confidence(X → y) = support(X → y)/coverage(X → y). This can be viewed as a maximum likelihood estimate of the conditional probability P (y | X);

• lift(X → y) = confidence(X → y)/confidence(∅ → y).

• leverage(X → y) = support(X → y) − coverage(X → y) × |{R ∈ D : y ∈ R}| / |D|, This measure represents the difference between the support and the support that would be expected if X and y were independent.

Each assessment of whether a given ρ satisfies φ is accompanied by a risk that the pattern ρ will satisfy φ with respect to the sample data D but not with respect to Θ. Most pattern discovery systems fail to effectively control this risk.

Statistical hypothesis tests are applicable to such a scenario. To apply such a test it is necessary to specify a null hypothesis, in our context the hypothesis that the negation of φ is true. If the discovery process “discovers” a pattern ρ that satisfies the null hypothesis, ρ is considered to be a false discovery or equivalently, a type-1 error. Any pattern ρ that is not “discovered” and does not satisfy the null hypothesis is called a type-2 error.

The techniques presented herein allow arbitrary statistical hypothesis tests to be applied during pattern discovery in a manner that allows the user to place a strict upper bound on the risk of any pattern being accepted that is a false discovery.

Mach Learn (2007) 68: 1–33 5

4 Previous Approaches to Avoiding False Discoveries in Pattern Discovery

Pages:   || 2 | 3 | 4 | 5 |   ...   | 7 |

Similar works:

«Dedication This edition is dedicated to my old Dungeons & Dragons group, The Mutants of the Round Table (you know who you are), friends and family, and all those who supported me with encouragement, comments, reviews, and suggestions. Books by D.L. Morrese ~*~ ~Stories of the Warden's World~ An Android Dog’s Tale Defying Fate (Combined eBook Edition) The Warden Threat (Defying Fate Part 1) The Warden War (Defying Fate Part 2) Amy’s Pendant Disturbing Clockwork ~*~ ~Adventures of the Brane...»

«Behaviour 152 (2015) 335–357 brill.com/beh Non-reciprocal but peaceful fruit sharing in wild bonobos in Wamba Shinya Yamamoto a,b,∗ a Graduate School of Intercultural Studies, Kobe University, 1-2-1 Tsurukabuto, Nada-ku, 657-8501 Kobe, Japan b Wildlife Research Center, Kyoto University, Yoshida-honmachi, Sakyo-ku, 606-8501 Kyoto, Japan * Author’s e-mail address: shinyayamamoto1981@gmail.com Accepted 30 December 2014; published online 29 January 2015 Abstract Food sharing is considered to...»

«Fast Track Trade Authority Must Be Replaced To Deliver Trade Agreement that Can Deliver Broad Benefits March 23, 2015 Dear Senator: Last fall, our organizations were joined by nearly 600 other unions and environmental, consumer, faith, family farm, civil rights, seniors, LGBT and other civil society organizations on a letter outlining the features of a trade authority mechanism that we would support. As negotiations on a prospective trade authority bill continue, we wanted to bring these...»

«Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2015 The human brain response to dental pain relief Meier, M L; Widmayer, S; Abazi, J; Brügger, M; Lukic, N; Lüchinger, R; Ettlin, D A Abstract: Local anesthesia has made dental treatment more comfortable since 1884, but little is known about associated brain mechanisms. Functional magnetic resonance imaging is a modern neuroimaging tool widely used for investigating...»

«COPYRIGHT NOTICE: Theodore Ziolkowski: The Sin of Knowledge is published by Princeton University Press and copyrighted, © 2000, by Princeton University Press. All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher, except for reading and browsing via the World Wide Web. Users are not permitted to mount this file on any...»

«How-To Guide CUSTOMER Document Version: 1.5 – 2015-12-28 How to Scramble Data Using SAP Test Data Migration Server Release 4.0 Typographic Conventions Type Style Description Example Words or characters quoted from the screen. These include field names, screen titles, pushbuttons labels, menu names, menu paths, and menu options. Textual cross-references to other documents. Example Emphasized words or expressions. Technical names of system objects. These include report names, program names,...»

«Perceptual Signal Coding for More Efficient Usage of Bit Codes Scott Miller Mahdi Nezamabadi Scott Daly Dolby Laboratories, Inc. What defines a digital video signal? •  SMPTE 292M, SMPTE 372M, HDMI? –  No, these are interface specifications –  They don’t say anything about what the RGB or YCbCr values mean •  Rec601 or Rec709? –  Not really, these are encoding specifications which define the OETF (Opto-Electrical Transfer Function) used for image capture –  Image...»

«Department of Chemical and Biomolecular Engineering November, 2008 The mission of the Department is to provide the highest quality programs to educate students in the principles and applications of Chemical and Biomolecular Eengineering. The excellence of the program is ensured by the high regard for teaching, strong research activities and solid industrial ties. The program educates students to take leadership roles in industry, academia and government. The Graduate Program The Department of...»

«3 Whose Apple Is It Anyway! Copyright © 2014 Linda F. Williams. All rights reserved. No portion of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means — electronic, mechanical, photocopy, recording, scanning, or other — except for brief quotations in critical reviews or articles, or as specifically allowed by the U. S. Copyright Act of 1976, as amended, without the prior written permission of the publisher. Published by Whose Apple Press...»

«Automatic Extraction of Generic House Roofs from High Resolution Aerial Imagery ? Frank Bignone, Olof Henricsson, Pascal Fua+ and Markus Stricker Communications Technology Laboratory Swiss Federal Institute of Technology ETH CH-8092 Zurich, Switzerland + SRI International, Menlo Park, CA 94025, USA Abstract. We present a technique to extract complex suburban roofs from sets of aerial images. Because we combine 2-D edge information, photometric and chromatic attributes and 3-D information, we...»

«SUBHRAJIT BHATTACHARYA Postdoctoral Researcher Department of Mathematics, University of Pennsylvania Room 3C7, 209 South 33rd Street, Philadelphia, PA 19104. phone: (001) 267-252-6638 • e-mail: subhrabh@math.upenn.edu • web: http://hans.math.upenn.edu/∼subhrabh/ PERSONAL INFORMATION Year of Birth: 1983 Country of Citizenship: India Marital Status: Married EDUCATION M.S. and Ph.D., Mechanical Engineering and Applied Mechanics (January 2012) University of Pennsylvania, USA. Ph.D....»

«Pour citer cet article : LE BRETON, D. « Concepts et significations majeures des conduites à risque », Journal des socio-anthropologues de l'adolescence et de la jeunesse, Revue en-ligne. Date de publication : janvier 2012. [http://anthropoado.com/le-journal-des-socio-anthropologues-de-l-adolescence-et-de-lajeunesse-textes-en-ligne/] Concepts et significations majeures des conduites à risque Par David Le Breton Significations des conduites à risque Paradoxalement les conduites à risque...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.