# «Discovering Signiﬁcant Patterns Geoffrey I. Webb Received: 23 May 2005 / Revised: 13 February 2007 / Accepted: 14 February 2007 / Published online: ...»

Mach Learn (2007) 68: 1–33

DOI 10.1007/s10994-007-5006-x

Discovering Signiﬁcant Patterns

Geoffrey I. Webb

Received: 23 May 2005 / Revised: 13 February 2007 /

Accepted: 14 February 2007 / Published online: 14 April 2007

Springer Science+Business Media, LLC 2007

**Abstract**

Pattern discovery techniques, such as association rule discovery, explore large

search spaces of potential patterns to ﬁnd those that satisfy some user-speciﬁed constraints.

Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of ﬁnding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well- established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

Keywords Pattern discovery · Statistical evaluation · Association rules 1 Introduction This paper addresses the problem of using statistical hypothesis tests to screen individual patterns in the context of discovering large numbers of patterns from a single set of data. This problem arises in pattern discovery, as exempliﬁed by association rule discovery (Agrawal et al. 1993), k-optimal or k-top rule discovery (Webb 1995; Scheffer and Wrobel 2002; Webb and Zhang 2005), contrast or emerging pattern discovery (Bay and Pazzani 2001; Dong and Li 1999), subgroup discovery (Klösgen 1996), interesting itemset discovery (Jaroszewicz and Simovici 2004) and impact or quantitative rule discovery (Aumann 1999; Webb 2001;

Zhang et al. 2004). All these techniques search large spaces of possible patterns P and return all patterns that satisfy user-deﬁned constraints. Such a process entails evaluating numerous Editor: Johannes Fürnkranz.

G.I. Webb ( ) Faculty of Information Technology, Monash University, PO Box 75, Clayton, Vic., 3800, Australia e-mail: webb@infotech.monash.edu 2 Mach Learn (2007) 68: 1–33 patterns ρ ∈ P against a set of sample data D drawn from a distribution Θ, seeking those ρ that satisfy user-speciﬁed constraints φ with respect to Θ. Each time such an evaluation is performed there is a risk that the pattern ρ will satisfy φ with respect to the sample data D and hence that it will be inferred that ρ satisﬁes φ with respect to Θ, even though this inference is incorrect. Even if this risk is tightly bounded for each ρ, by considering numerous patterns, the risk of accepting at least one erroneous pattern can grow large. For example, suppose the risk is bounded by applying a statistical test for φ, accepting each pattern only if the signiﬁcance level falls below critical value κ. If n independent patterns are assessed, the risk of accepting at least one erroneous pattern by chance is 1 − (1 − κ)n, which, for even relatively small values of κ and n, rapidly approaches 1.0. This problem is closely related to the multiple comparisons problem (Jensen and Cohen 2000) and the problem of oversearch (Quinlan and Cameron-Jones 1995).

Most pattern discovery systems do little to control this risk. While a number of approaches to controlling this risk have been developed (for example, Megiddo and Srikant 1998; Bay and Pazzani 2001; Webb 2002), each has limitations as detailed in Sect. 4.

The current paper investigates two approaches to applying statistical tests in pattern discovery. The ﬁrst applies a Bonferroni correction for multiple tests (Shaffer 1995), dividing the global signiﬁcance level α by the number of patterns in the search space in order to obtain the critical value κ. The second divides the available data into exploratory and holdout sets. The exploratory data are used for pattern discovery. The holdout data are then used to assess the patterns so discovered, with any set of statistical hypothesis tests the user desires being applied to each pattern in turn. The critical value used with the hypothesis tests is adjusted for the number of patterns tested using a correction for multiple tests such as the Bonferroni correction.

These approaches

• can apply any type of statistical hypothesis test to each of the individual patterns discovered by a system;

• can apply any number of such hypothesis tests to each of the patterns; and

• provide precise control over the experimentwise risk of type-1 error. That is, under the assumption that D is an iid sample from Θ, they allow the experimenter to establish a precise upper-bound on the risk of any of the multiple hypothesis tests falsely rejecting a null hypothesis and thus on the risk of falsely accepting a pattern.

Further, the holdout-evaluation approach can be applied as a simple wrapper to any pattern discovery system.

These approaches to handling multiple comparisons are well established in statistical theory and practice (Shaffer 1995). The contribution of the current work is to recognize their applicability in the context of the massive search spaces frequently explored in pattern discovery, to highlight the critical need for such techniques, and to investigate their relative strengths and weaknesses in this context.

The use of holdout evaluation should be very familiar to most machine learning researchers, who frequently use a similar process, dividing data into training and test sets, forming models from the former and obtaining unbiased estimates of their expected performance by observing their performance against the latter. The differences are that rather than obtaining an unbiased estimate of the quality of a single model, such as its error or ROC curve, we are seeking to either accept or reject each of many patterns, and we are doing so in a manner that strictly controls the risk of false discoveries.

The paper is organized as follows. Section 2 describes pattern discovery. Section 3 provides a formal problem statement. Section 4 discusses previous approaches to statistical Mach Learn (2007) 68: 1–33 3 testing within pattern discovery and outlines their limitations. Section 5 presents the new approaches. Section 6 summarises a series of experiments that assess key attributes of the new technique and compare its performance to that of previous approaches. Details of the experiments are presented in Appendices 1 and 2. Section 7 provides a general discussion including directions for future research. Section 8 concludes the paper with a summary and concluding remarks.

The holdout-evaluation technique was ﬁrst presented in Webb (2003) and the directadjustment technique ﬁrst presented in Webb (2006). Experiment 1 in the current paper, and Sect. 6.1 that discusses it, is reproduced from Webb (2006) and Experiments 2, 3 and 4 are more extensive variants of experiments in that previous paper. The discussions of previous approaches in Sect. 4 and the new approaches in Sect. 5 are substantially expanded upon those in the earlier papers. The general discussion presented in Sect. 7 is entirely new.

**2 Pattern Discovery**

Most machine learning systems learn a single model from the available data. The model learned is usually that expected to maximize accuracy or some other metric on unseen future data. Many systems that learn explicit models, such as decision tree (Quinlan 1993) or decision rule (Michalski 1983) learners, do so by searching a space of alternative models to select the model that appears to perform best with respect to the available data.

Frequently during such a search through the space of models the learner will encounter alternatives that perform equally well or almost as well as each other on the available data.

For example, during decision tree induction it is common to ﬁnd multiple alternative splits at a node all of which score as well or nearly as well on the split selection criterion.

A machine learning system must inevitably make arbitrary choices between such alternative models. Quite simply, it must choose one of these models and has no non-arbitrary basis on which to do so. This is acceptable if there is no other source of information about the potential models. If several alternatives appear equally good, there is no basis on which to prefer any one over the others and it does not matter which is selected.

However, in some contexts there are factors not available to the learning system that can help distinguish between the models.

• Experts may be able to bring to bear background knowledge of a form that would be difﬁcult to encode and make available to a learning system.

• Alternative models may have different levels of desirability because they use different attributes that represent tests with differing costs to evaluate (Turney 2000). For example, in a medical diagnosis task it might be possible to form alternative models of equivalent power, one using a simple blood test and the other using an expensive MRI scan.

Clearly the former model will be more desirable. However, most learners are unable to take account of the costs of tests during a learning process.

• Some models may be socially or politically unacceptable, for example because they do not ﬁt into senior management’s conceptualization of the business.

• Some models may be unusable due to legislated restrictions.

• Alternative models may be more or less acceptable to a user population simply because one is consistent with existing knowledge and beliefs and the other is not.

• Finally, in some applications it may be possible to deploy multiple models and hence be unnecessary to derive just one.

4 Mach Learn (2007) 68: 1–33 For these reasons it is often desirable to ﬁnd all models that satisfy some criteria with regard to the data. In such pattern discovery tasks, the user speciﬁes constraints on the models to be discovered and the system discovers all models that satisfy those constraints. For many pattern discovery techniques the models take the form of rules. However, the same underlying techniques may be applied to other forms of model of regularities in the data, such as itemsets (Agrawal et al. 1993), sequential patterns (Agrawal and Srikant 1995) and sub-graphs (Kuramochi and Karypis 2001). The best known example of pattern discovery is association rule discovery (Agrawal et al. 1993).

**3 Problem Statement**

As outlined in the introduction, pattern discovery seeks to identify patterns ρ ∈ P that satisfy constraints φ with respect to distribution Θ. However, whether ρ satisﬁes φ with respect to Θ is assessed by reference to sample data D drawn from Θ. Although the principles extend directly to further contexts, the current research limits consideration to two types of data, transactional data and attribute-value data, and one type of pattern, rules.

For both data types, D is a multiset of n records and each record R ∈ D is a set of items R ⊆ I. For transactional data, items are atomic terms. For attribute-value data, there exists a set of a attributes A1... Aa, each attribute Ai has a domain of #Ai values dom(Ai ), each item is an attribute-value pair denoted as Ai = v, where v ∈ dom(Ai ), and each record R ∈ D contains exactly one item for each attribute.

Rules take the form X → y, where X ⊆ I, |X| ≥ 1 and y ∈ I. X is called the antecedent and y the consequent of the rule. For attribute-value data, X ∪ {y} may contain no more than one item for any one attribute.

Association rule discovery ﬁnds all rules that satisfy speciﬁed constraints φ speciﬁed as a minimum support (Agrawal et al. 1993), together with other constraints, if desired, such as minimum conﬁdence (Agrawal et al. 1993), lift (International Business Machines 1996), or leverage (Piatetsky-Shapiro 1991). These terms are deﬁned with respect to a rule X → y

**and dataset D as follows:**

• coverage(X → y) is |{R ∈ D : X ⊆ R}|;

• support(X → y) is |{R ∈ D : X ∪ {y} ⊆ R}|;

• conﬁdence(X → y) = support(X → y)/coverage(X → y). This can be viewed as a maximum likelihood estimate of the conditional probability P (y | X);

• lift(X → y) = conﬁdence(X → y)/conﬁdence(∅ → y).

• leverage(X → y) = support(X → y) − coverage(X → y) × |{R ∈ D : y ∈ R}| / |D|, This measure represents the difference between the support and the support that would be expected if X and y were independent.

Each assessment of whether a given ρ satisﬁes φ is accompanied by a risk that the pattern ρ will satisfy φ with respect to the sample data D but not with respect to Θ. Most pattern discovery systems fail to effectively control this risk.

Statistical hypothesis tests are applicable to such a scenario. To apply such a test it is necessary to specify a null hypothesis, in our context the hypothesis that the negation of φ is true. If the discovery process “discovers” a pattern ρ that satisﬁes the null hypothesis, ρ is considered to be a false discovery or equivalently, a type-1 error. Any pattern ρ that is not “discovered” and does not satisfy the null hypothesis is called a type-2 error.

The techniques presented herein allow arbitrary statistical hypothesis tests to be applied during pattern discovery in a manner that allows the user to place a strict upper bound on the risk of any pattern being accepted that is a false discovery.

Mach Learn (2007) 68: 1–33 5

**4 Previous Approaches to Avoiding False Discoveries in Pattern Discovery**