# «CONSIDERING AUTOCORRELATION IN PREDICTIVE MODELS Daniela Stojanova Doctoral Dissertation Jožef Stefan International Postgraduate School Ljubljana, ...»

## CONSIDERING AUTOCORRELATION IN

## PREDICTIVE MODELS

Daniela Stojanova

Doctoral Dissertation

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia, December 2012

**Evaluation Board:**

Prof. Dr. Marko Bohanec, Chairman, Jožef Stefan Institute, Ljubljana, Slovenia

Assoc. Dr. Janez Demšar, Member, Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia Asst. Prof. Michelangelo Ceci, Member, Università degli Studi di Bari “Aldo Moro”, Bari, Italy IONAL PO AT S RN SLOVENIA

## MEDNARODNA PODIPLOMSKA ŠOLA JOŽEFA STEFANA

TG E AN INT RA DUATE S## JOŽEF STEFAN INTERNATIONAL POSTGRADUATE SCHOOL

TEF Ljubljana, Slovenia LJ FS UBLJANA C HO E JOŽ OL Daniela Stojanova## CONSIDERING AUTOCORRELATION

## IN PREDICTIVE MODELS

Doctoral Dissertation## UPOŠTEVANJE AVTOKORELACIJE V

## NAPOVEDNIH MODELIH

Doktorska disertacija Supervisor: Prof. Dr. Sašo Džeroski Ljubljana, Slovenia, December 2012 Contents**Abstract**

xi Povzetek xiii Abbreviations xv Abbreviations xv 1 Introduction

Most machine learning, data mining and statistical methods rely on the assumption that the analyzed data are independent and identically distributed (i.i.d.). More speciﬁcally, the individual examples included in the training data are assumed to be drawn independently from each other from the same probability distribution. However, cases where this assumption is violated can be easily found: For example, species are distributed non-randomly across a wide range of spatial scales. The i.i.d. assumption is often violated because of the phenomenon of autocorrelation.

The cross-correlation of an attribute with itself is typically referred to as autocorrelation: This is the most general deﬁnition found in the literature. Speciﬁcally, in statistics, temporal autocorrelation is deﬁned as the cross-correlation between the attribute of a process at different points in time. In timeseries analysis, temporal autocorrelation is deﬁned as the correlation among time-stamped values due to their relative proximity in time. In spatial analysis, spatial autocorrelation has been deﬁned as the correlation among data values, which is strictly due to the relative location proximity of the objects that the data refer to. It is justiﬁed by Tobler’s ﬁrst law of geography according to which “everything is related to everything else, but near things are more related than distant things”. In network studies, autocorrelation is deﬁned by the homophily principle as the tendency of nodes with similar values to be linked with each other.

In this dissertation, we ﬁrst give a clear and general deﬁnition of the autocorrelation phenomenon, which includes spatial and network autocorrelation for continuous and discrete responses. We then present a broad overview of the existing autocorrelation measures for the different types of autocorrelation and data analysis methods that consider them. Focusing on spatial and network autocorrelation, we propose three algorithms that handle non-stationary autocorrelation within the framework of predictive clustering, which deals with the tasks of classiﬁcation, regression and structured output prediction. These algorithms and their empirical evaluation are the major contributions of this thesis.

We ﬁrst propose a data mining method called SCLUS that explicitly considers spatial autocorrelation when learning predictive clustering models. The method is based on the concept of predictive clustering trees (PCTs), according to which hierarchies of clusters of similar data are identiﬁed and a predictive model is associated to each cluster. In particular, our approach is able to learn predictive models for both a continuous response (regression task) and a discrete response (classiﬁcation task). It properly deals with autocorrelation in data and provides a multi-level insight into the spatial autocorrelation phenomenon.

The predictive models adapt to the local properties of the data, providing at the same time spatially smoothed predictions. We evaluate our approach on several real world problems of spatial regression and spatial classiﬁcation.

The problem of “network inference” is known to be a challenging task. In this dissertation, we propose a data mining method called NCLUS that explicitly considers autocorrelation when building predictive models from network data. The algorithm is based on the concept of PCTs that can be used for clustering, prediction and multi-target prediction, including multi-target regression and multi-target classiﬁcation. We evaluate our approach on several real world problems of network regression, coming from the areas of social and spatial networks. Empirical results show that our algorithm performs better than PCTs learned by completely disregarding network information, CLUS* which is tailored for spatial data, but does not take autocorrelation into account, and a variety of other existing approaches.

We also propose a data mining method called NHMC for (Network) Hierarchical Multi-label Classication. This has been motivated by the recent development of several machine learning algorithms for gene function prediction that work under the assumption that instances may belong to multiple classes and that classes are organized into a hierarchy. Besides relationships among classes, it is also possible to identify relationships among examples. Although such relationships have been identiﬁed and extensively studied in the literature, in particular as deﬁned by protein-to-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Their use introduces the autocorrelation phenomenon and violates the i.i.d. assumption adopted by most machine learning algorithms. Besides improving the predictive capabilities of learned models, NHMC is helpful in obtaining predictions consistent with the network structure and consistently combining two information sources (hierarchical collections of functional class deﬁnitions and PPI networks). We compare different PPI networks (DIP, VM and MIPS for yeast data) and their inﬂuence on the predictive capability of the models. Empirical evidence shows that explicitly taking network autocorrelation into account can increase the predictive capability of the models, especially when the PPI networks are dense.

NHMC outperforms CLUS-HMC (that disregards the network) for GO annotations, since these are more coherent with the PPI networks.

Povzetek

1 Introduction In this introductory chapter, we ﬁrst place the dissertation within the broader context of its research area.

We then motivate the research performed within the scope of the dissertation. The major contributions of the thesis to science are described next. We conclude this chapter by giving an outline of the structure of the remainder of the thesis.

1.1 Outline The research presented in this dissertation is placed in the area of artiﬁcial intelligence (Russell and Norvig, 2003), and more speciﬁcally in the area of machine learning. Machine learning is concerned with the design and the development of algorithms that allow computers to evolve behaviors based on empirical data, i.e., it studies computer programs that automatically improve with experience (Mitchell, 1997). A major focus of machine learning research is to extract information from data automatically by computational and statistical methods and make intelligent decisions based on the data. However, the difﬁculty lies in the fact that the set of all possible behaviors, given all possible inputs, is too large to be covered by the set of observed examples.

In general, there are two types of learning: inductive and deductive. Inductive machine learning (Bratko, 2000) is a very signiﬁcant ﬁeld of research in machine learning, where new knowledge is extracted out of data that describes experience and is given in the form of learning examples (instances). In contrast, deductive learning (Langley, 1996) explains a given set of rules by using speciﬁc information from the data.

Depending on the feedback the learner gets during the learning process, learning can be classiﬁed as supervised or unsupervised. Supervised learning is a machine learning technique for learning a function from a set of data. Supervised inductive machine learning, also called predictive modeling, assumes that each learning example includes some target property, and the goal is to learn a model that accurately predicts this property. On the other hand, unsupervised inductive machine learning, also called descriptive modeling, assumes no such target property to be predicted. Examples of machine learning methods for predictive modeling include decision trees, decision rules and support vector machines. In contrast, examples of machine learning methods for descriptive modeling include clustering, association rule modeling and principal-component analysis (Bishop, 2007).

In general, predictive and descriptive modeling are considered as different machine learning tasks and are usually treated separately. However, predictive modeling can be seen as a special case of clustering (Blockeel, 1998). In this case, the goal of predictive modeling is to identify clusters that are compact in the target space (i.e., group the instances with similar values of the target variable). The goal of descriptive modeling, on the other hand, is to identify clusters compact in the descriptive space (i.e., group the instances with similar values of the descriptive variables).

Predictive modeling methods are used for predicting an output (i.e., target property or target attribute) for an example. Typically, the output can be either a discrete variable (classiﬁcation) or a continuous variable (regression). However, there are many real-life problems, such as text categorization, gene function prediction, image annotation, etc., where the input and/or the output are structured. Beside the

## 2 Introduction

typical classiﬁcation and regression task, we also consider the latter, namely, predictive modeling tasks with structured outputs.Predictive clustering (Blockeel, 1998) combines elements from both prediction and clustering. As in clustering, clusters of examples that are similar to each other are identiﬁed, but a predictive model is associated to each cluster. New instances are assigned to clusters based on cluster descriptions. The associated predictive models provide predictions for the target property. The beneﬁt of using predictive clustering methods, as in conceptual clustering (Michalski and Stepp, 2003), is that besides the clusters themselves, they also provide symbolic descriptions of the constructed clusters. However, in contrast to conceptual clustering, predictive clustering is a form of supervised learning.

Predictive clustering trees (PCTs) are tree structured models that generalize decision trees. Key properties of PCTs are that i) they can be used to predict many or all attributes of an example at once (multi-target), ii) they can be applied to a wide range of prediction tasks (classiﬁcation and regression) and iii) they can work with examples represented by means of a complex representation (Džeroski et al, 2007), which is achieved by plugging in a suitable distance metric for the task at hand. PCTs were ﬁrst implemented in the context of First-Order logical decision trees, in the system TILDE (Blockeel, 1998), where relational descriptions of the examples are used. The most known implementation of PCTs, however, is the one that uses attribute-value descriptions of the examples and is implemented in the predictive clustering framework of the CLUS system (Blockeel and Struyf, 2002). The CLUS system is available for download at http://sourceforge.net/projects/clus/.

Here, we extend the predictive clustering framework to work in the context of autocorrelated data.

For such data the independence assumption which typically underlies machine learning methods and multivariate statistics, is no longer valid. Namely, the autocorrelation phenomenon directly violates the assumption that the data instances are drawn independent from each other from an identical distribution (i.i.d.). At the same time, it offers the unique opportunity to improve the performance of predictive models which would take it into account.

Autocorrelation is very common in nature and has been investigated in different ﬁelds, from statistics and time-series analysis, to signal-processing and music recordings. Here we acknowledge the existence of four different types of autocorrelation: spatial, temporal, spatio-temporal and network (relational) autocorrelation, describing the existing autocorrelation measures and the data analysis methods that consider them. However, in the development of the proposed algorithms, we focus on spatial autocorrelation and network autocorrelation. In addition, we also deal with the complex case of predicting structured targets (outputs), where network autocorrelation is considered.

In the PCT framework (Blockeel, 1998), a tree is viewed as a hierarchy of clusters: the top-node contains all the data, which is recursively partitioned into smaller clusters while moving down the tree.

This structure allows us to estimate and exploit the effect of autocorrelation in different ways at different nodes of the tree. In this way, we are able to deal with non-stationarity autocorrelation, i.e., autocorrelation which may change its effects over space/networks structure.

PCTs are learned by extending the heuristics functions used in tree induction to include the spatial/network autocorrelation. In this way, we obtain predictive models that are able to deal with autocorrelated data. More speciﬁcally, beside maximizing the variance reduction which minimizes the intra-cluster distance in the class labels associated to examples, we also maximize cluster homogeneity in terms of autocorrelation at the same time doing the evaluation of candidate splits for adding a new node to the tree. This results in improved predictive performance of the obtained models and in smother predictions.