FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |   ...   | 32 |

«CONSIDERING AUTOCORRELATION IN PREDICTIVE MODELS Daniela Stojanova Doctoral Dissertation Jožef Stefan International Postgraduate School Ljubljana, ...»

-- [ Page 1 ] --



Daniela Stojanova

Doctoral Dissertation

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia, December 2012

Evaluation Board:

Prof. Dr. Marko Bohanec, Chairman, Jožef Stefan Institute, Ljubljana, Slovenia

Assoc. Dr. Janez Demšar, Member, Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia Asst. Prof. Michelangelo Ceci, Member, Università degli Studi di Bari “Aldo Moro”, Bari, Italy IONAL PO AT S RN SLOVENIA




TEF Ljubljana, Slovenia LJ FS UBLJANA C HO E JOŽ OL Daniela Stojanova



Doctoral Dissertation



Doktorska disertacija Supervisor: Prof. Dr. Sašo Džeroski Ljubljana, Slovenia, December 2012 Contents


xi Povzetek xiii Abbreviations xv Abbreviations xv 1 Introduction

–  –  –

Most machine learning, data mining and statistical methods rely on the assumption that the analyzed data are independent and identically distributed (i.i.d.). More specifically, the individual examples included in the training data are assumed to be drawn independently from each other from the same probability distribution. However, cases where this assumption is violated can be easily found: For example, species are distributed non-randomly across a wide range of spatial scales. The i.i.d. assumption is often violated because of the phenomenon of autocorrelation.

The cross-correlation of an attribute with itself is typically referred to as autocorrelation: This is the most general definition found in the literature. Specifically, in statistics, temporal autocorrelation is defined as the cross-correlation between the attribute of a process at different points in time. In timeseries analysis, temporal autocorrelation is defined as the correlation among time-stamped values due to their relative proximity in time. In spatial analysis, spatial autocorrelation has been defined as the correlation among data values, which is strictly due to the relative location proximity of the objects that the data refer to. It is justified by Tobler’s first law of geography according to which “everything is related to everything else, but near things are more related than distant things”. In network studies, autocorrelation is defined by the homophily principle as the tendency of nodes with similar values to be linked with each other.

In this dissertation, we first give a clear and general definition of the autocorrelation phenomenon, which includes spatial and network autocorrelation for continuous and discrete responses. We then present a broad overview of the existing autocorrelation measures for the different types of autocorrelation and data analysis methods that consider them. Focusing on spatial and network autocorrelation, we propose three algorithms that handle non-stationary autocorrelation within the framework of predictive clustering, which deals with the tasks of classification, regression and structured output prediction. These algorithms and their empirical evaluation are the major contributions of this thesis.

We first propose a data mining method called SCLUS that explicitly considers spatial autocorrelation when learning predictive clustering models. The method is based on the concept of predictive clustering trees (PCTs), according to which hierarchies of clusters of similar data are identified and a predictive model is associated to each cluster. In particular, our approach is able to learn predictive models for both a continuous response (regression task) and a discrete response (classification task). It properly deals with autocorrelation in data and provides a multi-level insight into the spatial autocorrelation phenomenon.

The predictive models adapt to the local properties of the data, providing at the same time spatially smoothed predictions. We evaluate our approach on several real world problems of spatial regression and spatial classification.

The problem of “network inference” is known to be a challenging task. In this dissertation, we propose a data mining method called NCLUS that explicitly considers autocorrelation when building predictive models from network data. The algorithm is based on the concept of PCTs that can be used for clustering, prediction and multi-target prediction, including multi-target regression and multi-target classification. We evaluate our approach on several real world problems of network regression, coming from the areas of social and spatial networks. Empirical results show that our algorithm performs better than PCTs learned by completely disregarding network information, CLUS* which is tailored for spatial data, but does not take autocorrelation into account, and a variety of other existing approaches.

We also propose a data mining method called NHMC for (Network) Hierarchical Multi-label Classication. This has been motivated by the recent development of several machine learning algorithms for gene function prediction that work under the assumption that instances may belong to multiple classes and that classes are organized into a hierarchy. Besides relationships among classes, it is also possible to identify relationships among examples. Although such relationships have been identified and extensively studied in the literature, in particular as defined by protein-to-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Their use introduces the autocorrelation phenomenon and violates the i.i.d. assumption adopted by most machine learning algorithms. Besides improving the predictive capabilities of learned models, NHMC is helpful in obtaining predictions consistent with the network structure and consistently combining two information sources (hierarchical collections of functional class definitions and PPI networks). We compare different PPI networks (DIP, VM and MIPS for yeast data) and their influence on the predictive capability of the models. Empirical evidence shows that explicitly taking network autocorrelation into account can increase the predictive capability of the models, especially when the PPI networks are dense.

NHMC outperforms CLUS-HMC (that disregards the network) for GO annotations, since these are more coherent with the PPI networks.


–  –  –

1 Introduction In this introductory chapter, we first place the dissertation within the broader context of its research area.

We then motivate the research performed within the scope of the dissertation. The major contributions of the thesis to science are described next. We conclude this chapter by giving an outline of the structure of the remainder of the thesis.

1.1 Outline The research presented in this dissertation is placed in the area of artificial intelligence (Russell and Norvig, 2003), and more specifically in the area of machine learning. Machine learning is concerned with the design and the development of algorithms that allow computers to evolve behaviors based on empirical data, i.e., it studies computer programs that automatically improve with experience (Mitchell, 1997). A major focus of machine learning research is to extract information from data automatically by computational and statistical methods and make intelligent decisions based on the data. However, the difficulty lies in the fact that the set of all possible behaviors, given all possible inputs, is too large to be covered by the set of observed examples.

In general, there are two types of learning: inductive and deductive. Inductive machine learning (Bratko, 2000) is a very significant field of research in machine learning, where new knowledge is extracted out of data that describes experience and is given in the form of learning examples (instances). In contrast, deductive learning (Langley, 1996) explains a given set of rules by using specific information from the data.

Depending on the feedback the learner gets during the learning process, learning can be classified as supervised or unsupervised. Supervised learning is a machine learning technique for learning a function from a set of data. Supervised inductive machine learning, also called predictive modeling, assumes that each learning example includes some target property, and the goal is to learn a model that accurately predicts this property. On the other hand, unsupervised inductive machine learning, also called descriptive modeling, assumes no such target property to be predicted. Examples of machine learning methods for predictive modeling include decision trees, decision rules and support vector machines. In contrast, examples of machine learning methods for descriptive modeling include clustering, association rule modeling and principal-component analysis (Bishop, 2007).

In general, predictive and descriptive modeling are considered as different machine learning tasks and are usually treated separately. However, predictive modeling can be seen as a special case of clustering (Blockeel, 1998). In this case, the goal of predictive modeling is to identify clusters that are compact in the target space (i.e., group the instances with similar values of the target variable). The goal of descriptive modeling, on the other hand, is to identify clusters compact in the descriptive space (i.e., group the instances with similar values of the descriptive variables).

Predictive modeling methods are used for predicting an output (i.e., target property or target attribute) for an example. Typically, the output can be either a discrete variable (classification) or a continuous variable (regression). However, there are many real-life problems, such as text categorization, gene function prediction, image annotation, etc., where the input and/or the output are structured. Beside the

2 Introduction

typical classification and regression task, we also consider the latter, namely, predictive modeling tasks with structured outputs.

Predictive clustering (Blockeel, 1998) combines elements from both prediction and clustering. As in clustering, clusters of examples that are similar to each other are identified, but a predictive model is associated to each cluster. New instances are assigned to clusters based on cluster descriptions. The associated predictive models provide predictions for the target property. The benefit of using predictive clustering methods, as in conceptual clustering (Michalski and Stepp, 2003), is that besides the clusters themselves, they also provide symbolic descriptions of the constructed clusters. However, in contrast to conceptual clustering, predictive clustering is a form of supervised learning.

Predictive clustering trees (PCTs) are tree structured models that generalize decision trees. Key properties of PCTs are that i) they can be used to predict many or all attributes of an example at once (multi-target), ii) they can be applied to a wide range of prediction tasks (classification and regression) and iii) they can work with examples represented by means of a complex representation (Džeroski et al, 2007), which is achieved by plugging in a suitable distance metric for the task at hand. PCTs were first implemented in the context of First-Order logical decision trees, in the system TILDE (Blockeel, 1998), where relational descriptions of the examples are used. The most known implementation of PCTs, however, is the one that uses attribute-value descriptions of the examples and is implemented in the predictive clustering framework of the CLUS system (Blockeel and Struyf, 2002). The CLUS system is available for download at http://sourceforge.net/projects/clus/.

Here, we extend the predictive clustering framework to work in the context of autocorrelated data.

For such data the independence assumption which typically underlies machine learning methods and multivariate statistics, is no longer valid. Namely, the autocorrelation phenomenon directly violates the assumption that the data instances are drawn independent from each other from an identical distribution (i.i.d.). At the same time, it offers the unique opportunity to improve the performance of predictive models which would take it into account.

Autocorrelation is very common in nature and has been investigated in different fields, from statistics and time-series analysis, to signal-processing and music recordings. Here we acknowledge the existence of four different types of autocorrelation: spatial, temporal, spatio-temporal and network (relational) autocorrelation, describing the existing autocorrelation measures and the data analysis methods that consider them. However, in the development of the proposed algorithms, we focus on spatial autocorrelation and network autocorrelation. In addition, we also deal with the complex case of predicting structured targets (outputs), where network autocorrelation is considered.

In the PCT framework (Blockeel, 1998), a tree is viewed as a hierarchy of clusters: the top-node contains all the data, which is recursively partitioned into smaller clusters while moving down the tree.

This structure allows us to estimate and exploit the effect of autocorrelation in different ways at different nodes of the tree. In this way, we are able to deal with non-stationarity autocorrelation, i.e., autocorrelation which may change its effects over space/networks structure.

PCTs are learned by extending the heuristics functions used in tree induction to include the spatial/network autocorrelation. In this way, we obtain predictive models that are able to deal with autocorrelated data. More specifically, beside maximizing the variance reduction which minimizes the intra-cluster distance in the class labels associated to examples, we also maximize cluster homogeneity in terms of autocorrelation at the same time doing the evaluation of candidate splits for adding a new node to the tree. This results in improved predictive performance of the obtained models and in smother predictions.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 32 |

Similar works:

«“I can’t wait for Monday to get here! This seminar was so motivating and so insightful. Everyone should attend! Thank you for sharing your wisdom.” Dr. Kesa McConnell, Oklahoma City, OK “Wonderful seminar motivational for staff and doctor. Great ideas for staff to get more involved and understand things from the doctor’s perspective. Thank you we had fun!” Dr. Leslie Showalker, Madison, WI “Great presentation. Very practical info for smaller offices. Good jokes made it fun!”...»

«5, 370-378 (1978) ETHNICITY From Immigrants to Ethnics: Toward a New Theory of Ethn icization D. JONATHAN SARNA Yale University The author contrasts the fragmented nature of immigrant groups upon their arrival in America with the social and cultural unities found among ethnic groups years later. He explains this change-the process of ethnicization -as a consequence of two factors: ascription and adversity. Outside institutions ascribed ethnic identity for practical reasons: village loyalties...»

«LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 12 : 4 April 2012 ISSN 1930-2940 Managing Editor: M. S. Thirumalai, Ph.D. Editors: B. Mallikarjun, Ph.D. Sam Mohanlal, Ph.D. B. A. Sharada, Ph.D. A. R. Fatihi, Ph.D. Lakhan Gusain, Ph.D. Jennifer Marie Bayer, Ph.D. S. M. Ravichandran, Ph.D. G. Baskaran, Ph.D. L. Ramamoorthy, Ph.D. Spelling and Auditory Discrimination Difficulties of Students in Oman: An Analysis Maruthi Kumari Vaddapalli, Ph.D....»


«Notes on Media Briefing by Akira Kiyota, Director and Representative Executive Officer, Group CEO, Japan Exchange Group, Inc., on December 17, 2015 Good afternoon. The only item on the agenda for today is concerning the transition to a standardized trading unit. Please refer to the handout. Following an erroneous order placed by Mizuho Securities a decade ago in December 2005, the FSA set up a council of experts. The council proposed standardizing trading units at exchange markets, because...»

«SANTO TOMÁS DE AQUINO Y EL ISLAM 1 EN LA SUMA CONTRA LOS GENTILES St. Thomas Aquinas and Islam in the Summa contra gentes 2 DlETRICH LORENZ DAIBER Resumen El presente trabajo indaga sobre las posibilidades de un diálogo entre cristianos y musulmanes, tomando como punto de partida para este análisis el texto de la Suma contra Gentiles, I, c. 6. Abstract This work enquires about the possibilities of a dialogue between Christians and Muslims, using the text of the Summa contra Gentes I, c. 6,...»

«Waxyaabaha Aad Dooneyso In Aad Ogaato Marka Aad Qaadaneyso Macaashka Hawlgabka Ama Dhaxalka 2013 La Xiriirka Lambarka Bulshada Booqo bartayada internetka Bogga internetka, www.socialsecurity.gov, waa meel aad macluumaad qiime badan leh iyo dhamaan barnaamijyada Lambarka Bulshada. Boggeena internetka, waxa kale oo aad ka heli kartaa: • Waxa aad ka coddsan kartaa hawlgabka, naafada, iyo macaashka Medicare; • Waxa aad ka eegi kartaa Social Security Statement (Warbixinta Lambar Bulshada);...»

«IFS 4Q09 Conference Call 02/03/2010 9:00 A.M. Operator: Good morning and welcome to the Intergroup Financial Services Fourth Quarter 2009 conference call. All lines have been placed on mute to prevent any background noise. After this presentation, we will open the floor for questions. At that time, instructions will be given as a procedure to follow if you would like to ask a question. It is now my pleasure to turn the call over to Peter Majeski of i-advize Corporate Communications. Sir, you...»

«Does Diversification Cause the “Diversification Discount”? Belén Villalonga University of California, Los Angeles Anderson Graduate School of Management 110 Westwood Plaza, Box 951481 Los Angeles, CA 90095-1481, U.S.A. Tel.: (310) 470-2623 Fax: (310) 470-0643 e-mail: belen.villalonga@anderson.ucla.edu web: http://personal.anderson.ucla.edu/belen.villalonga First draft, January 1999. This draft, July 2000. I wish to thank my dissertation committee––Harold Demsetz, Guido Imbens, Bill...»

«The Top Four Percent: An Exploratory Study of Women Leading Fortune 1000 Firms Marge Karsten University of Wisconsin-Platteville Wendy Brooke University of Wisconsin-Platteville Marvee L. Marr Ashford University This study of women who lead the 1000 most profitable organizations in the United States began in October 2012. At that time, four percent, or 40, of these Chief Executive Officers (CEOs) were women; 96% were men. Since then, several changes have occurred. One CEO no longer holds that...»

«Volume 5: 2012-2013 ISSN: 2041-6776 School of English Compare the depictions of Vikings in at least two different periods of English literature. Joel Davie Our conception of the Vikings is shaped by more than purely material and archaeological evidence; these may offer hints as to how the Vikings lived, but not how they thought. For this, we must turn to the literature. It is in the stories and poetry of a people that we find, for want of a better word, the ‘spirit’ of that people. This...»

«16 7 Guignebert, G 82 The Jewish ~ o r l d the in time of Jesus in the time of By CHARLES GUIGNEBERT Late Professor of Christianity at the Sorbonne in the time of W i t h An I n t r o d u c t i o n b y Dr. C h a r l e s F r a n c i s P o t t e r New Hyde Park, New York UNIYERBITY B O O X S First Printing, December 1959 Second Printing, October 1961 Third Printing, October 1965 Copyright 1959. FIRSTAMERICAN EDITION by University Books, lnc. Library of Congress Catalog Card No. 59-14528...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.