«A MODULAR ONTOLOGY OF DATA MINING Panˇe Panov c Doctoral Dissertation Joˇef Stefan International Postgraduate School z Ljubljana, Slovenia, July ...»
A MODULAR ONTOLOGY
OF DATA MINING
Joˇef Stefan International Postgraduate School
Ljubljana, Slovenia, July 2012
Prof. Dr. Nada Lavraˇ, Chair, Joˇef Stefan Institute, Ljubljana, Slovenia
Dr. Larisa Soldatova, Member, Brunel University, London, United Kingdom
Prof. Dr. Dunja Mladeni´, Member, Joˇef Stefan Institute, Ljubljana, Slovenia c z Panˇe Panov c
A MODULAR ONTOLOGY
OF DATA MININGDoctoral Dissertation
PODATKOVNEGA RUDARJENJADoktorska disertacija Supervisor : Prof. Dr. Saˇo Dˇeroski s z Ljubljana, Slovenia, July 2012 V Contents
XI Povzetek XIII 1 Introduction 1
1.1 Background..................................... 1 1.1.1 Challenges in the domain of data mining................. 1 1.1.2 Formalization of scientiﬁc investigations................. 2 1.1.3 Applied ontology.............................. 3
1.2 Motivation..................................... 4
1.3 Goals........................................ 5
1.4 Scientiﬁc contributions............................... 6
1.5 Thesis structure................................... 7 2 Ontology 9
2.1 What is an ontology?................................ 9 2.1.1 Deﬁnitions of ontology in computer science............... 10 2.1.2 Ontology as a representational artifact.................. 11 2.1.3 Roles of an ontology.......
Abstract The domain of data mining (DM) deals with analyzing diﬀerent types of data. The data typically used in data mining is in the format of a single table, with primitive datatypes as attributes. However, structured (complex) data, such as graphs, sequences, networks, text, image, multimedia and relational data, are receiving an increasing amount of interest in data mining. A major challenge is to treat and represent the mining of diﬀerent types of structured data in a uniform fashion. A theoretical framework that uniﬁes diﬀerent data mining tasks, on diﬀerent types of data can help to formalize the knowledge about the domain and provide a base for future research, uniﬁcation and standardization. Next, automation and overall support of the Knowledge Discovery in Databases (KDD) process is also an important challenge in the domain of data mining. A formalization of the domain of data mining is a solution that addresses these challenges. It can directly support the development of a general framework for data mining, support the representation of the process of mining structured data, and allow the representation of the complete process of knowledge discovery.
In this thesis, we propose a reference modular ontology for the domain of data mining OntoDM, directly motivated by the need for formalization of the data mining domain. The OntoDM ontology is designed and implemented by following ontology best practices and design principles. Its distinguishing feature is that it uses Basic Formal Ontology (BFO) as an upper-level ontology and a template, a set of formally deﬁned relations from Relational Ontology (RO) and other state-of-the-art ontologies, and reuses classes and relations from the Ontology of Biomedical Investigations (OBI), the Information Artifact Ontology (IAO), and the Software Ontology (SWO). This will ensure compatibility and connections with other ontologies and allow cross-domain reasoning capabilities. The OntoDM ontology is composed of three modules covering diﬀerent aspects of data mining: OntoDT, which supports the representation of knowledge about datatypes and is based on an accepted ISO standard for datatypes in computer systems; OntoDM-core, which formalizes the key data mining entities for representing the mining of structured data in the context of a general framework for data mining; and OntoDM-KDD, which formalizes the knowledge discovery process based on the Cross Industry Standard Process for Data Mining (CRISP-DM) process model.
The OntoDT module provides a representation of the datatype entity, deﬁnes a taxonomy of datatype characterizing operations, and a taxonomy of datatype qualities. Furthermore, it deﬁnes a datatype taxonomy comprising classes and instances of primitive datatypes, generated datatypes (non-aggregate and aggregated datatypes), subtypes, and deﬁned datatypes.
With this structure, the module provides a generic mechanism for representing arbitrarily complex datatypes.
The OntoDM-core module formalizes the key data mining entities needed for the representation of mining structured data in the context of a general framework for data mining.
These include the entities dataset, data mining task, generalization, data mining algorithm, and others. More speciﬁcally, it provides a representation of datasets, and a taxonomy of datasets based on the type of data. Next, it provides a representation of data mining tasks, and proposes a taxonomy of data mining tasks, predictive modeling tasks and hierarchiXII cal classiﬁcation tasks. Furthermore, it provides a representation for generalizations, and proposes a taxonomy of generalizations and predictive models based on the types of data and generalization language. Moreover, it provides a representation of data mining algorithms, proposes a taxonomy of data mining algorithms, predictive modeling algorithms, and hierarchical classiﬁcation algorithms, and generalizes the mechanism for representing data mining algorithms to represent general algorithms in computer science. In addition, the OntoDM-core module provides a representation of constraints and constraint-based data mining tasks and proposes a taxonomy thereof. Finally, the module provides a representation of data mining scenarios that includes data mining scenarios as a speciﬁcation, data mining workﬂows, and the process of executing a data mining workﬂow.
The OntoDM-KDD module supports the representation of data mining investigations.
It provides a representation of data mining investigation by directly extending classes from the OBI and IAO ontologies. Furthermore, it models each of the phases in a data mining investigation (such as application understanding, data understanding, data preparation, modeling, DM process evaluation, and deployment), and their inputs and outputs.
The OntoDM ontology and its three modules OntoDT, OntoDM-core, and OntoDMKDD) were evaluated in order to assess their quality. The evaluation was performed by assessing the ontology against a set of design principles and best practices, and assessing whether the competency questions posed in the design phase were implemented in the language of the ontology. In addition, we provided a domain coverage assessment by comparing the OntoDM data mining tasks taxonomy with the data mining topic ontology constructed in a semi-automatic fashion from abstracts of articles from data mining conferences and journals.
The developed ontology supports a large variety of applications. We demonstrate the use and the application of the ontology by describing six use cases. The OntoDM ontology is used for the annotation of data mining algorithms; for the representation of data mining scenarios; for the annotation of data mining investigations; in cross domain applications to support ontology-based representation of QSAR modeling for drug discovery, as a mid-level ontology by the Expose ontology; and for the annotation of articles containing data mining terms in combination with text mining tools.
The novelties that the OntoDM ontology introduces and what distinguishes it from other related ontologies are the facts that it allows representation of mining of structured data and the general process of data mining in a principled way, it is based on a theoretical ontological framework and due to this it can be connected to other domain ontologies to support cross-domain applications. The OntoDM ontology is also the ﬁrst ontology that supports the representation of the complete process of knowledge discovery.
In the future developments of the OntoDM ontology, we plan to focus on several aspects. First, we would like to align and map of our ontology to other upper-level ontologies.
Second, we plan to extend the established ontological framework to represent entities about components of data mining algorithms, such as distance functions and kernel functions.
Next, we plan to populate the ontology downward with instances. Furthermore, we plan to extend the representational framework for representing experiments for mining structured data in the context of experiment databases. Finally, we plan to include more contributors from the domain of data mining into the development of OntoDM and apply the OntoDM design principles to the development of ontologies for other areas of computer science.
Abbreviations BFO = Basic Formal Ontology CRISP-DM = Cross Industry Standard Process for Data Mining CheTA = Chemistry using Text Annotations CBDM = Constraint-based Data Mining DM = Data Mining DMO = Data Mining Ontologies DMOP = Data Mining Optimization DAG = Directed Acyclic Graph DOLCE = Descriptive Ontology for Linguistic and Cognitive Engineering DDI = Drug Discovery Investigations EDM = Electric Discharge Machining EXACT = Ontology of Experiment Actions EXPO = Ontology of Scientiﬁc Experiments FDL = Full Depth Labeling GFO = General Formal Ontology GDC = Generically Dependent Continuant HC = Hierarchical Classiﬁcation HMC = Hierarchical Multi-label Classiﬁcation ICE = Information Content Entity IAO = Information Artifact Ontology ISO = International Organization for Standardization KD = Knowledge Discovery KDD = Knowledge Discovery in Databases LABORS = Ontology of Automated Experimentation MIREOT = Minimum Information to Reference an External Ontology Term MPL = Multiple Paths Labeling OBI = Ontology of Biomedical Investigations OBO = Open Biomedical Ontologies OSCAR = Open Source Chemistry Routines PDL = Partial Depth Labeling QSAR = Quantitative structure-activity relationship RO = Relational Ontology RDF = Resource Description Framework SWO = Software Ontology SVM = Support Vector Machines SPL = Single Path Labeling SDC = Speciﬁcally Dependent Continuant SUMO = Suggested Upper Merged Ontology URI = Universal Resource Identiﬁer YAMATO = Yet Another More Advanced Top-level Ontology W3C = The World Wide Web Consortium XVI
In this chapter, we ﬁrst present the background of the work presented in this thesis (Section 1.1). Next, we state the motivation for this work (Section 1.2). In addition, we present the goals of the work presented this thesis (Section 1.3). Furthermore, we list the original scientiﬁc contributions of this thesis (Section 1.4). We conclude this chapter with an overview of the thesis structure (Section 1.5).
1.1 Background 1.1.1 Challenges in the domain of data mining Fayyad et al. (1996) deﬁne knowledge discovery in databases (KDD) as a “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. According to this deﬁnition, data mining (DM) is the central step in the KDD process concerned with the application of computational techniques (i.e., data mining algorithms implemented as computer programs) to ﬁnd patterns in the data1.
Data mining is concerned with analyzing diﬀerent types of data. Besides data in the format of a single table (with primitive datatypes as attributes), most commonly used in data mining (Witten and Frank, 2005; Hand et al., 2001; Kaufman and Rousseeuw, 1990; Kononenko and Kukar, 2007; Han, 2005), structured (complex) data are receiving an increasing amount of interest (Bakir et al., 2007). These include graphs, sequences, networks, text, image, multimedia and relational data. Also, many data mining algorithms are designed to solve data mining tasks for speciﬁc types of data, most frequently deﬁned for data represented in a single table. Examples of such tasks are classiﬁcation, regression, or clustering. These tasks in essence can be deﬁned on an arbitrary datatype and a theoretical framework that uniﬁes diﬀerent data mining tasks, on diﬀerent types of data would help to formalize the knowledge about the domain and provide a base for future research, uniﬁcation and standardization. A major challenge is to treat and represent the mining of diﬀerent types of structured data in a uniform fashion. This was identiﬁed by Yang and Wu (2006) as one of the challenging problems in the domain. In addition, recent surveys of research challenges for data mining by Kriegel et al. (2007) and Dietterich et al. (2008) list the mining of complex data, the use of domain knowledge, and the support for complex knowledge discovery processes among the top-most open issues that have the best chance of providing the tools for building integrated artiﬁcial intelligence (AI) systems.
Dˇeroski (2007) addresses the ambitious task of formulating a general framework for data z mining. In the paper, the author discusses the requirements that such a framework should fulﬁll. These include elegantly handling of diﬀerent types of data, diﬀerent data mining tasks, and diﬀerent types of patterns and models. Furthermore, Dˇeroski discusses the z design and implementation of data mining algorithms and their composition into scenarios for practical applications. In addition, the author develops his framework by laying some basic concepts, such as structured data, patterns and models, continues with data mining 1 The patterns in this deﬁnition denote any kind of knowledge that is extracted in the process of data
tasks and basic components of data mining algorithms, such as features, distances, kernels and reﬁnement operators. Finally, these concepts are used to formulate constraint-based data mining task and the design of generic data mining algorithms.