WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 | 5 |   ...   | 29 |

«A MODULAR ONTOLOGY OF DATA MINING Panˇe Panov c Doctoral Dissertation Joˇef Stefan International Postgraduate School z Ljubljana, Slovenia, July ...»

-- [ Page 1 ] --

A MODULAR ONTOLOGY

OF DATA MINING

Panˇe Panov

c

Doctoral Dissertation

Joˇef Stefan International Postgraduate School

z

Ljubljana, Slovenia, July 2012

Evaluation Board:

Prof. Dr. Nada Lavraˇ, Chair, Joˇef Stefan Institute, Ljubljana, Slovenia

c z

Dr. Larisa Soldatova, Member, Brunel University, London, United Kingdom

Prof. Dr. Dunja Mladeni´, Member, Joˇef Stefan Institute, Ljubljana, Slovenia c z Panˇe Panov c

A MODULAR ONTOLOGY

OF DATA MINING

Doctoral Dissertation

MODULARNA ONTOLOGIJA

PODATKOVNEGA RUDARJENJA

Doktorska disertacija Supervisor : Prof. Dr. Saˇo Dˇeroski s z Ljubljana, Slovenia, July 2012 V Contents

Abstract

XI Povzetek XIII 1 Introduction 1

1.1 Background..................................... 1 1.1.1 Challenges in the domain of data mining................. 1 1.1.2 Formalization of scientific investigations................. 2 1.1.3 Applied ontology.............................. 3

1.2 Motivation..................................... 4

1.3 Goals........................................ 5

1.4 Scientific contributions............................... 6

1.5 Thesis structure................................... 7 2 Ontology 9

2.1 What is an ontology?................................ 9 2.1.1 Definitions of ontology in computer science............... 10 2.1.2 Ontology as a representational artifact.................. 11 2.1.3 Roles of an ontology.......

–  –  –

Abstract The domain of data mining (DM) deals with analyzing different types of data. The data typically used in data mining is in the format of a single table, with primitive datatypes as attributes. However, structured (complex) data, such as graphs, sequences, networks, text, image, multimedia and relational data, are receiving an increasing amount of interest in data mining. A major challenge is to treat and represent the mining of different types of structured data in a uniform fashion. A theoretical framework that unifies different data mining tasks, on different types of data can help to formalize the knowledge about the domain and provide a base for future research, unification and standardization. Next, automation and overall support of the Knowledge Discovery in Databases (KDD) process is also an important challenge in the domain of data mining. A formalization of the domain of data mining is a solution that addresses these challenges. It can directly support the development of a general framework for data mining, support the representation of the process of mining structured data, and allow the representation of the complete process of knowledge discovery.

In this thesis, we propose a reference modular ontology for the domain of data mining OntoDM, directly motivated by the need for formalization of the data mining domain. The OntoDM ontology is designed and implemented by following ontology best practices and design principles. Its distinguishing feature is that it uses Basic Formal Ontology (BFO) as an upper-level ontology and a template, a set of formally defined relations from Relational Ontology (RO) and other state-of-the-art ontologies, and reuses classes and relations from the Ontology of Biomedical Investigations (OBI), the Information Artifact Ontology (IAO), and the Software Ontology (SWO). This will ensure compatibility and connections with other ontologies and allow cross-domain reasoning capabilities. The OntoDM ontology is composed of three modules covering different aspects of data mining: OntoDT, which supports the representation of knowledge about datatypes and is based on an accepted ISO standard for datatypes in computer systems; OntoDM-core, which formalizes the key data mining entities for representing the mining of structured data in the context of a general framework for data mining; and OntoDM-KDD, which formalizes the knowledge discovery process based on the Cross Industry Standard Process for Data Mining (CRISP-DM) process model.

The OntoDT module provides a representation of the datatype entity, defines a taxonomy of datatype characterizing operations, and a taxonomy of datatype qualities. Furthermore, it defines a datatype taxonomy comprising classes and instances of primitive datatypes, generated datatypes (non-aggregate and aggregated datatypes), subtypes, and defined datatypes.

With this structure, the module provides a generic mechanism for representing arbitrarily complex datatypes.

The OntoDM-core module formalizes the key data mining entities needed for the representation of mining structured data in the context of a general framework for data mining.

These include the entities dataset, data mining task, generalization, data mining algorithm, and others. More specifically, it provides a representation of datasets, and a taxonomy of datasets based on the type of data. Next, it provides a representation of data mining tasks, and proposes a taxonomy of data mining tasks, predictive modeling tasks and hierarchiXII cal classification tasks. Furthermore, it provides a representation for generalizations, and proposes a taxonomy of generalizations and predictive models based on the types of data and generalization language. Moreover, it provides a representation of data mining algorithms, proposes a taxonomy of data mining algorithms, predictive modeling algorithms, and hierarchical classification algorithms, and generalizes the mechanism for representing data mining algorithms to represent general algorithms in computer science. In addition, the OntoDM-core module provides a representation of constraints and constraint-based data mining tasks and proposes a taxonomy thereof. Finally, the module provides a representation of data mining scenarios that includes data mining scenarios as a specification, data mining workflows, and the process of executing a data mining workflow.





The OntoDM-KDD module supports the representation of data mining investigations.

It provides a representation of data mining investigation by directly extending classes from the OBI and IAO ontologies. Furthermore, it models each of the phases in a data mining investigation (such as application understanding, data understanding, data preparation, modeling, DM process evaluation, and deployment), and their inputs and outputs.

The OntoDM ontology and its three modules OntoDT, OntoDM-core, and OntoDMKDD) were evaluated in order to assess their quality. The evaluation was performed by assessing the ontology against a set of design principles and best practices, and assessing whether the competency questions posed in the design phase were implemented in the language of the ontology. In addition, we provided a domain coverage assessment by comparing the OntoDM data mining tasks taxonomy with the data mining topic ontology constructed in a semi-automatic fashion from abstracts of articles from data mining conferences and journals.

The developed ontology supports a large variety of applications. We demonstrate the use and the application of the ontology by describing six use cases. The OntoDM ontology is used for the annotation of data mining algorithms; for the representation of data mining scenarios; for the annotation of data mining investigations; in cross domain applications to support ontology-based representation of QSAR modeling for drug discovery, as a mid-level ontology by the Expose ontology; and for the annotation of articles containing data mining terms in combination with text mining tools.

The novelties that the OntoDM ontology introduces and what distinguishes it from other related ontologies are the facts that it allows representation of mining of structured data and the general process of data mining in a principled way, it is based on a theoretical ontological framework and due to this it can be connected to other domain ontologies to support cross-domain applications. The OntoDM ontology is also the first ontology that supports the representation of the complete process of knowledge discovery.

In the future developments of the OntoDM ontology, we plan to focus on several aspects. First, we would like to align and map of our ontology to other upper-level ontologies.

Second, we plan to extend the established ontological framework to represent entities about components of data mining algorithms, such as distance functions and kernel functions.

Next, we plan to populate the ontology downward with instances. Furthermore, we plan to extend the representational framework for representing experiments for mining structured data in the context of experiment databases. Finally, we plan to include more contributors from the domain of data mining into the development of OntoDM and apply the OntoDM design principles to the development of ontologies for other areas of computer science.

XIII Povzetek

–  –  –

Abbreviations BFO = Basic Formal Ontology CRISP-DM = Cross Industry Standard Process for Data Mining CheTA = Chemistry using Text Annotations CBDM = Constraint-based Data Mining DM = Data Mining DMO = Data Mining Ontologies DMOP = Data Mining Optimization DAG = Directed Acyclic Graph DOLCE = Descriptive Ontology for Linguistic and Cognitive Engineering DDI = Drug Discovery Investigations EDM = Electric Discharge Machining EXACT = Ontology of Experiment Actions EXPO = Ontology of Scientific Experiments FDL = Full Depth Labeling GFO = General Formal Ontology GDC = Generically Dependent Continuant HC = Hierarchical Classification HMC = Hierarchical Multi-label Classification ICE = Information Content Entity IAO = Information Artifact Ontology ISO = International Organization for Standardization KD = Knowledge Discovery KDD = Knowledge Discovery in Databases LABORS = Ontology of Automated Experimentation MIREOT = Minimum Information to Reference an External Ontology Term MPL = Multiple Paths Labeling OBI = Ontology of Biomedical Investigations OBO = Open Biomedical Ontologies OSCAR = Open Source Chemistry Routines PDL = Partial Depth Labeling QSAR = Quantitative structure-activity relationship RO = Relational Ontology RDF = Resource Description Framework SWO = Software Ontology SVM = Support Vector Machines SPL = Single Path Labeling SDC = Specifically Dependent Continuant SUMO = Suggested Upper Merged Ontology URI = Universal Resource Identifier YAMATO = Yet Another More Advanced Top-level Ontology W3C = The World Wide Web Consortium XVI

1 Introduction

In this chapter, we first present the background of the work presented in this thesis (Section 1.1). Next, we state the motivation for this work (Section 1.2). In addition, we present the goals of the work presented this thesis (Section 1.3). Furthermore, we list the original scientific contributions of this thesis (Section 1.4). We conclude this chapter with an overview of the thesis structure (Section 1.5).

1.1 Background 1.1.1 Challenges in the domain of data mining Fayyad et al. (1996) define knowledge discovery in databases (KDD) as a “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. According to this definition, data mining (DM) is the central step in the KDD process concerned with the application of computational techniques (i.e., data mining algorithms implemented as computer programs) to find patterns in the data1.

Data mining is concerned with analyzing different types of data. Besides data in the format of a single table (with primitive datatypes as attributes), most commonly used in data mining (Witten and Frank, 2005; Hand et al., 2001; Kaufman and Rousseeuw, 1990; Kononenko and Kukar, 2007; Han, 2005), structured (complex) data are receiving an increasing amount of interest (Bakir et al., 2007). These include graphs, sequences, networks, text, image, multimedia and relational data. Also, many data mining algorithms are designed to solve data mining tasks for specific types of data, most frequently defined for data represented in a single table. Examples of such tasks are classification, regression, or clustering. These tasks in essence can be defined on an arbitrary datatype and a theoretical framework that unifies different data mining tasks, on different types of data would help to formalize the knowledge about the domain and provide a base for future research, unification and standardization. A major challenge is to treat and represent the mining of different types of structured data in a uniform fashion. This was identified by Yang and Wu (2006) as one of the challenging problems in the domain. In addition, recent surveys of research challenges for data mining by Kriegel et al. (2007) and Dietterich et al. (2008) list the mining of complex data, the use of domain knowledge, and the support for complex knowledge discovery processes among the top-most open issues that have the best chance of providing the tools for building integrated artificial intelligence (AI) systems.

Dˇeroski (2007) addresses the ambitious task of formulating a general framework for data z mining. In the paper, the author discusses the requirements that such a framework should fulfill. These include elegantly handling of different types of data, different data mining tasks, and different types of patterns and models. Furthermore, Dˇeroski discusses the z design and implementation of data mining algorithms and their composition into scenarios for practical applications. In addition, the author develops his framework by laying some basic concepts, such as structured data, patterns and models, continues with data mining 1 The patterns in this definition denote any kind of knowledge that is extracted in the process of data

mining.

tasks and basic components of data mining algorithms, such as features, distances, kernels and refinement operators. Finally, these concepts are used to formulate constraint-based data mining task and the design of generic data mining algorithms.



Pages:   || 2 | 3 | 4 | 5 |   ...   | 29 |


Similar works:

«2014 Uniform Evaluation Report Chartered Professional Accountants of Canada UNIFORM EVALUATION REPORT i MEMBERSHIP OF 2014 BOARD OF EVALUATORS Christine Allison CPA, CA MD Funds Management Inc. Ottawa, Ontario Pierre-Yves Desbiens, CPA, CA, CF, MBA Cindy Ditner, FCPA, FCA, CMA Institute NEOMED BDO Canada LLP Montréal, Québec Toronto, Ontario Aline Girard, Ph.D., MBA, CPA, CA Mike Fitzpatrick, CPA, CA HEC Montréal Fitzpatrick & Company Montréal, Québec Charlottetown, Prince Edward Island...»

«BILLING CODE: 4810-AM-P BUREAU OF CONSUMER FINANCIAL PROTECTION 12 CFR Part 1026 [Docket No. CFPB-2014-0016] RIN 3170-ZA00 Application of Regulation Z’s Ability-to-Repay Rule to Certain Situations Involving Successors-in-Interest AGENCY: Bureau of Consumer Financial Protection. ACTION: Final rule. SUMMARY: The Bureau of Consumer Financial Protection (Bureau) is issuing this interpretive rule to clarify that the Bureau’s Ability-to-Repay Rule incorporates the existing definition of...»

«E-60 SAMTREDIA-GRIGOLETI HIGHWAY km 42,0 – km 51,570 SECTION CONSTRUCTION ENVIRONMENTAL IMPACT ASSESSMENT VOLUME I Project No Financed by EIB Prepared Road Department Foundation WEG Ministry of Regional Development and Infrastructure COBA Ltd / TRANSPROJECT Ltd June 2014 I Table of Contents 1. INTRODUCTION 1.1 Background 1.2 Objective of the project and Terms of Reference 1.3 Methodology 2. PROJECT DESCRIPTION 2.1 Main road Design Criteria and Project parameters 2.2 Project section location...»

«From Exclusion to Embrace: Reflections on Reconciliation Dr. Miroslav Volf Henry B. Wright Professor of Theology at Yale University Divinity School Author of Exclusion and Embrace: A Theological Exploration of Identity, Togetherness, and Reconciliation From Exclusion to Embrace: Reflections on Reconciliation (message given at the Sixteenth Annual International Prayer Breakfast at the United Nations on September 11, 2001) Mr. President, Mr. Minister, Excellencies, ladies and gentlemen: It is...»

«ÇANKIRI-ÇORUM HAVZASININ SUNGURLU BÖLGESİNDEKİ EOSEN YAŞLI TÜRBİDİT, OLİSTOSTROM VE OLİSTOLİT FASİYESLERİ TURBIDITES, OLISTOSTROME AND OLISTOLITHS OF EOCENE AGE IN THE SUNGURLU REGİON OF THE ÇANKIRI-ÇORUM BASIN Muhittin ŞENALP Maden Tetkik ve Arama Enstitüsü, Ankara ÖZ. — Bugün büyük bir bölümünü karasal fasiyesteki formasyonların kapladığı Çankırı-Çorum havzası en azından Üst Kretasenin başından Orta Eosene kadar geçen süre içinde dar ve derin bir...»

«Case: 1:13-cv-01530 Document #: 25 Filed: 01/16/14 Page 1 of 15 PageID #:587 UNITED STATES DISTRICT COURT FOR THE NORTHERN DISTRICT OF ILLINOIS EASTERN DIVISION ) ) FEDERAL TRADE COMMISSION, ) ) Case No. 1:13-cv-01530 Plaintiff, ) v. ) Judge Charles Norgle ) JASON Q. CRUZ, individually and also doing ) Magistrate Judge Mary M. Rowland ) business as APPIDEMIC, INC., ) ) Defendant. STIPULATED FINAL JUDGMENT AND ORDER FOR PERMANENT INJUNCTION AND OTHER EQUITABLE RELIEF Plaintiff, the Federal Trade...»

«Sugar Space Studio for the Arts Birthday Party Packages for guests of all ages! 2,500 Square feet of private party space! Central Sugarhouse location. Mix and match party options, select a package or build your own! Sugar Space 616 East Wilmington Avenue (2190 South) 888-300-7898 / www.thesugarspace.com $279 includes 16 kids Our basic package includes up to 16 kids, paper products (cups, napkins, plates, forks), theater lights, sound system, two hours of space rental, 6 balloons and one staff...»

«UNIVERSIDADE FEDERAL DO AMAZONAS UFAM CENTRO DE CIÊNCIAS DO AMBIENTE CCA PROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIAS DO AMBIENTE E SUSTENTABILIDADE NA AMAZÔNIA PPG/CASA REORDENAMENTO TERRITORIAL E CONFLITOS AGRÁRIOS EM PRESIDENTE FIGUEIREDO – AMAZONAS Tiago Maiká Müller Schwade Ivani Ferreira de Faria (Orientadora) Manaus UNIVERSIDADE FEDERAL DO AMAZONAS UFAM CENTRO DE CIÊNCIAS DO AMBIENTE CCA PROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIAS DO AMBIENTE E SUSTENTABILIDADE NA AMAZÔNIA PPG/CASA...»

«''Great customer service and really quick solution! Thank you very much and I will be spreading the word to my friends who live and work overseas.'' Pamela Baze District of Columbia, USA 19th March 2014 ''Thank you.Service worked out great for Mr T. I’m sure he will use it again.'' Donna Ercolano (PA to Mr T) New York, USA 19th March 2014 ''I would just like to thank you for your efficient service in transporting our luggage from Sydney to Kirkintilloch. Luggage arrived well within the...»

«ANIMALS IN ART ARTIST: Carel P. Brest van Kempen (1958 ) Murray/Holladay, Utah TITLE: Lizard Relay: Jaquarundi with Green Iguanas and Banded Basilisks 1996 MEDIA: acrylic on board SIZE: 32 x 42 BIOGRAPHICAL INFORMATION Carel Brest van Kempen was born in Murray, Utah, in 1958. He says he has been fascinated with drawing ever since he could hold a pencil, and since he grew up in Emigration Canyon, his drawing focused on nature from the very beginning. Although Brest van Kempen is a wildlife...»

«Insight Public Affairs Boys’ Reading Commission 15 May 2012 Commission Members Gavin Barwell MP, Chair (Conservative MP for Croydon Central) Baroness Rendell (Labour Peer) Lord Knight (Labour Peer) Stephen McPartland MP (Conservative MP for Stevenage) Lord Tope (Liberal Democrat Peer) Baroness Prashar (Crossbench Peer) Andrew Percy MP (Conservative MP for Brigg and Goole) Robert Halfon MP (Conservative MP for Harlow) Craig Whittaker MP (Conservative MP for Calder Valley) Baroness Perry...»

«Based on TCU Mapping-Enhanced Counseling Manuals for Adaptive Treatment As Included in NREPP Getting Motivated To Change A collection of materials for leading counseling sessions that address motivation and readiness for change. N. G. Bartholomew, D. F. Dansereau, and D. D. Simpson TCU Institute of Behavioral Research (September 2006) TCU Mapping-Enhanced Counseling manuals provide evidence-based guides for adaptive treatment services (included in the National Registry of Evidence-based...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.