FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |   ...   | 18 |


-- [ Page 1 ] --

Correlation-Based Methods for Biological Data



(Master of Technology, NUS)






2007 In long memory of my father and sister II Correlation-Based Methods for Biological Data Cleaning by JUDICE, LIE YONG KOH, M.Tech Dissertation Presented to the Faculty of the School of Computing of the National University of Singapore in Partial Fulfillment of the Requirements for the Degree of


National University of Singapore March 2007 III Acknowledgements I would like to express my gratitude to all those who have helped me complete this PhD thesis. First, I am deeply grateful to my supervisor, Dr. Mong Li Lee, School of Computing, National University of Singapore, for her guidance and teachings. This completion of the PhD thesis will not be possible without her consistent support and patience, as well as her wisdom which has been of utmost value to the project.

I would also like extend my gratitude to my mentor, Associate Prof Wynne Hsu, School of Computing, National University of Singapore, for her guidance and knowledge. I am fortunate to have learned from her, and have been greatly inspired by her wide knowledge and intelligence.

I have furthermore to thank my other mentor, Dr. Vladimir Brusic, University of Queensland for providing biological perspectives to the project. And my appreciation goes to the advisory committee members for beneficial discussions during my Qualifying and Thesis Proposal examinations.

In addition, I wish to extend my appreciation to my colleagues in the Institute for Infocomm Research (I2R) for their assistance, suggestions and friendship during the course of my part-time PhD studies. Special acknowledgement goes to Mr. Wee Tiong Ang and Ms.

Veeramani Anitha, Research Engineer for their helps and to Dr. See Kiong Ng, Manager of Knowledge Discovery Department for his understanding and encouragement.

Most importantly, I will like to thank my family for their love. I will also like to dedicate this thesis to my sister whose passing had driven me to retrospect my goals in life and to my father who died of heart attack and kidney failure in the midst of my study and whom I regretted for not spending enough time with during his last days. And to the one I respect most in life, my mother.

Last but not least, I wish to express my greatest appreciation to my husband, Soon Heng Tan for his continuous support, encouragement and for providing his biological

–  –  –

Data overload combine with widespread use of automated large-scale analysis and mining result in a rapid depreciation of the World’s data quality. Data cleaning is an emerging domain that aims at improving data quality through the detection and elimination of data artifacts. These data artifacts comprise of errors, discrepancies, redundancies, ambiguities, and incompleteness that hamper the efficacy of analysis or data mining.

Despite the importance, data cleaning remains neglected in certain knowledge-driven domains. One such example is Bioinformatics; biological data are often used uncritically without considering the errors or noises contained within, and research on both the “causes” of data artifacts and the corresponding data cleaning remedies are lacking. In this thesis, we conduct the an in-depth study of what constitutes data artifacts in real-world biological databases. To the best of our knowledge, this is the first complete investigation of the data quality factors in biological data.The result of our study indicates that biological data quality problem is by nature multifactorial and requires a number of different data cleaning approaches. While some existing data cleaning methods are directly applicable to certain artifacts, others such as annotation errors and multiple duplicate relations have not been studied. This provides the inspirations for us to devise new data cleaning methods.

Current data cleaning approaches derive observations of data artifacts from the values of independent attributes and records. On the other hand, the correlation patterns between the attributes provide additional information of the relationships embedded within a data set among the entities. In this thesis, we exploit the correlations between data entities to identify data artifacts that existing data cleaning methods fall short of addressing. We propose 3 novel data cleaning methods for detecting outliers and duplicates, and further apply them to realworld biological data as proof-of-concepts.

–  –  –

records. While rarity may be a good measure for class outliers, for attribute outliers, rarity may not equate abnormality. The ODDS (Outlier Detection from Data Subspaces) method utilizes deviating correlation patterns for the identification of common yet abnormal attributes. Experimental validation shows that it can achieve an accuracy of up to 88%.

The ODDS method is further extended to XODDS, an outlier detection method for semi-structured data models such as XML which is rapidly emerging as a new standard for data representation and exchange on the World Wide Web (WWW). In XODDS, we leverage on the hierarchical structure of the XML to provide addition context information enabling knowledge-based data cleaning. Experimental validation shows that the contextual information in XODDS elevates both efficiency and the effectiveness of detecting outliers.

Traditional duplicate detection methods regard duplicate relation as a boolean property. Moreover, different types of duplicates exists, some of which cannot be trivially merged. Our third contribution, the correlation-based duplicate detection method induced rules from associations between attributes in order to identify different types of duplicates.

Correlation-based methods aimed at resolving data cleaning problems are conceptually new. This thesis demonstrates they are effective in addressing some data artifacts that cannot be tackled by existing data cleaning techniques, with evidence of practical applications to real-world biological databases.

–  –  –

Table 1.1: Different records in database representing the same customer

Table 1.2: Customer bank accounts with personal information and monthly transactional averages

Table 2.1: Different types of data artifacts

Table 2.2: Different records from multiple databases representing the same customer.

......... 19 Table 3.1: The disulfide bridges in PDB records 1VNA, 1B3C and corresponding Entrez record GI 494705 and GI 4139618

Table 3.2: Summary of possible biological data cleaning remedies

Table 4.1: World Clock data set containing 4 attribute outliers

Table 4.2: Number of attribute outliers inserted into World-Clock data set

Table 4.3: Performance of ODDS/O-measure at varying number of CA-outliers per tuple.

.. 84 Table 4.4: Description of attributes in UniProt

Table 4.5: Frequencies of GO target attributes identified at projections of degree 5 of UniProt data set

Table 4.6: CA-outliers detected in UniProtKB/TrEMBL using ODDS/Of-measure.

............. 89 Table 4.7: Manual verification of Gene Ontology CA-outliers detected in UniProtKB/TrEMBL

Table 5.1: The 2☓2 contigency table of a target attribute and its correlated neighbourhood.


Table 5.2: Example contingency tables for monotone properties.

Table 5.3.

Properties of attribute outlier metrics

Table 5.4: Attribute subspaces derived in RBank using χ2

Table 5.5: Outliers detected from the UniProt/TrEMBL Gene Ontologies and Keywords annotations

–  –  –

Table 6.1: Multiple types of duplicates that exist in the protein databases

Table 6.2: Similarity scores of Entrez records 1910194A and P45639

Table 6.3: Different types of duplicate pair in training data set

Table 6.4: Examples of duplicate rules induced from CBA

–  –  –

Figure 1.1: Exponential growth of DNA records in GenBank, DDBJ and EMBL

Figure 2.1: Sorted Neighbourhood Method with sliding window of width 6

Figure 3.1: The central dogma of molecular biology.

Figure 3.2: The data warehousing framework of BioWare

Figure 3.3: The 4 levels physical classification of data artifacts in sequence databases.

........ 44 Figure 3.4: The conceptual classification of data artifacts in sequence databases.................. 45 Figure 3.5: Protein sequences recorded at UniProtKB/Swiss-Prot containing 5 to 15 synonyms

Figure 3.6: Undersized sequences in major protein databases

Figure 3.7: Undersized sequences in major nucleotide databases

Figure 3.8: Nucleotide sequence with the flanking vectors at the 3’ and 5’ ends

Figure 3.9: Structure of the eukaryotic gene containing the exons, introns, 5’ untranslated region and 3’ untranslated region

Figure 3.10: The functional descriptors of a UniProtKB/Swiss-Prot sequence map to the comment attributes in Entrez

Figure 3.11: Mis-fielded reference values in a GenBank record

Figure 4.1: Selected attribute combinations of the World Clock dataset and their supports.

.. 70 Figure 4.2: Example of a concept lattice of 4 tuples with 3 attributes F1, F2, and F3............ 77 Figure 4.3: Attribute combinations at projections of degree k with two attribute outliers - b and d

Figure 4.4: Rate-of-change for individual attributes in X1

Figure 4.5: Accuracy of ODDS converges in data subspaces of lower degrees in Mix3.

....... 85 Figure 4.6: Performance of ODDS compared with classifier-based attribute outlier detection..

–  –  –

Figure 5.2: The 4 subspaces in Bank Account XML

Figure 5.3: The XODDS outlier detection framework

Figure 5.5: Performance of XODDS of various metrics using ROC-derived thresholds.

..... 114 Figure 5.6: Performance of XODDS of various outlier metrics using Top-k

Figure 5.7: Performance of XODDS at varying noise levels

Figure 5.8: Performance of XODDS compared to the relational approach

Figure 5.9: Number of aggregate outliers in the account subspace across varying noise.

.... 119 Figure 5.10: Running time of XODDS at varying data size

Figure 5.11: Simplified UniProt XML

Figure 6.1: Extent of replication of scorpion toxin proteins across multiple databases.

....... 126 Figure 6.2: Duplicate detection framework

Figure 6.3: Matching criteria of an Entrez protein record

Figure 6.4: Field labels from each pair of duplicates in training dataset

Figure 6.5: Accuracy of detecting duplicates using different classifiers

Figure 6.6: F-score of detecting different types of duplicates

–  –  –



List of Tables

List of Figures

Chapter 1: Introduction

1.1 Background

1.1.1 Data Explosion, Data Mining, and Data Cleaning

1.1.2 Applications Demanding “Clean Data”

1.1.3 Importance of Data Cleaning in Bioinformatics

1.1.4 Correlation-based Data Cleaning Approaches

1.1.5 Scope of Data Cleaning

1.2 Motivation

1.3 Contribution

1.4 Organisation

Chapter 2: A Survey on Data Cleaning Approaches

2.1 Data Artifacts and Data Cleaning

2.2 Evolution of Data Cleaning Approaches

–  –  –

2.3.1 Duplicate Detection Methods

2.3.2 Outlier Detection Methods

2.3.3 Other Data Cleaning Methods

2.4 Data Cleaning Frameworks and Systems

2.4.1 Knowledge-based Data Cleaning Systems

2.4.2 Declarative Data Cleaning Applications

2.5 From Structured to Semi-structured Data Cleaning

2.5.1 XML Duplicate Detection

–  –  –

2.6 Biological Data Cleaning

2.6.1 BIO-AJAX

2.6.2 Classifier-based Cleaning of Sequences

2.7 Concluding Remarks

Chapter 3: A Classification of Biological Data Artifacts

3.1 Background

3.1.1 Central Dogma of Molecular Biology

3.1.2 Biological Database Systems

3.1.3 Sources of Biological Data Artifacts

3.2 Motivation

3.3 Classification

3.3.1 Attribute-level artifacts

3.3.2 Record-level artifacts

3.3.3 Single Database level artifacts

3.3.4 Multiple Database level artifacts

3.4 Applying Existing Data Cleaning Methods

3.5 Concluding Section

Chapter 4: Correlation-based Detection of Attribute Outliers using ODDS

4.1 Introduction

4.1.1 Attribute Outliers and Class Outliers

4.1.2 Contribution

4.2 Background

4.2.1 Motivating Example

4.3 Definitions

4.3.1 Preliminaries

4.3.2 Correlation-based Outlier Metrics

4.3.3 Rate-of-Change for Threshold Optimisation

–  –  –

4.4.1 Subspace Generation using Concept Lattice

4.4.2 The ODDS Algorithm

4.4.3 Pruning Strategies in ODDS

4.4.4 The prune-ODDS Algorithm

4.5 Performance Evaluation

4.5.1 World-Clock Data Set

4.5.2 UniProt Protein Data Set

4.6 Concluding Section

Chapter 5: Attribute Outlier Detection in XML using XODDS

5.1 Introduction

5.1.1 Motivating Example

5.1.2 Contributions

5.2 Preliminary Definitions

5.2.1 Correlated Subspaces

5.2.2 Aggregate Attributes

5.2.3 Correlation Neighbourhood

5.2.4 Outlier Scoring

5.2.5 Outlier Identification

5.3 Outlier Detection Framework

5.3.1 XODDS Framework

Pages:   || 2 | 3 | 4 | 5 |   ...   | 18 |

Similar works:

«Digestive Systems: The Anatomy of Representative Vertebrates Modified from: Biology in the laboratory. 3rd edition. Helms, Helms, Kosinski and Cummings. Biological Investigations: Form, Function, Diversity and Process. 7th Edition. W.D. Dolphin Helms, Helms, Kosinski, Cummings. Biology in the Laboratory, 3rd edition. Freeman Publishing. Harold M. Kaplan and Kathleen A. Jones Southern Illinois University OVERVIEW The digestive system participates in the procurement and metabolism of...»

«Sensors 2013, 13, 5937-5944; doi:10.3390/s130505937 OPEN ACCESS sensors ISSN 1424-8220 www.mdpi.com/journal/sensors Article DNA Hairpins as Temperature Switches, Thermometers and Ionic Detectors Anette Thyssen Jonstrup, Jacob Fredsøe and Anni Hangaard Andersen * Laboratory of Genome Research, Department of Molecular Biology and Genetics, Aarhus University, C.F. Moellers Allè 3, Building 1130, 8000 Aarhus C, Denmark; E-Mails: jo@rosborg-gym.dk (A.T.J.); jcf@mb.au.dk (J.F.) * Author to whom...»

«Case 4:04-cv-02688-JEJ Document 309 Filed 11/04/2005 Page 1 of 18 Appendix A: Excerpts from Of Pandas and People (2nd ed., 1993), the published version used by students, Quote A: “This book has a single goal: to present data from six areas of science that bear on the central question of biological origins. We don't propose to give final answers, nor to unveil The Truth. Our purpose, rather, is to help readers understand origins better, and to see why the data may be viewed in more than one...»

«Electrooxidative C–H Functionalization of Aromatic Compounds Based on Rational Design Tatsuya Morofuji 2015 Preface The studies presented in this thesis have been carried out under the direction of Professor Jun-ichi Yoshida at the Department of Synthetic Chemistry and Biological Chemistry of Kyoto University during 2010-2015. The studies are concerned with development of new electrochemical reactions. The author would like to express his sincerest gratitude to Professor Jun-ichi Yoshida for...»

«Studies on Macroscale Structuralization of Porous Coordination Polymers Kenji Hirai 2013 Preface The study in this thesis has been carried out under the direction of Professor Susumu Kitagawa at during April 2007 March 2013 at Department of Synthetic Chemistry and Biological Chemistry, Graduate School of Engineering, Kyoto University. The author is greatly indebted to Professor Susumu Kitagawa for his significant guidance, valuable suggestions and continuous encouragement. The author wishes to...»

«Reaction Integration Using Electrochemically Generated Cationic Species Yosuke Ashikari 2013 ii Preface The studies presented in this thesis have been carried out under the direction of Professor Jun-ichi Yoshida at the Department of Synthetic Chemistry and Biological Chemistry of Kyoto University during 2007−2013. The studies are concerned with reaction integration using electrochemically generated cationic species. The author would particularly like to express his sincerest gratitude to...»

«International Journal of Environmental & Science Education, 2015, 10(3), 301-318 A US-China Interview Study: Biology Students’ Argumentation and Explanation about Energy Consumption Issues Hui Jin Educational Testing Service, USA Hayat Hokayem Texas Christian University, USA Sasha Wang Boise State University, USA Xin Wei People's Education Press, CHINA Received 30 January 2015 Revised 09 February 2015 Accepted 09 February 2015 As China and the United States become the top two carbon...»

«THEA M. EDWARDS, Ph.D. Sewanee: The University of the South Office: Spencer Hall 176A Department of Biology Office Phone: (931) 598-3167 735 University Avenue Email: tmedward@sewanee.edu Sewanee, TN 37383-1000 EDUCATION B.S. 1991 Virginia Tech, Department of Horticulture, Blacksburg, VA Masters 1994 University of Florida, Department of Environmental Horticulture, Gainesville, FL. Thesis directed by Profs. Terril Nell and Jim Barrett, entitled “Effects of Ethylene on Postharvest Physiology of...»

«SOCIAL MIRRORS AND THE BRAIN including a functional imaging study of role-play and verse PhD Thesis Charles Whitehead Department of Anthropology University College London 2003 1 Abstract This thesis attempts to bridge the conceptual gulf dividing social from biological anthropology using three main lines of argument: 1. Enculturated human cooperation originated in expanded systems of kinship and reciprocity which require obfuscation of selfish bodies. Disembodied self perceptions characterize...»

«  1   Curriculum Vita John Edward Korstad June 1, 2011 Personal Data: Born: July 4, 1949, in Woodland, California, U.S.A. Marital Status: Married to Sally D. (Steffen) Korstad; 4 children and currently 5 grandchildren 301 E. 122nd Ct. S., Jenks, OK 74037 Home Address: Current Position: Professor of Biology (since Fall 1980) Department of Biology Oral Roberts University 7777 S. Lewis Tulsa, OK 74171 Phone Numbers: Home: (918) 853-2580 School: (918) 495-6942 Cell: (918) 853-3579 Fax: (918)...»

«Doctoral Thesis Improving de novo model quality and its application in ab initio phasing Rojan Shrestha A Dissertation Presented By Rojan Shrestha Submitted to The Graduate School of Frontier Sciences of the University of Tokyo in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2014 Department of Computational Biology Abstract De novo models are computationally predicted three-dimensional models of the given proteins using only amino acids sequence...»

«DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS 277 PRIIT ADLER Analysis and visualisation of large scale microarray data DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS 277 DISSERTATIONES BIOLOGICAE UNIVERSITATIS TARTUENSIS 277 PRIIT ADLER Analysis and visualisation of large scale microarray data Institute of Molecular and Cellular Biology, University of Tartu, Estonia Dissertation is accepted for the commencement of the degree of Doctor of Philosophy in bioinformatics at University of...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.