«JUDICE, LIE YONG KOH (Master of Technology, NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE SCHOOL OF ...»
Correlation-Based Methods for Biological Data
JUDICE, LIE YONG KOH
(Master of Technology, NUS)
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE2007 In long memory of my father and sister II Correlation-Based Methods for Biological Data Cleaning by JUDICE, LIE YONG KOH, M.Tech Dissertation Presented to the Faculty of the School of Computing of the National University of Singapore in Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF PHILOSOPHYNational University of Singapore March 2007 III Acknowledgements I would like to express my gratitude to all those who have helped me complete this PhD thesis. First, I am deeply grateful to my supervisor, Dr. Mong Li Lee, School of Computing, National University of Singapore, for her guidance and teachings. This completion of the PhD thesis will not be possible without her consistent support and patience, as well as her wisdom which has been of utmost value to the project.
I would also like extend my gratitude to my mentor, Associate Prof Wynne Hsu, School of Computing, National University of Singapore, for her guidance and knowledge. I am fortunate to have learned from her, and have been greatly inspired by her wide knowledge and intelligence.
I have furthermore to thank my other mentor, Dr. Vladimir Brusic, University of Queensland for providing biological perspectives to the project. And my appreciation goes to the advisory committee members for beneficial discussions during my Qualifying and Thesis Proposal examinations.
In addition, I wish to extend my appreciation to my colleagues in the Institute for Infocomm Research (I2R) for their assistance, suggestions and friendship during the course of my part-time PhD studies. Special acknowledgement goes to Mr. Wee Tiong Ang and Ms.
Veeramani Anitha, Research Engineer for their helps and to Dr. See Kiong Ng, Manager of Knowledge Discovery Department for his understanding and encouragement.
Most importantly, I will like to thank my family for their love. I will also like to dedicate this thesis to my sister whose passing had driven me to retrospect my goals in life and to my father who died of heart attack and kidney failure in the midst of my study and whom I regretted for not spending enough time with during his last days. And to the one I respect most in life, my mother.
Last but not least, I wish to express my greatest appreciation to my husband, Soon Heng Tan for his continuous support, encouragement and for providing his biological
Data overload combine with widespread use of automated large-scale analysis and mining result in a rapid depreciation of the World’s data quality. Data cleaning is an emerging domain that aims at improving data quality through the detection and elimination of data artifacts. These data artifacts comprise of errors, discrepancies, redundancies, ambiguities, and incompleteness that hamper the efficacy of analysis or data mining.
Despite the importance, data cleaning remains neglected in certain knowledge-driven domains. One such example is Bioinformatics; biological data are often used uncritically without considering the errors or noises contained within, and research on both the “causes” of data artifacts and the corresponding data cleaning remedies are lacking. In this thesis, we conduct the an in-depth study of what constitutes data artifacts in real-world biological databases. To the best of our knowledge, this is the first complete investigation of the data quality factors in biological data.The result of our study indicates that biological data quality problem is by nature multifactorial and requires a number of different data cleaning approaches. While some existing data cleaning methods are directly applicable to certain artifacts, others such as annotation errors and multiple duplicate relations have not been studied. This provides the inspirations for us to devise new data cleaning methods.
Current data cleaning approaches derive observations of data artifacts from the values of independent attributes and records. On the other hand, the correlation patterns between the attributes provide additional information of the relationships embedded within a data set among the entities. In this thesis, we exploit the correlations between data entities to identify data artifacts that existing data cleaning methods fall short of addressing. We propose 3 novel data cleaning methods for detecting outliers and duplicates, and further apply them to realworld biological data as proof-of-concepts.
records. While rarity may be a good measure for class outliers, for attribute outliers, rarity may not equate abnormality. The ODDS (Outlier Detection from Data Subspaces) method utilizes deviating correlation patterns for the identification of common yet abnormal attributes. Experimental validation shows that it can achieve an accuracy of up to 88%.
The ODDS method is further extended to XODDS, an outlier detection method for semi-structured data models such as XML which is rapidly emerging as a new standard for data representation and exchange on the World Wide Web (WWW). In XODDS, we leverage on the hierarchical structure of the XML to provide addition context information enabling knowledge-based data cleaning. Experimental validation shows that the contextual information in XODDS elevates both efficiency and the effectiveness of detecting outliers.
Traditional duplicate detection methods regard duplicate relation as a boolean property. Moreover, different types of duplicates exists, some of which cannot be trivially merged. Our third contribution, the correlation-based duplicate detection method induced rules from associations between attributes in order to identify different types of duplicates.
Correlation-based methods aimed at resolving data cleaning problems are conceptually new. This thesis demonstrates they are effective in addressing some data artifacts that cannot be tackled by existing data cleaning techniques, with evidence of practical applications to real-world biological databases.
Table 1.1: Different records in database representing the same customer
Table 1.2: Customer bank accounts with personal information and monthly transactional averages
Table 2.1: Different types of data artifacts
Table 2.2: Different records from multiple databases representing the same customer.
......... 19 Table 3.1: The disulfide bridges in PDB records 1VNA, 1B3C and corresponding Entrez record GI 494705 and GI 4139618
Table 3.2: Summary of possible biological data cleaning remedies
Table 4.1: World Clock data set containing 4 attribute outliers
Table 4.2: Number of attribute outliers inserted into World-Clock data set
Table 4.3: Performance of ODDS/O-measure at varying number of CA-outliers per tuple.
.. 84 Table 4.4: Description of attributes in UniProt
Table 4.5: Frequencies of GO target attributes identified at projections of degree 5 of UniProt data set
Table 4.6: CA-outliers detected in UniProtKB/TrEMBL using ODDS/Of-measure.
............. 89 Table 4.7: Manual verification of Gene Ontology CA-outliers detected in UniProtKB/TrEMBL
Table 5.1: The 2☓2 contigency table of a target attribute and its correlated neighbourhood.
Table 5.2: Example contingency tables for monotone properties.
Properties of attribute outlier metrics
Table 5.4: Attribute subspaces derived in RBank using χ2
Table 5.5: Outliers detected from the UniProt/TrEMBL Gene Ontologies and Keywords annotations
Table 6.1: Multiple types of duplicates that exist in the protein databases
Table 6.2: Similarity scores of Entrez records 1910194A and P45639
Table 6.3: Different types of duplicate pair in training data set
Table 6.4: Examples of duplicate rules induced from CBA
Figure 1.1: Exponential growth of DNA records in GenBank, DDBJ and EMBL
Figure 2.1: Sorted Neighbourhood Method with sliding window of width 6
Figure 3.1: The central dogma of molecular biology.
Figure 3.2: The data warehousing framework of BioWare
Figure 3.3: The 4 levels physical classification of data artifacts in sequence databases.
........ 44 Figure 3.4: The conceptual classification of data artifacts in sequence databases.................. 45 Figure 3.5: Protein sequences recorded at UniProtKB/Swiss-Prot containing 5 to 15 synonyms
Figure 3.6: Undersized sequences in major protein databases
Figure 3.7: Undersized sequences in major nucleotide databases
Figure 3.8: Nucleotide sequence with the flanking vectors at the 3’ and 5’ ends
Figure 3.9: Structure of the eukaryotic gene containing the exons, introns, 5’ untranslated region and 3’ untranslated region
Figure 3.10: The functional descriptors of a UniProtKB/Swiss-Prot sequence map to the comment attributes in Entrez
Figure 3.11: Mis-fielded reference values in a GenBank record
Figure 4.1: Selected attribute combinations of the World Clock dataset and their supports.
.. 70 Figure 4.2: Example of a concept lattice of 4 tuples with 3 attributes F1, F2, and F3............ 77 Figure 4.3: Attribute combinations at projections of degree k with two attribute outliers - b and d
Figure 4.4: Rate-of-change for individual attributes in X1
Figure 4.5: Accuracy of ODDS converges in data subspaces of lower degrees in Mix3.
....... 85 Figure 4.6: Performance of ODDS compared with classifier-based attribute outlier detection..
Figure 5.2: The 4 subspaces in Bank Account XML
Figure 5.3: The XODDS outlier detection framework
Figure 5.5: Performance of XODDS of various metrics using ROC-derived thresholds.
..... 114 Figure 5.6: Performance of XODDS of various outlier metrics using Top-k
Figure 5.7: Performance of XODDS at varying noise levels
Figure 5.8: Performance of XODDS compared to the relational approach
Figure 5.9: Number of aggregate outliers in the account subspace across varying noise.
.... 119 Figure 5.10: Running time of XODDS at varying data size
Figure 5.11: Simplified UniProt XML
Figure 6.1: Extent of replication of scorpion toxin proteins across multiple databases.
....... 126 Figure 6.2: Duplicate detection framework
Figure 6.3: Matching criteria of an Entrez protein record
Figure 6.4: Field labels from each pair of duplicates in training dataset
Figure 6.5: Accuracy of detecting duplicates using different classifiers
Figure 6.6: F-score of detecting different types of duplicates
List of Tables
List of Figures
Chapter 1: Introduction
1.1.1 Data Explosion, Data Mining, and Data Cleaning
1.1.2 Applications Demanding “Clean Data”
1.1.3 Importance of Data Cleaning in Bioinformatics
1.1.4 Correlation-based Data Cleaning Approaches
1.1.5 Scope of Data Cleaning
Chapter 2: A Survey on Data Cleaning Approaches
2.1 Data Artifacts and Data Cleaning
2.2 Evolution of Data Cleaning Approaches
2.3.1 Duplicate Detection Methods
2.3.2 Outlier Detection Methods
2.3.3 Other Data Cleaning Methods
2.4 Data Cleaning Frameworks and Systems
2.4.1 Knowledge-based Data Cleaning Systems
2.4.2 Declarative Data Cleaning Applications
2.5 From Structured to Semi-structured Data Cleaning
2.5.1 XML Duplicate Detection
2.6 Biological Data Cleaning
2.6.2 Classifier-based Cleaning of Sequences
2.7 Concluding Remarks
Chapter 3: A Classification of Biological Data Artifacts
3.1.1 Central Dogma of Molecular Biology
3.1.2 Biological Database Systems
3.1.3 Sources of Biological Data Artifacts
3.3.1 Attribute-level artifacts
3.3.2 Record-level artifacts
3.3.3 Single Database level artifacts
3.3.4 Multiple Database level artifacts
3.4 Applying Existing Data Cleaning Methods
3.5 Concluding Section
Chapter 4: Correlation-based Detection of Attribute Outliers using ODDS
4.1.1 Attribute Outliers and Class Outliers
4.2.1 Motivating Example
4.3.2 Correlation-based Outlier Metrics
4.3.3 Rate-of-Change for Threshold Optimisation
4.4.1 Subspace Generation using Concept Lattice
4.4.2 The ODDS Algorithm
4.4.3 Pruning Strategies in ODDS
4.4.4 The prune-ODDS Algorithm
4.5 Performance Evaluation
4.5.1 World-Clock Data Set
4.5.2 UniProt Protein Data Set
4.6 Concluding Section
Chapter 5: Attribute Outlier Detection in XML using XODDS
5.1.1 Motivating Example
5.2 Preliminary Definitions
5.2.1 Correlated Subspaces
5.2.2 Aggregate Attributes
5.2.3 Correlation Neighbourhood
5.2.4 Outlier Scoring
5.2.5 Outlier Identification
5.3 Outlier Detection Framework
5.3.1 XODDS Framework