Download - Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science
Modelling and computingthe quality of information in e-science
Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK
Alun Preece, Binling JinDepartment of Computing Science
University of Aberdeen, UK
http://www.qurator.org
Aberdeen, 24/1/07
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality of data
Main driver, historically: data cleaning for
• Integration: use of same IDs across data sources
• Warehousing, analytics:
– restore completeness,
– reconcile referential constraints
– cross-validation of numeric data by aggregation
Focus:
• Record de-duplication, reconciliation, “linkage”
– Ample literature – see eg Nov 2006 issue of IEEE TKDE
• Consistency of data across sources
• Managing uncertainty in databases (Trio - Stanford)
The need for data quality control is rooted in the data management practice
Combining the strengths of UMIST andThe Victoria University of Manchester
Common quality issues
• Completeness: not missing any of the results
• Correctness: each data should reflect the actual real-world entity that it is intended to model
– The actual address where you live, the correct balance in your bank account…
• Timeliness: delivered in time for use by a consumer process
– Eg stock information
• …
Combining the strengths of UMIST andThe Victoria University of Manchester
Taxonomy for data quality dimensions
Combining the strengths of UMIST andThe Victoria University of Manchester
Our motivation: quality in public e-science data
GenBankUniProt
EnsEMBL
Entrez
dbSNP
• Large volumes of data in many public repositories• Increasingly creative uses for this data
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Combining the strengths of UMIST andThe Victoria University of Manchester
Some quality issues in biology
“Quality” covers a broader spectrum of issues than traditional DQ
• “X% of database A may be wrong (unreliable) – but I have no easy way to test that”
• “This microarray data looks ok but is testing the wrong hypothesis”
• The output from this sequence matching algorithm produces false positives
• …
Each of these issues calls for a separate testing procedureDifficult to generalize
Each of these issues calls for a separate testing procedureDifficult to generalize
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness in biology - examples
Data type Creation process Correctness
Uniprot protein annotation
Manual curation Functional annotation f for p correct if function f can reliably be attributed to p
Qualitative proteomics:
Protein identification
Generate peptides peak lists, match peak lists (eg Imprint)
No false positives:
Every protein in the output is actually present in the cell sample
Transcriptomics:
Gene expression report (up/down-regulation)
Microarray data analysis
No false positives, no false negatives
Combining the strengths of UMIST andThe Victoria University of Manchester
Defining quality in e-science is challenging
• In-silico experiments express cutting-edge research
– Experimental data liable to change rapidly
– Definitions of quality are themselves experimental
• Scientists’ quality requirements often just a hunch
– Quality tests missing or based on experimental heuristics
– Definitions of quality criteria are personal and subjective
• Quality controls tightly coupled to data processing
– Often implicit and embedded in the experiment
– Not reusable
Combining the strengths of UMIST andThe Victoria University of Manchester
Research goals
1. Make personal definitions of quality explicit and formal
– Identify a common denominator for quality concepts
– Expressed as a conceptual model for Information Quality
2. Make existing data processing quality-aware
– Define an architectural framework that accommodates personal definitions of quality
– Compute quality levels and expose them to the user
Elicit “nuggets” of latent quality knowledgefrom the experts
Elicit “nuggets” of latent quality knowledgefrom the experts
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: protein identification
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Correct entry true positive
Evidence:
mass coverage (MC) measures the amount of protein sequence matched
Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum
ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness of protein identification
Estimator function: (computes a score rather than a probability)
PMF score = (HR x 100) + MC + (ELDP x 10)
Prediction performance – comparing 3 models:
ROC curve:True positives vs false positives
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality process components
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Goal:to automatically add the additional filtering step in a principled way
Goal:to automatically add the additional filtering step in a principled way
PMF score = (HR x 100) + MC + (ELDP x 10)
Quality filtering
Quality assertion:
Evidence:•mass coverage (MC)•Hit ratio (HR)•ELDP
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality Assertions
QA(D): any function of evidence (metadata for D) that computes a partial order on D
1. Score model (total or partial order)
2. Classification model with class ordering:
D
reject
accept
analyze
Reject < analyze < acceptActions associated to regions
Combining the strengths of UMIST andThe Victoria University of Manchester
Abstract quality views
An operational definition for personal quality:
1. Formulate a quality assertion on the dataset:
– i.e. a ranking of proteins by PMF score
2. Identify underlying evidence necessary to compute the assertion
– the variables used to compute the score (HR, MC, ELDP)
3. Define annotation functions that compute evidence values
• Functions that compute HR, MC, ELDP
4. Define quality regions on the ranked dataset
• In this case, intervals of acceptability
5. Associate actions to each region
Combining the strengths of UMIST andThe Victoria University of Manchester
Computable quality views as commodities
Cost-effective quality-awareness for data processing:
• Reuse of high-level definitions of quality views
• Compilation of abstract quality views into quality components
Abstract quality views
binding andcompilation
Executable Quality process
- runtime environment- data-specific quality services
Quratorarchitectural framework:
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality hypotheses discovery and testing
Quality modelPerformance assessment
Executionon test data
abstractquality view
CompilationCompilationTargeted
Compilation
Quality-enhancedUser environmentQuality-enhanced
User environmentQuality-enhancedUser environment
Target-specificQuality componentTarget-specific
Quality componentTarget-specificQuality component
DeploymentDeployment
Deployment
Multiple target environments:• Workflow• query processor
Quality modeldefinition
Combining the strengths of UMIST andThe Victoria University of Manchester
Experimental quality
Making data processing quality-aware using Quality Views
– Query, browsing, retrieval, data-intensive workflows
Discovery and validation of “Quality nuggets”
QualityView
Modeltesting
Testdatasets
Embedding quality views and flow-through
testing
+
Combining the strengths of UMIST andThe Victoria University of Manchester
Execution model for Quality views
Binding compilation executable component
– Sub-flow of an existing workflow
– Query processing interceptor
Host workflow
AbstractQuality view
Embeddedquality
workflow
QV compiler
D
D’ Quality view on D’
Qurator quality frameworkServices registry
Servicesimplementation
Host workflow: D D’
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: original proteomics workflow
Taverna workflow
Quality flow embedding point
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: embedded quality workflow
Combining the strengths of UMIST andThe Victoria University of Manchester
Interactive conditions / actions
Combining the strengths of UMIST andThe Victoria University of Manchester
Generic quality process pattern
Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations
<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>
Evaluate conditionsExecute actions
<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>
Compute assertions
ClassifierClassifier
Classifier
<QualityAssertion
serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"
Persistentevidence
Combining the strengths of UMIST andThe Victoria University of Manchester
A semantic model for quality concepts
Quality “upper ontology”(OWL)
Quality “upper ontology”(OWL)
Evidence annotations are class instances
Evidence annotations are class instances
Quality evidence typesQuality evidence types
EvidenceMeta-data model
(RDF)
EvidenceMeta-data model
(RDF)
Combining the strengths of UMIST andThe Victoria University of Manchester
Main taxonomies and properties
Class restriction:MassCoverage is-evidence-for . ImprintHitEntry
Class restriction:PIScoreClassifier assertion-based-on-evidence . HitScorePIScoreClassifier assertion-based-on-evidence . Mass Coverage
assertion-based-on-evidence: QualityAssertion QualityEvidence
is-evidence-for: QualityEvidence DataEntity
Combining the strengths of UMIST andThe Victoria University of Manchester
The ontology-driven user interface
Detecting inconsistencies: no annotators for this Evidence type
Detecting inconsistencies: no annotators for this Evidence type
Detecting inconsistencies: Unsatisfied input requirements
for Quality Assertion
Detecting inconsistencies: Unsatisfied input requirements
for Quality Assertion
Combining the strengths of UMIST andThe Victoria University of Manchester
Qurator architecture
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality-aware query processing
Data
Queryprocessor
SQL, XQUERY
annotate
R’
Queryclient
QualityView
component
R
assert
act
evidence
dump
dumpR’
Quality-aware
query
Combining the strengths of UMIST andThe Victoria University of Manchester
Research issuesQuality modelling:
• Provenance as evidence
– Can data/process provenance be turned into evidence?
• Experimental elicitation of new Quality Assertions
– Seeking new collaborations with biologists!
• Classification with uncertainty
– Data elements belong to a quality class with some probability
• Computing Quality Assertions with limited evidence
– Evidence may be expensive and sometimes unavailable
– Robust classification / score models
Architecture:
• Metadata management model
– Quality Evidence is a type of metadata with known features…
Combining the strengths of UMIST andThe Victoria University of Manchester
Summary
For complex data types, often no single “correct” and agreed-upon definition of quality of data
• Qurator provides an environment for fast prototyping of quality hypotheses
– Based on the notion of “evidence” supporting a quality hypothesis
– With support for an incremental learning cycle
• Quality views offer an abstract model for making data processing environments quality-aware
– To be compiled into executable components and embedded
– Qurator provides an invocation framework for Quality Views
Publications: http://www.qurator.orgQurator is registered with OMII-UK