Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science
-
Upload
paolo-missier -
Category
Technology
-
view
317 -
download
1
Transcript of Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science
![Page 1: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/1.jpg)
Modelling and computingthe quality of information in e-science
Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK
Alun Preece, Binling JinDepartment of Computing Science
University of Aberdeen, UK
http://www.qurator.org
Aberdeen, 24/1/07
![Page 2: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/2.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality of data
Main driver, historically: data cleaning for
• Integration: use of same IDs across data sources
• Warehousing, analytics:
– restore completeness,
– reconcile referential constraints
– cross-validation of numeric data by aggregation
Focus:
• Record de-duplication, reconciliation, “linkage”
– Ample literature – see eg Nov 2006 issue of IEEE TKDE
• Consistency of data across sources
• Managing uncertainty in databases (Trio - Stanford)
The need for data quality control is rooted in the data management practice
![Page 3: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/3.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Common quality issues
• Completeness: not missing any of the results
• Correctness: each data should reflect the actual real-world entity that it is intended to model
– The actual address where you live, the correct balance in your bank account…
• Timeliness: delivered in time for use by a consumer process
– Eg stock information
• …
![Page 4: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/4.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Taxonomy for data quality dimensions
![Page 5: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/5.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Our motivation: quality in public e-science data
GenBankUniProt
EnsEMBL
Entrez
dbSNP
• Large volumes of data in many public repositories• Increasingly creative uses for this data
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Problem: using third party data of unknown quality may result in misleading scientific conclusions
![Page 6: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/6.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Some quality issues in biology
“Quality” covers a broader spectrum of issues than traditional DQ
• “X% of database A may be wrong (unreliable) – but I have no easy way to test that”
• “This microarray data looks ok but is testing the wrong hypothesis”
• The output from this sequence matching algorithm produces false positives
• …
Each of these issues calls for a separate testing procedureDifficult to generalize
Each of these issues calls for a separate testing procedureDifficult to generalize
![Page 7: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/7.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness in biology - examples
Data type Creation process Correctness
Uniprot protein annotation
Manual curation Functional annotation f for p correct if function f can reliably be attributed to p
Qualitative proteomics:
Protein identification
Generate peptides peak lists, match peak lists (eg Imprint)
No false positives:
Every protein in the output is actually present in the cell sample
Transcriptomics:
Gene expression report (up/down-regulation)
Microarray data analysis
No false positives, no false negatives
![Page 8: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/8.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Defining quality in e-science is challenging
• In-silico experiments express cutting-edge research
– Experimental data liable to change rapidly
– Definitions of quality are themselves experimental
• Scientists’ quality requirements often just a hunch
– Quality tests missing or based on experimental heuristics
– Definitions of quality criteria are personal and subjective
• Quality controls tightly coupled to data processing
– Often implicit and embedded in the experiment
– Not reusable
![Page 9: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/9.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Research goals
1. Make personal definitions of quality explicit and formal
– Identify a common denominator for quality concepts
– Expressed as a conceptual model for Information Quality
2. Make existing data processing quality-aware
– Define an architectural framework that accommodates personal definitions of quality
– Compute quality levels and expose them to the user
Elicit “nuggets” of latent quality knowledgefrom the experts
Elicit “nuggets” of latent quality knowledgefrom the experts
![Page 10: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/10.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: protein identification
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Correct entry true positive
Evidence:
mass coverage (MC) measures the amount of protein sequence matched
Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum
ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
![Page 11: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/11.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness of protein identification
Estimator function: (computes a score rather than a probability)
PMF score = (HR x 100) + MC + (ELDP x 10)
Prediction performance – comparing 3 models:
ROC curve:True positives vs false positives
![Page 12: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/12.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality process components
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Goal:to automatically add the additional filtering step in a principled way
Goal:to automatically add the additional filtering step in a principled way
PMF score = (HR x 100) + MC + (ELDP x 10)
Quality filtering
Quality assertion:
Evidence:•mass coverage (MC)•Hit ratio (HR)•ELDP
![Page 13: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/13.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality Assertions
QA(D): any function of evidence (metadata for D) that computes a partial order on D
1. Score model (total or partial order)
2. Classification model with class ordering:
D
reject
accept
analyze
Reject < analyze < acceptActions associated to regions
![Page 14: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/14.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Abstract quality views
An operational definition for personal quality:
1. Formulate a quality assertion on the dataset:
– i.e. a ranking of proteins by PMF score
2. Identify underlying evidence necessary to compute the assertion
– the variables used to compute the score (HR, MC, ELDP)
3. Define annotation functions that compute evidence values
• Functions that compute HR, MC, ELDP
4. Define quality regions on the ranked dataset
• In this case, intervals of acceptability
5. Associate actions to each region
![Page 15: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/15.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Computable quality views as commodities
Cost-effective quality-awareness for data processing:
• Reuse of high-level definitions of quality views
• Compilation of abstract quality views into quality components
Abstract quality views
binding andcompilation
Executable Quality process
- runtime environment- data-specific quality services
Quratorarchitectural framework:
![Page 16: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/16.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality hypotheses discovery and testing
Quality modelPerformance assessment
Executionon test data
abstractquality view
CompilationCompilationTargeted
Compilation
Quality-enhancedUser environmentQuality-enhanced
User environmentQuality-enhancedUser environment
Target-specificQuality componentTarget-specific
Quality componentTarget-specificQuality component
DeploymentDeployment
Deployment
Multiple target environments:• Workflow• query processor
Quality modeldefinition
![Page 17: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/17.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Experimental quality
Making data processing quality-aware using Quality Views
– Query, browsing, retrieval, data-intensive workflows
Discovery and validation of “Quality nuggets”
QualityView
Modeltesting
Testdatasets
Embedding quality views and flow-through
testing
+
![Page 18: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/18.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Execution model for Quality views
Binding compilation executable component
– Sub-flow of an existing workflow
– Query processing interceptor
Host workflow
AbstractQuality view
Embeddedquality
workflow
QV compiler
D
D’ Quality view on D’
Qurator quality frameworkServices registry
Servicesimplementation
Host workflow: D D’
![Page 19: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/19.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: original proteomics workflow
Taverna workflow
Quality flow embedding point
![Page 20: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/20.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: embedded quality workflow
![Page 21: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/21.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Interactive conditions / actions
![Page 22: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/22.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Generic quality process pattern
Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations
<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>
Evaluate conditionsExecute actions
<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>
Compute assertions
ClassifierClassifier
Classifier
<QualityAssertion
serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"
Persistentevidence
![Page 23: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/23.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
A semantic model for quality concepts
Quality “upper ontology”(OWL)
Quality “upper ontology”(OWL)
Evidence annotations are class instances
Evidence annotations are class instances
Quality evidence typesQuality evidence types
EvidenceMeta-data model
(RDF)
EvidenceMeta-data model
(RDF)
![Page 24: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/24.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Main taxonomies and properties
Class restriction:MassCoverage is-evidence-for . ImprintHitEntry
Class restriction:PIScoreClassifier assertion-based-on-evidence . HitScorePIScoreClassifier assertion-based-on-evidence . Mass Coverage
assertion-based-on-evidence: QualityAssertion QualityEvidence
is-evidence-for: QualityEvidence DataEntity
![Page 25: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/25.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
The ontology-driven user interface
Detecting inconsistencies: no annotators for this Evidence type
Detecting inconsistencies: no annotators for this Evidence type
Detecting inconsistencies: Unsatisfied input requirements
for Quality Assertion
Detecting inconsistencies: Unsatisfied input requirements
for Quality Assertion
![Page 26: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/26.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Qurator architecture
![Page 27: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/27.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality-aware query processing
Data
Queryprocessor
SQL, XQUERY
annotate
R’
Queryclient
QualityView
component
R
assert
act
evidence
dump
dumpR’
Quality-aware
query
![Page 28: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/28.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Research issuesQuality modelling:
• Provenance as evidence
– Can data/process provenance be turned into evidence?
• Experimental elicitation of new Quality Assertions
– Seeking new collaborations with biologists!
• Classification with uncertainty
– Data elements belong to a quality class with some probability
• Computing Quality Assertions with limited evidence
– Evidence may be expensive and sometimes unavailable
– Robust classification / score models
Architecture:
• Metadata management model
– Quality Evidence is a type of metadata with known features…
![Page 29: Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science](https://reader033.fdocuments.us/reader033/viewer/2022052823/5550669bb4c905c0448b547f/html5/thumbnails/29.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Summary
For complex data types, often no single “correct” and agreed-upon definition of quality of data
• Qurator provides an environment for fast prototyping of quality hypotheses
– Based on the notion of “evidence” supporting a quality hypothesis
– With support for an incremental learning cycle
• Quality views offer an abstract model for making data processing environments quality-aware
– To be compiled into executable components and embedded
– Qurator provides an invocation framework for Quality Views
Publications: http://www.qurator.orgQurator is registered with OMII-UK