system for predictive life cycle management of buildings and infrastructures
Big data in the research life cycle: technologies, infrastructures, policies
-
Upload
bigdataeurope -
Category
Technology
-
view
346 -
download
0
Transcript of Big data in the research life cycle: technologies, infrastructures, policies
Prof. Yannis Ioannidis
“Athena” Research Center & University of Athens
BioMed
Oceans
Space & Earth
Culture Environment
OA Policies
Data Proc
OpenMinTeD
EXAREME MaDIS GRAPHOS
PAROS
CHESS
Optique
AITION/TopMod
KDD/ML
MDP
OpenAIRE
MaDgIK Systems
DCV ML
ResAnal
HBP Capsella
W-Dance
O-MinTeD
STE
G-kak^3
BB
EarthSrvr
V-Exhibit
EFG1914
Fut-TDM
OpenUP
WDAqua
RDA
StR-ESFRI
Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine
Application Layer: Data (pre) processing and knowledge discovery platform
Imaging , Video
Streaming Data Un/Semi/Structured Biomedical Data
Legacy Data Simulation Models Digital Libraries (PubMed etc)
Ontologies (UMLS, GO..)
Clinician knowledge
Upper level declarative language and extensible UDFs
MADRefine module Data Preprocessing & Transformation
Curation & Validation
AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling
Distributed execution on clouds and ad-hoc clusters
Distributed Query Engine
AITION simulation Graphical Probabilistic modelling for
Statistical simulation
Ontology Based Data Access
Data Processing
• Distribution, Federation, Parallelism
• EXAREME
Data Analytics
• Cleaning & curation
• MADRefine
• Modeling, Mining
• AITION
Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration
Multi-modal, vertical integrated, distributed bio medical data
Biomedical Info
Registries & Metadata
Simulation Models
KDD Results
Data Infrastructures
• ESFRI Infrastructures
• ICOS, EMSO,
…
• E-Infrastructures
• OpenAIRE
WH
AT
W
HE
RE
H
OW
W
HY
OpenAIRE HUB
CERN zenodo
Visualize - Manage Enhanced Publications
Get support (NOADs)
Linked Content Statistics
+++
Search & Browse
Curate & collaborate
Deposit Publications
& data Research impact Citations, usage
statistics
+++
Link Classify
De-duplicate Cite
Text Mine APIs
Publication repositories Institutional & Thematic Open Access Journals
17,500,000 OA publications 700+ validated repositories
accessing >5K repos/OA journals
Data repositories Data Journals
ResearchID (ORCID, ..)
OpenDOAR
…
CRIS Systems
National funding
EC funding
Usage data
Metadata on publications Metadata
on data
Guidelines for Data Providers & Open Data Pilot
Guidelines for Funding Info
Guidelines for Publications
OpenAIRE
ICOS
LIFEWATCH
EMSO
SIOS
EURO-ARGO
IAGOS
EPOS
EISCAT
COPAL ACTRIS
DANUBIUS_RI
ICOS: Integrated Carbon Observation System
Harmonized and High Precision Scientific Data on Carbon Cycle And Greenhouse Gas Budget and Perturbations
EMSO: European Multi-disciplinary Seafloor and water-column Observatory
Ocean observation systems for long-term, high-resolution, (near) real-time monitoring of environmental processes including natural hazards, climate change, and marine ecosystems
SIOS: Svalbard Integrated Earth Observing System
Arctic environmental and climate-related challenges
EURO-ARGO: European contribution to ARGO
Ocean observation and for oceanography and climate
IAGOS: In-service Aircraft for a Global Observing System
Atmospheric composition, aerosol and cloud particles
EISCAT_3D: European Incoherent Scatter
Radar systems for the upper atmosphere, the ionosphere and the Aurora Borealis
EUFAR-COPAL: European Facility for Airborne Research
Airborne research for the environmental and geo sciences in Europe
ACTRIS: Aerosols, Clouds and Trace gases RI
Models and forecast systems by offering high quality data for atmospheric gases, clouds, and trace gases
DANUBIUS-RI: Int’l Center for Advanced Studies on River-Sea Systems
Addressing conflicts between society’s demands, environmental change and environmental protection in river–sea systems worldwide.
Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer:
Multi-modal, vertical integrated, distributed bio medical data
Biomedical Info
Registries & Metadata
Simulation Models
Imaging , Video
Streaming Data Un/Semi/Structured Biomedical Data
Legacy Data Simulation Models Digital Libraries (PubMed etc)
Ontologies (UMLS, GO..)
Clinician knowledge
KDD Results
Application Layer: Data (pre) processing and knowledge discovery platform
MADRefine module Data Preprocessing & Transformation
Curation & Validation
AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling
AITION simulation Graphical Probabilistic modelling for
Statistical simulation
Data Analytics
• Cleaning & curation
• MADRefine
• Modeling, Mining
• AITION
Data Infrastructures
• ESFRI Infrastructures
• ELIXIR
• E-Infrastructures
• OpenAIRE
Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine Upper level declarative language and extensible UDFs
Distributed execution on clouds and ad-hoc clusters
Distributed Query Engine
Ontology Based Data Access
Data Processing
• Distribution, Federation, Parallelism
• EXAREME
Gateway
Master
Worker Worker Worker Worker
Execution Engine
Execution Engine
Optimization Engine
Optimization Engine
Fast Local Net
Data Connector
Data Connector
P2P Net
Parallel / distributed execution of complex data flows targeting data analysis and mining
Data remain at source (hospital) – dataflow / query travels
Privacy preserving: transmit only aggregated information from hospital (sufficient statistics)
Advanced data compression, on the data partitioning
Query Language: SQL + UDFs (in Python)
Query
Fed
era
tion
Decompose query into
local and global parts
1 N
id m-name m-value id m-name m-value
Local queries Local queries
Partial
aggregated
results
Run local
queries Run local
queries
“count, avg, std”
m-name N avg std
m-name Σx Σx2 N
Σx,Σx2,N Σx,Σx2,N
Partial
aggregated
results m-name Σx Σx2 N
L:“Σx, Σx2, N”
G:“N, avg, std”
Run global
queries N, avg, std
• Distributed elastic execution
– Parallel aggregations, unions, and joins
– Resources are reserved dynamically
• Iterative dataflow execution
– Support machine learning algorithms
• Novel query optimization techniques
– SQL with User Defined Functions
– Arbitrary user code with unknown properties
– Privacy-aware query optimization
• Time and money
• 2-dimensional optimization
Quantum: 1 hour
• Simple map-reduce flow
– A: 1 hour B: 10 minutes C: 1 hour
Schedule Time
(hours)
Money
(resource hours)
Winner
One host for all ops 18.60 19 5x cheaper
Different host per op 2.16 102 9x faster
• Optimal dataflow scheduling
• Skyline of all Pareto optimal plans
Time
Money
Data provision Layer : Extract, Transform, Load (ETL) , Anonymization & pre-processing of existing resources
EXAREME Middleware Layer: Distributed execution of complex dataflows & distributed querying Engine
Federated data Layer & (open) research data infrastructures: Semantic Data modelling, Provenance & Integration Layer:
Multi-modal, vertical integrated, distributed bio medical data
Biomedical Info
Registries & Metadata
Simulation Models
Imaging , Video
Streaming Data Un/Semi/Structured Biomedical Data
Legacy Data Simulation Models Digital Libraries (PubMed etc)
Ontologies (UMLS, GO..)
Clinician knowledge
KDD Results
Upper level declarative language and extensible UDFs
Distributed execution on clouds and ad-hoc clusters
Distributed Query Engine
Ontology Based Data Access
Data Processing
• Distribution, Federation, Parallelism
• EXAREME
Data Infrastructures
• ESFRI Infrastructures
• ELIXIR
• E-Infrastructures
• OpenAIRE
Application Layer: Data (pre) processing and knowledge discovery platform
MADRefine module Data Preprocessing & Transformation
Curation & Validation
AITION clustering & general KDD SoA Machine Learning Algorithms Latent Variable & Topic Modelling
AITION simulation Graphical Probabilistic modelling for
Statistical simulation
Data Analytics
• Cleaning & curation
• MADRefine
• Modeling, Mining
• AITION
Data Mining
Disease signatures
Patient grouping & similarity
Raw data from biomarker based
personalized acquisition
Personalized Model
Guided Medicine
For a particular patient
Unknown / missing data
Predict value of missing
variable
Variable dependencies & causality
Simulation Models
Create Statistical
Simulation
Models
Individualized diagnosis,
prognosis & treatment plan
Model & Verification Knowledge Discovery Reasoning & decision support
Data
Preprocessing
Curation & Validation
Transformed &
Validated Data
Domain knowledge &
assumptions
Clinical workflows
BOTTOM-UP TOP-DOWN
Big Data Analytics • Capture
• multi source • multi modal • multi system
Management • Data provenance • Sanitization
(Anonymization)
• Process
• aggregate • distributed
Analysis • Privacy preserving
• Algorithms • Mechanisms
Modeling • Personalized • De-identified
Practice • Ethics • Privacy
SEX AgeOnSet
ILAR
JntActDis
GlbActDis
DisDur JntLOM GenEval
CHAQ ESR CRP ANA
MEFN IL2RA Poznanski
NSAID STEROID DMARD BIOLOGIC
JADI
JntLOMDiff CHAQDiff
ESRDiff CRPDiff
JntActDisDiff GlbActDisDiff
GenEvalDiff
BOXValidatedOut
Adapted Sharp/ van der Heijde
Score Out JADIOut
Extended BOX
Predictors
Medication
Outcome
demographics imaging genetics
clinical
lab
Synovial volume
OTHER
Disease signatures
Patient grouping & similarity Variable dependencies & causality
Simulation Models
Individualized diagnosis,
prognosis & treatment plan
Data Mining Personalized Model
Guided Medicine
For a particular patient
Unknown / missing data
Predict value of missing
variable
Create Statistical
Simulation
Models
Model & Verification Knowledge Discovery Reasoning & decision support
Domain knowledge &
assumptions
Clinical workflows Raw data from biomarker based
personalized acquisition
Data
Preprocessing
Curation & Validation
Transformed &
Validated Data
Extensible validation and data transformation engine
Ιnteractive and efficient WEB-Based interface
Data cleaning:
◦ Typographical error detection (numeric & alphanumeric)
◦ Data cleaning rules: (functional dependencies, conditional funct. dependencies, denial constraints)
◦ New/derived columns (discretization, computation of medical scores)
◦ Data visualisation (barcharts, piecharts, scatterplots, linecharts, etc.)
End-to-end data analysis workflow support (rerun experiments, reproduce results)
Variable dependencies & causality
Simulation Models
Individualized diagnosis,
prognosis & treatment plan Transformed &
Validated Data
Personalized Model
Guided Medicine
For a particular patient
Unknown / missing data
Predict value of missing
variable
Create Statistical
Simulation
Models
Model & Verification Reasoning & decision support
Data
Preprocessing
Curation & Validation
Domain knowledge &
assumptions
Clinical workflows
Data Mining
Raw data from biomarker based
personalized acquisition
Knowledge Discovery
Disease signatures
Patient grouping & similarity
Disease signatures: Latent factors (patterns) that characterize
disease
◦ Distribution of most relevant variables for disease (e.g., biomarkers)
◦ Multiple variables per signature, signatures per disease
Patient Cluster: Homogeneous patient group with common
characteristics
Patient Similarity: Patients “like” me or mine (patient or
clinician role)
◦ “like” = according to different criteria
(e.g., allocation on disease signatures)
Similarity & Graph clustering
Topics & allocations
Modelling
Disease signatures
Patient grouping & similarity
Individualized diagnosis,
prognosis & treatment plan Transformed &
Validated Data
Personalized Model
Guided Medicine
For a particular patient
Unknown / missing data
Predict value of missing
variable
Reasoning & decision support
Clinical workflows
Data Mining
Raw data from biomarker based
personalized acquisition
Knowledge Discovery
Data
Preprocessing
Curation & Validation
Create Statistical
Simulation
Models
Model & Verification
Domain knowledge &
assumptions
Variable dependencies & causality
Simulation Models
Bayesian Net: Directed Acyclic Graph + Conditional Prob Distributions
◦ Features (Nodes) & Dependencies (Edges)
◦ Compact representation of joint data distribution
Patient X1 X2 X3 X4 X5 X6 X7 X8
1 Y N N Y Y Y N Y
:
1000 N N Y N N Y N N
X1
X4 X5
X7 X8
Smoking
Lung cancer
Chronic bronchitis
X2
Genetic Factor
X6
X3
Allergy +
Find:
Given:
+ Domain Knowledge
Final DAG (based on MCMC&DP, threshold=0.5)
Age
ParC
HD
Pro
cedure
s
ExIn
tole
r
Cyanosis
CP
BP
CP
Arr
hy
CP
Concl
CP
Term
Rsn
BS
A
TP
VR
egurg
TriR
egurg
RV
D
RedR
V
PS
Motion
Restr
Patt
AV
Blo
ck
Supra
vA
rrhy
Ventr
icA
rrhy
Age
ParCHD
Procedures
ExIntoler
Cyanosis
CPBP
CPArrhy
CPConcl
CPTermRsn
BSA
TPVRegurg
TriRegurg
RVD
RedRV
PSMotion
RestrPatt
AVBlock
SupravArrhy
VentricArrhy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Modelling Dependency Analysis
Inference
Disease signatures
Patient grouping & similarity Variable dependencies & causality
Simulation Models
Transformed &
Validated Data
Data Mining
Raw data from biomarker based
personalized acquisition
Knowledge Discovery
Data
Preprocessing
Curation & Validation
Create Statistical
Simulation
Models
Model & Verification
Domain knowledge &
assumptions
Personalized Model
Guided Medicine
For a particular patient
Unknown / missing data
Predict value of missing
variable
Reasoning & decision support
Clinical workflows
Individualized diagnosis,
prognosis & treatment plan
Increased RVD is related with worse values in every MR aspect (TVPRegurg, PSMotion, RedRV, AV_Block, TriRegurg)
Brussels – 6-7 May 2014
MyHealthMyData
Raw
Personal
Data
Raw
Anonymised
Summary
Anonymised
Private Controlled Access Public
Bioinformatics
services for All Users Doctors (and
Patients?) Researchers
Obtaining consent not straightforward
Anonymisation: necessary, rather complicated, ensuring neither privacy nor data value
“Blending in a crowd” and k-anonymity: privacy is property not output of sanitization
How do we define privacy?
data publishing: “Sanitization” (Anonymisation) hiding individual info (k-anonymity) but preserving (sufficient) aggregated statistics
data mining: Specific algorithms (usually operating in two phases) for classification, clustering, association rules, …
mechanisms: Differential Privacy & Crowd-Blending Privacy perturb data or add noise ensuring ε-indistinguishable output distribution
encryption: Fully Homomorphic Encryption (FHE) for computation and query to run over encrypted data
decentralization: Blockchain to Protect Personal Data - decentralized personal data management, users own and control their data
Big data is not only about size
Data is distributed, data is heterogeneous
Processing goes to data, not data to processing
ICT (Data management & processing) advances
◦ Data compression
◦ Federated / privacy-preserving processing
◦ Scalable parallel / distributed processing
◦ Data curation (otherwise: garbage in, garbage out)
◦ Text and data analytics
http://www.madgik.di.uoa.gr
https://www.humanbrainproject.eu
http://www.md-paedigree.eu/
http://www.openaire.eu
http://www.optique-project.eu