Aalborg Universitet Aspects of temporal and spatio-temporal ...
Extreme Spatio-Temporal Data Analysis
-
Upload
joel-saltz -
Category
Education
-
view
691 -
download
3
description
Transcript of Extreme Spatio-Temporal Data Analysis
Extreme Spatio Temporal Data Analysis in Biomedical Informatics
Joel Saltz MD, PhD
Director Center for Comprehensive Informatics
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Contributions
• Computer Science: Methods and middleware for analysis, classification of very large datasets from low dimensional spatio-temporal sensors; methods to carry out comparisons and change detection between sensor datasets
• Biomedical: Mine whole slide image datasets to better predict outcome and response to treatments, generate basic insights into pathophysiology and identify new treatment targets
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Outline of Talk
• Pathology: Analysis of Digitized Tissue for Research and Practice
• Feature Clustering: Morphologic Tumor Subtypes in GBM Brain Tumors and Relationship to “omic” classifications
• Whole Slide Image Analysis in Clinical Practice: Neuroblastoma
• Tissue Flow: Multiplex Quantum Dot • HPC/BIGDATA Feature Pipeline• Pathology data analytic tools and techniques
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Whole Slide Imaging: Scale
Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA
Total: 0.1-10 Exabytes/year
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Pathology Computer Assisted Diagnosis
Shimada, Gurcan, Kong, Saltz
Computerized Classification System for Grading Neuroblastoma
• Background Identification• Image Decomposition (Multi-
resolution levels)• Image Segmentation
(EMLDA)• Feature Construction (2nd
order statistics, Tonal Features)
• Feature Extraction (LDA) + Classification (Bayesian)
• Multi-resolution Layer Controller (Confidence Region)
No
YesImage Tile
InitializationI = L
Background? Label
Create Image I(L)
Segmentation
Feature Construction
Feature Extraction
Classification
Segmentation
Feature Construction
Feature Extraction
Classifier Training
Down-sampling
Training Tiles
Within ConfidenceRegion ?
I = I -1
I > 1?
Yes
Yes
No
No
TRAINING
TESTING
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information
In Silico Brain Tumor Center
Anaplastic Astrocytoma(WHO grade III)
Glioblastoma(WHO grade IV)
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Morphological Tissue Classification
Nuclei Segmentation
Cellular Features
Lee Cooper,Jun Kong
Whole Slide Imaging
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Consensus clustering of morphological signatures
Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients
Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering
Nuclear Features Used to Classify GBMs
3 2 1
20 40 60 80 100 120 140 160
20
40
60
80
100
120
140
1602 3 4 5 6 725
30
35
40
45
50
# Clusters
Silh
ouet
te A
rea
0 0.5 1
1
2
3
Silhouette Value
Clu
ster
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:
Cell Cycle (CC), Chromatin Modification (CM),
Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)
Featu
re I
ndic
es
CC CM PB
10
20
30
40
500 500 1000 1500 2000 2500 3000
0
0.2
0.4
0.6
0.8
1
Days
Sur
viva
l
CC
CM
PB
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Cox proportional hazards– Gene expression class not significant p=0.58– Morphology clustering p=5.0e-3
Gene Expression Class Associations
CC CM PB0
20
40
60
80
100
Cluster
Sub
type
Per
cent
age
(%)
Classical
MesenchymalNeural
Proneural
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Clustering Validation
• Separate set of 84 GBMs from Henry Ford Hospital• ClusterRepro: CC p=7.2e-3, CM p=1.3e-2
Fea
ture
Ind
ices
CC Mixed CM
10
20
30
40
50
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Months
Sur
viva
l
CCMixedCM
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Associations
Novel Pathology Modalities
ImagingExcellent Spatial Resolution
Limited Molecular Resolution
GenomicsExcellent Molecular Resolution
Limited Spatial Resolution
100
0’s
of g
en
es
Quantum Dots
Professor Robin Bostick
Imaging Pipeline – Feature Extraction
Example Application: Cancer Stem Cell Niche
• Cancer stem cells– Rare(?), proliferative cells, regenerative– Do they prefer to live near blood vessels, or necrosis?
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
• Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data
• Run lots of different algorithms to derive same features
• Run lots of algorithms to derive complementary features
• Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
Extreme Spatio-Temporal Sensor Data Analytics
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Application Targets
• Multi-dimensional spatial-temporal datasets– Microscopy image analyses
– Biomass monitoring using satellite imagery
– Weather prediction using satellite and ground sensor data
– Large scale simulations
• Can we analyze 100,000+ microscopy images per hour?
• Correlative and cooperative analysis of data from multiple sensor modalities and sources
• What-if scenarios and multiple design choices or initial conditions
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Biomass Monitoring (joint with ORNL)• Investigate changes in vegetation and land use• Hierarchical, multi-resolution coarse/fine-grained analytics into a
unified framework • Changes identified using high temporal/low spatial resolution MODIS
data• Segmentation and classification methods used to characterize
changes using higher resolution data (e.g. multitemporal AWiFS data)
• Segmentation and classification to identify man-made structures.
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Core Transformations
• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation • Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification
Extreme DataCutter
DataCutter Pipeline of filters connected though logical streams
In transit processing
Flow control between filters and streams
Developed 1990s-2000s; led to IBM System S
Extreme DataCutter Two level hierarchical pipeline framework
In transit processing
Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes
Fine grained pipeline operations managed at the node level
Both levels employ filter/stream paradigm
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Extreme DataCutter – Two Level Model
Coarse Grained Level
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Node Level Work Scheduling
• Features of Node Level Architectures – Nodes contain CPUs, GPUs– Each CPU contains multiple cores– GPU has complex internal architecture– Data locality within node– Data paths between CPUs and GPUs
Keeneland Node
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Node Level Work Scheduling
• Attempt to minimize data movement• Identify and assign operations that perform
well on GPU • Balance load between CPUs and GPUs• Prefetch data• Identify and use high bandwidth CPU/GPU
data paths• Schedule exclusive GPU access for
components (e.g. morphological reconstruction) requiring fine grained parallelism
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Node Level Work Scheduling
Fine Grained Level
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Control Structures for Handling Fine Grained/Runtime Dependent Parallelism in GPUs
Morphological Reconstruction: 8-15 Fold speedup vis one CPU core (Intel i7 2.66 GHz) on NVIDIA C2070 and GTX580 GPUs
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Large Scale Data Management
Implemented with IBM DB2 for large scale pathology image metadata (~million markups per slide)
Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships
Highly optimized spatial query and analyses
Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus
.
Window query: return markups contained in a rectangle
Spatial join query: algorithm validation/comparison
Containment query: nuclear featureaggregation in tumor regions
PAIS (Pathology Analytical Imaging Standards)Supported by caBIG, R01 and ACTSI
PAIS Logical Model 62 UML classes
markups, annotations, imageReferences, provenance
PAIS Data Representation XML (compressed) or HDF5
PAIS Databases loading, managing and
querying and sharing data Native XML DBMS or
RDBMS + SDBMS
class Domain Mo...
Annotation
GeometricShape
CalculationObservation
Specimen
ImageReference
Provenance
User
PAIS
EquipmentGroup
AnatomicEntity
Subject
Field
Project
MicroscopyImageReference
DICOMImageReference
TMAImageReference
Markup
Inference
Region
WholeSlideImageReferencePatient
Surface
Collection
AnnotationReference
10..1
1
0..1
0..*
0..*
1
0..*1
0..1
1 0..*
1
0..1
1
0..1
10..1
10..*
1
0..*
0..*
0..*
1 0..11
0..1
1
0..*
0..1
0..*
1
0..*
1
0..1
1
0..*
10..1
10..1
1
0..*
10..*
1 0..*
1
0..*
PAIS
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
Example Query for Integrative Studies• Find mean nuclear feature vector and covariance on
tumor regions for each patient grouped by tumor subtype
PAIS: Example Queries
SELECT c.pais_uid, pc.subtype, AVG(area), AVG(perimeter), AVG(eccentricity), COVARIANCE(area, perimeter), COVARIANCE(area, eccentricity)FROM pais.calculation_flat c,TCGA.PATIENT_CHARACTERISTIC pc, pais.patient p WHERE p.patientid = pc.patient_id AND p.pais_uid = c.pais_uid GROUP BY c.pais_uid, pc.subtype;
2 1 3 4
50 100 150
20
40
60
80
100
120
140
160
0 0.2 0.4 0.6 0.8 1
1
2
3
4
Silhouette Value
Clu
ste
r
Fea
ture
Ind
ices
10
20
30
40
50
60
70
80
90
100
110
0 500 1000 1500 2000 2500 30000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DaysS
urvi
val
Cluster 1
Cluster 2Cluster 3
Cluster 4
Algorithm Validation: Intersection between Two Result Sets (Spatial Join)
PAIS: Example Queries
. .
Cen
ter
for
Com
pre
hen
sive In
form
ati
cs
VLDB 2012
Change Detection, Comparison, and Quantification
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Summary and Perspective
• Large scale integrative data analytic methods and tools to integrate clinical, molecular, Pathology, Radiology data
• Characterize new cancer subtypes and biomarkers, predict outcome, treatment response
• Algorithms to quantify Pathology classification• HPC/BIGDATA analysis pipelines
C
en
ter
for
Com
pre
hen
siv
e I
nfo
rmati
cs
Importance:
• Computer Science: general approaches to analysis and classification of very large datasets from low dimensional spatio-temporal sensors
• Biomedical: generate basic insights into pathophysiology, clues to new treatments, better ways of evaluating existing treatments and core infrastructure needed for comparative effectiveness research studies
Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David
Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin Kurc, Himanshu Rathod Emory leads
• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon, Daniel Rubin, Fred Prior, Larry Tarbox and many others
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul
Pantalone• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony
Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman, Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi
Thanks!