Download - Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Transcript
Page 1: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Data-Intensive Computing Symposium

Data-Rich Computing:Where It’s At

Phillip B. GibbonsIntel Research Pittsburgh

Data-Intensive Computing SymposiumMarch 26, 2008

Some slides are borrowed from Jason Campbell, Shimin Chen, Suman Nath,and Steve Schlosser. Remaining slides are © Phillip B. Gibbons

Page 2: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium3

Particle PhysicsLarge Hadron

Collider(15PB)

Human Genomics(7000PB)1GB / person

200PB+ captured200% CAGR

http://www.intttp://www.intetp://www.intelp://www.intel.://www.intel.c//www.intel.co

World Wide Web(~1PB)

wiki wikiiki wiki wki wiki wii wiki wik

Wikipedia(10GB)

100% CAGR

Internet Archive(1PB+)

Typical Oil Company

(350TB+)

Estimated On-line RAM in Google

(8PB)

Personal Digital Photos

(1000PB+)100% CAGR

Total digital data to be created this year 270,000PB (IDC)

200 of London’s Traffic Cams

(8TB/day)

2004 WalmartTransaction DB(500TB)

Annual Email Traffic, no spam(300PB+)

Merck BioResearch DB

(1.5TB/qtr)

One Day of Instant Messaging in 2002

(750GB)

Terashake Earthquake Model

of LA Basin(1PB)

MIT BabytalkSpeech

Experiment(1.4PB)

UPMC HospitalsImaging Data

(500TB/yr)

Page 3: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium4

Everyday Sensing & Perception (ESP)

15MB today, 100s of GB soon

Cardiac CT4GB per 3D scan,

1000s of scans/year

Terashake Sims~1 PB for LA basin

Object Recognition GB today TB needed

Data-Rich Computing Thriving in a World Awash with Data

@ IntelResearch

Sampling ofthe projects

Page 4: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium5

Goal: Sample entire region at 10m resolution

6x104 x 3x104 x 1x104 = 18x1012 sample points!

~1 PB of data uncompressed

Image credit: Amit Chourasia, Visualization Services, SDSC

600 km

300

km100 kmdeep

SCECgroundmodel

Building groundmodels ofSouthernCalifornia

Steve Schlosser, Michael Ryan, Dave O’Hallaron (IRP)

Page 5: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium6

Harvard ground model

Time to Build:SCEC model – ~1 day

Harvard model – ~6 hours

50 8core blades8GB memory300GB disk

Page 6: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium8

Data-Rich Computing: Where It’s At

Important, interesting, exciting research area

Cluster approach:computing is co-located where the storage is at

Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation

Pervasive multimedia sensing: processing & querying must be pushed out of the data center to where the sensors are at

I know where it’s at, man!

Focus of this talk:

Page 7: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium9

Memory Hierarchy (I):CMP Architecture

Shared H/W Resources– On-chip cache

– Off-chip PIN bandwidth

Main Memory

P

L1

P

L1

P

L1

(Distributed) Shared L2 Cache

Interconnect

Processor Chip

Longer latency Lower bandwidth

Memory

Page 8: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium10

Memory Hierarchy (II):CMPs, Memories & Disks on a LAN

SSD (Flash)

MagneticDisk

and/or

Memory

SSD (Flash)

MagneticDisk

and/or

Memory

Cluster– Orders of magnitude

differences in latency & bandwidth among the levels

– Differing access characteristics:

– Quirks of disk

– Quirks of flash

– Quirks of cache coherence

Moreover, can have WAN of such Clusters

Page 9: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium11

Hierarchy-Savvy Parallel Algorithm Design (HI-SPADE) project

Hierarchy-savvy:– Hide what can be hid– Expose what must be exposed

– Sweet-spot between ignorant and fully aware

Support:– Develop the compilers, runtime systems,

architectural features, etc. to realize the model– Important component: fine-grain threading

Goal: Support a hierarchy-savvy model ofcomputation for parallel algorithm design

Page 10: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium12

HI-SPADE project: Initial Progress

Effectively Sharing a Cache among Threads [Blelloch & Gibbons, SPAA’04]

– First thread scheduling policy (PDF) with provably-good shared cache performance for any parallel computation

– W.r.t. sequential cache performance

– Hierarchy-savvy: automatically get good shared-cache performance from good sequential cache performance

P

L2 Cache

Main Memory

Shared L2 Cache

P P P P

Main Memory

With

PDF

Page 11: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium13

Example: Parallel Merging in Merge Sort

Parallel Depth First (PDF):

Work Stealing (WS):

Shared cache = 0.5 *(src array size + dest array size).

Cache miss Cache hit Mixed

P=8

Page 12: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium14

HI-SPADE: Initial Progress (II)

Scheduling Threads for Constructive Cache Sharing on CMPs [Chen et al, SPAA’07]

– Exposes differences between theory result & practice

– Provides an automatic tool to select task granularity

LU Merge Sort Hash Join

Work Stealing (ws) vs. Parallel Depth First (pdf); simulated CMPs

Page 13: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium15

HI-SPADE: Initial Progress (III)

Provably Good Multicore Cache Performance for Divide-and-Conquer Algorithms [Blelloch et al, SODA’08]

– First model considering both shared & private caches

– Competing demands: share vs. don’t share

– Hierarchy-savvy: Thread scheduling policy achieves provably-good private-cache & shared-cache performance, for divide-and-conquer algorithms

P

L2 Cache

Main Memory

P P P P

Shared L2 Cache

Main Memory

L1 L1 L1 L1 L1

Page 14: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium16

HI-SPADE: Initial Progress (IV)

Online Maintenance of Very Large Random Samples on Flash Storage [Nath & Gibbons, submitted]

– Flash-savvy algorithm (B-File) is 3 orders of magnitude faster & more energy-efficient than previous approaches

– Well-known that random writes are slow on flash; we show a subclass of “semi-random” writes are fast

Progress thus far is only the tip of the iceberg:Still far from our HI-SPADE goal!

Springboard for a more general study of flash-savvy

algorithms based on semi-random writes

(in progress)Lexa

r C

F c

ard

Page 15: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium17

Data-Rich Computing: Where It’s At

Important, interesting, exciting research area

Cluster approach:computing is co-located where the storage is at

Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation

Pervasive multimedia sensing: processing & querying must be pushed out of the data center to where the sensors are at

I know where it’s at, man!

Page 16: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium18

Pervasive Multimedia Sensing

Rich collection of (cheap) sensors– Cameras, Microphones, RFID readers, vibration sensors, etc

Internet-connected. Potentially Internet-scale– Tens to millions of sensor feeds over wide-area

– Pervasive broadband (wired & wireless)

Goal: Unified system for accessing, filtering, processing, querying, & reacting to sensed data– Programmed to provide useful sensing services

Page 17: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium19

Example Multimedia Sensing Services

Consumer services:

Parking Space Finder

Lost & Found / Lost pet

Watch-my-child / Watch-my-parent

Congestion avoidance

Page 18: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium20

Example Multimedia Sensing Services

Health, Security, Commerce, and Science services:

• Internet-scale Sensor Observatories

• Homeland Security

• Asset/Supply Chain Tracking

Our prototype

• Low Atmosphere Climate Monitoring

• Epidemic Early Warning System

Page 19: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium21

Data & Query Scaling Challenges

Data scaling

– Millions of sensors

– Globally-dispersed

– High volume feeds

– Historical data

Query scaling

– May want sophisticated data processing on all sensor feeds

– May aggregate over large quantities of data, use historical data, run continuously

– Want latest data, NOW

NetRad: 100Mb/s

Page 20: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium22

IrisNet: Internet-scale Resource-intensive Sensor Network services

General-purpose architecture for wide-area sensor systems– A worldwide sensor web

Key Goal: Ease of service authorship– Provides important functionality for all services

Intel Research Pittsburgh + many CMU collaborators– First prototype in late 2002

– In ACM Multimedia, BaseNets, CVPR, DCOSS, Distributed Computing, DSC, FAST, NSDI(2), Pervasive Computing, PODC, SenSys, Sigmod(2), ToSN

Page 21: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium23

Data & Query Scaling in IrisNet

Store sensor feeds locally– Too much data to collect centrally

Push data processing & filtering to sensor nodes– Reduce the raw data to derived info, in parallel near source

Push (distributed) queries to sensor nodes– Data sampled » Data queried– Tied to particular place: Queries often local

Exploit logical hierarchy of sensor data– Compute answers in-network

Processing & querying must be pushed out of the data center to where the sensors are at

Page 22: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium24

IrisNet’s Two-Tier Architecture

User

. . .SA

senseletsenselet

Sensor

SA

senseletsenselet

Sensor Sensor

SA

senseletsenselet

Web Serverfor the url

. . .

Query

OAXML database

. . .OA

XML databaseOA

XML database

Two components:SAs: sensor feed processingOAs: distributed database

Sensornet

Page 23: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium25

Creating a New IrisNet Service

Senselet(program to

filter sensor data)

Extended code(application-specific

aggregation) Hierarchy

(XML schema) Front-end

SA SAOAs

Query with standardDB language

Image processing steps

FFFFEFF Send to OAUpdates DBOnly 500 lines of new code

for Parking Space Finder

vs. 30K lines of IrisNet code

Research focus: Fault Tolerance

Page 24: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium26

Data-Rich Computing: Where It’s At

Important, interesting, exciting research area

Cluster approach:computing is co-located where the storage is at

Memory hierarchy issues: [HI-SPADE]where the (intermediate) data are at, over the course of the computation

Pervasive multimedia sensing: [IrisNet] processing & querying must be pushed out of the data center to where the sensors are at

I know where it’s at, man!

Page 25: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium27

Backup Slides

Page 26: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium28

Techniques for Privacy Protection

Cameras raise huge privacy concerns– Use to it in London. Chicago protest

Viewed by law enforcement vs. viewed by public

• IrisNet Goal: Exploit processing at the sensor node to implement privacy policies

• Privileged senselet that detects & masks faces

• All other senselets only see masked version

Only tip of the iceberg

Page 27: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium29

Data Organized as Logical Hierarchy

<State id=“Pennysylvinia”> <County id=“Allegheny”> <City id=“Pittsburgh”> <Neighborhood id=“Oakland”>

<total-spaces>200</total-spaces> <Block id=“1”>

<GPS>…</GPS> <pSpace id=“1”> <in-use>no</in-use>

<metered>yes</metered> </pSpace>

<pSpace id=“2”> …

</pSpace> </Block> </Neighborhood>

<Neighborhood id=“Shadyside”> …

……

Example XML Hierarchy

IrisNet automaticallypartitions the hierarchy

among the OAs

Page 28: Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March.

Phillip B. Gibbons, Data-Intensive Computing Symposium30

In-Network Query Processing:Query Evaluate Gather (QEG)

/NE/PA/Allegheny/Pittsburgh/(Oakland | Shadyside) / rest of query

Pittsburgh OAQ

1. Queries its XML DB

Discovers Shadyside datais cached, but not Oakland

Does DNS lookup to find IP addr for Oakland

2. Evaluate the result

3. Gathers the missing data by sending Q’ to Oakland OA Oakland OA

QEG

Q’

Q’: /NE/PA/Allegheny/Pittsburgh/Oakland/ rest of query

Combines results & returns

IrisNet’sapproach