Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B....

download Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B. Gibbons Intel Research Pittsburgh Data-Intensive Computing Symposium March

of 28

  • date post

    11-Jan-2016
  • Category

    Documents

  • view

    215
  • download

    0

Embed Size (px)

Transcript of Data-Intensive Computing Symposium Data-Rich Computing: Where It’s At Phillip B....

Data-Intensive Computing SymposiumPhillip B. Gibbons
Intel Research Pittsburgh
Data-Intensive Computing Symposium
March 26, 2008
Some slides are borrowed from Jason Campbell, Shimin Chen, Suman Nath,
and Steve Schlosser. Remaining slides are © Phillip B. Gibbons
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Data-Rich Computing: Where It’s At
Abstract: To use a phrase popularized in the sixties, data-rich (or data-intensive) computing is “where it’s at”. That is, it’s an important, interesting, exciting research area. Significant efforts are underway to understand the essential truths of data-rich computing, i.e., to know where it’s at. Google-style clusters ensure that computing is co-located where the storage is at. In this talk, we consider two further issues raised by “where it’s at”. First, we highlight our efforts to support a high-level model of computation for parallel algorithm design and analysis, with the goal of hiding most aspects of the cluster’s deep memory hierarchy (where the data is at, over the course of the computation) without unduly sacrificing performance. Second, we argue that the most compelling data-rich applications often involve pervasive multimedia sensing. The real-time, in situ nature of these applications reveals a fundamental limitation of the cluster approach: Computing must be pushed out of the machine room and into the world, where the sensors are at. We highlight our work addressing several key issues in pervasive sensing, including techniques for distributed processing and querying of a world-wide collection of multimedia sensors.
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
(8PB)
Total digital data to be created this year 270,000PB (IDC)
200 of London’s Traffic Cams
(8TB/day)
(300PB+)
(750GB)
(1PB)
(500TB/yr)
These will build in topic-based clusters… Still looking for a few more strategic images and will checkerboard the rest.
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Object Recognition
GB today
TB needed
Data-Rich Computing
@ Intel
Research
AT: big data
Eyal Krupka / IRI 1 videocam now, wants to do 10 cams in realtime, based on 30K categories, 1000-10000 instances per training, 30M images/training set.
2sec/frame -> ms/frame
*
6x104 x 3x104 x 1x104 = 18x1012 sample points!
~1 PB of data uncompressed
Building ground
models of
600 km
300 km
100 km
SCEC
ground
model
*
*
James Hays, Alexei Efros, Julio Lopez (CMU), Steve Schlosser (IRP)
Goal: Geographically localize new images,
by matching with others with similar features
Processed 1TB corpus from Flickr.com (7M JPEGs)
All images were tagged with GPS coordinates
Generated several features for each image using Matlab
Color histograms, Texton histograms, Line features, gist descriptor, etc.
72 hours total using Maui/Torque on 50 8core blades
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Important, interesting, exciting
Memory hierarchy issues:
where the (intermediate) data are at, over the course of the computation
Pervasive multimedia sensing:
processing & querying must be pushed out of the data center to where the sensors are at
Focus of this talk:
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Memory
Given a fixed silicon real estate, there is a tradeoff between cache area and number of cores.
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Cluster
Orders of magnitude differences in latency & bandwidth among the levels
Differing access characteristics:
Quirks of disk
Quirks of flash
SSD (Flash)
Magnetic
Disk
and/or
Memory
Given a fixed silicon real estate, there is a tradeoff between cache area and number of cores.
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Hierarchy-savvy:
Sweet-spot between ignorant
and fully aware
architectural features, etc. to realize the model
Important component: fine-grain threading
computation for parallel algorithm design
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Effectively Sharing a Cache among Threads [Blelloch & Gibbons, SPAA’04]
First thread scheduling policy (PDF) with provably-good shared cache performance for any parallel computation
W.r.t. sequential cache performance
With
PDF
P
*
Parallel Depth First (PDF):
P=8
Cache miss
Cache hit
*
HI-SPADE: Initial Progress (II)
Scheduling Threads for Constructive Cache Sharing on CMPs [Chen et al, SPAA’07]
Exposes differences between theory result & practice
Provides an automatic tool to select task granularity
Work Stealing (ws) vs. Parallel Depth First (pdf); simulated CMPs
Merge Sort
Hash Join
*
First model considering both shared & private caches
Competing demands: share vs. don’t share
Hierarchy-savvy: Thread scheduling policy achieves provably-good private-cache & shared-cache performance, for divide-and-conquer algorithms
L2 Cache
Main Memory
*
HI-SPADE: Initial Progress (IV)
Online Maintenance of Very Large Random Samples on Flash Storage [Nath & Gibbons, submitted]
Flash-savvy algorithm (B-File) is 3 orders of magnitude faster & more energy-efficient than previous approaches
Well-known that random writes are slow on flash; we show a subclass of “semi-random” writes are fast
Progress thus far is only the tip of the iceberg:
Still far from our HI-SPADE goal!
Springboard for a more
general study of flash-savvy
0.1 0 200 400 600 800 1000 1200 1400 1600
E ne
rg y
*
Important, interesting, exciting
Memory hierarchy issues:
where the (intermediate) data are at, over the course of the computation
Pervasive multimedia sensing:
processing & querying must be pushed out of the data center to where the sensors are at
I know where it’s at, man!
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Cameras, Microphones, RFID readers, vibration sensors, etc
Internet-connected. Potentially Internet-scale
Pervasive broadband (wired & wireless)
Goal: Unified system for accessing, filtering, processing, querying, & reacting to sensed data
Programmed to provide useful sensing services
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
*
Internet-scale Sensor Observatories
Asset/Supply Chain Tracking
*
May want sophisticated data processing on all sensor feeds
May aggregate over large quantities of data, use historical data, run continuously
Want latest data, NOW
*
A worldwide sensor web
First prototype in late 2002
In ACM Multimedia, BaseNets, CVPR, DCOSS, Distributed Computing, DSC, FAST, NSDI(2), Pervasive Computing, PODC, SenSys, Sigmod(2), ToSN
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
Store sensor feeds locally
Push data processing & filtering to sensor nodes
Reduce the raw data to derived info, in parallel near source
Push (distributed) queries to sensor nodes
Data sampled » Data queried
Exploit logical hierarchy of sensor data
Compute answers in-network
the data center to where the sensors are at
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
*
Senselet
for Parking Space Finder
Research focus: Fault Tolerance
*
Important, interesting, exciting
Memory hierarchy issues: [HI-SPADE]
where the (intermediate) data are at, over the course of the computation
Pervasive multimedia sensing: [IrisNet] processing & querying must be pushed out of the data center to where the sensors are at
I know where it’s at, man!
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
*
Use to it in London. Chicago protest
Viewed by law enforcement vs. viewed by public
IrisNet Goal: Exploit processing at the sensor node
to implement privacy policies
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
<State id=“Pennysylvinia”>
<County id=“Allegheny”>
<City id=“Pittsburgh”>
<Neighborhood id=“Oakland”>
<total-spaces>200</total-spaces>
<Block id=“1”>
Phillip B. Gibbons, Data-Intensive Computing Symposium
*
by sending Q’ to Oakland OA
Combines results & returns
Discovers Shadyside data
IP addr for Oakland
2. Evaluate the result