Download - Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Transcript
Page 1: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Data-Intensive Computing Symposium

Data-Intensive ComputingSymposium: Report Out

Phillip B. GibbonsIntel Research Pittsburgh

Page 2: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium2

Data-Intensive Computing Symposium

Held 3/26/08 @Yahoo! in Sunnyvale, CA

Sponsored by:

– Yahoo! Research

– Computing Community Consortium supports the computing research community in creating compelling research visions and the mechanisms to realize these visions (http://www.cra.org/ccc/)

~100 invited attendees, ~12 invited talks

Slides and video to be posted on CCC web site

Blog: http://dita.ncsa.uiuc.edu/xllora (thanks!)

Page 3: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium3

Randy Bryant (CMU)Data-Intensive Scalable Computing

Local speaker; I’ll skip in interest of time

DISC has been renamed

Page 4: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium4

ChengXiang Zhai (UIUC)Text Information Management

Page 5: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium5

ChengXiang Zhai (UIUC)Proposal 1: Maximum Personalization

Page 6: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium6

ChengXiang Zhai (UIUC)

Page 7: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium7

ChengXiang Zhai (UIUC)

Page 8: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium8

Dan Reed (Microsoft)Clouds and ManyCore: The Revolution

Big Data: Should focus more on the user experience

How to manage resources

Cloud computing can help organically orchestrate resources on demand

Initiative to bring academics, business, and users together under the big data problem (PCAST NITRD review)

Page 9: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium9

Jill Mesirov (Broad Institute)Comput. Paradigms for Genomic Medicine

Broad has 4.8K processors, 1.4 PBs storage on site

Big Data Problem: Mining genome expression arrays– Row: patients; Column: genes, Value: expression values

– Example: classify leukemias based on expression arrays

– Solved by grad student over the weekend using web sources

Challenge: Computation/Analysis/Provenance infrastructure needed– Developed GenePattern 3.1: Software infrastructure for

interoperable informatics

– Usable by biologists

Page 10: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium10

Garth Gibson (CMU)Simplicity and Complexity in Data Systems at Scale

Petascale Data Storage Institute Understanding disk failures, cfdr.usenix.org

Another local speaker, so I’ll skip in interest of time

Page 11: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium11

Jeff Dean (Google)Handling Large Datasets at Google

Page 12: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium12

Jeff Dean (Google)

Page 13: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium13

Jeff Dean (Google)

Page 14: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium14

Jeff Dean (Google)

GFS Usage

Page 15: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium15

Jeff Dean (Google)

Page 16: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium16

Jeff Dean (Google)

Page 17: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium17

Jeff Dean (Google)

Page 18: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium18

Jeff Dean (Google)

Page 19: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium19

Jon Kleinberg (Cornell)Large-Scale Social Network Data

Diffusion in Social Networks

Why is chain letter diffusion so deep & narrow?

Iraq war authorization protestchain letter diffusion (18K nodes)

Page 20: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium20

Jon Kleinberg (Cornell)

Page 21: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium21

Jon Kleinberg (Cornell)

Page 22: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium22

Marc Najork (Microsoft Research)Mining the Web Graph

Scalable Hyperlink Store: used internally within MSR, for web graphs

Query-dependent link-based ranking algorithm (HITS, SALSA)

Page 23: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium23

Joe Hellerstein (UC Berkeley)“What” Goes Around

1. Industrial revolution of data: sensors, logs, cameras

2. Hardware revolution: datacenters/virtualization, many-core

3. Industrial revolution in software? Declarative languages in some domains

Why “What”: – Rapid prototyping

– Pocket-size code bases

– Independent from the runtime

– Ease of analysis and security

– Allow optimization and adaptability

Page 24: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium24

Joe Hellerstein (UC Berkeley)

Page 25: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium25

Joe Hellerstein (UC Berkeley)

Sensor Networks, Mobile Networks, Modular Robotics, computer games, program analysis

Distributive inference (junction trees and loopy belief propagation), graphs upon graphs

Evita Raced: Overlog Metacompiler (compiler is written declaratively)

– matches datalog optimizations (dynamic prog.), cycle tests

Datalog with known extensions and tweaks Centrality of Rendezvous & graphs

Challenges: – performance beyond number of messages (e.g., memory

hierarchy), availability, real programs, not Turing complete

Page 26: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium26

Raghu Ramakrishnan (Yahoo! Res.)Sherpa: Cloud Computing of the Third Kind

Page 27: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium27

Raghu Ramakrishnan (Yahoo! Res.)

Page 28: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium28

Raghu Ramakrishnan (Yahoo! Res.)

Page 29: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium29

Alex Szalay (Johns Hopkins)Scientific Applications of Large Databases

Page 30: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium30

Alex Szalay (Johns Hopkins)

Page 31: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium31

Alex Szalay (Johns Hopkins)

Page 32: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium32

Important, interesting, exciting research area

Cluster approach:computing is co-located where the storage is at

Memory hierarchy issues:where the (intermediate) data are at, over the course of the computation

Pervasive multimedia sensing: processing & querying must be pushed out of the data center to where the sensors are at

I know where it’s at, man!

Focus of this talk:

Phillip Gibbons (Intel Research)Data-Rich Computing: Where It’s At

Page 33: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium33

Hierarchy-Savvy Parallel Algorithm Design (HI-SPADE) project

Hierarchy-savvy:– Hide what can be hid– Expose what must be exposed

– Sweet-spot between ignorant and fully aware

Support:– Develop the compilers, runtime systems,

architectural features, etc. to realize the model– Important component: fine-grain threading

Goal: Support a hierarchy-savvy model ofcomputation for parallel algorithm design

Page 34: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium34

IrisNet’s Two-Tier Architecture

User

. . .SA

senseletsenselet

Sensor

SA

senseletsenselet

Sensor Sensor

SA

senseletsenselet

Web Serverfor the url

. . .

Query

OAXML database

. . .OA

XML databaseOA

XML database

Two components:SAs: sensor feed processingOAs: distributed database

Sensornet

Page 35: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium35

Jeannette Wing (CMU/NSF)NSF Plans for SupportingData-Intensive Computing

Google/IBM Data Center– ~2000 processors, large Hadoop cluster

– Allocate in units of rack weeks

– NSF will review proposals for use: Cluster Exploratory (CluE)

– Running Xen; Won’t open up performance monitoring

– Goal: Show applicable outside of computer science

Academic-Industry-Government partnership

Page 36: Data-Intensive Computing Symposium Data-Intensive Computing Symposium: Report Out Phillip B. Gibbons Intel Research Pittsburgh.

Phillip B. Gibbons, Data-Intensive Computing Symposium36

Randy Bryant (CMU)Big Data Computing Study Group

Collection of ~20 people (looking for volunteers) Goals:

– Fostering educational activities

– Advocacy

– Building community

CCC’s Big Data Computing Study Group seeks to foster collaborations between industry, academia, and the U.S. government to advance the state of art in the development and application of large scale computing systems for making intelligent use of the massive amounts of data being generated in science, commerce, and society