Software and Hardware Requirements for Next-Generation Data Analytics

20
Software and Hardware Requirements for Next-Generation Data Analytics John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory October, 2010

description

Software and Hardware Requirements for Next-Generation Data Analytics. John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory October, 2010. Graphs are everywhere in science. Astrophysics Problem : Outlier detection. Challenges : massive datasets, - PowerPoint PPT Presentation

Transcript of Software and Hardware Requirements for Next-Generation Data Analytics

Page 1: Software and Hardware Requirements for Next-Generation Data Analytics

Software and Hardware Requirements for Next-Generation Data Analytics

John FeoCenter for Adaptive Supercomputing Software

Pacific Northwest National Laboratory

October, 2010

Page 2: Software and Hardware Requirements for Next-Generation Data Analytics

Graphs are everywhere in science

Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations.Graph problems: clustering, matching.

BioinformaticsProblem: Identifying drug target proteins.Challenges: Data heterogeneity, quality.Graph problems: centrality, clustering.

Social InformaticsProblem: Discover emergent communities, model spread of information.Challenges: new analytics routines, uncertainty in data.Graph problems: clustering, shortest paths, flows.

Page 3: Software and Hardware Requirements for Next-Generation Data Analytics

… and in commerce

Sample queries: Allegiance switching: identify entities that switch communities.Community structure: identify the genesis and dissipation of communitiesPhase change: identify significant change in the network structureThought leaders: identify influential individuals that drive events

Graph features:Topology: Interaction graph is low-diameter and has no good separatorsIrregularity: Communities are not uniform in sizeOverlap: individuals are members of one or more communities

1000x growth

in 3 years!

has more than 300 million active users

Page 4: Software and Hardware Requirements for Next-Generation Data Analytics

Small-world and scale-free

Low diameter (small-world):work explodesdifficult to partition/load-balancehigh % of nodes are visited quickly

“Six degrees of separation”

Scale-free (power-law):difficult to partition/load-balancework concentrates in a few nodes

4

0.25

0.50

1.00

Blockk-way

Number of partitions

Rat

io o

f edg

es c

ut

RMAT graph with a million vertices

Page 5: Software and Hardware Requirements for Next-Generation Data Analytics

Grids, Erdős–Rényi, and Scale-Free GraphsUSA Roadmap

Erdős–Rényi

Scale-Free

Communication trace from execution of ½-approx weighted matching

(data distributed using Metis)

5

Page 6: Software and Hardware Requirements for Next-Generation Data Analytics

Challenges

Problem sizeTon of bytes, not ton of flops

Little data localityHave only parallelism to tolerate latencies

Low computation to communication ratioSingle word accessThreads limited by loads and stores

Synchronization points are simple elementsNode, edge, record

Work tends to be dynamic and imbalancedLet any processor execute any thread

Page 7: Software and Hardware Requirements for Next-Generation Data Analytics

System requirementsGlobal shared memory

No simple data partitionsLocal storage for thread private data

Network support for single word accessesTransfer multiple words when locality exists

Multi-threaded processorsHide latency with parallelismSingle cycle context switchingMultiple outstanding loads and stores per thread

Full-and-empty bitsEfficient synchronizationWait in memory

Message driven operationsDynamic work queuesHardware support for thread migration

Cray XMT

Page 8: Software and Hardware Requirements for Next-Generation Data Analytics

Center for Adaptive Supercomputer Software

Driving development of next-generation multithreaded architectures and methods for

irregular problems

DATA

Scientific Simulations

Sensor Networks

Internet

Databases

Data Analytics

Knowledge Discovery

Trend Analysis

Science

Policy

Commerce

Sponsored by DOD

Page 9: Software and Hardware Requirements for Next-Generation Data Analytics

Partners

Page 10: Software and Hardware Requirements for Next-Generation Data Analytics

Analytic methods and applications

Community thought leaders

Blog Analysis

Community Activities

FaceBook - 300 M users

Connect-the-dots

Bus

HayashiZaire

TrainAnthrax

MoneyEndo

National Security

People, Places, & Actions

Semantic Web

Anomaly detection

Security

N-x contingency analysis

SmartGrid

Page 11: Software and Hardware Requirements for Next-Generation Data Analytics

Chapel for hybrid systems

Next generation multithreaded architectures

Communication software for hybrid systems

Performance analysis and toolsCompiler and runtime system

SmartGrid Sensor Networks

Mesh generation

N-x contingency analysisSemantic Databases

Bayesian networks Social networks

Arc

hite

ctur

eR

untim

eS

yste

mLa

ngua

ges

Met

hods

App

licat

ions

Research focus areas

MapReduce

Clustering

Computer SecurityBioInformatics

Page 12: Software and Hardware Requirements for Next-Generation Data Analytics

PathsShortest path

Betweenness

Min/max flow

StructuresSpanning trees

Connected components

Graph isomorphism

GroupsMatching/Coloring

Partitioning

Equivalence

Methods for data analyticsInfluential Factors

Degree distributionNormal

Scale-free

Planar or non-planar

Static or dynamic

Weighted or unweightedWeight distribution

Typed or untyped edges

Load imbalanceNon-planar

Concurrent insertsand deletions

Difficult to partition

Page 13: Software and Hardware Requirements for Next-Generation Data Analytics

Systems for large-scale analytics

Cray XMT

Graph resides in

XMT memory

RDBSruns on cluster

Netezza TwinFin

Page 14: Software and Hardware Requirements for Next-Generation Data Analytics

vap

wspd_va

tbsky 31

sky ir temp

precip-tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir temp

precip-tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir temp

precip-tbrg

percent_opaque

radar7

radar13

radar19

Replicate per time stepAdd dependencies across time steps (not shown)

Dynamic Bayesian Network Model for Atmospheric Sensor Network Validation

Page 15: Software and Hardware Requirements for Next-Generation Data Analytics

Convert dynamic Bayesian network to junction tree for inferencingEach node in the junction tree is a clique or super node containing several nodes from original Bayesian networkJunction Tree based “Evidence Propagation” is an efficient method of propagating the effect of any variable’s state to every other variable in the BN

vap

wspd_va

tbsky 31

sky ir

temp

precip-

tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir

temp

precip-

tbrg

percent_opaque

radar7

radar13

radar19

vap

wspd_va

tbsky 31

sky ir

temp

precip-

tbrg

percent_opaque

radar7

radar13

radar19

DBN to Junction Tree Conversion

Page 16: Software and Hardware Requirements for Next-Generation Data Analytics

Evidence Propagation is highly irregular

Compute per node is unbalancedDegree per node is irregularData moves up and down

Loop parallelism intra-nodeTask parallelism inter-node (recursion, futures)Data flow schedulingData synchronization

SMALL SYSTEMS HAVE 100S OF MILLIONS OF NODES

Page 17: Software and Hardware Requirements for Next-Generation Data Analytics

Atmospheric Sensor Network Validation Framework

Page 18: Software and Hardware Requirements for Next-Generation Data Analytics

Semantic analysisUnderstanding the relationships among data

Data intensive science

National security

Commerce

Data and relationships best expressed as triples and graphs

<John owns Dog>

18

PNNL, SNL, Cray

Patient Blue bumps

Pink rash High fever

John Yes _ Yes

Alice _ Yes _

Mary _ _ Yes

18

Blue bumps

JohnAlice

Mary

has symptom

Pink rash

has symptom

High Feverhas symptom

has symptom

Mayo Clinic’s patient database has 650K columns

Page 19: Software and Hardware Requirements for Next-Generation Data Analytics

XMT’s potential for semantic analysis

Machine Programming Model Performance (inferences per sec)

Author

X86, 32 nodes, 128 cores

MPI ~ 600 K inf/sec Weaver and Hendler (ISWC 2009)

X86, 64 nodes, 256 cores

Hadoop ~550K – 800K Urbani et al(ESWC 2010)

256 Treadstorm processors

C++ ~2.2M w/ read time ~13M w/o read time

RDFS closureInferring new relationships and attributes

Rule based

Original Diagram from Urbani et al. "Scalable Distributed Reasoning using MapReduce" ISWC 2009

JOB 3: Delete Duplicates

JOB 0: Transitive Closure

<John studied under Jim Browne> +

<Jim Browne teaches at UT Austin>

<John attended UT Austin>

865 million triples

Page 20: Software and Hardware Requirements for Next-Generation Data Analytics

Summary

The new HPC is irregular and sparseBad news: we need new architectures

Good news: there are commercial and consumer applications

Shared memory is necessary, but not sufficientNeed processors that can fill the memory system with requests

Need memory systems that support millions of simultaneous requests

Need fine-grain hardware synchronization in memory