Download - Data Intensive Computing at Sandia

Transcript
Page 1: Data Intensive Computing at Sandia

Data Intensive Computing at Sandia

September 15, 2010

Andy WilsonSenior Member of Technical StaffData Analysis and Visualization

Sandia National Laboratories

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of

Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Page 2: Data Intensive Computing at Sandia

The Question

What is Data-Intensive Computing?

Page 3: Data Intensive Computing at Sandia

My Answer

What is Data-Intensive Computing?

Parallel computing where you design your algorithms and your software around efficient

access and traversal of a data set; where hardware requirements are dictated by data size

as much as by desired run times

Usually distilling compact results from massive data

Page 4: Data Intensive Computing at Sandia
Page 5: Data Intensive Computing at Sandia

Outline

• What is Data-Intensive Computing?

• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures

• Into the Future

Page 6: Data Intensive Computing at Sandia

Spaghetti Plot (2)

Page 7: Data Intensive Computing at Sandia

Traditional Visualization Workflow

Solver

DiskStorage

Visualization

Full Mesh

Page 8: Data Intensive Computing at Sandia

Traditional In-Situ Visualization

Solver

DiskStorage

Visualization

Images

Solver

DiskStorage

Visualization

Full Mesh

Page 9: Data Intensive Computing at Sandia

Coprocessing

Solver

DiskStorage

Visualization

Images

Solver

DiskStorage

Visualization

Full Mesh

Solver

DiskStorage

Features &Statistics

Salient Data

Visualization

Page 10: Data Intensive Computing at Sandia

Collision Movie

Page 11: Data Intensive Computing at Sandia

Outline

• What is Data-Intensive Computing?

• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures

• Into the Future

Page 12: Data Intensive Computing at Sandia

Slide 3/20

Community Detection in Networks

• Find many small groups of vertices and/or edges– O(n) communities– overlaps may be allowed

• Hundreds of papers in physics and computer science

Lancichinetti, Fortunato, Radicchi 2008

Page 13: Data Intensive Computing at Sandia

Slide 2/20

Analysis of Massive Graphs

• Finding communities: a kernel of social network analysis

• “Dunber’s number” from sociology: there is a size limit (~150) on stable social group size (from neolithic farming village to academic sub-discipline)

Twitter social network (|V|≈200M)

[Akshay Java, 2007]

Page 14: Data Intensive Computing at Sandia

Slide 19/20

Collapsed Dendrograms and Statistical Confidence: wCNM

The wCNM partitioning is much deeper,resolving smaller communities

The statistically significant variation is visuallyclose, but does not reproduce ground truth as well

Image credit: Titan

The (much better) wCNM solution also has a statistically significant variation.

Page 15: Data Intensive Computing at Sandia

LSA and LDA from 5 miles up

Slide 15 of 18 Image credit: Dave Robinson

(LDA)

Page 16: Data Intensive Computing at Sandia

LSA/LDA: Increasing Data Size, Single ProcessorStraight Line = Linear Scaling, Lower = Faster

Slide 16 of XX

100 1000 10000 1000000.1

1

10

100

1000

10000

100000Higher Lines = More Topics

Number of Documents

CPU

Tim

e (s

ec.)

Slide 16 of 18

Page 17: Data Intensive Computing at Sandia

LSA/LDA: Weak Scaling(Bigger Problem, Same Time)Flat Lines = Perfect Scaling

Slide 17 of XX

1 10 100 10001

10

100

1000

10000

100000Higher Lines = More Documents

Number of Processors

CPU

Tim

e (s

ec.)

Slide 17 of 18

Page 18: Data Intensive Computing at Sandia

Outline

• What is Data-Intensive Computing?

• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures

• Into the Future

Page 19: Data Intensive Computing at Sandia

NGC System Diagram

Architectures Algorithms Web Services Applications(Clients)

Titan, browserTrilinosAlgebraic MethodsClustering, Ranking,High Dimensional Mapping

MTGLGraph MethodsSubgraph searches,Connection sg’s,Shortest Path, etc.

SpecializedDistributed Data Operations

TitanAnalysis Pipelines,Capability Integration,Data Access,Lightweight analysis

TitanAnalysis Pipelines,Capability Integration,Data Access,Lightweight analysis

“This project seeks to bring these two strengths – a solid reputation for excellence in computing, and our niche expertise in specific classes of intelligence analysis – to bear on a thorny problem: developing advanced informatics capabilities that are both usable and useful to analysts who are drowning in data.” NGC project proposal

Highly optimized Iterative, flexible

Data

Page 20: Data Intensive Computing at Sandia

SQL ServiceEnables Remote Access to Data Warehouse Appliances (DWA)

SQL Service*– Provides “bridge” between parallel

apps and external DWA– Runs on Red Storm network nodes– Titan applications communicate with

service through Portals– External resources (Netezza)

communicate through standard interfaces (e.g. ODBC over TCP/IP)

The SQL service enables an HPC application to access a remote DWA

Service Nodes(GUI and Database Services)

High-Speed Network (Portals)

Compute Nodes(Titan Analysis Code)

Tech Area 1Anywhere CSRI

Netezza

LexisNexis

OtherODBC DWA

Analyst HPC System (Red Storm) DWA

TCP/IP SQL

* Results of SQL access from parallel statistics code presented at CUG’2009.

Additional Modifications for Multilingual– Tokenization support on Netezza (goal is to count unique words)– Developed a custom UTF-8 words splitter for SPU (snippet processing unit)– Allows parallel tokenization and counting at storage device

Slide 20 of 14

Page 21: Data Intensive Computing at Sandia

Outline

• What is Data-Intensive Computing?

• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures

• Into the Future

Page 22: Data Intensive Computing at Sandia

Into the Future

• I don’t care about flops anymore. I care about mops.

• I want to send more complex requests to the storage system.

• There is no one perfect architecture.