Data Intensive Computing at Sandia

download Data Intensive Computing at Sandia

of 22

  • date post

    15-Feb-2016
  • Category

    Documents

  • view

    25
  • download

    0

Embed Size (px)

description

Data Intensive Computing at Sandia. September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories. - PowerPoint PPT Presentation

Transcript of Data Intensive Computing at Sandia

Main Title 32pt

Data Intensive Computing at SandiaSeptember 15, 2010

Andy WilsonSenior Member of Technical StaffData Analysis and VisualizationSandia National Laboratories

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration under contract DE-AC04-94AL85000.The QuestionWhat is Data-Intensive Computing?My AnswerWhat is Data-Intensive Computing?

Parallel computing where you design your algorithms and your software around efficient access and traversal of a data set; where hardware requirements are dictated by data size as much as by desired run times

Usually distilling compact results from massive data

OutlineWhat is Data-Intensive Computing?

Data-Intensive Computing at SandiaPhysicsInformaticsArchitectures

Into the Future

Spaghetti Plot (2)

Traditional Visualization WorkflowSolverDiskStorageVisualizationFull MeshThis is the traditional visualization workflow.As we move to petascale, visualization community is realizing that the disk storage is a bottleneck.Traditional In-Situ VisualizationSolverDiskStorageVisualizationImagesSolverDiskStorageVisualizationFull MeshThe term in-situ visualization is fairly new, but the concept is old.The basic idea is to embed the visualization in the solver and write out images.Image data sizes are comparatively tiny.Existing products focus on this to-end-image answer.Libraries RVSLIB (commercial) and pV3 (open source) .Kwan-Liu Mas (forerunner in in-situ visualization research) in-situ library only a parallel volume renderer.However, this is a myopic view that stems from the misconception that visualization = graphicsAssumes you know what to look for a-prioriPresupposes that all you want is a visual representation.CoprocessingSolverDiskStorageVisualizationImagesSolverDiskStorageVisualizationFull MeshSolverDiskStorageFeatures &StatisticsSalient DataVisualizationOur approach is a more general coprocessing library.In the solver we place a full-featured visualization and post-processing library.The library acts as a coprocessing library to the solver.The solver can call the coprocessing library to perform data analysis in core.The post-processing is, by its nature, attempting to find salient data, something we call an extract.An extract can be an image, if that makes sense. But it can be any number of other features: subgrids, isosurfaces, statistical quantities, other geometric structures.The extracts are more information rich than the original data.They can generally be written at a much higher fidelity that the full mesh.Afterwards, a visualization tool can read the extracts from disk and allow the user to interactively explore the salient data.Collision MovieOutlineWhat is Data-Intensive Computing?

Data-Intensive Computing at SandiaPhysicsInformaticsArchitectures

Into the Future

Slide 3/20Community Detection in NetworksFind many small groups of vertices and/or edgesO(n) communitiesoverlaps may be allowedHundreds of papers in physics and computer science

Lancichinetti, Fortunato, Radicchi 2008No formal graph theoretic definition of connectedness seems to capture what a human perceives to be the correct communities in all casesShouldnt find communities in random graphsSlide 2/20Analysis of Massive GraphsFinding communities: a kernel of social network analysis

Dunbers number from sociology: there is a size limit (~150) on stable social group size (from neolithic farming village to academic sub-discipline)

Twitter social network (|V|200M) [Akshay Java, 2007]But this kind of coarse-grained parallelism isnt always there

Slide 19/20Collapsed Dendrograms and Statistical Confidence: wCNMThe wCNM partitioning is much deeper,resolving smaller communities

The statistically significant variation is visuallyclose, but does not reproduce ground truth as well

Image credit: TitanThe (much better) wCNM solution also has a statistically significant variation. LSA and LDA from 5 miles upSlide 15 of 18Image credit: Dave Robinson(LDA)The vectors in U and V^T form orthonormal bases. The D matrix is a set of stretch factors (interpretable as concept weights). For LDA, the concept weights are included in the theta matrix.LSA/LDA: Increasing Data Size, Single ProcessorStraight Line = Linear Scaling, Lower = FasterSlide 16 of XXSlide 16 of 18Bakery analogy: adding more loaves to a single oven how long does it take for them to bake?

Okay, since this is a log-log plot, straight lines dont *automatically* mean linear scaling. But take my word for it. Its linear.LSA/LDA: Weak Scaling(Bigger Problem, Same Time)Flat Lines = Perfect ScalingSlide 17 of XXSlide 17 of 18Bakery analogy: If you can bake one loaf in one hour with one oven, you should be able to bake ten loaves IN ONE HOUR with ten ovens.

If a line trends upward, away from horizontal, it means that youre losing computational efficiency to increasing overhead and communication.OutlineWhat is Data-Intensive Computing?

Data-Intensive Computing at SandiaPhysicsInformaticsArchitectures

Into the Future

NGC System Diagram

ArchitecturesAlgorithmsWeb ServicesApplications(Clients)Titan, browserTrilinosAlgebraic MethodsClustering, Ranking,High Dimensional MappingMTGLGraph MethodsSubgraph searches,Connection sgs,Shortest Path, etc.SpecializedDistributed Data OperationsTitanAnalysis Pipelines,Capability Integration,Data Access,Lightweight analysisTitanAnalysis Pipelines,Capability Integration,Data Access,Lightweight analysisThis project seeks to bring these two strengths a solid reputation for excellence in computing, and our niche expertise in specific classes of intelligence analysis to bear on a thorny problem: developing advanced informatics capabilities that are both usable and useful to analysts who are drowning in data. NGC project proposal

Highly optimizedIterative, flexibleData

SQL ServiceEnables Remote Access to Data Warehouse Appliances (DWA)SQL Service*Provides bridge between parallel apps and external DWARuns on Red Storm network nodesTitan applications communicate with service through PortalsExternal resources (Netezza) communicate through standard interfaces (e.g. ODBC over TCP/IP)

The SQL service enables an HPC application to access a remote DWA

Service Nodes(GUI and Database Services)High-Speed Network (Portals)Compute Nodes(Titan Analysis Code)Tech Area 1AnywhereCSRINetezzaLexisNexisOtherODBC DWAAnalystHPC System (Red Storm)DWATCP/IPSQL* Results of SQL access from parallel statistics code presented at CUG2009.Additional Modifications for MultilingualTokenization support on Netezza (goal is to count unique words)Developed a custom UTF-8 words splitter for SPU (snippet processing unit)Allows parallel tokenization and counting at storage device

Slide 20 of 14Not 20OutlineWhat is Data-Intensive Computing?

Data-Intensive Computing at SandiaPhysicsInformaticsArchitectures

Into the Future

Into the FutureI dont care about flops anymore. I care about mops.

I want to send more complex requests to the storage system.

There is no one perfect architecture.

documents

wor

ds

documents

wor

ds

wor

ds

topi

cs

topics

documents

wor

ds

conc

epts

concepts

documents

C

C

Uconcepts

conc

epts D VT

q e

=Latent SemanticAnalysis

ProbabilisticLatent

SemanticAnalysis