System Support for Data-Intensive Applications
-
Upload
kaseem-dalton -
Category
Documents
-
view
35 -
download
2
description
Transcript of System Support for Data-Intensive Applications
Slide 1
System Support for Data-Intensive Applications
Katherine Yelick U.C. Berkeley, EECS
Slide 2
The “Post PC” Generation
Two technologies will likely dominate:
1) Mobile Consumer Electronic Devices
–e.g., PDA, Cell phone, wearable computers, with cameras, recorders, sensors
–make the computing “invisible” through reliability and simple interfaces
2) Infrastructure to Support such Devices
–e.g., successor to Big Fat Web Servers, Database Servers
–make these “utilities” with reliability and new economic models
Slide 3
Open Research Issues• Human-computer interaction
– uniformity across devices• Distributed computing
– coordination across independent devices• Power
– low power designs and renewable power sources
• Information retrieval– finding useful information amidst a flood of
data• Scalability
– Scaling devices down– Scaling services up
• Reliability and maintainability
Slide 4
The problem space: big data
• Big demand for enormous amounts of data– today: enterprise and internet applications
» online applications: e-commerce, mail, web, archives» enterprise decision-support, data mining databases
– future: richer data and more of it» computational & storage back-ends for mobile devices» more multimedia content» more use of historical data to provide better services
• Two key application domains:– storage: public, private, and institutional data– search: building static indexes, dynamic
discovery
Slide 5
Reliability/Performance Trade-off
• Techniques for reliability:– High level languages with strong types
» avoid memory leaks, wild pointers, etc.» C vs. Java
– Redundant storage, computation, etc.» adds storage and bandwidth overhead
• Techniques for performance:– Optimize for a specific machine
» e.g., cache or memory hierarchy
– Minimize redundancy
• These two goals work against each other
Slide 6
Specific Projects• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers– Modeling the real world– Storing and finding information
• Titanium– A high level language (Java) with high
performance– A domain-specific language and
optimizing compiler• Sparsity
– Optimization using partial program input
Slide 7
ISTORE: Reliable Storage System
• 80-node x86-based cluster, 1.4TB storage– cluster nodes are plug-and-play, intelligent, network-
attached storage “bricks”» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk– 2-node system running now; full system in next quarter
ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibration sensors...
Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)Diagnostic Processor
Disk
Half-height canister
Slide 8
A glimpse into the future?
• System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk
• ISTORE HW in 5-7 years:
– building block: 2006 MicroDrive integrated with IRAM » 9GB disk, 50 MB/sec from disk» connected via crossbar switch
– 10,000 nodes fit into one rack!
• O(10,000) scale is our ultimate design point
Slide 9
Specific Projects• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers– Modeling the real world– Storing and finding information
• Titanium– A high level language (Java) with high
performance– A domain-specific language and
optimizing compiler• Sparsity
– Optimization using partial program input
Slide 10
Heart Modeling• A computer simulation of a human heart
– Used to design artificial heart valves– Simulations run for days on a C90 supercomputer– Done by Peskin and MacQueen at NYU
• Modern machines are faster but harder to use– working with NYU– using Titanium
• Shown here: close-up of aortic valve during ejection
• Images from the Pittsburgh Supercomputer Center
Slide 11
Simulation of a Beating Heart
• Shown here:– Aortic valve (yellow); Mitral valve (purple)– Mitral valves closes when left ventrical pumps
• Future: virtual surgery?
Slide 12
Earthquake Simulation• Earthquake modeling
– Used for retrofitting buildings, emergency preparedness, construction policies– Done by Beilak (CMU); also by Fenves (Berkeley)
– Problems: grid (graph) generation; using images
Slide 13
Earthquake Simuation• Movie shows a simulated aftershock following the
1994 Northridge earthquake in California
• Future: sensors everywhere; tied to central system
Slide 14
Pollution Standards• Simulation of ozone layer
– Done by Russell (CMU) and McRae (MIT)– Used to influence automobile emissions
policy
Los Angeles Basin shown at 8am (left) and 2pm (right)
The “cloud” shows areas where ozone levels are above federal ambient air quality standards (0.12 parts per million)
Slide 15
Information Retrieval• Finding useful information amidst huge data sets
– I/O intensive application• Today’s example: web search engines
– 10 Million documents in typical matrix. – Web storage increasing 2x every 5 months– One class of techniques based on sparse
matrices
• Problem: Can you make this run faster, without writing hand-optimized, non-portable code?
# keywords
~100K
# documents ~= 10 M
•Matrix is compressed
•“Random” memory access
•Cache miss per 2Flops
•Run at 1-5% of machine peak
x
Slide 16
Image-Based Retrieval• Digital library
problem: – retrieval on images– content-based
• Computer vision problem– uses sparse matrix
• Future: search in medical image databases; diagnosis; epidemiological studies
Slide 17
Object Based Image Description
Slide 18
Specific Projects• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers– Modeling the real world– Storing and finding information
• Titanium– A high level language (Java) with high
performance– A domain-specific language and
optimizing compiler• Sparsity
– Optimization using partial program input
Slide 19
Titanium Goals• Help programmers write reliable software
– Retain safety properties of Java– Extend to parallel programming constructs
• Performance– Sequential code comparable to C/C++/Fortran– Parallel performance comparable to MPI
• Portability• How?
– Domain-specific language and compiler– No JVM– Optimizing compiler– Explicit parallelism and other language
constructs for high performance
Slide 20
Titanium Overview: Sequential
Object-oriented language based on Java with:• Immutable classes
– user-definable non-reference types for performance
• Unordered loops– compiler is free to run iteration in any order– useful for cache optimizations and others
• Operator overloading– by demand from our user community
• Multidimensional arrays– points and index sets as first-class values – specific to an application domain: scientific
computing with block-structured grids
Slide 21
Titanium Overview: ParallelExtensions of Java for scalable parallelism:• Scalable parallelism
– SPMD model with global address space• Global communication library
– E.g., broadcast, exchange (all-to-all)– Used to build data structures in the
global address space• Parallel Optimizations
– Pointer operations– Communication (underway)
• Bulk asynchronous I/O– speed with safety
Slide 22
Implementation• Strategy
– Compile Titanium into C– Communicate through shared memory on SMPs– Lightweight communication for distributed
memory
• Titanium currently runs on:– Uniprocessors– SMPs with Posix or Solaris threads– Berkeley NOW, SP2 (distributed memory)– Tera MTA (multithreaded, hierarchical)– Cray T3E (global address space) – SP3 (cluster of SMPs, e.g., Blue Horizon at
SDSC)
Slide 23
Sequential Performance
C/C++/FORTRAN
JavaArrays
TitaniumArrays Overhead
DAXPY3D multigrid2D multigridEM3D
1.4s12s
5.4s0.7s 1.8s 1.0s 42%
15%83%
7%
6.2s22s
1.5s6.8s
Ultrasparc:
C/C++/FORTRAN
JavaArrays
TitaniumArrays Overhead
DAXPY3D multigrid2D multigridEM3D
1.8s23.0s
7.3s1.0s 1.6s 60%
-25%-13%27%
5.5s20.0s
2.3s
Pentium II:
Performance results from 98; new IR and optimization framework almost complete.
Slide 24
SPMD Execution Model
• Java programs can be run as Titanium, but the result will be that all processors do all the work
• E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(‘’Hello from proc ‘’ + Ti.thisProc()); } }
• Any non-trivial program will have communication and synchronization
Slide 25
SPMD Execution Model
• A common style is compute/communicate
• E.g., in each timestep within particle simulation with gravitation attraction
read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier();
• This basic model is used on the large-scale parallel simulations described earlier
Slide 26
SPMD Model• All processor start together and execute same
code, but not in lock-step• Basic control done using
– Ti.numProcs() total number of processors– Ti.thisProc() number of executing processor
• Sometimes they do something independent if (Ti.thisProc() == 0) { ….. do setup ..… }
System.out.println(‘’Hello from ‘’ + Ti.thisProc());
double [1d] a = new double [Ti.numProcs()];
Slide 27
Barriers and Single
• Common source of bugs is barriers or other global operations inside branches or loops
barrier, broadcast, reduction, exchange• A “single” method is one called by all procs
public single static void allStep(...)• A “single” variable has same value on all procs
int single timestep = 0;
• The compiler uses “single” type annotations to ensure there are no synchronization bugs with barriers
Slide 28
Explicit Communication: Exchange
• To create shared data structures– each processor builds its own piece– pieces are exchanged (for object, just
exchange pointers)• Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2);
• E.g., on 4 procs, each will have copy of allData:
0 2 4 6
Slide 29
Exchange on Objects
• More interesting example: class Boxed { public Boxed (int j) {
val = j;
}
public in val;
}
Object [1d] single allData;
allData = new Object [0:Ti.numProcs()-1];
allData.exchange(new Boxed(Ti.thisProc());
Slide 30
Use of Global / Local• As seen, references (pointers) may be remote
– easy to port shared-memory programs• Global pointers are more expensive than local
– True even when data is on the same processor
– Use local declarations in critical sections• Costs of global:
– space (processor number + memory address)
– dereference time (check to see if local)• May declare references as local
Slide 31
Global Address Space
• Processes allocate locally• References can be passed
to other processes
Class C { int val;….. }C gv; // global pointerC local lv; // local pointer
if (thisProc() == 0) {lv = new C();
}gv = broadcast lv from 0; gv.val = …..; ….. = gv.val;
Process 0Other
processes
lv
gv
lv
gv
Slide 32
Local Pointer Analysis
• Compiler can infer many uses of local– “Local Qualification Inference” (Liblit’s work)
• Data structures must be well partitioned
Effect of LQI
0
50
100
150
200
250
cannon lu sample gsrb poison
applications
run
nin
g t
ime
(s
ec
)
Original
After LQI
Slide 33
Bulk Asynchronous I/O Performance
async
bulkds
bulkraf
dsb
ds
raf
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0 10 20 30 40 50 60
File Size (MB)
Th
rou
gh
pu
t (M
B/s
ec
)
External sort benchmark on NOW
• raf: random access file (Java)
• ds: unbuffered stream (Java)
• dsb: buffered stream (Java)
• bulkraf: bulk random access (Titanium)
• bulkds: bulk sequential (Titanium)
• async: asynchronous (Titanium)
Slide 34
Performance Heterogeneity
• System performance limited by the weakest link• Performance heterogeneity is the norm
– disks: inner vs. outer track (50%), fragmentation– processors: load (1.5-5x)
• Virtual Streams: dynamically off-load I/O work from slower disks to faster ones
0
1
2
3
4
5
6
100% 67% 39% 29%
Efficiency Of Single Slow Disk
Min
imu
m P
er-
Pro
ce
ss
B
an
dw
idth
(MB
/se
c)
Ideal
Virtual Streams
Static
Slide 35
Parallel performance on an SMP
• Speedup on Ultrasparc SMP (shared memory multiprocessor)
• EM3D performance linear
– simple kernel
• AMR largely limited by
– problem size
– 2 levels, with top one serial
0
1
2
3
4
5
6
7
8
1 2 4 8
em3d
amr
Slide 36
Parallel Performance on a NOW
• MLC for Finite-Differences by Balls and Colella• Poisson equation with infinite boundaries
– arise in astrophysics, some biological systems, etc.• Method is scalable
– Low communication
• Performance on– SP2 (shown) and t3e– scaled speedups– nearly ideal (flat)
• Currently 2D and non-adaptive 0
0.2
0.4
0.6
0.8
1
1.2
1 4 16
processors
Tim
e/f
ine
-pa
tch
-ite
r/p
roc
129x129/65x65
129x129/33x33
257x257/129x129
257x257/65x65
Slide 37
Performance on CLUMPs• Clusters of SMPs (CLUMPs) have two-levels of
communication– BH at SDSC has 144 nodes, each with 8
nodes– 8th processor cannot be used effectively
GSRB performance with 700x700 patches
0
10
20
30
40
50
60
70
0 5 10 15 20 25 30 35
Processes
Tim
e (s
)
1 p/node
2 p/node
4 p/node
7 p/node
8 p/node
Slide 38
Cluster of SMPs• Communication within a node is shared-
memory• Communication between nodes uses LAPI
– for large messages, a separate thread is created by LAPI
– interferes with computation performanceAggregate bandwidth with multiple processes
0
10
20
30
40
50
0 10000 20000 30000 40000 50000 60000 70000
Data Size (bytes)
Ban
dwid
th (M
B/s
) 1 p/node
2 p/node
4 p/node
7 p/node
8 p/node
Slide 39
Optimizing Parallel Programs• Would like compiler to introduce asynchronous
communication, which is a form of possible reordering
• Hardware also reorders– out-of-order execution– write buffered with read by-pass– non-FIFO write buffers
• Software already reorders too– register allocation– any code motion
• System provides enforcement primitives– volatile: at the language level not well-defined– tend to be heavy weight, unpredictable
• Can the compiler hide all this?
Slide 40
Semantics: Sequential Consistency
• When compiling sequential programs:
Valid if y not in expr1 and x not in expr2 (roughly)
• When compiling parallel code, not sufficient test.
y = expr2;
x = expr1;
x = expr1;
y = expr2;
Initially flag = data = 0
Proc A Proc B
data = 1; while (flag==1);
flag = 1; ... = ...data...;
Slide 41
Cycle Detection: Dependence Analog
• Processors define a “program order” on accesses from the same thread P is the union of these total orders
• Memory system define an “access order” on accesses to the same variable
A is access order (read/write & write/write pairs)
• A violation of sequential consistency is cycle in P U A.
• Intuition: time cannot flow backwards.
write data read flag
write flag read data
Slide 42
Cycle Detection
• Generalizes to arbitrary numbers of variables and processors
• Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir]
write x write y read y
read y write x
Slide 43
Static Analysis for Cycle Detection
• Approximate P by the control flow graph• Approximate A by undirected “dependence”
edges• Let the “delay set” D be all edges from P that
are part of a minimal cycle
• The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)
• Synchronization analsysis also critical [Krishnamurthy]
write z read x
read y write z
write y read x
Slide 44
Automatic Communication Optimization
• Implemented in subset of C with limited pointers • Experiments on the NOW; 3 synchronization
styles
• Future: pointer analysis and optimizations
Tim
e (
no
rma
lized
)
Slide 45
Specific Projects• ISTORE
– A reliable, scalable, maintainable storage system
• Data-intensive applications for “backend” servers– Modeling the real world– Storing and finding information
• Titanium– A high level language (Java) with high
performance– A domain-specific language and
optimizing compiler• Sparsity
– Optimization using partial program input
Slide 46
Sparsity: Sparse Matrix Optimizer
• Several data mining or web search algorithms use sparse matrix-vector multiplication– use for documents, images, video, etc.– irregular, indirect memory patterns perform
poorly on memory hierarchies• Performance improvements possible, but depend
on: – sparsity structure, e.g., keywords within
documents– machine parameters without analytical models
• Good news:– operation repeated many times on similar matrix– Sparsity: automatic code generator based on
matrix structure and machine
Slide 47
Sparsity: Sparse Matrix Optimizer
Slide 48
Summary• Future
– small devices + larger servers– reliability increasingly important
• Reliability techniques include– hardware: redundancy, monitoring– software: better languages, many others
• Performance trades off against safety in languages– use of domain-specific features (e.g.,
Titanium)
Slide 49
Backup Slides
Slide 50
The Big Motivators for Programming Systems
Research
• Ease of Programming– Hardware costs -> 0– Software costs -> infinity
• Correctness– Increasing reliance on software increases
cost of software errors (medical, financial, etc.)
• Performance– Increasing machine complexity– New languages and applications
» Enabling Java; network packet filters
Slide 51
The Real Scalability Problems: AME
• Availability– systems should continue to meet quality of
service goals despite hardware and software failures and extreme load
• Maintainability– systems should require only minimal ongoing
human administration, regardless of scale or complexity
• Evolutionary Growth– systems should evolve gracefully in terms of
performance, maintainability, and availability as they are grown/upgraded/expanded
• These are problems at today’s scales, and will only get worse as systems grow
Slide 52
Research Principles
• Redundancy everywhere, no single point of failure• Performance secondary to AME
– Performance robustness over peak performance– Dedicate resources to AME
» biological systems use > 50% of resources on maintenance
– Optimizations viewed as AME-enablers » e.g., use of (slower) safe languages like Java with static
and dynamic optimizations
• Introspection– reactive techniques to detect and adapt to
failures, workload variations, and system evolution
– proactive techniques to anticipate and avert problems before they happen
Slide 53
Outline• Motivation• Hardware Techniques
– general techniques– ISTORE projects
• Software Techniques• Availability Benchmarks• Conclusions
Slide 54
Hardware techniques
• Fully shared-nothing cluster organization– truly scalable architecture, automatic
redundancy– tolerates partial hardware failure
• No Central Processor Unit: distribute processing with storage– Most storage servers limited by speed of CPUs;
why does this make sense?– Amortize sheet metal, power, cooling
infrastructure for disk to add processor, memory, and network
• On-demand network partitioning/isolation– Applications must tolerate these anyway – Allows testing, repair of online system
Slide 55
Hardware techniques
• Heavily instrumented hardware– sensors for temp, vibration, humidity, power
• Independent diagnostic processor on each node– remote control of power, console, boot code– collects, stores, processes environmental
data – connected via independent network
• Built-in fault injection capabilities– Used for proactive hardware introspection
» automated detection of flaky components» controlled testing of error-recovery mechanisms
– Important for AME benchmarking
Slide 56
ISTORE-2 Hardware Proposal• Smaller disks
– replace 3.5” disks with 2.5” or 1” drives» 340MB available now in 1”, 1 GB next year (?)
• Smaller, more highly integrated processors– E.g., Transmeta Crusoe includes processor
and Northbridge (interface) functionality in 1 Watt
– Xilinx FPGA for Southbridge, diagnostic proc, etc.
• Larger scale– Roughly 1000 nodes, depending on support
» ISTORE-1 built with donated disks, memory, processors
» Paid for network, board design, enclosures (discounted)
Slide 57
Outline• Motivation• Hardware Techniques• Software Techniques
– general techniques– Titanium: a high performance Java dialect– Sparsity: using dynamic information– Virtual streams: performance robustness
• Availability Benchmarks • Conclusions
Slide 58
Software techniques• Fault tolerant data structures
– Application controls replication, checkpointing, and consistency policy
– Self-scrubbing used to identify software errors that have corrupted application state
• Encourage use of safe languages– Type safety and automatic memory
management avoid a host of application errors– Use of static and dynamic information to meet
performance needs• Runtime adaptation to performance
heterogeneity– e.g., outer vs. inner track (1.5X),
fragmentation– Evolution of systems adds to this problem
Slide 59
Software Techniques• Reactive introspection
– Use statistical techniques to identify normal behavior and detect deviations from it» e.g., network activity, response time, program
counter (?)
– Semi-automatic response to abnormal behavior » initially, rely on human administrator » eventually, system learns to set response
parameters
• Proactive introspection– Continuous online self-testing
» in deployed systems!» goal is to shake out bugs in failure response code
on isolated subset» use of fault-injection and stress testing
Slide 60
Techniques for Safe Languages
Titanium: A high performance dialect of Java• Scalable parallelism
– A global address space, but not shared memory
– For tightly-coupled applications, e.g., mining– Safe, region-based memory management
• Scalar performance enhancements, some specific to application domain – immutable classes (avoids indirection)– multidimensional arrays with subarrays
• Application domains– scientific computing on grids
» typically +/-20% of C++/F in this domain– data mining in progress
Slide 61
Use of Static Information• Titanium compiler performs parallel
optimizations– communication overlap (40%) and aggregation
• Uses two new analyses– synchronization analysis: the parallel analog
to control flow analysis » identifies code segments that may execute
in parallel– shared variable analysis: the parallel analog to
dependence analysis»recognize when reordering can be observed
by another processor»necessary for any code motion or use of
relaxed memory models in hardware => missed or illegal optimizations
Slide 62
Conclusions• Two key applications domains
– Storage: loosely coupled– Search: tightly coupled, computation important
• Key challenges to future servers are:– Availability, Maintainability, and Evolutionary
growth
• Use of self-monitoring to satisfy AME goals– Proactive and reactive techniques
• Use of static techniques for high performance and reliable software– Titanium extension of Java
• Use of dynamic information for performance robustness– Sparsity and Virtual Streams
• Availability benchmarks a powerful tool?
Slide 63
Projects and Participants
ISTORE: iram.cs.berkeley.edu/istore
With James Beck, Aaron Brown, Daniel Hettena, David Oppenheimer, Randi Thomas, Noah Treuhaft, David Patterson, John Kubiatowicz
Titanium: www.cs.berkeley.edu/projects/titanium
With Greg Balls, Dan Bonachea, David Gay, Ben Liblit, Chang-Sun Lin, Peter McQuorquodale, Carleton Miyamoto, Geoff Pike, Alex Aiken, Phil Colella, Susan Graham, Paul Hilfinger
Sparsity: www.cs.berkeley.edu/~ejim/sparsity
With Eun-Jin Im
Slide 64
History of Programming Language Research
70s 80s 90s 2K
Flop optimization
General PurposeLanguage Design
Parsing Theory
Domain-SpecificLanguage Design
Type Systems Theory
Memory Optimizations
GarbageCollection Threads
Program Verification Program Checking Tools
Data and Control AnalysisType-Based Analysis