Galois Performance Mario Mendez-Lojo Donald Nguyen.

36
Galois Performance Mario Mendez-Lojo Donald Nguyen
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    3

Transcript of Galois Performance Mario Mendez-Lojo Donald Nguyen.

Page 1: Galois Performance Mario Mendez-Lojo Donald Nguyen.

Galois Performance

Mario Mendez-LojoDonald Nguyen

Page 2: Galois Performance Mario Mendez-Lojo Donald Nguyen.

2

Overview

• Galois system is a test bed to explore opts– Safe but not fast out of the box

• Important optimizations– Select least transactional overhead– Select right scheduling– Select appropriate data structure

• Quantify optimizations on applications

Page 3: Galois Performance Mario Mendez-Lojo Donald Nguyen.

3

Algorithms

irregularalgorithms

topology

operator

ordering

morph

local computation

reader

general graph

grid

tree

unordered

ordered

1. Barnes-Hut

2. Delaunay Mesh Refinement

3. Preflow-push

Page 4: Galois Performance Mario Mendez-Lojo Donald Nguyen.

4

MethodologyTh

read

s

IdleSerial GC

Time

Compute

• Abort Ratio: Aborted It/Total it

• GC options• UseParallelGC• UseParallelOldGC• NewRatio=1

Page 5: Galois Performance Mario Mendez-Lojo Donald Nguyen.

5

Terms

• Base– Default scheduling, Default graph

• Serial– Galois classes => No concurrency control classes

• Speedup– Best mean performance of a serial variant

• Throughput– # Serial Iterations / time

Page 6: Galois Performance Mario Mendez-Lojo Donald Nguyen.

6

Numbers

• Runtime– Last of 5 runs in same VM– Ignore time to read and construct initial graph

• Other statistics– Last of 5 runs

Page 7: Galois Performance Mario Mendez-Lojo Donald Nguyen.

7

Test Environment

• 2 x Xeon X5570 (4 core, 2.93 GHz)• Java 1.6.0_0-b11• Linux 2.6.24-27 x86_64• 20GB heap size

Page 8: Galois Performance Mario Mendez-Lojo Donald Nguyen.

8

BARNES-HUT

Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

Page 9: Galois Performance Mario Mendez-Lojo Donald Nguyen.

9

Barnes-Hut• N-body algorithm

– Oct-tree acceleration structure– Serial

• Tree build, center of mass, particle update

– Parallel• Force computation

• Structure– Reader on tree

• Variants– Splash2, Reader Galois

Page 10: Galois Performance Mario Mendez-Lojo Donald Nguyen.

10

Reader Optimization

child = octree.getNeighbor(nn, 1);

child = octree.getNeighbor(nn, 1, MethodFlag.NONE);

Page 11: Galois Performance Mario Mendez-Lojo Donald Nguyen.

11

ParaMeter Profile

Page 12: Galois Performance Mario Mendez-Lojo Donald Nguyen.

12

Barnes-Hut Results

100,000 points, 1 time step

Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X

Page 13: Galois Performance Mario Mendez-Lojo Donald Nguyen.

13

Barnes-Hut Results

100,000 points, 1 time step

Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X

Page 14: Galois Performance Mario Mendez-Lojo Donald Nguyen.

14

Barnes-Hut Scalability

Page 15: Galois Performance Mario Mendez-Lojo Donald Nguyen.

15

Page 16: Galois Performance Mario Mendez-Lojo Donald Nguyen.

16

DELAUNAY MESH REFINEMENT

Page 17: Galois Performance Mario Mendez-Lojo Donald Nguyen.

17

Delaunay Mesh Refinement

• Refine “bad” triangles– Maintained in worklist

• Structure– Cautious operator on graph

• Variants– Flag optimized, locallifo

base: Priority.defaultOrder()

local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

Page 18: Galois Performance Mario Mendez-Lojo Donald Nguyen.

Cautious Optimization

mesh.contains(item);...

mesh.remove(preNodes.get(i));...

mesh.add(node);

mesh.contains(item, MethodFlag.CHECK_CONFLICT);...

mesh.remove(preNodes.get(i), MethodFlag.NONE);...

mesh.add(node, MethodFlag.NONE);

• No need to save undo info• Only check conflicts up to first write

Page 19: Galois Performance Mario Mendez-Lojo Donald Nguyen.

19

LIFO Optimization

GaloisRuntime.foreach(...,

Priority.defaultOrder());

GaloisRuntime.foreach(...,

Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));

Page 20: Galois Performance Mario Mendez-Lojo Donald Nguyen.

20

ParaMeter Profile

Page 21: Galois Performance Mario Mendez-Lojo Donald Nguyen.

21

DMR Results

0.5M triangles, 0.25M bad triangles

Best serial: locallifo.flagoptSerial time: 17002 msBest // time: 3745 msBest speedup: 4.5X

Page 22: Galois Performance Mario Mendez-Lojo Donald Nguyen.

22

Page 23: Galois Performance Mario Mendez-Lojo Donald Nguyen.

23

PREFLOW-PUSH

Page 24: Galois Performance Mario Mendez-Lojo Donald Nguyen.

Preflow-push

• Max-flow algorithm– Nodes push flow downhill

• Structure– Cautious, local computation

• Variants– Flag optimized, local computation graph

base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class)

base (relabel): Priority.first(ChunkedFIFO.class, 8)

Page 25: Galois Performance Mario Mendez-Lojo Donald Nguyen.

25

Local Computation Optimization

graph = ...

graph = ...b = new LocalComputationGraph.ObjectGraphBuilder();

graph = b.from(graph).create()

Page 26: Galois Performance Mario Mendez-Lojo Donald Nguyen.

26

ParaMeter Profile

Page 27: Galois Performance Mario Mendez-Lojo Donald Nguyen.

27

Preflow-push Results

From challenge problem (genmf-wide)14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edgeshttp://avglab.com/andrew/CATS/maxflow_synthetic.htm

C: 11450 msJava: 30234 ms

Best serial: lc.flagoptSerial time: 57121 msBest // time: 18242 msBest speedup: 3.1X

Page 28: Galois Performance Mario Mendez-Lojo Donald Nguyen.

28

Preflow-push Scalability

Page 29: Galois Performance Mario Mendez-Lojo Donald Nguyen.

29

Page 30: Galois Performance Mario Mendez-Lojo Donald Nguyen.

30

What performance did we expect?Th

read

s

Time

IdleSerial GC//Compute Miss-Speculation

Measured Indirectly

Synchronization, …

Error

Page 31: Galois Performance Mario Mendez-Lojo Donald Nguyen.

31

What performance did we expect?

• Naïve: r(x) = t1 / x

• Amdahl: r(x) = tp / x + ts

t1 = tp + ts

ts = tidle + tgc+ tserial

• Simple: r(x) = (tp (ix / i1)) / x + ts

Page 32: Galois Performance Mario Mendez-Lojo Donald Nguyen.

32

Barnes-Hut

Page 33: Galois Performance Mario Mendez-Lojo Donald Nguyen.

33

Delaunay Mesh Refinement

Page 34: Galois Performance Mario Mendez-Lojo Donald Nguyen.

34

Preflow-push

Page 35: Galois Performance Mario Mendez-Lojo Donald Nguyen.

35

Summary

• Many profitable optimizations– Selecting among method flags, worklists, graph

variants

• Open topics– Automation– Static, dynamic and performance analysis– Efficient ordered algorithms

Page 36: Galois Performance Mario Mendez-Lojo Donald Nguyen.

36