Parallel Disk-Based Computation and Computational Group...

24
Parallel Disk-Based Computation and Computational Group Theory Eric Robinson Gene Cooperman Daniel Kunkle urgen M¨ uller Northeastern University, Boston, USA RWTH, Aachen, Germany

Transcript of Parallel Disk-Based Computation and Computational Group...

Page 1: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Parallel Disk-Based Computation and Computational Group Theory

Eric Robinson∗Gene Cooperman∗

Daniel Kunkle∗

Jurgen Muller†

∗ Northeastern University, Boston, USA† RWTH, Aachen, Germany

Page 2: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Applications from Computational Group Theory (2003–2007)

Space State TotalGroup Size Size StorageFischerFi23 1.17×1010 100 bytes 1 TB“Baby Monster” 1.35×1010 548 bytes 7 TBJankoJ4 1.31×1011 64 bytes 8 TBRubik (sym. cosets)1.4×1012 6 bytes 8 TB

(joint with Eric Robinson, Dan Kunkle)

Page 3: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Solution of Rubik’s Cube: Cosets in Mathematical Group Theory

Rubik’s Cube: 26 Moves Suffice to Solve Rubik’s Cube (2007)7 terabytes (temporary storage), 1012 states, 6 bytes per statePurposely chose a method requiring very large storage.Further details at:http://www.ccs.neu.edu/home/gene/papers/rubik-slides.pdf

GROUP:COSETS:

SUBGRP:

Page 4: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

History of Rubik’s Cube

• Invented in late 1970s in Hungary.

• 1982: “God’s Number” (number of moves needed) was known by authors of conjectureto be between 17 and 52.

• 1990: C., Finkelstein, and Sarawagi showed 11 moves suffice forRubik’s 2×2×2 cube(corner cubies only); but apparently known by hobbyists in 1982

• 1995: Reid showed 29 moves suffice (lower bound of 20 already known)

• 2006: Radu showed 27 moves suffice

• 2007 Kunkle and C. showed 26 moves suffice (and computation is still proceeding)

• D. Kunkle and G. Cooperman, “Twenty-Size Moves Suffice for Rubik’s Cube”,International Symposium on Symbolic and Algebraic Computation (ISSAC-07), 2007,ACM Press, pp. 235–242.

• 2008 Rokicki, 22 moves suffice (50 core-years, contributed by JohnWelborn and SonyPictures Imageworks)http://cubezzz.homelinux.org/drupal/?q=node/view/121

Page 5: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

The End of the World (as we know it) is Coming

• The world is changing, as we near the end of Moore’s Law.

– Memory chips are no longer twice as dense every 18 months.

– Large RAM is still available on server-class motherboards.

– But the commodity market doesn’t want to pay that premium.

– So, those of us doing large search and scientific computations arebeing left out in the cold. We still need those ever larger memories– especially as the trend toward multi-core CPUs places evermorepressure on RAM.

– Heterogeneous computing (cell processors, NVIDIA G8, etc.)onlymake it worse: The newfound power for regular computations (e.g.SIMD) are causing us to run out of RAM faster than ever before!!!

Page 6: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

What To Do When You Run out of RAM?

1. Shared-memory many-CPU computers usually have more RAM(expensive, limited to100 GB or 1 TB of RAM)

2. Use aggregate RAM of a computer cluster(needs parallel programming)

3. Use the disk on a single computer(much slower than RAM)

4. Use the many parallel local disks (or SAN) of a cluster(ALMOST FREE!!)

A Fourth Major Use for Disks:

1. File System

2. Database (relational or other)

3. Virtual Memory

4. Parallel Disks as extension of RAM(similar goal to virtual memory: minimallyinvasive modification of seqeuntial program, but good performance)

Page 7: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Our Solution

• Disk is the New RAM

• Bandwidth of Disk: ˜ 100 MB/s

• Bandwidth of 50 Disks: 50×100 MB/s = 5 GB/s

• Bandwidth of RAM: approximately 5 GB/s

• Conclusion:

1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostlyidle disk space, is a good approximation to a shared memory computer with 200CPU cores and asinglesubsystem with 25 TB of shared memory.(The arguments also work for a SAN with multiple access nodes,but we considerlocal disks for simplicity.)

2. The disks of a cluster can serve as if they were RAM.

3. The traditional RAM can then serve as if it were cache.

Page 8: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

What About Disk Latency?

• Unfortunately, putting 50 disks on it, doesn’t speed up the latency.

• So, re-organize the data structures and low-level algorithms.

• Our group has five years of case histories applying this to computational group theory— but each case requires months of development and debugging.

• We’re now developing both higher level abstractions for run-timelibraries, and alanguage extension that will make future development much faster.

Page 9: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

LONGER-TERM GOALS

• Why do it?

1. State space search occurs across a huge number of scientific disci-plines.

2. Because the world is running out of RAM!

– A commodity motherboard holds only 4 GB RAM.– We now have 4- and 8-core motherboards, but no one will be putting

eight times as much RAM on acommoditymotherboard.

Page 10: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Applications Benefiting from Disk-Based Parallel Computation

Symbolic Algebra:Applications occur wherever there is intermediate swell!

Application1. Grobner bases2. Term Rewriting (Knuth-Bendix)3. Term Manipulations (differential operators, polynomial multiplications,. . .)4. Large Search Applications (widespread)5. Numerous examples from computational group theory6. Others

Page 11: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Applications Benefiting from Disk-Based Parallel Computation

Discipline Example Application1. Verification Symbolic Computation using BDDs2. Verification Explicit State Verification3. Comp. Group Theory Search and Enumeration in Mathematical Structures4. Coding Theory Search for New Codes5. Security Exhaustive Search for Passwords6. Semantic Web RDF query language; OWL Web Ontology Language7, Artificial Intelligence Planning8. Proteomics Protein folding via a kinetic network model9. Operations Research Branch and Bound

10. Operations ResearchInteger Programming (applic. of Branch-and-Bound)11. Economics Dynamic Programming12. Numerical Analysis ATLAS, PHiPAC, FFTW, and other adaptive software13. Engineering Sensor Data14. A.I. Search Rubik’s Cube

Page 12: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

LONGER-TERM GOALS (cont.)

• The world is changing, as we near the end of Moore’s Law.

– Memory chips are no longer twice as dense every 18 months.

– Large RAM is still available on server-class motherboards.

– But the commodity market doesn’t want to pay that premium.

– So, those of us doing large search and scientific computations arebeing left out in the cold. We still need those ever larger memories– especially as the trend toward multi-core CPUs places evermorepressure on RAM.

• Our solution is to use disk as the new RAM! (See next slide.)

Page 13: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Central Claim

Suppose one had a single computer with 25 terabytes of RAM and 200 CPU cores. Doesthat satisfy your need for computers with more RAM?

CLAIM: A computer cluster of 50 quad-core nodes, each with a 500 GB local disk, isa good approximation of the above computer. (The arguments also work for a SAN with

multiple access nodes, but we discuss local disks for simplicity.)

Page 14: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

When is a cluster like a 25 TB shared memory computer?

• Assume 500 GB/node of free disk space

• Assume 50 nodes,

• The bandwidth of 50 disks is 50×100MB/s= 5GB/s.

• The bandwidth of asingleRAM subsystem is about 5GB/s.

CLAIM: You probably have the 25 TB of temporary disk space lying idleon your ownrecent-modelcomputer cluster. You just didn’t know it.(Or were you just not telling other people about the space, so you could use if for yourself?)

The economics of disks are such that one saves very little by buying less than 500 GBdisk per node. It’s common to buy the 500 GB disk, and reserve the extra space forexpansion.

Page 15: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

When is a cluster NOT like a 25 TB shared memory computer?

1. We require aparallel program. (We must access the local disks of many cluster nodesin parallel.)

2. The latency problem of disk.

3. Can the network keep up with the disk?

Page 16: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Solutions to the Latency of Disk Do Exist

1. For duplicates on frontier in state space search:Delayed Duplicate Detectionimplieswaiting until many nodes of the next frontier (and duplicates fromprevious iterations)have been discovered. Then remove duplicates.

2. For hash tables, wait until there are millions of hash queries. Then sort on the hashindex, and scan the disk to resolve queries.

3. For pointer-chasing, wait until millions of pointers are available for chasing. Then sortand scan the disk to dereference pointers.

Page 17: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Example Times for Different Phases of Computation

Baby Monster group: 4.2×1033 elementsGoal: Construct 13,571,955,000 “points” of permutation representation

Manager Disk Time CPU/RAM Time Network TimeRead/Write 0.5 days 0 days —Computation 0 days 3 days 2 daysCheck 0 days 0 days —Hash ≪ 1 day ≪ 1 day —Formatting/Sorting ≪ 1 day ≪ 1 day —Duplicate Elimination ≪ 1 day < 1 day —Rebuilding ≪ 1 day 6 days —Approximate Total 2 days 10 days 2 days

NOTE: Between CPU time and RAM bandwidth, the computation is primarily limited byRAM bandwidth.(Disk time not the bottleneck.)

Using faster CPUs has almost no benefit!(Only faster RAM helps.)

“A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs”, Eric Robinson,

Daniel Kunkle and Gene Cooperman,Proc. of 2007 International Workshop on Parallel Symbolic and

Algebraic Computation(PASCO ’07), ACM Press, 2007, pp. 78–87

Page 18: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

SPDE: System for Parallel Disk-Based Enumeration

SPDE Software provides infrastructure

1. Free and Open Source Software (currently beta; see Eric Robinson forcopies)

2. Modules added for different search strategies

Page 19: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Space-Time Tradeoffs using Additional Disk

• Use even more disk space in order to speed up the algorithm.

Page 20: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Search Modules (not all implemented)

FROM: “A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs”, EricRobinson, Daniel Kunkle and Gene Cooperman,Proc. of 2007 Int. Workshop on Parallel Symbolic andAlgebraic Computation(PASCO ’07), ACM Press, 2007, pp. 78–87

1. Breadth-First Search (BFS) with perfect hash (traditional) —requires perfect hash function

2. Sorting-based Delayed Duplicate Detection (DDD) —uses external sort to compare frontier against known states

3. Hash-based Delayed Duplicate Detection (DDD) — hash high bitsto determines bin number; corresponding bins from frontierand known states loadedinto RAM; detect duplicates in bins with hashing

4. Level Modulo 3 — assumes perfect hash function;store search depth modulo three; can easily distinguish previous level(level - 1 mod 3)and next level(level + 1 mod 3)

5. Structured Duplicate Detection —requires search graph with good locality; local search in RAM for duplicates eliminatesmost duplicates

Page 21: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Search Modules (cont.)

6. Implicit Open List —allows multiple passes at a given level; needed when initialfrontier (duplicates not yetdetected) is too large for available storage

7. Landmarks —store a fraction of states (e.g. states ending in three zeroes); these arethe landmarks; Given any other state, a local search finds a landmark state. new stateassociated with nearest landmark

8. Frontier Search —requires that inverses of neighbor generators themselves be neighborgenerators; need check for duplicates at current level, next level, or previous level; butby setting a generator used bit, one can avoid later searching the inverse generator(since it would only lead to the previous level)

9. Tiered Duplicate Detection —like a 1-bit Bloom filters; use lossy hash function; if hashes to empty hash slot, it’s new;if occupied slot, either a collision or a duplicate; save ambiguous state in second tierand use a different delayed duplicate detection method on this significantly smallersetof states

Page 22: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

LONGER-TERM GOAL: Roomy Mini-Language

Well-understood building blocks already exist: external sorting, B-trees, Bloom filters,Delayed Duplicate Detection, Distributed Hash Trees (DHT)

GOAL:1. Extend C/C++ with Roomy run-time library:

Support for extensible arrays, lists (support append, concatenate, merge, extract.. . .),hash arrays; Emphasize streaming access (using MapReduce, Reduce, Modify-in-Place,. . .)

2. Minimally invasive: common data structures in usersequentialcode are replaced byRoomy data structures

3. Add Roomy Modules for Common Tasks:

(a) Parallel Breadth-First and (approximate) Depth-First Search

(b) Priority Queues

(c) Table lookup/modify

First alpha release of Roomy expected by end of summer. (open source)

Page 23: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

Example Roomy Code (under development)

Page 24: Parallel Disk-Based Computation and Computational Group ...moreno/HPCA-ACA-2009/aca09-RobinsonEtAl.pdfParallel Disk-Based Computation and Computational Group Theory ... – Heterogeneous

QUESTIONS?