Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept...
Transcript of Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept...
![Page 1: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/1.jpg)
PCS - COLIMA, Sept 20-21, 2004 ➘➚
Efficient Data-Structures andParallel Algorithms for
Association Rules Discovery
Christophe CérinMichel KoskasGaël Le Mahec
Jean-Sébastien Gay
[email protected]@[email protected]@laria.u-picardie.fr
1
![Page 2: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/2.jpg)
Outline ➘➚
✘ Context
•Large Scale Systems•Challenging Applications•Association Rule Discovery
✘ Data Minning
•New Data Structure•New Algorithms
✘ Experimental results
•Our Library - Preliminary results
✘ Conclusion & further work
•Impl. choices - FT - SQL service
2
![Page 3: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/3.jpg)
CLASSIFICATION OF DISTRIBUTED SYSTEMS ➘➚❊
100.000volatile
no trust
stabletrust
P2P
InternetComputing
1−100
no identity
Two types of distributed systems
identity
"large scale"
"grid computing"
�� ������ ���� �� ��
� � � � � � � � � � � � � � � � � � � � � � � �
������������������������������������������
����������������������������������������������������������������������
����������������������������������������������������������������
����������
��Switching Hub
��Switching Hub
3
![Page 4: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/4.jpg)
GRID INITIATIVE IN FRANCE ➘➚
French National Project: Grid’5000
ACI Masse de données1M EUR
70 persons3 years term project
4 sub−projects
Good properties of our approach for mining:
✘DB scanned once
✘Few and short messages
4
![Page 5: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/5.jpg)
PROBLEM DESCRIPTION ➘➚
✘INPUT: list of tickets(A) champagne 13.45 EUR(B) appetizer 4.15 EUR(C) salmon 7.10 EUR· · · · · · · · · · · · · · · · · ·
✘Ex: if (AB) occurs 4000 times whereas(ABC) occurs 3000 times, produce: “ifAB occurs, then there is a probabilityof 75% that C occurs too”
5
![Page 6: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/6.jpg)
PROBLEM DESCRIPTION ➘➚❊
➮Alg. for generating rules:for all frequent sequences β do
for all subsequences α < β doconf = fr(α) / fr(β)if conf > min_conf then
output α ⇒ βoutput conf
➮Main problem: discovering frequentepizodes: rare in huge files(no enumeration, please!)
6
![Page 7: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/7.jpg)
PROBLEM DESCRIPTION ➘➚
✘KEY FACT of parallel / seq. algorithms
❄Count the occurrence of each item(cancel if support < XX → 1-itemsets)
❄From 1-item sets, generate 2-itemsets(cancel if support < XX → 2-itemsets)
❄iterate until producing the lastk-itemset
✘OUR APPROACH: efficient ADT for counting,candidate (k-itemset) generation
↓
Radix Trees
7
![Page 8: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/8.jpg)
RADIX TREES ➘➚
✓For each item, we keep the line numberwhere it appears
0 1 3 4 7 11
0 00 0
h=20 => 262144 bytes
2.(2^h − 1) bits
1111110111101011101010
11111101111010111010100000000000
11111101111010111000101000000
✓Avantages: complexity of a searchlinked to the tree height - space -natural parallelism in tree management.
8
![Page 9: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/9.jpg)
RADIX TREE OPERATIONS (support = intersect)➘➚❊
0
0
0 1
1
1
1
⋃ 0
0
0
1
0
0
1
1
−→
0
0
0 1
1
0
0
1
1
Union of Radix Tree.
0
0
0 1
1
1
1
⋂ 0
0
0
1
0
0
1
1
−→
0
0
0
1
1
1
Intersection of Radix Tree.
1
9
![Page 10: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/10.jpg)
MULTITHREADING RADIX TREE OPERATIONS ➘➚
✘Problems: load-balancing ;maximum number of allowed threads ;
0 1
0 0 1
0 1 0 1 0 1
Thread T1 Main Thread
✘Current implementation: constant k
(number of maximum threads to be started)in Radix-Tree Class ;
10
![Page 11: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/11.jpg)
MULTITHREADING RADIX TREE OPERATIONS ➘➚
0 1
0 1 0 1
0 1 0 1 0 1
Main Thread Thread T2 Thread T1
Heuristic (# thread > 2): start a threadon a right child - left thread runs untilit encounters a leaf.
11
![Page 12: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/12.jpg)
MULTITHREADING RADIX TREE OPERATIONS ➘➚❊
0 1
0 1 0 1
0 1 0 1 0 1
Main Thread Thread T2 Thread T1
✘Goal 1: limit busy waiting.
✘Goal 2: limit pthread_create calls.Here: divide by 2 (comlete binary tree).
12
![Page 13: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/13.jpg)
CANDIDATE GENERATION ➘➚
✓Two informations stored in Radix Trees
A B C D
t1 t2 t4 t1 t4
13
![Page 14: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/14.jpg)
CANDIDATE GENERATION ➘➚
✓We focus on 2-itemset generation
A B C D
AA AB AC AD CA CB CC CD
14
![Page 15: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/15.jpg)
CANDIDATE GENERATION ➘➚❊
✓Important note: operations on RadixTrees only!
A B C D
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD
15
![Page 16: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/16.jpg)
MAIN PARALLEL ALGORITHM ➘➚
Algorithm executed on each Proc. 0 ≤ i ≤ p.
/* Initially, each processor has locally n/p lines of the transactiondatabase where n is the total number of lines and p is the processor number.*/
1- In parallel for each processor:Scanning of the local database for construction of 1-itemset tree.
2- In parallel for each processor:doBroadcast supports./* This part can be de-synchronized *//* to perform overlapping (see above) */Wait for all supports from others.Perform the sum reductions.Elimination of unsufficient itemsets support.Lk = rest of Ck
Construction of new candidates sets Ck+1.while (Ck+1 6= ∅)
3- frequent itemsets =⋃
Lk
16
![Page 17: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/17.jpg)
Bench (1 thread) - SUN bi-Opteron v20z ➘➚
ListsRadix Trees
Time (sec.)
# items
× 100040 50 60 70 80 90 100 110 120 130 140 150
10
20
30
40
50
60
70
80
90
100
110
17
![Page 18: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/18.jpg)
MAIN LESSONS ➘➚❊
➮Two threads > one thread(http://www.boost.org thread impl.)
➮Bitset, Hierarchy of bitsets?#include <iostream>#include <fstream>#include <string>#include <bitset>using namespace std;// MAINint main(int nb,char **arg) {ifstream fin;ofstream fout;int i;
bitset<4294967> aaaa(0);bitset<4294967> bbbb(1);bitset<4294967> cccc(0);
for(i=0;i<600;i++) {cccc = aaaa | bbbb;aaaa[i]=1;
}for(i=0;i<600;i++) {
cccc = aaaa | bbbb;aaaa[2*i]=1;
}for(i=0;i<600;i++) {
cccc = aaaa | bbbb;aaaa[3*i]=1;
}// cout << cccc << "\n";return 1;}
Properties
TIMEMEMORY SPACEEASY TO MANAGEDISK SPACE
18
![Page 19: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/19.jpg)
CONCLUSION ➘➚
➽New approaches for computing candidates
➽Parallel Alg. + Multithreading of Radixtree operations
➽MPI code available soon (+ MPI-V → FTMPI developped by F. Cappello in theGridExplorer Initiative)
➽How to represent Radix Trees (vector ofbits?) → efficient library for yetanother application in the project:
SQL service
19
![Page 20: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/20.jpg)
CONCLUSION ➘➚
Efficient Data-Structures andParallel Algorithms for
Association Rules Discovery
www.laria.u-picardie.fr/˜cerin/zeta.html
[email protected]@laria.u-picardie.fr
20
![Page 21: Efficient Data-Structures and Parallel Algorithms for ...cerin/slideshow.pdf · PCS - COLIMA, Sept 20-21, 2004 Efficient Data-Structures and Parallel Algorithms for Association Rules](https://reader034.fdocuments.us/reader034/viewer/2022051913/60046b0b3eccfa1c0909e7c9/html5/thumbnails/21.jpg)