Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights ·...
Transcript of Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights ·...
![Page 1: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/1.jpg)
Efficient Document Analytics on Compressed Data: Method,
Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄
†Renmin University of China⋄Tsinghua University
#North Carolina State University⋆ETH Zurich
1
![Page 2: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/2.jpg)
Motivation• Every day, 2.5 quintillion bytes of data created – 90% data in the world
today has been created in the last two years alone[1].
[1] What is Big Data?https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html 2/23
![Page 3: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/3.jpg)
Outline
• Introduction• Motivation & Example• Compression-Based Direct Processing • Challenges
• Guidelines and Techniques• Solution Overview• Guidelines
• Evaluation• Benchmarks• Results
• Conclusion
3/23
![Page 4: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/4.jpg)
Motivation
How to perform efficient document analytics when data are extremely large?
• Challenge 1: • SPACE: Large Space Requirement
• Challenge 2:• TIME: Long Processing Time
4/23
![Page 5: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/5.jpg)
Motivation• Observation
• Using Hash Table to check redundant content for Wikipedia dataset
0
20
40
60
80
100
512B 1KB 4KB
red
un
dan
cy r
atio
(%
)
chunk size
data redundancy ratio
50 GB
150 GB
5/23
![Page 6: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/6.jpg)
Our Idea• Compression-based direct processing
• Sequitur algorithm meets our requirement
edge
R2:
R1:
R0:
R0 → R1 R1 R2 aR1 → R2 c R2 dR2 → a b
a b c a b d a b c a b d a b a
Rules:
(a) Original data (b) Sequitur compressed data (c) DAG Representation
R1 R1 R2
R2 c R2 d
a b
a: 0 b: 1 c: 2 d: 3R0: 4 R1: 5 R2: 6
(d) Numerical representation
4 → 5 5 6 05 → 6 2 6 36 → 0 1
(e) Compressed data in numerical ID
aInput:
6/23
![Page 7: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/7.jpg)
Double benefits
R2:
R1:
R0: R1 R1 R2
R2 c R2 d
a b
a
Appear more than once, but only store once!
Challenge 1: Space
Appear more than once, but only compute once! Challenge 2:
Time
7/23
![Page 8: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/8.jpg)
Optimization
• We can make it more compact.
R2:
R1:
R0: R1 R1 R2
R2 c R2 d
a b
a
Some applications do not need to keep the sequence.
R2:
R1:
R0: R1 R1 R2
R2 c R2 d
a b
a
R0: x2
x1
x1R1
R2
a
R2: x1
x1
a
b
R1: c
d
x2 x1
x1
R2
i weight
1
2
2
In each rule, we may remove
sequence info too.
2
1
2
Further saves storage space
and computation
time.
8/23
![Page 9: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/9.jpg)
Example• Word Count
<a, 2×2 + 1 +1> = <a, 6><b, 2×2 + 1> = <b, 5>
<c, 1×2 > = <c, 2><d, 1×2 > = <d, 2>
<a,2>, <b,2><c,1>, <d,1>
<a,1>, <b,1>
3
2
1
R0: R1 R1 R2 a
R1: R2 c R2 d
R2: a b
<a, 1×2> = <a, 2><b, 1×2> = <b, 2>
<c, 1><d, 1>
<a,6>, <b,5><c,2>, <d,2>
CFG RelationInformation Propagation
i
<w,i>
Step #
Word table
9/23
![Page 10: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/10.jpg)
Challenges
CHALLENGES
Unitsensitivity
Parallelismbarriers
Ordersensitivity
Dataattributes
Reuseof results across nodes
Overheadin saving and propagating
How to perform parallelism on large
datasets.
How to utilize the attributes of datasets.
How to accommodate the order for
applications that are sensitive to the order.
How to organize data.
10/23
![Page 11: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/11.jpg)
Outline
• Introduction• Motivation & Example• Compression-Based Direct Processing • Challenges
• Guidelines and Techniques• Solution Overview• Guidelines
• Evaluation• Benchmarks• Results
• Conclusion
11/23
![Page 12: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/12.jpg)
Solution Overview
SOLUTION TECHNIQUESAdaptive
traversal order andinformation to propagation
Compression-timeindexing
DoublecompressionLoad-time
coarseningTwo-level table withdepth-first traversal
Coarse-grainedparallel algorithm and
automatic data partition
Double-layeredbit vector for
footprint minimization
CHALLENGES
Unitsensitivity
Parallelismbarriers
Ordersensitivity
Dataattributes
Reuseof results across nodes
Overheadin saving and propagating
12/23
![Page 13: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/13.jpg)
Data AttributesProblem: How to utilize the attributes of datasets
• Guideline I: minimize the footprint size
• Guideline II: Traversal order is essential for the efficiency
Average Size
Files #Postorder
≤2860 >2860
Preorder using 2levBitMap
≤800
Preorder using regular BitMap
>800
or ?
Traversal Order
The best traversal order may depend on the data attributes of input.
or
13/23
![Page 14: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/14.jpg)
Parallelism BarriersProblem: How to perform parallelism on large datasets
• Guideline III: Coarse-grained distributed implementation is preferred
pipe() pipe() pipe() pipe()
f1 f2 f3
f4 f5 f6 …Input files
f1 f2 f3
Partition 1 Partition 2f4
Partition 1f5:part1
Partition 2f5:part2
Partition 1f5:part3
Partition 2f6
RDD 1 RDD 2 RDD 3
…
C++ Program
C++ Program
C++ Program
C++ Program
SparkContext
Final Results
14/23
![Page 15: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/15.jpg)
Order Sensitivity (detailed in paper)Problem: Some applications are sensitive to the order
• Guideline IV: depth-first traversal and a two-level table design
R1 R2 w1 spt1 R2 w2 spt2 R3
root node
… w4 …
…
file0 file1 file2
w5 R4R1: R4R2:
w6 w7R4: w8
R4 R5R3:
w9 w6R5: w8
…
…
w5
Local Sequence Table
Local Sequence Table
Global Sequence Table
15/23
![Page 16: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/16.jpg)
Unit Sensitivity (detailed in paper)Problem: How to organize data
• Guideline V: use of double-layered bitmap if unit information needs to be passed across the CFG
NULLP1 P3 …
bit2bit1 bit3 …Bit array
Pointer array P0
bit0
bit0
bit1
…
bit N-1
…
Level 1:
Level 2:
N bits
bit0
bit1
…
bit N-1
bit0
bit1
…
bit N-1
R1 R2 w1 spt1 R2 w2 spt2 R3
root node
… w4 …
…
file0 file1 file2
w5 R4R1: R4R2:
w6 R6R4: R7
R4 R5R3:
w9 R8R5: w8
…
…
w5
16/23
![Page 17: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/17.jpg)
Short Summary for Six Guidelines
• Data Attributes Challenge• Guideline II
• Parallelism Barriers• Guideline III
• Order Sensitivity• Guideline IV
• Unit Sensitivity• Guideline V
• General insights and common techniques• Guideline I and Guideline VI
17/23
![Page 18: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/18.jpg)
Outline
• Introduction• Motivation & Example• Compression-Based Direct Processing • Challenges
• Guidelines and Techniques• Solution Overview• Guidelines
• Evaluation• Benchmarks• Results
• Conclusion
18/23
![Page 19: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/19.jpg)
Benchmarks
• Six benchmarks• Word Count, Inverted Index, Sequence Count, Ranked
Inverted Index, Sort, Term Vector
• Five datasets• 580 MB ~ 300 GB
• Two platforms• Single node
• Spark cluster (10 nodes on Amazon EC2)
19/23
![Page 20: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/20.jpg)
Time Savings
• CompressDirect yields 2X speedup, on average, over direct processing.
0.00
1.00
2.00
3.00
4.00
5.00
wo
rdC
ou
nt
sort
inve
rted
Ind
ex
term
Ve
cto
r
seq
uen
ceC
ou
nt
rakd
Invt
dId
x
AV
G
wo
rdC
ou
nt
sort
inve
rted
Ind
ex
term
Ve
cto
r
seq
uen
ceC
ou
nt
rakd
Invt
dId
x
AV
G
wo
rdC
ou
nt
sort
inve
rted
Ind
ex
term
Ve
cto
r
seq
uen
ceC
ou
nt
rakd
Invt
dId
x
AV
G
50 GB 150 GB 300 GB
spee
du
p
gzip CompressDirect
20/23
![Page 21: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/21.jpg)
Space Savings• CompressDirect achieves 11.8X compression ratio,
even more than gzip does.
compression ratio = original size / compressed data size
0
2
4
6
8
10
12
14
16
50 GB 150 GB 300 GB 580 MB 2.1 GB AVG
com
pre
ssio
n r
atio
datasets
direct-processing gzip CompressDirect
21/23
![Page 22: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/22.jpg)
Conclusion
• Our method, compression-based direct processing.
• How the concept can be materialized on Sequitur. • Major challenges.
• Guidelines.
• Our library, CompressDirect, to help further ease the required development efforts.
22/23
![Page 23: Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights · 2018-09-23 · Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms,](https://reader030.fdocuments.us/reader030/viewer/2022040615/5f0d26397e708231d438eafd/html5/thumbnails/23.jpg)
Thanks!
• Any questions?
23/23
Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄
†Renmin University of China⋄Tsinghua University
#North Carolina State University⋆ETH Zurich