Efficient Document Analytics on Compressed Data: Method ...people.inf.ethz.ch › omutlu › pub ›...

Efficient Document Analytics on Compressed Data: Method,

Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄

†Renmin University of China⋄Tsinghua University

#North Carolina State University⋆ETH Zurich

Motivation• Every day, 2.5 quintillion bytes of data created – 90% data in the world

today has been created in the last two years alone[1].

[1] What is Big Data?https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html 2/23

Outline

• Introduction• Motivation & Example• Compression-Based Direct Processing • Challenges

• Guidelines and Techniques• Solution Overview• Guidelines

• Evaluation• Benchmarks• Results

• Conclusion

Motivation

How to perform efficient document analytics when data are extremely large?

• Challenge 1: • SPACE: Large Space Requirement

• Challenge 2:• TIME: Long Processing Time

Motivation• Observation

• Using Hash Table to check redundant content for Wikipedia dataset

512B 1KB 4KB

chunk size

data redundancy ratio

150 GB

Our Idea• Compression-based direct processing

• Sequitur algorithm meets our requirement

R0 → R1 R1 R2 aR1 → R2 c R2 dR2 → a b

a b c a b d a b c a b d a b a

Rules:

(a) Original data (b) Sequitur compressed data (c) DAG Representation

R1 R1 R2

R2 c R2 d

a: 0 b: 1 c: 2 d: 3R0: 4 R1: 5 R2: 6

(d) Numerical representation

4 → 5 5 6 05 → 6 2 6 36 → 0 1

(e) Compressed data in numerical ID

aInput:

Double benefits

R0: R1 R1 R2

R2 c R2 d

Appear more than once, but only store once!

Challenge 1: Space

Appear more than once, but only compute once! Challenge 2:

Optimization

• We can make it more compact.

R0: R1 R1 R2

R2 c R2 d

Some applications do not need to keep the sequence.

R0: R1 R1 R2

R2 c R2 d

R0: x2

R2: x1

i weight

In each rule, we may remove

sequence info too.

Further saves storage space

and computation

Example• Word Count

<a, 2×2 + 1 +1> = <a, 6><b, 2×2 + 1> = <b, 5>

<c, 1×2 > = <c, 2><d, 1×2 > = <d, 2>

<a,2>, <b,2><c,1>, <d,1>

<a,1>, <b,1>

R0: R1 R1 R2 a

R1: R2 c R2 d

R2: a b

<a, 1×2> = <a, 2><b, 1×2> = <b, 2>

<c, 1><d, 1>

<a,6>, <b,5><c,2>, <d,2>

CFG RelationInformation Propagation

Step #

Word table

Challenges

CHALLENGES

Unitsensitivity

Parallelismbarriers

Ordersensitivity

Dataattributes

Reuseof results across nodes

Overheadin saving and propagating

How to perform parallelism on large

datasets.

How to utilize the attributes of datasets.

How to accommodate the order for

applications that are sensitive to the order.

How to organize data.

Outline

• Conclusion

Solution Overview

SOLUTION TECHNIQUESAdaptive

traversal order andinformation to propagation

Compression-timeindexing

DoublecompressionLoad-time

coarseningTwo-level table withdepth-first traversal

Coarse-grainedparallel algorithm and

automatic data partition

Double-layeredbit vector for

footprint minimization

CHALLENGES

Unitsensitivity

Parallelismbarriers

Ordersensitivity

Dataattributes

Reuseof results across nodes

Overheadin saving and propagating

Data AttributesProblem: How to utilize the attributes of datasets

• Guideline I: minimize the footprint size

• Guideline II: Traversal order is essential for the efficiency

Average Size

Files #Postorder

≤2860 >2860

Preorder using 2levBitMap

≤800

Preorder using regular BitMap

Traversal Order

The best traversal order may depend on the data attributes of input.

Parallelism BarriersProblem: How to perform parallelism on large datasets

• Guideline III: Coarse-grained distributed implementation is preferred

pipe() pipe() pipe() pipe()

f1 f2 f3

f4 f5 f6 …Input files

f1 f2 f3

Partition 1 Partition 2f4

Partition 1f5:part1

Partition 2f5:part2

Partition 1f5:part3

Partition 2f6

RDD 1 RDD 2 RDD 3

C++ Program

SparkContext

Final Results

Order Sensitivity (detailed in paper)Problem: Some applications are sensitive to the order

• Guideline IV: depth-first traversal and a two-level table design

R1 R2 w1 spt1 R2 w2 spt2 R3

root node

… w4 …

file0 file1 file2

w5 R4R1: R4R2:

w6 w7R4: w8

R4 R5R3:

w9 w6R5: w8

Local Sequence Table

Global Sequence Table

Unit Sensitivity (detailed in paper)Problem: How to organize data

• Guideline V: use of double-layered bitmap if unit information needs to be passed across the CFG

NULLP1 P3 …

bit2bit1 bit3 …Bit array

Pointer array P0

bit N-1

Level 1:

Level 2:

N bits

bit N-1

R1 R2 w1 spt1 R2 w2 spt2 R3

root node

… w4 …

file0 file1 file2

w5 R4R1: R4R2:

w6 R6R4: R7

R4 R5R3:

w9 R8R5: w8

Short Summary for Six Guidelines

• Data Attributes Challenge• Guideline II

• Parallelism Barriers• Guideline III

• Order Sensitivity• Guideline IV

• Unit Sensitivity• Guideline V

• General insights and common techniques• Guideline I and Guideline VI

Outline

• Conclusion

Benchmarks

• Six benchmarks• Word Count, Inverted Index, Sequence Count, Ranked

Inverted Index, Sort, Term Vector

• Five datasets• 580 MB ~ 300 GB

• Two platforms• Single node

• Spark cluster (10 nodes on Amazon EC2)

Time Savings

• CompressDirect yields 2X speedup, on average, over direct processing.

50 GB 150 GB 300 GB

gzip CompressDirect

Space Savings• CompressDirect achieves 11.8X compression ratio,

even more than gzip does.

compression ratio = original size / compressed data size

50 GB 150 GB 300 GB 580 MB 2.1 GB AVG

datasets

direct-processing gzip CompressDirect

Conclusion

• Our method, compression-based direct processing.

• How the concept can be materialized on Sequitur. • Major challenges.

• Guidelines.

• Our library, CompressDirect, to help further ease the required development efforts.

Thanks!

• Any questions?

Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang Chen ⋄

†Renmin University of China⋄Tsinghua University

#North Carolina State University⋆ETH Zurich

Efficient Document Analytics on Compressed Data: Method ...people.inf.ethz.ch › omutlu › pub ›...

Documents

Transcript of Efficient Document Analytics on Compressed Data: Method ...people.inf.ethz.ch › omutlu › pub ›...

Compressed Accessibility Map: Efficient Access Control for XML

Energy Efficient Compressed Air Systems. Compressed Air Fundamentals V = volume flow rate P = pressure rise Eff = efficiencies of compressor,

Our Capabilities – Your Sustainable Productivity Dryer - Refrigerated compressed air dryer provides efficient and reliable dry compressed air through cycling dryers, fixed speed

The most energy efficient compressed air filters in the …glauber.com/files/pdf/oilx_brochure.pdfThe most energy efficient compressed air filters ... COMPRESSED AIR IS A SAFE AND

Refrigeration Dryer - CompAir · PDF file01 GB Energy efficient compressed air treatment F-Series 50Hz & 60Hz Refrigeration Dryer High quality compressed air

Efficient Compressed Air - Focus on Energy · PDF filethe energy required for the task. ... compressor power draw by 1% ... Efficient Compressed Air Compressor 1 Compressor 2 Filter

Energy Efficient Management of Compressed Air … Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Energy Efficient Management of Compressed Air Systems in

Efficient Compressed Air Treatment...Efficient Compressed Air Treatment How to save energy, reduce costs and help the environment. The accumulation of condensed water in the system

Total Compressed Air Management Solutions - Elgi · profile rotors ensure energy-efficient compressed air supply for all ... Real life problem 2 Corrosion of ... efficient reciprocating

COMPRESSED AIRc1940652.r52.cf0.rackcdn.com/files/4e768abf22234e22ca...1 ealised. COMPRESSED AIR Compressed air systems are usually only between 10% - 15% efficient. As a rule of thumb,

Efficient Processing of Multi-Connection Compressed Web Traffic

Efficient control of compressed air technology with ...

Energy Efficient Compression of Shock Data using Compressed Sensing

Video Analytics for Smarter, More Efficient Subsea Inspection

Special lecture on Information Knowledge Networkkida/lecture/IKN_lecture_eng_6.pdfSummary • Pattern matching on compressed texts. Aim ： develop an efficient PM algorithm on compressed

Video Data Analytics for Safer and More Efficient Mobility ...

Energy Efficient Compressed Air Solutions · 2018-04-18 · GARDNER DENVER | COMPRESSED AIR TECHNOLOGIES 6 You can’t buy the right compressed air solution if you don’t have a

Refrigeration Dryer - CompAir · 01 GB Energy efficient compressed air treatment F-Series 50Hz & 60Hz Refrigeration Dryer High quality compressed air

: Efficient Ad-Hoc Analytics on Evolving Graphs

1 REFRIGERATION, COMPRESSED AIR SYSTEMS & EFFICIENT USE OF ENERGY.