BigDataBench First Part-1213 - Big Data and AI Benchmark...
Transcript of BigDataBench First Part-1213 - Big Data and AI Benchmark...
BigDataBench Tutorial
Jianfeng Zhan, Zhen Jia, and Gang Lu
INS
TITUT
http://prof.ict.ac.cn/BigDataBench
TE OF C
OM
PU
TING
Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese
Academy of Sciences
G TEC
HN
OLO
GY
BigDataBench TutorialMICRO 2014 Cambridge, UK
AcknowledgementsAcknowledgements
BigDataBench contributorsLei Wang, Chunjie Luo, Zhen Jia, Wanling Gao, Dr. g, j , , g ,Rui Han, Dr. Yuqing Zhu, Qiang Yang, XinlongLin, Jingwei Li, Wei Zhu, g ,Shujie Zhang, Dr. Chuliang Weng Dr Yongqiang HeDr. Yongqiang HeXiaona Li Bizhu QiuKent Zhan, Zijian Ming
BigDataBench MICRO 2014
, j g
Acknowledgements (Cont’)Acknowledgements (Cont )
Great thanks for Prof. Jason Mars to invite us to give this tutorial at Micro’14. g
BigDataBench MICRO 2014
BigDataBench Tutorial Program (1)BigDataBench Tutorial Program (1)
9:00‐9:40 Jianfeng ZhanWhat is BigDataBench? gBigDataBench benchmarking methodology
9 40 10 00 G L9:40‐10:00 Gang LuBigDataBench data sets and workloads
10:00‐10:30 Coffee break
BigDataBench MICRO 2014
BigDataBench Tutorial Program (2)BigDataBench Tutorial Program (2)
10:30‐11:00 Gang LuHow to use BigDataBench data sets and gworkloads?How to generate Large‐scale data sets?How to generate Large scale data sets?Multi‐tenancy version of BigDataBench
11:00‐12:00 Zhen Jia BigDataBench subsettingg gHow to use the simulator versions of BigDataBench ?
BigDataBench MICRO 2014
BigDataBench ?
Handbook of BigDataBenchHandbook of BigDataBench
Please feel free to download and distribute this handbook.
http://prof.ict.ac.cn/BigDataBench_micro_14/Draft versionDraft version93 pages
BigDataBench MICRO 2014
BigDataBench handbook: 1st partBigDataBench handbook: 1st part
Section 1 Summary of BigDataBench 3.1Section 2 Benchmarking methodologySection 2 Benchmarking methodologySection 3 BigDataBench specificationSection 4 BigDataBench implementationsSection 5 BigDataBench subsettingSection 5 BigDataBench subsetting
BigDataBench MICRO 2014
BigDataBench handbook: 2th partBigDataBench handbook: 2th part
Section 6 BigDataBench simulator versionSection 7 Multi‐tenancy version ofSection 7 Multi tenancy version of BigDataBenchS i 8 U lSection 8 User manual Section 9 BigDataBench usersgSection 10 Q&A
BigDataBench MICRO 2014
OutlineOutline
What is BigDataBench?
BigDataBench benchmarking methodology
BigDataBench MICRO 2014
Why Big Data Benchmarking?Why Big Data Benchmarking?
Measuring big data systems and architectures quantitatively
BigDataBench MICRO 2014
quantitatively
What is BigDataBench?What is BigDataBench? An open source big data benchmarking project p g g p j• http://prof.ict.ac.cn/BigDataBench
• Search Google using “BigDataBench”
BigDataBench MICRO 2014
What is BigDataBench (cont’)?g ( )BDGS(Big Data Generator Suite) for scalable data
Facebook Social Network
ImageNet
Wikipedia Entries
E‐commerce Transaction
English broadcasting audio
Amazon Movie Reviews
ProfSearch Resumes
DVD Input Streams
Google Web Graph
g p
SoGou Data
Image scene
MNIST
Genome sequence data Assembly of the human genome
ImpalaNoSql
14 Real‐world Data Sets
Shark
Impala
Search Engine
H d RDMA
SocialNetwork
E-commerce
MPI
Software Stacks33 Workloads
DataMPI
Hadoop RDMAMultimedia Bioinformatics
BigDataBench MICRO 2014
Software Stacks33 Workloads
BigDataBench evolutionBigDataBench evolution
5 application domains: 14 data sets and 33 workloadsSame specifications: diverse implementationsMulti‐tenancy version
h b d l
BigDataBench 3.1
Multidisciplinary effort
BigDataBench subset and simulator version 2014.12
BigDataBench 3.0
Typical Internet service domains
2014.4p y
32 workloads: diverse implementations
BigDataBench 2.0
2013.12
Typical Internet service domainsAn architectural perspective19 workloads & data generation tools
CloudRank 1.0DCBench 1.0BigDataBench 1.02013.7
Search engine6 workloads
11 data analytics workloads
Mixed data analytics workloads
BigDataBench MICRO 2014
Why BigDataBench?Why BigDataBench?Specification
Application domains
Workload Types
Workloads
Scalable data sets (from real
Multipleimpleme
Multitenancy
Subsets
Simulatorca o do a s ypes oads se s ( o ea
data)p e e
ntationsa cy e s o
version
BigDataBench Y Five Four 33 8 Y Y Y Y
BigBench Y One Three 10 3 N N N NCloudSuite N N/A Two 8 3 N N N Y
HiBench N N/A Two 10 3 N N N N
CALDA Y N/A One 5 1 Y N N N/YCSB Y N/A One 6 N/A Y N N NLinkBench Y One One 10 1 Y N N N
AMPBenchmarks
N N/A One 4 1 Y N N N
BigDataBench MICRO 2014
Observations from CloudSuiteObservations from CloudSuiteThe characteristic of big data workloadsThe characteristic of big data workloads
High L1I miss • Frontend inefficiencies
Low ILPC i ffi i i• Core inefficiencies
LLC is good but overprovision• Data access inefficiencies• Data‐access inefficiencies
Less off‐chip bandwidth and MLP• Bandwidth inefficienciesBandwidth inefficiencies
Clearing the Clouds, ASPLOS 2012 Best paper
BigDataBench MICRO 2014
System Behaviors of BigDataBenchSystem Behaviors of BigDataBenchCPU utilization I/O wait ratio
Diversified system level behaviors:
20%
40%
60%
80%
100%CPU utilization I/O wait ratio
erce
ntag
e
0%
20%
H-G
rep(
7)Km
eans
(1)
geR
ank(
1)dC
ount
(1)
H-B
ayes
(1)
M-B
ayes
M-K
mea
nsPa
geR
ank
-Rea
d(10
)fe
renc
e(9)
tQue
ry(9
) dC
ount
(8)
-Pro
ject
(4)
Ord
erBy
(3)
S-G
rep(
1)M
-Gre
p-T
PC-D
S-…
Ord
erBy
(7)
-TPC
-DS-
…-T
PC-D
S-…
S-So
rt(1)
Wor
dCou
ntM
-Sor
tS_
BigD
ata
Pe
S-K
S-Pa
gH
-Wor H M
M-P H-
H-D
iffI-S
elec
tS-
Wor S- S-O H I-O S S
M-W
AVG
_S
10
100
time
0.01
0.1
1
10
… … …
Wei
ghte
d I/O
0.01H
-Gre
p(7)
S-Km
eans
(1)
S-Pa
geR
ank(
1)H
-Wor
dCou
nt(1
)H
-Bay
es(1
)M
-Bay
esM
-Km
eans
M-P
ageR
ank
H-R
ead(
10)
H-D
iffer
ence
(9)
Sele
ctQ
uery
(9)
S-W
ordC
ount
(8)
S-Pr
ojec
t(4)
S-O
rder
By(3
)S-
Gre
p(1)
M-G
rep
H-T
PC-D
S-…
I-Ord
erBy
(7)
S-TP
C-D
S-…
S-TP
C-D
S-…
S-So
rt(1)
M-W
ordC
ount
M-S
ort
AVG
_S_B
igD
ata
BigDataBench MICRO 2014
Weighted disk I/O time ratio
S H I-S S A
System Behaviors of BigDataBenchSystem Behaviors of BigDataBenchCPU utilization I/O wait ratio
Diversified system level behaviors:
20%
40%
60%
80%
100%CPU utilization I/O wait ratio
erce
ntag
e
High CPU utilization & less I/O time
0%
20%
H-G
rep(
7)Km
eans
(1)
geR
ank(
1)dC
ount
(1)
H-B
ayes
(1)
M-B
ayes
M-K
mea
nsPa
geR
ank
-Rea
d(10
)fe
renc
e(9)
tQue
ry(9
) dC
ount
(8)
-Pro
ject
(4)
Ord
erBy
(3)
S-G
rep(
1)M
-Gre
p-T
PC-D
S-…
Ord
erBy
(7)
-TPC
-DS-
…-T
PC-D
S-…
S-So
rt(1)
Wor
dCou
ntM
-Sor
tS_
BigD
ata
Pe
S-K
S-Pa
gH
-Wor H M
M-P H-
H-D
iffI-S
elec
tS-
Wor S- S-O H I-O S S
M-W
AVG
_S
10
100
time
0.01
0.1
1
10
… … …
Wei
ghte
d I/O
0.01H
-Gre
p(7)
S-Km
eans
(1)
S-Pa
geR
ank(
1)H
-Wor
dCou
nt(1
)H
-Bay
es(1
)M
-Bay
esM
-Km
eans
M-P
ageR
ank
H-R
ead(
10)
H-D
iffer
ence
(9)
Sele
ctQ
uery
(9)
S-W
ordC
ount
(8)
S-Pr
ojec
t(4)
S-O
rder
By(3
)S-
Gre
p(1)
M-G
rep
H-T
PC-D
S-…
I-Ord
erBy
(7)
S-TP
C-D
S-…
S-TP
C-D
S-…
S-So
rt(1)
M-W
ordC
ount
M-S
ort
AVG
_S_B
igD
ata
BigDataBench MICRO 2014
Weighted disk I/O time ratio
S H I-S S A
System Behaviors of BigDataBenchSystem Behaviors of BigDataBenchCPU utilization I/O wait ratio
Diversified system level behaviors:
20%
40%
60%
80%
100%CPU utilization I/O wait ratio
erce
ntag
e
High CPU utilization & less I/O timeLow CPU utilization
0%
20%
H-G
rep(
7)Km
eans
(1)
geR
ank(
1)dC
ount
(1)
H-B
ayes
(1)
M-B
ayes
M-K
mea
nsPa
geR
ank
-Rea
d(10
)fe
renc
e(9)
tQue
ry(9
) dC
ount
(8)
-Pro
ject
(4)
Ord
erBy
(3)
S-G
rep(
1)M
-Gre
p-T
PC-D
S-…
Ord
erBy
(7)
-TPC
-DS-
…-T
PC-D
S-…
S-So
rt(1)
Wor
dCou
ntM
-Sor
tS_
BigD
ata
Pe
Low CPU utilization relatively and lots of I/O time
S-K
S-Pa
gH
-Wor H M
M-P H-
H-D
iffI-S
elec
tS-
Wor S- S-O H I-O S S
M-W
AVG
_S
10
100
time
0.01
0.1
1
10
… … …
Wei
ghte
d I/O
0.01H
-Gre
p(7)
S-Km
eans
(1)
S-Pa
geR
ank(
1)H
-Wor
dCou
nt(1
)H
-Bay
es(1
)M
-Bay
esM
-Km
eans
M-P
ageR
ank
H-R
ead(
10)
H-D
iffer
ence
(9)
Sele
ctQ
uery
(9)
S-W
ordC
ount
(8)
S-Pr
ojec
t(4)
S-O
rder
By(3
)S-
Gre
p(1)
M-G
rep
H-T
PC-D
S-…
I-Ord
erBy
(7)
S-TP
C-D
S-…
S-TP
C-D
S-…
S-So
rt(1)
M-W
ordC
ount
M-S
ort
AVG
_S_B
igD
ata
BigDataBench MICRO 2014
Weighted disk I/O time ratio
S H I-S S A
System Behaviors Of BigDataBenchSystem Behaviors Of BigDataBench CPU utilization I/O wait ratio
Diversified system level behaviors:
20%
40%
60%
80%
100%CPU utilization I/O wait ratio
erce
ntag
e
High CPU utilization & less I/O timeLow CPU utilization
0%
20%
H-G
rep(
7)Km
eans
(1)
geR
ank(
1)dC
ount
(1)
H-B
ayes
(1)
M-B
ayes
M-K
mea
nsPa
geR
ank
-Rea
d(10
)fe
renc
e(9)
tQue
ry(9
) dC
ount
(8)
-Pro
ject
(4)
Ord
erBy
(3)
S-G
rep(
1)M
-Gre
p-T
PC-D
S-…
Ord
erBy
(7)
-TPC
-DS-
…-T
PC-D
S-…
S-So
rt(1)
Wor
dCou
ntM
-Sor
tS_
BigD
ata
Pe
Low CPU utilization relatively and lots of I/O timeM di CPU tili ti
S-K
S-Pa
gH
-Wor H M
M-P H-
H-D
iffI-S
elec
tS-
Wor S- S-O H I-O S S
M-W
AVG
_S
10
100
time
Medium CPU utilization and I/O
0.01
0.1
1
10
… … …
Wei
ghte
d I/O
0.01H
-Gre
p(7)
S-Km
eans
(1)
S-Pa
geR
ank(
1)H
-Wor
dCou
nt(1
)H
-Bay
es(1
)M
-Bay
esM
-Km
eans
M-P
ageR
ank
H-R
ead(
10)
H-D
iffer
ence
(9)
Sele
ctQ
uery
(9)
S-W
ordC
ount
(8)
S-Pr
ojec
t(4)
S-O
rder
By(3
)S-
Gre
p(1)
M-G
rep
H-T
PC-D
S-…
I-Ord
erBy
(7)
S-TP
C-D
S-…
S-TP
C-D
S-…
S-So
rt(1)
M-W
ordC
ount
M-S
ort
AVG
_S_B
igD
ata
BigDataBench MICRO 2014
Weighted disk I/O time ratio
S H I-S S A
Workloads Classification Of BigDataBench
Finding from system behaviorsSystem behaviors vary across different workloads y yWorkloads can be divided into 3 categories:
BigDataBench MICRO 2014
IPC of BigDataBench vs. other
2
benchmarks
1
1.5
IPC
0
0.5
rep(
7)an
s(1)
ank(
1)un
t(1)
yes(
1)Ba
yes
mea
nsR
ank
d(10
)ce
(9)
ry(9
) un
t(8)
ect(4
)By
(3)
rep(
1)-G
rep
ry3(
9)By
(7)
C-D
S-…
ry8(
1)or
t(1)
Cou
ntM
-Sor
tgD
ata
PC-C
Suite
HPC
CR
SEC
PEC
fpEC
int
H-G
rS-
Kmea
S-Pa
geR
aH
-Wor
dCou
H-B
ay M-B
M-K
mM
-Pag
eH
-Rea
H-D
iffer
enI-S
elec
tQue
S-W
ordC
ouS-
Proj
eS-
Ord
erS-
Gr
M-
-TPC
-DS-
quer
I-Ord
erS-
TPC
-TPC
-DS-
quer
S-So
M-W
ordC M
AVG
_S_B
ig TPAV
G_C
loud
Avg_
HAv
g_PA
RAV
G_S
PAV
G_S
PE
H- S-
The average IPC of the big data workloads are larger than CloudSuite, SPECFP and SPECINT, same as PARSEC and slightly smaller than HPCCPARSEC and slightly smaller than HPCC
The avrerage IPC of BigDataBench is 1.3 times of that of CloudSuite
BigDataBench MICRO 2014
Some workloads have high IPC (M_Kmeans, S‐TPC‐DS‐Query8)
Instructions Mix of BigDataBench vs. other benchmarks
For big data workloads:b hMore branch instructions
The percentage is 20% (1.5 times larger than others), except for TPC‐CThe ratio of integer instructions to FP instructions is very high
BigDataBench MICRO 2014
The average is 73
Cache Behaviors of BigDataBench th b h kvs. other benchmarks
CPU‐intensive I/O‐intensive hybrid
L1I MPKIL1I MPKILarger than traditional benchmarks, but lower than that of CloudSuite (12 Vs. 31)
Different among big data workloadsDifferent among big data workloadsCPU‐intensive(8), I/O intensive(22), and hybrid workloads(9)
MPI workloads have less instruction cache miss
BigDataBench MICRO 2014
Only 3.4 on the average
Cache Behaviors of BigDataBenchCache Behaviors of BigDataBench
L2 Cache:The IO‐intensive workloads undergo more L2 MPKI
L3 Cache:The average L3 MPKI of the big data workloads is smallerThe average L3 MPKI of the big data workloads is smaller than all of the other workloads
The underlying software stacks impact dataThe underlying software stacks impact data locality
MPI workloads have better data locality and less cache misses
BigDataBench MICRO 2014
Our observation from BigDataBenchOur observation from BigDataBench
Unique characteristicMore branch instructions & Higher ratio of integer to FP instructions
Different behaviors between Big Data workloadsDisparity of ILP and memory access behaviorsDisparity of ILP and memory access behaviors• Several workloads can achieve higher IPC • Several workloads can achieve higher Off‐chip bandwidth
CloudSuite is a subclass of Big Data
High front‐end stall is not the unique characteristics of big data workloads
Related with software stacks
BigDataBench MICRO 2014
BigDataBench PublicationsBigDataBench Publicationsi h i h k S i f S i 20 hBigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE
International Symposium On High Performance Computer Architecture (HPCA‐2014).Characterizing and Subsetting Big Data Workloads. Zhen Jia, Jianfeng Zhan, Wang Lei, Rui Han, Sally A. McKee, Qiang Yang, Chunjie Luo, and Jingwei Li. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2014.Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award)BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014)BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014)
BigDataBench MICRO 2014
BigDataBench usersBigDataBench users
More than 20 groups have published papers using BigDataBenchg g
h // f i /Bi D B h/ /http://prof.ict.ac.cn/BigDataBench/users/
BigDataBench MICRO 2014
OutlineOutline
What is BigDataBench?
BigDataBench benchmarking methodology
BigDataBench MICRO 2014
Five StepsFive Steps
Multi‐tenancy
Big data
Diverse implementation
yor subset
Investigate
Typical workloads and data set
benchmark specification
important application domain
BigDataBench MICRO 2014
BigDataBench MethodologyBigDataBench Methodology
Application Benchmark Real‐world data sets
Multi‐tenancy
Domain 1
Application
Data models of different
types & semantics
specification 1
Benchmark Data generation
version
Mix with different percentages
pp
Domain … Data operations &
workload patterns
specification …
tools
Reduce benchmarking cost
Application
Domain N
Benchmark
specification N
Workloads with diverse implementations
BigDataBench
subset
benchmarking cost
BigDataBench MICRO 2014
Investigate Application DomainsInvestigate Application DomainsWh t li tiInternet Services What application domains should we pay attention to?
Internet Services
BigDataBench MICRO 2014
Internet ServicesInternet Services
Search Engine Social Network
Taking up 80% of internet services
5%15%
Search Engine Social NetworkElectronic Commerce Media StreamingOthers
according to page views and daily visitors
40%
15%
5%
25%
http://www.alexa.com/topsites/global;0Top 20 websites
BigDataBench MICRO 2014
Multimedia DataMultimedia Data
BigDataBench MICRO 2014
The Explosive Growth of Multimedia Data
new
VIDEOS on YouTube
new
PHOTOS on FLICKR every hours MUSIC streaming every minute minute on PANDORA every minute
VIDEO feeds fromdata growth
arei t VOICE llVIDEO feeds from surveillance cameras
are IMAGES, VIDEOS, documents, …
minutes VOICE calls on Skype every minute
htt // ld l / t t/ l d /2014/11/ h ti bi d t DKB 2 df
BigDataBench MICRO 2014
http://www.oldcolony.us/wp‐content/uploads/2014/11/whatisbigdata‐DKB‐v2.pdf
The Explosive Growth of Human Genome Data
htt // h / it /d f lt/fil / i d bi d t 1 0 df
BigDataBench MICRO 2014
http://www.osehra.org/sites/default/files/genomics_and_big_data_1_0.pdf
Application Domain We ChooseApplication Domain We Choose
Internet ServicesServices
Search EngineSocial NetworkSocial NetworkE‐commerceMultimedia
Emerging
MultimediaBioinformatics
but Important
BigDataBench MICRO 2014
BigDataBench MethodologyBigDataBench Methodology
Application
Domain 1
Application
Data models of different
types & semantics
Domain … Data operations &
workload patterns
Application
Domain N
BigDataBench MICRO 2014
Success story: Relational model of data
E. F. Codd, A relational Model of Data for Large shared data banks Communication ofLarge shared data banks. Communication of ACM, vol 13. no.6, 1970.Set concept : general mathematical meaning
General representation of datapBasis of relational algebra (theoretical foundation of database)of database)5 basic operations
S l P j P d U i Diff• Select, Project, Product, Union, Difference
BigDataBench MICRO 2014
Success story: parallel computing
By a multidisciplinary group of well‐known researchers
e.g.:Jim Gray,Michael JordanDavid A. Patterson
Operations & PatternsAbstracted from 13representative parallelcomputation patternsParallel computation inherent demand for big data
BigDataBench MICRO 2014processing (volume & complexity)
Primitive Operations & Patterns in Big DataPrimitive Operations & Patterns in Big Data
3 Categories of Operations11 basic operations
3 P i P tt3 Processing Patterns
BigOP: generating comprehensive big data workloads as a benchmarking framework.
BigDataBench MICRO 2014
g g g p g gDASFAA 2014
Some Examples (not exhausted)Some Examples (not exhausted)
BigOP: generating comprehensive big data workloads as a benchmarking framework. g g g p g g19th International Conference on Database Systems for Advanced Applications (DASFAA 2014)
BigDataBench MICRO 2014
Image SearchImage Search
Perceptual Hash Algorithm
Input Image
samping Basic information
of image
SIFT Imagefeatures
Hash Imagefingerprint Database
Set Ope
SimilarSort
ration
SimilarimagesOutput
Sort
BigDataBench MICRO 2014
42
Feature Extraction SIFTFeature Extraction‐‐SIFT
GaussianFilter
InputImage Convolution
Image ScaleSpace
Sampling ImagePyramid
MatrixSubtraction DOG
Image
Sort
Key PointOf Image
GaussianWindow
CountFeatureVectorsOutput Sampling
BigDataBench MICRO 2014 43
Workload & Data Set in BigDataBench
•Text•Graph•Table•Multimedia
• Structured• Semi‐Structured•Unstructured
Data Model Semantics
Workload Patterns
Data Operations
•Different combination of units of computation
•Unit of computation
BigDataBench MICRO 2014
BigDataBench MethodologyBigDataBench Methodology
Application Benchmark
Domain 1
Application
Data models of different
types & semantics
specification 1
Benchmark pp
Domain … Data operations &
workload patterns
specification …
Application
Domain N
Benchmark
specification N
BigDataBench MICRO 2014
Data management’s traditionData management s tradition
Specification First. Functions of abstraction are units ofFunctions of abstraction are units of computation that appear frequently in the application domain being benchmarkedapplication domain being benchmarked. They are expressed in a generic form that is independent of the underlying system implementation.implementation.
BigDataBench MICRO 2014
TPC C examplesTPC‐C examples
BigDataBench MICRO 2014
BigDataBench MethodologyBigDataBench Methodology
Application Benchmark Real‐world data sets
Domain 1
Application
Data models of different
types & semantics
specification 1
Benchmark Data generation pp
Domain … Data operations &
workload patterns
specification …
tools
Application
Domain N
Benchmark
specification N
Workloads with diverse implementations
BigDataBench MICRO 2014
Real World Data setsReal‐World Data sets
14 real‐world data sets
BigDataBench MICRO 2014
Big Data Generation Tool BDGSBig Data Generation Tool‐‐BDGS
Provide scalable data set extracted from real‐world data sets
Text
BDGSbased on real data
TableGraph
BigDataBench MICRO 2014
Naïve Text generatorNaïve Text generator
select word randomly
big
hit t
system
miningdata
evaluatemachine
select word randomlyarchitectureCPU
benchmarkingmemory
learning
cpu
wordsfollowing multinomial distribution
documnet
Only modeling on word level;Words are selected according to the same gdistribution
BigDataBench MICRO 201451/
Improved Text generatorImproved Text generator
big CPUevaluatemachine
topic2 CPUselect word randomly
architecture
CPU
benchmarking
miningdata
t
topic1
topic3
select topic randomlyCPU
words
systemmemorylearning
documnet
topic3
topicsfollowing multinomial distribution under topic2following multinomial distribution
Modeling on topic and word levelWords are drew from distribution under particular topicTopics are drew from same distribution, as a result, each document has same topic proportion
BigDataBench MICRO 201452/
Optimized Text GeneratorOptimized Text Generator
big systemdata
evaluatemachine
topic1topic2
select word randomly
architecture
benchmarking
miningdata
CPU
memory
l i
CPU
topic3
select topic randomlyselect topic
proportion randomly
words
following multinomial distribution under topic1
learning
documnettopics
following multinomial distribution
Topic distribution parameters
following dirichlet distribution
Modeling on topic and word levelWords are drew from distribution under particular topicTopics are selected from different distribution with parameters following a dirichlet distribution
BigDataBench MICRO 201453/
Workloads With Diverse Implementations
MapReduce
Software Stack
DataMPIMPI
Spark
BigDataBench MICRO 2014
BigDataBench MethodologyBigDataBench Methodology
Application Benchmark Real‐world data sets
Multi‐tenancy
Domain 1
Application
Data models of different
types & semantics
specification 1
Benchmark Data generation
version
Mix with different percentages
pp
Domain … Data operations &
workload patterns
specification …
tools
Reduce benchmarking cost
Application
Domain N
Benchmark
specification N
Workloads with diverse implementations
BigDataBench
subset
benchmarking cost
BigDataBench MICRO 2014
Multi‐tenancy version of BigDataBench
Scenarios of multiple tenants running heterogeneous applications in cloud datacenters
Latency‐critical online servicesLatency‐insensitive offline batch applications
Mining real‐world Workload traces Mixed workloads
Benchmarking scenarios
Workload traces (Google and Facebook) Workload matching
using Parametric workload
Mixed workloads in public clouds
Profiling Real‐world Workload
t
Machine learning techniques
generation tool Data analytical
workloads in private clouds
BigDataBench MICRO 2014
traces
BigDataBench SubsetBigDataBench Subset
MotivationExpensive to run all the benchmarks for system p yand architecture researches
• multiplied by different implementationsmultiplied by different implementations • BigDataBench 3.0 provides about 77 workloads
BigDataBench MICRO 2014
Subsetting MethodologySubsetting Methodology
• Identify a comprehensive set of workload characteristics from a specific perspective
• Eliminate the correlation data in those metricsEliminate the correlation data in those metrics• Map the high dimension metrics to a low dimension
• Use the clustering method to classifyCh t ti kl d f h t• Choose representative workloads from each category
BigDataBench MICRO 2014
BigDataBench MICRO 2014