A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today...
Transcript of A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today...
![Page 1: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/1.jpg)
ARobustPartitioningSchemeforAd-HocQueryWorkloads
ANILSHANBHAGMIT
J/WAlekh Jindal,SamMadden, JorgeQuiane, AaronJ.ElmoreMicrosoftMIT QCRI Univ.Chicago
![Page 2: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/2.jpg)
Today
Datacollectionischeap=>Lotsofdata!
![Page 3: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/3.jpg)
DataPartitioning
FindaverageordersizeforallordersbetweenSept10andSept11,2017
DataSkipping - Skipdatablocksnotnecessary
10%selectivityquery=>10xfasterifdatapartitionedonselectionpredicate
Orderdate
![Page 4: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/4.jpg)
TheProblem
Analytics
Ad-Hoc/ExploratoryAnalysis
RecurringWorkloads
+
Focusofexistingwork
Giveworkload=>Returnpartitioninglayout
Problems:1. Tedioustocollectworkload2. Maynotbeknownupfront3. Changesovertime
Howtogetbenefitsofpartitioninginthiscase?
![Page 5: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/5.jpg)
OurApproach
Doeverythingadaptively!
Twostepprocess:1. Upfrontloadthedatasetpartitioned2. Asusersquery,incrementallyimprovethe
partitioningofthedata
![Page 6: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/6.jpg)
DistributedstoragesystemslikeHDFS,filesbrokenintoblocks(128MBchunks)
A<=5andB<=7
UpfrontPartitioning
>Insteadofpartitioningbysize,partitionbyattributes.>SamenumberofblockscreatedasinHDFS.Eachblocknowhasadditionalmetadata
![Page 7: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/7.jpg)
AdaptiveRe-Partitioning
Whenusersubmitsaquery,optimizertriestoimprovethepartitioningbyreorganizingthepartitioningtree
HereifqueriesaskA<=3manytimes,replaceB7 byA3
DoneondatasetswhichareO(1TB)with~8000nodepartitiontrees.
![Page 8: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/8.jpg)
SystemArchitecturePredicatedScanQueryExample:
FINDemployeesWITHAge<30AND20k<Salary<40k1 2
![Page 9: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/9.jpg)
1.UpfrontPartitionerGoal:Generateapartitioningtree
WITHOUTanupfrontqueryworkload
>Generatesatreewithheterogeneousbranching
>Balancethepartitioningbenefitacrossallattributes
!
" #
$
! " !
![Page 10: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/10.jpg)
AllocationGoal: Balancepartitioningbenefitacrossattributes
Allocationofattributei ~averagepartitioningofanattributej
= 𝛴all nodes i nij cij
UpfrontPartitioningAlgorithm
AttributeAllocations
PartitioningTree
UniformifnoworkloadinformationWeightedifwehavepriorworkloadinformation
![Page 11: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/11.jpg)
2.AdaptiveQueryExecutorGoal:Returnmatchingtuples+checkifpartitioninglayoutcanbeimproved
Alternativesfoundviatransformationsonthepartitioningtree
1.SwapRule
2.PushupRule 3.RotateRule
![Page 12: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/12.jpg)
Gettingaplan
![Page 13: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/13.jpg)
CostModelThesystemmaintainawindowWofpastqueries
ComputeBenefitandRepartitioningCostforthebestplan
RepartitioningONLY happenswhenreductioninthetotalcostofthequeryworkloadisgreaterthanre-partitioningcost.
Solvesconstantre-partitioningduetorandomquerysequencesandboundstheworsecaseimpact.
![Page 14: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/14.jpg)
Performance
4metrics
1)Loadtime
2)Timetakenbyfirstquery
3)Aggregateruntimeoveraworkload
4)Incrementalimprovementwithworkloadhints
![Page 15: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/15.jpg)
LoadTimeTPC-H:ScaleFactor200+De-normalized.Datasize:1.4TB
Loadingperformance: 1.38timesslowerthanHDFS
Loadtimescalesalmostlinearlywithdatasizeandindependentofnumberofcolumns
![Page 16: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/16.jpg)
Timetakenbyfirstquery
OnAverage:45%betterthanfullscan20%betterthank-dtree
![Page 17: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/17.jpg)
AggregateWorkloadRuntime
0400800
120016002000
0400800
120016002000
0400800
120016002000
0 25 50 75 100 125 150 175 2004uery 1o
0400800
1200160020007i
me
7aNe
Q (iQ
s)
full scaQ raQge raQge2 AmoebaWorkload:200Queriesgeneratedfromrandominitializationof8querytemplatesofTPC-Hbenchmark
fullscan – Baseline
range – partitionsonorderdate (1perdate)1.88xbetter
range2– partitionsonorderdate(64),r_name(4),c_mktsegment(4),quantity(8)3.48xbetter
Amoeba– 3.84xbetterthanbaseline
![Page 18: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/18.jpg)
WorkloadHints
0400800
120016002000
0 25 50 75 100 125 150 175 2004uery 1o
0400800
120016002000
7im
e 7a
NeQ
(iQ s
)
default better iQitBetterInit:Startswithcustomallocationtomimicrange2
6.67xbetterthan fullscan
Filteringratio:default:0.81betterinit :0.9
![Page 19: A Robust Partitioning Scheme for Ad-Hoc Query Workloads · Microsoft MIT QCRI Univ. Chicago. Today Data collection is cheap => Lots of data ! Data Partitioning Find average order](https://reader036.fdocuments.us/reader036/viewer/2022071108/5fe29f1bda20d90c4a4126d7/html5/thumbnails/19.jpg)
Conclusion•Amoeba isadistributedstoragesystembasedonanadaptivedatapartitioningscheme• Lowloadingoverhead• Improvedfirstqueryperformance• Adapttochangesandsignificantlyimprovementtoworkloadruntime• Canexploitworkloadhints
•Allowsanalyststogetstartedrightawayandreapbenefitsofpartitioningwithoutanupfrontworkload