Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices
description
Transcript of Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices
HPDC 2013
Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices
Yu Su*, Gagan Agrawal*, Jonathan Woodring#
Kary Myers#, Joanne Wendelberger#, James Ahrens#
*The Ohio State University#Los Alamos National Laboratory
HPDC 2013
Motivation• Science becomes increasingly data driven;• Strong requirement for efficient data analysis;• Challenges:
– Fast data generation speed– Slow disk IO and network speed – Some number from road-runner EC3 simulation
• 40003 particles, 36 bytes per particle => 2.3 TB/s• Network Bandwidth: 10 GB/s• 230 times different, and bigger in future
• Extremely hard to download and analyze entire data
HPDC 2013
Server-side Subsetting Methods
SimpleRequest
AdvancedRequest
Challenges?
No subsetting request?
Data subset is still big?
Server-side SubsettingClient-side Subsetting
HPDC 2013
Data Sampling and Challenges• Statistic Sampling Techniques:
– A subset of individuals to represent whole population– Example: Simple Random, Stratified Random
• Information Loss and Error Metrics: – Mean, Variance, Histogram, Q-Q Plot
• Challenges: – Sampling Accuracy Considering Data Features
• Value Distribution, Spatial Locality
– Error Calculation without High Overhead.– Combine Data Sampling with Data Subsetting– Data Sampling without Data Reorganization
HPDC 2013
Our Solution and Contribution• A server-side subsetting and sampling framework.
– Standard SQL Interface– Bitmap Indexing
• Server-side Subsetting: Dimensions, Values• Server-side Sampling
• Support Data Sampling over Bitmap Indices– Data samples has better accuracy; – Support error prediction before sampling the data– Support data sampling over flexible data subset– No data reorganization is needed
HPDC 2013
Background: Bitmap Indexing• Widely used in scientific data management
• Suitable for float value by binning small ranges• Run Length Compression(WAH, BBC)
– Compress bitvector based on continuous 0s or 1s
HPDC 2013
System Architecture
Parse the SQL expression
Parse the metadata file
Generate Query RequestFind all bitvectors which satisfies current query
Calculate Errors based on bitvectors
Perform sampling over bitvectors
Access the actual dataset
HPDC 2013
Data Sampling over Bitmap Indices
• Features of Bitmap Indexing: – Each bin(bitvector) corresponds to one value range;– Different bins reflect the entire value distribution;– Each bin keeps the data spatial locality;
• Contains all space IDs (0-bits and 1-bits)• Row Major, Column Major• Hilbert Curve, Z-Order Curve
• Method:– Perform stratified random sampling over each bin;– Multi-level indices generates multi-level samples;
HPDC 2013
Stratified Random Sampling over Bins
S1: Index Generation
S2: Divide Bitvector into Equal Strides
S3: Random Select certain % of 1’s out of
each stride
HPDC 2013
Error Prediction vs. Error Calculation
SamplingRequest
PredictRequest
Error PredictionError Calculation
Data Sampling
Error Calculation
Sample
Not Good?
Multi-TimesError Prediction
Error Metrics Feedback
Decide Sampling
Sampling Request
Sample
HPDC 2013
Error Prediction• Pre-estimate the error metrics before sampling
– Bitmap Indices classifies the data into bins• Each bin corresponds to one value or value range;• Find some representative values for each bin: Vi;
– Enforce equal sampling percentage for each bin• Extra Metadata: number of 1-bits of each bin: Ci;
• Compute number of samples of each bin: Si;
– Pre-calculate error metrics based on Vi and Si
• Representative Values: – Small Bin: mean value– Big Bin: lower-bound, upper-bound, mean value
HPDC 2013
Error Prediction Formula• Mean, Variance:
• Histogram:
• Q-Q Plot
HPDC 2013
Data Subsetting + Data Sampling
S3: Perform Stratified Sampling on Subset
S2: Find Spatial ID subset
S1: Find value subset Value = [2, 3)
RID = (9, 25)
HPDC 2013
Experiment Results• Goals:
– Data analysis efficiency with the help of sampling– Accuracy among different sampling methods– Compare Predicted Error with Actual Error – Efficiency among different sampling methods– Speedup for combining data sampling with subsetting
• Datasets: – Ocean Data – Multi-dimensional Arrays– Cosmos Data – Separate Points with 7 attributes
• Environment: – Darwin Cluster: 120 nodes, 48 cores, 64 GB memory
HPDC 2013
Improve Efficiency of Distributed Data Analysis with Sampling
• Data Sampling in server-side;• Data Transfer between client and server;• Data Visualization in client-side;
• Dataset: 11.2 GB Ocean Data
• No Sampling(100%): zero sampling cost, but huge data transfer and visualization cost
• Sampling: much smaller data transfer and visualization cost
• 100 MB/s Network: data sampling achieves a 2.61 – 19.62 speedup
• 10 MB/s Network: data sampling achieves a 4.82 – 47.59 total speedup
HPDC 2013
Sample Accuracy Comparison• Sampling Methods:
– Simple Random Method– Stratified Random Method– KDTree Stratified Random Method– Big Bin Index Random Method– Small Bin Index Random Method
• Error Metrics: – Means over 200 separate sectors– Histogram using 200 value intervals– Q-Q Plot with 200 quantiles
• Sampling Percentage: 0.1%
HPDC 2013
Sample Accuracy Comparison
• Traditional sampling methods can not achieve good accuracy;
• Small Bin method achieves best accuracy in most cases;
• Big Bin method achieves comparable accuracy to KDTree sampling method.
Mean
Histogram
Q-Q Plot
HPDC 2013
Predicted Error vs. Actual Error
Means, Histogram, Q-Q Plot for Small Bin Method
Means, Histogram, Q-Q Plot for Big Bin Method
HPDC 2013
Efficiency Comparison
• Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower).
• The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster).
Sample Generation Time Error Calculation Time
HPDC 2013
Total Time based on Resampling Times
Total Sampling Time
• Index-based Sampling: • Multi-time Error Calculations• One-time Sampling
• Other Sampling Methods: • Multi-time Samplings• Multi-time Error Calculations
• X axis: resampling times• Speedup of Small Bin:
• 0.91 – 20.12
HPDC 2013
Speedup of Sampling over Subset
• X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%)• Y axis: Index Loading Time + Sampling Generation Time• 25% Sampling Percentage• Speedup :1.47 – 4.98 for Spatial Subsetting
2.25 - 21.54 for value Subsetting
Subset over Spatial IDs Subset over values
HPDC 2013
Conclusion
• ‘Big Data’ issue brings challenges for scientific data management;
• Data sampling is useful and necessary for data analysis;
• Perform server-side sampling over bitmap indices;• Pre-calculate errors before actually sampling data;• Combine data sampling with data subsetting;• Achieve good accuracy and efficiency.
HPDC 2013 23
Thanks