Post on 28-Jun-2015
description
SAVE BANDWIDTH(AND LEARN TO LOVE BLOBS)
CLOUD STORAGE 101
• DURABLE
• HIGHLY AVAILABLE
• ACCESS ANYWHERE (WITH CREDENTIALS)
• SCALABLE
CLOUD STORAGE 101
CHEAP!!
CLOUD STORAGE 101: BLOB BASICS
• USE BLOCKS OF DATA TO CONSTRUCT BLOB
• REPLACE BLOCKS IN EXISTING BLOBS
CLOUD STORAGE 101
CLOUD STORAGE 101
CLOUD STORAGE 101
CLOUD STORAGE 101
UPLOAD ENTIRE BLOB AGAIN
CLOUD STORAGE 101
UPLOAD ENTIRE BLOB AGAIN
WHY?
CLOUD STORAGE 101
TRY AGAIN
CLOUD STORAGE 101
CLOUD STORAGE 101
CLOUD STORAGE 101
CLOUD STORAGE 101
UPLOAD SINGLE BLOCK
BLOBSYNC AWESOMESAUCE
• DETECTS CHANGES
• DOES NOT NEED ORIGINAL FILE TO DETECT CHANGES
• UPLOADS/DOWNLOADS CHANGES ONLY
• A TRANSPARENT BLACKBOX… OPEN SOURCE BUT CAN TREAT AS A BLACK BOX
THEORY VS REALITY
• THEORY
Azure Blob Storage
Local machine
THEORY VS REALITY
• THEORY
Azure Blob Storage
Local machine
0 100 200 300 400
THEORY VS REALITY
• THEORY
Azure Blob Storage
Local machine
0 100 200 300 400
THEORY VS REALITY
• THEORY
Azure Blob Storage
Local machine
0 100 200 300 400
THEORY….
• IS ALL GOOD IN THEORY
THEORY VS REALITY
• REALITY
Azure Blob Storage
Local machine
0 100 200 300 400
THEORY VS REALITY
• REALITY
Azure Blob Storage
Local machine
0 100 200 300 400
A DB C
A B’ C D
FINDING COMMON GROUND
• HOW DO WE FIND MOVED BLOCKS?
FINDING COMMON GROUND
• HOW DO WE FIND MOVED BLOCKS?
• USE HASH/SIGNATURES FOR EACH BLOCK
• SEARCH FOR SIGNATURE ALL THROUGHOUT FILE
THEORY VS REALITY
• SEARCH LOCAL
Azure Blob Storage
Local machine
0 100 200 300 400
A DB C
A B’ C D
THEORY VS REALITY
• SEARCH LOCAL
• EG. SEARCH FOR ‘C’Local machine
0 100 200 300 400
A B’ C D
SUCCESS!
• CAN NOW FIND BLOCKS EVEN WHEN MOVED
SUCCESS!
• CAN NOW FIND BLOCKS EVEN WHEN MOVED
• IF WE CAN FIND A BLOCK WE CAN DETERMINE IF WE CAN REUSE IT
SUCCESS!
• CAN NOW FIND BLOCKS EVEN WHEN MOVED
• IF WE CAN FIND A BLOCK WE CAN DETERMINE IF WE CAN REUSE IT
• BUT…….
SUCCESS!
• CAN NOW FIND BLOCKS EVEN WHEN MOVED
• IF WE CAN FIND A BLOCK WE CAN DETERMINE IF WE CAN REUSE IT
• BUT…….
• MD5/SHA ETC ARE TOO SLOW TO DO THIS
• TOO SLOW? NO WAY!
• EG
• 100MB FILE/BLOB
• BLOCK OF 100K
• > 104M HASH CALCULATIONS. JUST TO FIND THAT ONE BLOCK
YOU HAVE TO ROLL WITH IT.
• ROLLING SIGNATURE
• EXTREMELY QUICK.
YOU HAVE TO ROLL WITH IT.
• ROLLING SIGNATURE
• EXTREMELY QUICK.
• DUE TO FALSE POSITIVES USE MD5/SHA AS CONFIRMATION STEP
YOU HAVE TO ROLL WITH IT.
• SIG = FUNC( 0 .. 4 )
YOU HAVE TO ROLL WITH IT.
• SIG = FUNC( 0 .. 4 )
• CALCULATE SIG OF 1..5 BASED OFF OLD SIG
• NEW SIG = OLDSIG – ARRAY[0] + ARRAY[5]
YOU HAVE TO ROLL WITH IT.
• CAN SEARCH ENTIRE FILE WITH MINIMAL CALCULATIONS. IE FAST!
SO WHAT NOW?
• CAN NOW SEARCH FILES QUICKLY FOR SIGNATURE MATCHES
• MEANS WE CAN FIGURE OUT WHAT IS COMMON BETWEEN CLOUD AND LOCAL
• CAN DOWNLOAD/UPLOAD ONLY THE DIFFERENCES.
PROVE IT!
FILE INTERNALS
FILE INTERNALS
ADDDELETE
REPLACE
LIES, MORE LIES AND STATISTICS
• SMALL DB (14M).
• CLEARED A SMALL TABLE.
• UPDATE 340K
• LARGE DB (555M).
• CLEARED A SMALL TABLE
• UPDATE 720K
• VM (8G).
• DELETED SOME FILES
• UPDATE 800M
UPCOMING CHANGES
• DEFRAG
• DYNAMICALLY DETERMINE BLOCK SIZE
• BETTER PARALLEL UPLOAD/DOWNLOAD
• 32 BIT VERSION
LINKS
• BLOG ON BLOBSYNC:
• HTTPS://KPFAULKNER.WORDPRESS.COM/CATEGORY/BLOBSYNC/
• NUGET PACKAGE:
• HTTPS://WWW.NUGET.ORG/PACKAGES/BLOBSYNC/
• GITHUB WITH SOURCE:
• HTTPS://GITHUB.COM/KPFAULKNER/BLOBSYNC/