Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation...
Transcript of Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation...
![Page 1: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/1.jpg)
Semi-hierarchical Semantic-aware Storage Architecture
Yu HuaHuazhong University of Science
and Technology
https://csyhua.github.io
![Page 2: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/2.jpg)
2
Current and Future Storage
S M I L E
![Page 3: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/3.jpg)
3
Current and Future Storage 1
• SMILE• Scale:Big Data,Big Storage
![Page 4: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/4.jpg)
4
Current and Future Storage 2• SMILE• NN(M)-Intelligent:
![Page 5: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/5.jpg)
5
Current and Future Storage 3• SMILE• Integrated:
Near Data Processing:Processing in-memory (PIM) In-storage computing (ISC) Quantx(Micron), Optane(Intel), NDP(HUAWEI), ……
![Page 6: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/6.jpg)
6
Current and Future Storage 4• SMILE• Long-term :
Storage media and runtime contextTime-sensitive and value
![Page 7: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/7.jpg)
7
Current and Future Storage 5• SMILE• Edge:• Edge computing, fog computing,
proximity computing, ……
![Page 8: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/8.jpg)
8
Challenge: Hierarchical Architecture
Heterogeneous PrincipleDifferentiated PerformanceManagement Complexity
• One-storey house->Skyscraper• More and more levels
![Page 9: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/9.jpg)
9
Challenge: Storage Reliability
SRAMDRAM……
Volatile
HDDTape……
Non-Volatile
NVM
![Page 10: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/10.jpg)
10
Hierarchical Data Structure
Millions of files under each directory
…
This tree is too FAT ! This tree is too HIGH !
![Page 11: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/11.jpg)
11
Hierarchical and Vertical Architecture
• Idea: based on locality principle, some key data consume many system resources.
• However, in the era of big data, the efficiency of locality becomes weak, thus being difficult to improve hit ratio.
![Page 12: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/12.jpg)
12
The essence behind Hierarchy• Goal:identify the correlation • In essence, the hierarchy is an approach to
dynamic filter data to obtain correlated aggregation and on-demand allocation.
• If the flat or semi-hierarchical schemes are able to achieve the same goal, it would be much better with significant performance improvements.
Source data
Correlation
Hierarchical
Flat
![Page 13: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/13.jpg)
13
• Problem to be addressed:
How to storage data in large-scale storage systems
• The idea:
Semantic storage is the new form of implementing storage systems.
SemiSemi--hierarchical Architecturehierarchical Architecture
![Page 14: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/14.jpg)
14
Our related work• Semantic Namespace:SANE(TPDS14)
• Semantic Aggregation:FAST(SC14), HAR(ATC14), SiLo(ATC11),
• Semantic Hash Computation:SmartCuckoo(ATC17), DLSH(SoCC17), SmartEye(INFOCOM15), NEST(INFOCOM13)
• Semantic On-line Service:ANTELOPE(TC14)
![Page 15: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/15.jpg)
15
SANE: The namespaceSANE: The namespace
"SANE: Semantic-Aware Namespace in Ultra-large-scale File Systems", IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol.25, No.5, May 2014, pages:1328-1338.
![Page 16: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/16.jpg)
16
Flat Addressing• Hierarchy becomes the
performance bottleneck• Design goals:
SearchableUnique
SANE: The Semantic NamespaceSANE: The Semantic Namespace
• Construct the semantic-aware namespace
![Page 17: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/17.jpg)
17
Comparisons with Conventional File Systems
![Page 18: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/18.jpg)
18
Grouping Procedures
Node Vector
![Page 19: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/19.jpg)
19
Mapping of Index Units• Our mapping is based on a simple bottom-up
approach that iteratively applies random selection and labeling operations.
Index unitsThe first-level
index units
The second-level index unit
![Page 20: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/20.jpg)
20
Components
Locality-aware Identification
Per-File Namespace Construction
Users
Enhanced POSIX I/O
Metadata Fetching
Access Requests
Naming Service
FUSE
Index Store
VFS
Hierarchical File Systems……
POSIX I/O
File AttributesAccess Patterns
Semantic Grouping
Read
Write
Data Management
Conventional Access to File Systems
![Page 21: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/21.jpg)
21
Naming and Rename Submodular MaximizationSelect a subset of namespaces with
distinct names
Maximization for Monotone Submodular functions
• Scoring Function is a monotone submodular function
– Greedy algorithm – Constant-scale mathematical
quality guarantee
![Page 22: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/22.jpg)
22
Example 1: New Deduplication Ecosystem
22
Low-bandwidth network file system (LBFS)
SOSP 2001
Data domain file system (DDFS)FAST 2008
Venti: Archival data storageFAST 2002
![Page 23: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/23.jpg)
23
The Synergization of Similarity and Locality——SiLo
Expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files
Leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection.
Expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files
Leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection.
“SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput,” Proceedings of USENIX ATC, June 2011.
The Scalability of Deduplication Indexing
Deduplicate 800 TB unique data.
SHA-1 signature.Avg. 8KB Chunk.
2TB Fingerprints are generated .
Global indexing.Disk bottleneck.
Existing data stream
Input data stream
Locality EnhancementPotential duplicate
Small files(≤ 64KB)
Large files(≥ 2 MB)
Percentage of total file number
≥ 80% ≤ 20%
Percentage of total space
≤ 20% ≥ 80%
Small files(≤ 64KB)
Large files(≥ 2 MB)
Percentage of total file number
≥ 80% ≤ 20%
Percentage of total space
≤ 20% ≥ 80%
Grouping many highly correlated small files
into a segment to minimize dedup
overheads
Dividing the large files into many small segments to expose
more similarity characteristics
![Page 24: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/24.jpg)
24
• The fragmentation decreases restore performance and results in invalid chunks becoming physically scattered in different containers after users delete backups.
• HAR exploits historical information of backup systems to more accurately identify and rewrite fragmented chunks.
•History-Aware Rewriting algorithm (HAR)
Fragmentation in Deduplication
24
"Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information", Proc. USENIX ATC, 2014,
![Page 25: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/25.jpg)
25
25
FFMPEGSIFT LSH
Cuckoo-driven
Random kicking Semi-Random kicking Last-step kicking
Example 2: Application-level Approximate Methodology--FAST
"FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis (SC), November 2014
![Page 26: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/26.jpg)
26
File System InterfacePhysical Devices
File SystemOperating
Systems
FAST
QueriesAdd/Delete/Update
User Perspective (Interfaces)
Caching Prefetching
System Perspective (Performance Optimization)
Images Sources
PCA-SIFTbased Feature ExactionDoG based Detection
of Interest PointsInterest Points
Feature Vectors
Multi-hashing Summarization
Correlated Groups
Summarized Bloom Filter
LSH based Semantic Aggregation
Cuckoo Hashing-driven Storage Strategy
Manageable Flat Addressing
Big Data Processing
Semantic Correlation
Analysis
"FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis (SC), November 2014
Application-level Approximate Methodology: FAST
![Page 27: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/27.jpg)
27
Approximate Image Transmission in Networking: SmartEye
Software Defined Network (SDN)
QoS-aware DiffServ
In-network Deduplication
Similarity
Label
Operations
Feature Representation Feature Detection
Feature Summarization
Label Mapping
Label Switching
Label Generation
Label Matching
Classification
Indexing
Caching Management QoS Routing
Deletion
"SmartEye: Real-time and Efficient Cloud Image Sharing for Disaster Environments", Proceedings of INFOCOM, 2015, pages: 1616-1624
![Page 28: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/28.jpg)
28
The design of SmartEye• Compact Feature Representation
![Page 29: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/29.jpg)
29
Locality Sensitive Hashing (LSH)
![Page 30: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/30.jpg)
30
Locality-Sensitive Hashing (LSH)
• Close items will collide with high probability• Distant items will have very little chance to collide
![Page 31: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/31.jpg)
31
Efficient Cuckoo-driven LSH
A multi-choice LSH Available locations for item a
Blue: hit position by LSH computationGreen: Neighbor bucket has data correlation
If all LSHi(a) are full, can choose adjacent empty bucket
Probing adjacent neighbors: the probability of endless “kicking out” is much more smaller than ordinary cuckoo hashing
(1) Use Cuckoo Driven LSH to reduce search time when collision occurs(2) Use neighbor buckets to further reduce the possibility of kickout
(3) Space efficiency due to neighboring probe and data locality
![Page 32: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/32.jpg)
32
NEST: Efficient Cuckoo-driven LSH
A multi-choice LSH Available locations for item a
Blue: hit position by LSH computationGreen: Neighbor bucket has data correlation
If all LSHi(a) are full, can choose adjacent empty bucket
Probing adjacent neighbors: the probability of endless “kicking out” in NEST is much more smaller than ordinary cuckoo hashing
(1) Use Cuckoo Driven LSH to reduce search time when collision occurs(2) Use neighbor buckets to further reduce the possibility of kickout
(3) Space efficiency due to neighboring probe and data locality
"NEST: Locality-aware Approximate Query Service for Cloud Computing", Proceedings of INFOCOM, April 2013, pages: 1327-1335
![Page 33: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/33.jpg)
33
NEST: Resolve Collision: if still fails
Hashing collisions for inserting item a Moving item h to its another location
Note: Adjacent probing significantly reduce or even avoid hash failing (FAST INDEX)
![Page 34: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/34.jpg)
34
Example 3: Pseudoforest
a 0
1
2
3
4
5
6
7
c
T2T1
b
d
e
x
a b
d x
An endless loop is formed.
Endless kickouts for any insertion within the loop.
"SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems", Proceedings of USENIX Annual Technical Conference (USENIX ATC), July 2017, pages: 553-566
![Page 35: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/35.jpg)
35
DLSH: A Distribution-aware LSH
•1
( ) ⎥⎦⎥
⎢⎣⎢ +⋅=
ωbqaqh
( ) ( ) ( ) ( )qqqq hahaha kk∗++∗+∗= ...g
2211
a b cd
j
Distance computation
①
③ ②
④
①Projection vector selection
②Weight quantization
③ Interval adjustment
④ Frequency recordation
Due to distribution-unaware projection vectors:Multiple hash tables to maintain data locality and guarantee the query accuracy.
Design goal:Decrease the number of hash tablesMitigate in-memory consumption
• Approach:Differentiating the aggregated data in a suitable direction; Exhibiting the data locality as well as decreasing the hash collisions.
"DLSH: A Distribution-aware LSH Scheme for Approximate Nearest Neighbor Query in Cloud Computing", Proceedings of ACM Symposium on Cloud Computing (SoCC), 2017
![Page 36: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/36.jpg)
36
Example 4: On-line Precomputation--Data Cube
"ANTELOPE: A Semantic-aware Data Cube Scheme for Cloud Data Center Networks", IEEE Transactions on Computers (TC), Vol.63, No.9, September 2014, pages: 2146-2159.
• Leverage precomputation based data cube to support online cloud services
• Use semantic-aware partial materialization to reduce the operation and space overheads
![Page 37: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/37.jpg)
37
Open Source Codes (in GitHub)• SmartCuckoo: in GitHub. SmartCuckoo is a new cuckoo hashing scheme to
support metadata query service.https://github.com/syy804123097/SmartCuckoo
• SmartSA (E-STORE): in GitHub to support near-deduplication for image sharing based on the energy availability in Smartphone.https://github.com/Pfzuo/SmartSA
• Real-time-Share: in GitHub, to support real-time image sharing in the cloud, which is an important component of SmartEye (INFOCOM 2015).https://github.com/syy804123097/Real-time-Share
• MinCounter: in GitHub. MinCounter is the proposed data structure in the MSST 2015 Paper.https://github.com/syy804123097/MinCounter
• NEST: in GitHub (Download INFOCOM 2013 Paper, Source Codes, Manual and TraceData).https://github.com/syy804123097/NEST
• LSBF (Locality-Sensitive Bloom Filter): in GitHub (Download TC 2012 Paper, Source Codes and Manual).https://github.com/syy804123097/LSBF
![Page 38: Semi-hierarchical Semantic- aware Storage Architecture · Big Data Processing Semantic Correlation Analysis "FAST: Near Real-time Searchable Data Analytics for the Cloud", Proceedings](https://reader034.fdocuments.us/reader034/viewer/2022043003/5f8464c2e562d83d28758b92/html5/thumbnails/38.jpg)
3838
Thanks and Questions