Pyramid: A large-scale array-oriented active storage system
-
Upload
viet-trung-tran -
Category
Documents
-
view
1.511 -
download
0
description
Transcript of Pyramid: A large-scale array-oriented active storage system
![Page 1: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/1.jpg)
Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN, Nicolae Bogdan,
Gabriel Antoniu, Luc Bougé
KerData Team
Inria, Rennes, France 02 09 2011
![Page 2: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/2.jpg)
02 09 2011Viet-TrungTran - 2
Outline
1. Motivation
2. Architecture
3. Preliminary evaluation
4. Conclusion
![Page 3: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/3.jpg)
Viet-TrungTran 00 MOIS 2011 - 3
MotivationWhyarray-orientedstorage?
1
![Page 4: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/4.jpg)
Context: Data-intensive large-scale HPC
simulations
• The scalability of data management is becoming
a critical issue
• Mismatch between storage model and application
data model
• Application data model
- Multidimensional typed arrays, images, etc.
• Storage model
- Parallel file systems: Simple and flat I/O
model
- Relational model: ill-suited for Scientifics
• Need additional layers to map the application
model to the storage model
02 09 2011Viet-TrungTran - 4
•Sequence of bytes
![Page 5: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/5.jpg)
[M. Stonebraker] The one-storage-fits-all-
needs has reached its limits
• Parallel I/O stack:
- Performance of non-contiguous I/O vs data
atomicity
• Relational data model:
- Simulating arrays on top of table is poor in
performance
- Scalability for join queries
• Need to specialize the I/O stack to match the
applications requirements
- Array-oriented storage for array data model
• Example: SciDB with ArrayStore.
02 09 2011Viet-TrungTran - 5
Application (Visit, Tornado
simulation)
Data model (HDF5, NetCDF)
MPI-IO middleware
Parallel file systems
![Page 6: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/6.jpg)
Our approach
• Multi-dimensional aware chunking
• Lock-free, distributed chunk indexing
• Array versioning
• Active storage support
• Versioning array-oriented access interface
02 09 2011Viet-TrungTran - 6
![Page 7: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/7.jpg)
Multi-dimensional aware chunking
• Split array into equal chunks and distributed over storage elements
- Simplify load balancing among storage elements
- Keep the neighbors of cells in the same chunk
• Shared nothing architecture
- Easier to handle data consistency
02 09 2011Viet-TrungTran - 7
![Page 8: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/8.jpg)
Lock-free, distributed chunk indexing
• Indexing multi-dimensional information
- R-tree, XD-tree, Quad-tree, etc
- Designed and optimized centralized management
• Centralized metadata management scheme may not scale
- Bottleneck under highly concurrency
• Our approach:
- Porting quad-tree like structures to distributed environment
- Using shadowing technique on quad-tree to enable lock-free
concurrent update
02 09 2011Viet-TrungTran - 8
![Page 9: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/9.jpg)
Array versioning
• Scientific applications need array versioning (VLDB 2009)
- Check pointing
- Cloning
- Provenance
• Keep data and metadata immutable
- Updating a chunk is handled at metadata level using shadowing
technique
02 09 2011Viet-TrungTran - 9
![Page 10: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/10.jpg)
Active storage support
• Move data computation to storage elements
- Conserve bandwidth
- Better workload parallelization
• Allow user sending User defined handlers to storage servers
02 09 2011Viet-TrungTran - 10
![Page 11: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/11.jpg)
Versioning array-oriented access interface
• Basic primitives
- id = CREATE(n, sizes[], defval)
- READ(id, v, offsets[], sizes[], buffer)
- w = WRITE(id, offsets[], sizes[], buffer)
- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)
• Other primitives like cloning, filtering mostly can be implemented based
on these above primitives
02 09 2011Viet-TrungTran - 11
![Page 12: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/12.jpg)
Viet-TrungTran 02 09 2011 - 12
Pyramid: Architecture
2
![Page 13: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/13.jpg)
02 09 2011Viet-TrungTran - 13
Architecture
• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]
• Version managers
- Ensure concurrency control
• Metadata managers
- Store index tree nodes
• Storage manager
- Monitor the storage servers
- Ensures a load balancing strategy of chunks among storage servers
• Active storage servers
- Store chunks and perform handlers on chunks
• Clients
- Perform I/O accesses
![Page 14: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/14.jpg)
02 09 2011Viet-TrungTran - 14
Read
• I: optionally ask the version manager for
the latest published version
• II: fetch the corresponding metadata from
the metadata managers
• III: contact storage servers in parallel and
fetch the chunks in the local buffer
Client
Storage
servers
Metadata
managers
Version
managers
I
II
III
![Page 15: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/15.jpg)
02 09 2011Viet-TrungTran - 15
Write
• I: get a list of storage servers that are
able to store the chunks, one for each
chunk
• II: contact storage servers in parallel and
write the chunks to the corresponding
providers
• III: get a version number for the update
• IV: add new metadata to consolidate the
new version
• V: report the new version is ready for
publication.
Client
Storage
servers
Metadata
managers
Version
manager
Storage
manager
II
I
III
IV
V
![Page 16: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/16.jpg)
02 09 2011Viet-TrungTran - 16
Lock-free, distributed chunk indexing
• Organized as a Quad-tree to index 2D arrays
• Each tree node has at most 4 children, each covers one of the four quadrants
• Root tree covers the whole array
• Each leaf corresponds to a chunk and holds information about its location
• Tree nodes are immutable, uniquely identified by the version number and the
sub-domain they cover
• Using DHT to distribute tree nodes over metadata managers
![Page 17: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/17.jpg)
02 09 2011Viet-TrungTran - 17
Tree shadowing to update
• Write newly created chunks to storage servers
• Build the quad-tree associated to the new snapshot in bottom-up fashion
- Writing the leaves to DHT
- Inner nodes may point to nodes of previous snapshots (imply a
synchronization of the quad-tree generation)
- Avoid synchronization by feeding additional information about the other
concurrent updaters (thank to computational ID of tree nodes)
![Page 18: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/18.jpg)
02 09 2011Viet-TrungTran - 18
Efficient parallel updating
• Chunks are written concurrently
• Versions are assigned in the order the
clients finish writing
• Clients get additional information about
the other concurrent writers
• Tree nodes are written in lock-free manner
• Versions are published in the order they
were assigned
Client
#1
Client
#2Storage
servers
Metadata
managers
Version
manager
Publish
Publish
![Page 19: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/19.jpg)
02 09 2011Viet-TrungTran - 19
Some more I/O primitives
• Easily implemented thanks to immutable data and metadata blocks
• Cheap I/O operators
• Clone a sub-domain
- Following the metadata tree of a specific snapshot
- Creating new metadata tree and publish as a newly created array
• Filtering, compression ca be done locally in parallel at active storage servers by
introducing user defined handlers
![Page 20: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/20.jpg)
Viet-TrungTran 02 09 2011 - 20
Preliminary evaluationExperimented on G5K (www.grid5000.fr)
3
![Page 21: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/21.jpg)
02 09 2011Viet-TrungTran - 21
Experimental setup
Simulate common access pattern exhibited by scientific applications: Array Dicing
• Using at most 130 nodes of Graphene cluster on G5K
- 1 Gbps Ethernet interconnected network
- 49 nodes deployed our Pyramid and the competitor system PVFS
• Array dicing
- Each client accesses a dedicated sub-array
- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)
- Concurrent Reading/Writing
• Measure the performance and scalability
![Page 22: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/22.jpg)
02 09 2011Viet-TrungTran - 22
Aggregated throughput achieved under
concurrency
• PVFS suffers from non-
contiguous access pattern due
to serialization to flat file
• Pyramid
- Throughputincreased
steady
- Promising good scalability
on both data and metadata
organization
![Page 23: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/23.jpg)
Viet-TrungTran 02 09 2011 - 23
Conclusion
4
![Page 24: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/24.jpg)
02 09 2011Viet-TrungTran - 24
Conclusion
• Pyramid is an array-oriented active storage system
• Proposed a system offering support for
- Parallel array processing for both read and write workloads
- Versioning data
- Distributed metadata management, shadowing to reflect updates
• Preliminary evaluation shows promising scalability
• Future work
- Planed to integrate to HDF5
- Pyramid as a storage engine for SciDB?
- Investigate on keeping data at quad-tree nodes
Could be used for store array at different resolutions (map application)
![Page 25: Pyramid: A large-scale array-oriented active storage system](https://reader033.fdocuments.us/reader033/viewer/2022060200/5597dd821a28ab64388b459f/html5/thumbnails/25.jpg)
Thankyou
INRIA – KerDataResearch Team
www.irisa.fr/kerdata