Overview of SciDB: Large Scale Array Storage, Processing...
Transcript of Overview of SciDB: Large Scale Array Storage, Processing...
![Page 1: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/1.jpg)
Overview of SciDB:Large Scale Array Storage, Processing and Analysis
Presenter: Kevin Chang
15-799
10/30/2013
1
![Page 2: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/2.jpg)
Executive Summary• Problem:
– Modern databases are primarily built for businessapplications
– Scientific data have different characteristics and are becoming more data-intensive
• Solution: SciDB
– A new open-source DBMS to support VERY large (PB) ARRAYdata
2
![Page 3: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/3.jpg)
Background and Motivation• Scientific data differ from business data in 3 ways
• 1. Science makes use of sensor arrays
– Rectangular array of individual sensors
– Example: Hubble Space Telescope’s camera
– A block of collocated pixels
– Implicit ordering
3
![Page 4: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/4.jpg)
Background and Motivation• 2. Require sophisticated data processing methods
– Sensor data needs filtering
– Clean data are fed into complex analytic processing
• 3. Extremely LARGE data
– Produced at large scale and reused frequently
– Example: LHC -> 15PB data annually
4
![Page 5: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/5.jpg)
SciDB Features• Goal: provide scalability to store, process and analyze
data
• Array data model:
– Implicit ordering
– Organized as collections of n-dimensional arrays
– Each cell contains a tuple of values with attribute names
5
(A::INTEGER, B::FLOAT) -> (2, 0.7)
![Page 6: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/6.jpg)
Array Definition
6
CREATE ARRAY EXAMPLE(A::INTEGER, B::FLOAT) [ I=0:2, J=0:2];
(2, 0.7) (1, 0.5) (3, 0.1)
(6, 0.3) (7, 0.1) (0, 0.1)
(1, 0.4) (4, 0.6) (8, 0.1)
I = 0 : 2
J = 0 : 2
(2, 0.7) (1, 0.5) (3, 0.1)
(7, 0.1) (0, 0.1)
(1, 0.4) (4, 0.6) (8, 0.1)
I = 0 : 2
J = 0 : 2
Empty cells
• SciDB will support extensible type system (PDF)• Nested model (an array within an array)
![Page 7: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/7.jpg)
Data Manipulation• Declarative query language
7
Slice (Example, I=2);
(2, 0.7) (1, 0.5) (3, 0.1)
(6, 0.3) (7, 0.1) (0, 0.1)
(1, 0.4) (4, 0.6) (8, 0.1)
(3, 0.1)
(0, 0.1)
(8, 0.1)
Subsample (Example,I BETWEEN 0 AND 1 ANDJ BETWEEN 0 AND 1);
(2, 0.7) (1, 0.5)
(6, 0.3) (7, 0.1)
![Page 8: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/8.jpg)
Data Manipulation
8
Sjoin (Slice (Example, I=2),Subsample (Example,I BETWEEN 0 AND 1 ANDJ BETWEEN 0 AND 1));
(2, 0.7) (1, 0.5) (3, 0.1)
(6, 0.3) (7, 0.1) (0, 0.1)
(1, 0.4) (4, 0.6) (8, 0.1)
(3, 0.1)
(0, 0.1)
(8, 0.1)
(2, 0.7) (1, 0.5)
(6, 0.3) (7, 0.1)
![Page 9: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/9.jpg)
Data Manipulation• Manipulate data content
• Operations are composable
9
Filter (Example, A > 2);
Apply (Slice (Example, I = 2), A * B);
Structural operator Content manipulator
![Page 10: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/10.jpg)
Extensibility• User-defined data type (UDT)
• User-defined function (UDF)
– Highly specific algorithm
– C/C++
– Example: Gaussian smoothing
10
Smooth (Subsample (Example,I BETWEEN 0 AND 1 ANDJ BETWEEN 0 AND 1));
![Page 11: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/11.jpg)
Architecture• Shared nothing over a network of computers
• One centralized system catalog database
– Stores info about data distribution, UDF, etc
• Storage manager:
– Distributed and no-overwrite storage manager• Data can only be appended
– ACID: Atomicity and durability• Does not talk about its implementation
11
![Page 12: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/12.jpg)
Data storage• Mapping of logical array data into physical storage
• 1. Vertical partitioning like column store:
– Users tend to focus on a single attribute
– Each cell only stores a single value
12
(3, 0.1)
(0, 0.1)
3
0
0.1
0.1
![Page 13: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/13.jpg)
Data Storage• 2. Chunking:
– Operate on a small area
– Split into equal-sized (and overlapping) chunks
13
(2, 0.7) (1, 0.5) (3, 0.1)
(6, 0.3) (7, 0.1) (0, 0.1)
(1, 0.4) (4, 0.6) (8, 0.1)(6, 0.3) (7, 0.1)
(1, 0.4) (4, 0.6)
(1, 0.5) (3, 0.1)
(7, 0.1) (0, 0.1)
(2, 0.7) (1, 0.5)
(6, 0.3) (7, 0.1)
(7, 0.1) (0, 0.1)
(4, 0.6) (8, 0.1)
![Page 14: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/14.jpg)
Data Storage• Overlapping chunks:
– Some methods require computation on neighboring cells
– Example: Gaussian smoothing
– Advantage: enables parallel computation
– Storage overhead
– Pipelined operators
14
Apply (Slice (Example, I = 2), A * B);
![Page 15: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/15.jpg)
Other Features• Data distribution: Some chunks are collocated on the
same node for locality
• Compression: Reduce bandwidth consumption
– Sparse matrix compression?
15
![Page 16: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/16.jpg)
Query Processing• Complicate parsing due to user-defined operators
• Optimization: Tries to parallelize queries
• Scatter/Gather operations to dynamically redistribute data across nodes
16
![Page 17: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/17.jpg)
Conclusion• A new ARRAY DBMS
– Data management
– Math operations
• Open source
• SciDB outperforms Postgres by 2 orders of magnitude
– Astronomy-style workloads
17
![Page 18: Overview of SciDB: Large Scale Array Storage, Processing ...pavlo/courses/fall2013/static/slides/scidb.pdf · Overview of SciDB: Large Scale Array Storage, Processing and Analysis](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f4544f3f159de455e1399df/html5/thumbnails/18.jpg)
18
THE END