Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini...
Transcript of Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini...
![Page 1: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/1.jpg)
Cloud AnalyticsData Warehousing
Marco Serafini
COMPSCI 590SLecture 19
![Page 2: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/2.jpg)
22
Trivia• How does Amazon make money?
• Selling books?• Entertainment?
![Page 3: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/3.jpg)
33
Cloud Computing• Shared resources
• Multiple tenants sharing resources (with isolation)• Economy of scale
• Elastic provisioning• Can easily add and remove resources on the fly
• Pay as you go only when used• Different flavors
• IaaS, PaaS, SaaS• Public, private cloud
![Page 4: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/4.jpg)
44
Cloud Offerings• Computing nodes
• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…
• Storage services• Example: AWS S3• Key-value stores (put/get), file systems
• Higher-level services• Example: DBMS
![Page 5: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/5.jpg)
55
Other Variants• Spot instances
• Allocated in real-time based on live bidding• Can be revoked any time (with notice)
• Serverless computing• Example: AWS Lambda
• Each of these services comes with own pricing
![Page 6: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/6.jpg)
![Page 7: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/7.jpg)
77
Storage Disaggregation• Use remote storage instead of local storage
• Network is fast• Remote and local storage can have same throughput
• Advantages: can use cloud storage services like S3• No configuration or provisioning needed• Cheaper
• Cost of disaggregated storage• Storage nodes can have weak CPUs and limited memory• Storage is cheap
![Page 8: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/8.jpg)
88
Remote vs. Local Storage
![Page 9: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/9.jpg)
9 9
Goals• Easily parallelize single-threaded code• Eliminate cluster management overhead
• Deployment of nodes• Installation• Configuration
• Even cloud offerings have their complexities• Many instance types• Many services
• Solution: Serverless functions
![Page 10: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/10.jpg)
1010
Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of
• Deployment• Load balancing• Performance isolation
• No need to• Deploy servers• Configure clusters
![Page 11: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/11.jpg)
1111
State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance
• Re-execute function• Require atomic writes to check what has succeeded
![Page 12: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/12.jpg)
1212
Registering Functions• Registering a new Lambda function is slow• Solution
• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it
![Page 13: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/13.jpg)
1313
Remote Storage Scalability
![Page 14: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/14.jpg)
1414
Semantics• Map is easy
• Execute one function per element of the list• Map + single Reducer
• E.g. parallel featurization + single-server ML• MapReduce
• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store
• Parameter server• Use Redis
![Page 15: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/15.jpg)
1515
The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node
![Page 16: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/16.jpg)
16
![Page 17: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/17.jpg)
17 17
Shared-Nothing and the Cloud• Shared-nothing architecture
• Each node has its own disk and memory• All nodes are “symmetric”
• Challenges• Heterogeneous workloads
• No one-size-fits-all hardware configuration• Membership changes
• Large data shuffles when a node fails/is removed• Online upgrade
• It is similar to changing all the nodes in the system
![Page 18: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/18.jpg)
1818
Architecture• Data Storage
• Based on S3: high throughput, high latency• Used also for intermediate data
• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)
• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store
![Page 19: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/19.jpg)
1919
Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless
• Simpler fault tolerance and membership change• Example: online upgrade
![Page 20: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/20.jpg)
![Page 21: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/21.jpg)
21 21
SparkSQL: Spark + DBMS• Extend Spark with
• Simple, high-level SQL-like operators• Query optimization
• No need to transfer data across systems• ETL, query processing, complex analytics in one system
![Page 22: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/22.jpg)
2222
DataFrames• Collection of rows with homogeneous schema
• Like a table in a DBMS• Can be manipulated like an RDD
• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries
![Page 23: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/23.jpg)
2323
Advantages• Relational structure enables query optimization• In-memory caching using columnar representation
• Better compression• Mix SQL-like operators and arbitrary code
• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations
![Page 24: Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high](https://reader034.fdocuments.us/reader034/viewer/2022042306/5ed224f65e0ec842bd789a86/html5/thumbnails/24.jpg)
2424
Catalyst• Query optimizer of SparkSQL• Rule-based optimization
• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules
• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient
• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer