15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004,...

30
15-319 / 15-619 Cloud Computing Recitation 8 October 18, 2016 1

Transcript of 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004,...

Page 1: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

15-319 / 15-619Cloud Computing

Recitation 8

October 18, 2016

1

Page 2: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Overview

• Administrative issuesOffice Hours, Piazza guidelines

• Last week’s reflectionProject 3.2, OLI Unit 3, Module 13, Quiz 6

• This week’s schedule- Quiz 7 - Thursday, October 20th

- Unit 4, Module 14- Project 3.3 - October 23

• Team Project: Phase 1

2

Page 3: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Last Week : A Reflection

• Content, Unit 3 - Module 13: - Storage and Network Virtualization- Quiz 6 completed

• P3.2: You explored consistency models- Sharding and Replication- Multithreaded programming- Implemented Strong consistency model- Bonus Task: Eventual Consistency

3

Page 4: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

This Week: Content

UNIT 4: Cloud Storage● Module 14: Cloud Storage

○ Quiz 7 - Introduction to Cloud Storage● Thursday, October 20, 2016

● Module 15: Case Studies: Distributed File Systems○ Quiz 8: Distributed File Systems Checkpoint

● Module 16: Case Studies: NoSQL Databases● Module 17: Case Studies: Cloud Object Storage

○ Quiz 9: NoSQL and Object Stores

Page 5: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3.2 Feedback

https://goo.gl/qpz7OU

Please leave us feedback5

Page 6: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3 Weekly Modules

● P3.1: Files, SQL and NoSQL○ Primer: Storage Benchmarking

● P3.2: Replication and Consistency models○ Primer: Intro. to Java Multithreading○ Primer: Thread-safe programming○ Primer: Intro. to Consistency Models

● P3.3: Social network with heterogeneous backend storage

Page 7: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Distributed Databases

● In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

● Response was a highly available key-value structured storage system called Dynamo (2007)

● Used in S3, DynamoDB, CassandraArticle on DynamoDB - By Werner Vogels

7

Problem Technique used as solution

Data Sharding Consistent Hashing

Transient Fault Handling Sloppy Quorum / Hinted Handoff

Permanent Failure Recovery Anti-entropy using Merkle trees

Membership and Health Checks Gossip protocols

Page 8: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Distributed Databases

● In 2006, Google published details about their implementation of BigTable

● Designed as a “sparse, distributed multi- dimensional sorted map”

● HBase stores members of “column families” adjacent to each other on the file system - columnar data store

8

Page 9: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3.3

Review

Page 10: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3.3 : Introduction• Build a social network about movies:

10

Page 11: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

High Fanout and Multiple Rounds of Data FetchingA single Facebook page, requires many data fetch operations

Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C., ... & Venkataramani, V. (2013, April). Scaling Memcache at Facebook. In nsdi (Vol. 13, pp. 385-398).

Page 12: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

P3.3 Data Set

1. User Profiles

1. User Authentication System (such as a Single-Sign-On or SSO) - RDS MySQL

2. User Info / Profile - RDS MySQL3. Action Log4. Social Graph of the User:

follower, followee, family etc. - HBase

2. User Activity System - All user

generated media - MongoDB3. Big Data Analytics System

1. Search System2. Recommender System3. User Behaviour Analysis

Page 13: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3.3 : Architecture

• Build a social network about movies:

13

Front-end Server Back-end Server

MySQL(RDS)

HBase

MongoDB

S3

Page 14: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

MongoDB

● Document Database

○ Schema-less model

● Scalable

○ Automatically shards data among multiple servers

○ Does load-balancing

● Complex Queries

○ MapReduce style filter and aggregations

○ Geo-spatial queries

Page 15: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3.3 : Tasks

• Build a social network about movies:

15

Page 16: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Project 3.3 : Task 5

• Friend recommendation

16

Page 17: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Twitter Analytics Team Project

17

Page 18: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

tWITTER DATA ANALYTICS:15619 PROJECT

Page 19: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Team Project System Architecture

● Web server architectures● Dealing with large scale real world tweet data● HBase and MySQL optimization

Page 20: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Team Project● Phase 1:

○ Q1○ Q2 (MySQL AND HBase)

● Phase 2○ Q1○ Q2 & Q3 (MySQL AND HBase)

● Phase 3○ Q1○ Q2, Q3 & Q4 (MySQL OR HBase)

CONFIRM YOUR AWS ACCOUNT AND

TEAM INFO

Page 21: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Team Project Time Table

Note:● There will be a report due at the end of each phase, where you are expected to discuss optimizations● WARNING: Check your AWS instance limits on the new account (should be > 10 instances)

Phase (and query due)

Start Deadline Code and Report Due

Phase 1● Q1

Monday 10/10/201600:00:01 EST

Sunday 10/23/201623:59:59 ET

● Q2 Sunday 10/30/201623:59:59 ET

Tuesday 11/01/201623:59:59 ET

Phase 2● Q1, Q2, Q3

Monday 10/31/201600:00:01 ET

Sunday 11/13/201615:59:59 ET

Phase 2 Live Test (Hbase/MySQL)

● Q1, Q2, Q3

Sunday 11/13/201618:00:01 ET

Sunday 11/13/201623:59:59 ET

Tuesday 11/15/201623:59:59 ET

Phase 3● Q1, Q2, Q3, Q4

Monday 11/14/201600:00:01 ET

Sunday 12/04/201615:59:59 ET

Phase 3 Live Test● Q1, Q2, Q3, Q4

Sunday 12/04/201618:00:01 ET

Sunday 12/04/201623:59:59 ET

Tuesday 12/06/201623:59:59 ET

Page 22: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Team Project Phase 1• Two queries

– Q1: Pure front end– Q2: ETL + back end + front end, do both

MySQL (relational DBMS) and HBase (NoSQL)

• Grading– Submit on TPZ, you will get several numbers:

• Error Rate, Correctness and RPS– Higher RPS, higher correctness, lower error rate ⇒

higher grade– Q1 is 25% of phase 1, Q2 MySQL is 25% of phase 1,

Q2 HBase is 25% of phase 1, report is 25% of phase 1

22

Page 23: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Team Project, Phase 1, Q1● Step 1: Compare different front-end frameworks ● Step 2: Deploy the front-end● Step 3: Perform decryption of a secret message

Pure front end, no database needed. Need to consider scaling horizontally

Page 24: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Team Project, Phase 1, Q2● Step 1: Extract tabular data from raw tweets

○ Input file: JSON Tweets (approx. 1 TB)○ Consider using a MapReduce Job for ETL

■ ETL is expensive and there’s the potential for errors, so plan carefully, test on smaller data sets

■ Start early, or no time to optimize the backend● Step 2: Load the data into HBase and MySQL (both!)● Step 3: Deploy

○ a web service for handling HTTP requests, responds with data from the backend

○ an optimized backend (MySQL and HBase)Higher throughput = More points Winner gets grades, fame (?), job (?)

Page 25: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

● Unicode

● Remember to do short URLs elimination

الحوسبة السحابیةबादल कं यू टगं云计算

クラウドコ

ンピューティング

ೌ ಕಂಪ ಂ

ווָאלקן קַאמּפיוטינоблачныхвычисленийגEmojis

Common Q2 issues

Page 26: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

• Read the write-up carefully (read more than once)• You can test only if you have a front end• ETL has many corner cases, can be time consuming

and expensive– Start early (from the first day), your backend will

be meaningless if you have incorrect data– The reference server and the reference ETL file

are your friends• Big data challenge will easily eat up your time and

money if you are careless. Think, calculate, & test before you launch an EMR cluster with 20 machines

Hints

26

Page 27: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

• Changes in Team Project writeup. Refer @1616• Updated banned word list. Refer @1729• You have a total budget of $50 for Phase 1• Your system should not cost more than $0.95 per hour,

this includes (see write-ups for details):– EC2 on demand instance cost

• even if you use spot instances, we will calculate your cost using the on-demand instance price

– EBS cost– ELB cost

• Target: Q2 - 3000 rps (for both MySQL and HBase)

Reminder

27

Page 28: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Start early!Team Project Q1 Also Due Sunday

Page 29: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Upcoming Deadlines

• Quiz 7: Unit 4 - Module 14 - Cloud Storage

Due: Thursday, 10/20/2016 11:59PM Pittsburgh

• Project 3.3: Social Networking Timeline with Heterogeneous

Backends

Due: 10/23/2016 11:59PM Pittsburgh

• Team Project: Phase 1 - Query 1, (This Sunday, Oct 23!)

Due: 10/23/2016 11:59PM Pittsburgh

• Team Project: Phase 1 - Query 2

Due: 10/30/2016 11:59PM Pittsburgh

Page 30: 15-319 / 15-619 Cloud Computingmsakr/15619-f16/recitations/F16...Distributed Databases In 2004, Amazon.com began to experience the limits of scale on a traditional web-scale system

Q&A