Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively...

32
Sunnie Chung

Transcript of Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively...

Page 1: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

Sunnie Chung

Page 2: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Volume : Unprecedentedly Huge Volume of Data fueled by web based business, social networking, micro blogs (e.g., click streams captured in web server logs)

e.g.) Ebay processes 8 Peta Bytes data per night

• Various Structures of Data (No Structure) :

Structured (Database)

Semi-structured (Web pages) and

Unstructured (Web Server Log, Sensor Data)

• Velocity : Unprecedentedly generate new data at a high rate

e.g.) Streaming Twitter Messages

2

Page 3: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Numerous new analytic and business intelligence opportunities like:

• Fraud detection

• Customer profiling

• Customer loyalty analysis

• All of which directly affect revenue of business and critical business decisions.

3

Page 4: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

Massively Parallel Processing (MPP)

Systems:

• Parallel Data Warehouse (PDW) System

Oracle, IBM, Teradata, Microsoft

• Hadoop System with Map Reduce

Google, Yahoo, Facebook, Twitter, LinkedIn

• Hybrid of Both

4

Page 5: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Cloud

Amazon Elastic D W

Google Cloud

Microsoft Azure

5

Page 6: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• MPP Systems

• PDW Based Systems : Microsoft PDW

• Hadoop/MapReduce Based Systems

• Mongo DB

• Pig Latin

• Hbase

• Hive

6

Page 7: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

7

http://blogs.the451group.com/opensource/2011/04/15/nosql-newsql-and-beyond-the-answer-to-sprained-relational-databases/

Page 8: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• A software framework of distributed computing on large datasets and large clusters of machines.

• First published by Google in OSDI'04

• Open source implementations, e.g. Apache Hadoop

• Becomes a focus of attentions for both academic and industrial worlds

• Universities teach it in Computer Science class (e.g. Berkeley)

• Companies use it for data analysis

• Controversial debate raised by database researchers

• “A major step backwards” – David J. DeWitt and Michael Stonebraker

Page 9: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• A single file system distributed across multiple physical computer nodes

• Google File System (GFS) – Specialize in Fault Tolerance, High Throughput and Scalability.• Desired workload are read and append with large data size• Big file is split into small pieces (default 64M)• Each piece has multiple copies on different machines (default 3)• Master node does book keeping only• Client application connects to slave nodes for data

• Hadoop Distributed File System (HDFS) is an open source implementation of GFS.

Page 10: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,
Page 11: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• An open source implementation of both MapReduce and Google File System

• Yahoo is the biggest contributor

• Full Java implementation

• Easy installation and setup

• On top of existing file system

• No root account is needed

• Used by companies:

• Yahoo, Facebook, Rackspace/Mailtrust, etc

• http://wiki.apache.org/hadoop/PoweredBy

Page 12: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

Worker -Map Instance

Worker -Map Instance

“You jump,

I jump.”

Both jump.

Worker -Reduce Instance

Worker -Reduce Instance

Worker -Map Instance

(You, 1)

(jump, 1)

(I, 1)

(jump, 1)

(Both, 1)

(jump, 1)

You, 1

I, 1

Both, 1

jump, 3

Input file from Distributed File System (DFS), e.g. GFS

Intermediate result stored on Mapper’s local disk

Reducer pullsthe data

Final output written to DFS

Master

Assign tasks Assign tasks

Page 13: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Hadoop MapReduce is fault tolerant with hardware/ networking failure, and input/program error.

• Slave nodes send heartbeat messages to the master node periodically.

• Master considers a node is dead by absence of heartbeat message

• No further requests are sent to dead nodes.

• Three things to recover• Data Block on failed machine [by DFS]

• Map work/result on failed machine [by MapReduce]

• Reduce work/result on failed machine [by MapReduce]

Page 14: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

14

http://blogs.the451group.com/opensource/2011/04/15/nosql-newsql-and-beyond-the-answer-to-sprained-relational-databases/

Page 15: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• NoSQL = Not only SQL

• Broad class of database management systems

• Non-adherence to the relational database model

• Generally do not use SQL for data manipulation

Page 16: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

16

http://www.indeed.com/jobanalytics/jobtrends?q=cassandra,+redis,+voldemort,+simpleDB,+couchDB,+mongoDb,+hbase,+Riak&l=

Page 17: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Relational databases cannot cope with massive amounts of data (like datasets at Google, Amazon, Facebook, etc.)

• Many application scenarios don’t use a fixed schema.• Many applications don’t require full ACID guarantees.• NoSQL database systems are able to manage large volumes of data that

do not necessarily have a fixed schema. • NoSQL databases do not necessarily provide full ACID guarantees. They

commonly provide eventual consistency.

When should we use NoSQL?• When we need to manage large amounts of data, and• Performance and real-time nature is more important than consistency

• Indexing a large number of documents• Serving pages on high-traffic web sites• Delivering streaming media

17

Page 18: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• NoSQL usually has a distributed, fault-tolerant architecture.

• Data is partitioned among different machines

• Performance

• Size limitations

• Data is replicated

• Tolerates failures

• Can easily scale out by adding more machines

• NoSQL databases commonly provide eventual consistency

• Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system

18

Page 19: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Document store

• Store documents that contain data in some format (XML, JSON, binary, etc.)

• Examples: MongoDB, SimpleDB, CouchDB, Oracle NoSQL Database, etc.

• Key-Value store

• Store the data in a schema-less way (commonly key-value pairs). Data items could be stored in a data type of a programming language or an object.

• Examples: Cassandra, Dynamo, Riak, MemcacheDB, etc.

• Graph databases

• Stores graph data. For instance: social relations, public transport links, road maps or network topologies.

• Examples: AllegroGraph, InfiniteGraph, Neo4j, OrientDB, etc.

19

Page 20: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Tabular

• Examples: Hbase, BigTable, Hypertable, etc.

• Object databases

• Examples: db4o, ObjectDB, Objectivity/DB, ObjectStore, etc.

• Others: Multivalue databases, RDF databases, etc.

20

Page 21: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

21

http://hbase.apache.org/

Page 22: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• HBase is an open source NoSQL distributed database

• Modeled after Google's BigTable and written in Java

• Runs on top of HDFS (Hadoop Distributed File System)

• Provides a fault-tolerant way of storing large amounts of sparse data

• Provides random reads and writes (HDFS does not support random writes)

Page 23: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Adobe

• Facebook

• Meetup

• Stumbleupon

• Twitter

• Yahoo!

• and many more…

Page 24: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

SS

Chung

CIS

61

2

24

Page 25: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• HBase is not ACID compliant• However, it guarantees certain properties, e.g., all mutations are atomic within a row.

• Strongly consistent reads/writes• HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-

speed counter aggregation.

• Automatic sharding• HBase tables are distributed on the cluster via regions, and regions are automatically split and re-

distributed as your data grows

• Automatic RegionServer failover• Hadoop/HDFS Integration

• HBase supports HDFS out of the box as its distributed file system

• MapReduce• HBase supports massively parallelized processing via MapReduce for using HBase as both source and

sink

• Java Client API• HBase supports an easy to use Java API for programmatic access.

• Block Cache and Bloom Filters• HBase supports a Block Cache and Bloom Filters for high volume query optimization

• Operational Management• HBase provides build-in web-pages for operational insight as well as JMX metrics.

25Apache HBase Reference Guide: http://hbase.apache.org/book/architecture.html#arch.overview

Page 26: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• NewSQL is a class of database systems that aims to provide the same scalable performance of NoSQL systems while still maintaining the ACID guarantees of a traditional single-node database system.

• When should you use NewSQL?

• When the application needs to handle very large datasets or a very large number of transactions

• When ACID guarantees are required

• When the application can significantly benefit from the use of the relational model and SQL

• Related Article (Communications of the ACM)• http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext

Page 27: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

1. Support the relational data model

2. Use SQL as the primary mechanism for application interaction

3. ACID support for transactions

4. A non-locking concurrency control mechanism so real-time reads will not conflict with writes, and thereby causethem to stall

5. A scale-out, shared-nothing architecture, capable of running on a large number of nodes without bottlenecking

6. An architecture providing much higher per-node performancethan available from traditional databases

27Modified from http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext

Page 28: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• New Architectures

• New database platforms designed to operate in a distributed cluster of shared-nothing nodes

• Examples: VoltDB, NuoDB, Clustrix, and VMware's SQLFire

• MySQL Engines

• Highly optimized storage engines for MySQL.

• Use the same programming interface as MySQL but scale better

• Examples: TokuDB, MemSQL, and Akiban

• Transparent Sharding

• These systems provide a sharding middleware layer to automatically split databases across multiple nodes

• Examples: dbShards, ScaleBase and ScaleDB

28Modified from http://en.wikipedia.org/wiki/NewSQL

Page 29: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

29

https://voltdb.com

Page 30: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• In-memory database

• ACID-compliant RDBMS

• Uses a shared nothing architecture

• Written in Java and C++

• Supported operating systems: Linux and Mac OS X

• Provides client libraries for Java, C++, C#, PHP, Python and Node.js

30

Page 31: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

• Atomicity• VoltDB defines a transaction as a stored procedure,

which either succeeds or rolls back on failure

• Consistency• VoltDB enforces schema and datatype constraints in all

database queries

• Isolation• VoltDB transactions are globally ordered and run to

completion on all affected partitions without interleaving

• Durability• VoltDB provides replication of partitions, and periodic

database snapshots combined with command logging to ensure high availability and database durability

31http://voltdb.com/dig-deeper/faq.php

Page 32: Sunnie Chung - cis.csuohio.educis.csuohio.edu/~sschung/CIS433/NoSQLBigData.pdf · Massively Parallel Processing (MPP) Systems: • Parallel Data Warehouse (PDW) System Oracle, IBM,

32