Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon...

58
Big Data and NoSQL

Transcript of Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon...

Page 1: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Big Data and NoSQL

Page 2: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Sources

• P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Page 3: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Very short history of DBMSs

• The seventies:

– IMS – end of the sixties, built for the Apollo program (today: Version 15) and IDS (then IDMS), hierarchical and network DBMSs, navigational

• The eighties – for twenty years:

– Relational DBMSs

• The nineties: client/server computing, three tiers, thin clients

Page 4: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Object Oriented Databased

• In the nineties, Object Oriented databases were proposed to overcome the impedance mismatch

• They influenced Relational Databases, and disappeared

Page 5: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Big Data

• Mid 2000s, Big Data:

– Volume:

• DBMSs do not scale enough for some applications

– Velocity:

• Computational speed

• Development velocity:– DBMS require upfront schema design and data cleaning

– Variety:

• Schemas conflict with variety

Page 6: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Big Data Examples

• Managing and analysing:

– Google searches

– Twitter feeds

– Facebook posts

– Amazon sales

– Connection data for a mobile phone company

– Location data for a car-black-box company

Page 7: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Big Data platforms

• The google stack:– Hardware: each Google Modular Data Center houses

1.000 Linux servers with AC and disks– GFS: distributed and redundant FS– MapReduce– BigTable, on top of GFS

• Hadoop – open source– HDFS, Hadoop MapReduce– HBase– SQL on Hadoop: Apache

Hive, IBM Jaql, Apache Pig,Cloudera Impala

Page 8: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Big Data systems: NoSQL systems

• NoSQL: Giving up something to get something more

• Giving up:– ACID transactions, to gain distribution

– Upfront schema, to gain• Velocity

• Variety

– First normal form, to reduce the need for joins

• Different from NewSQL

Page 9: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Types of NoSQL systems

• Key-value stores (Amazon Dynamo, Riak, Voldemort…)

• Document databases:– XML databases: MarkLogic, eXist

– JSON databases:• CouchDB, Membase, Couchbase

• MongoDB

• Sparse table databases:– Hbase

• Graph databases (not really about BigData):– Neo4j

Page 10: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

NewSQL

• NewSQL is a different approach to Velocity, much less disruptive than NoSQL

– Column databases

– In memory databases

Page 11: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

NoSQL

Page 12: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Why NoSQL

• Impedance mismatch

• The schema problem:

– Restrictive

– Heavy to set up

• Integration databases -> application databases

• Cluster architecture

– Google BigTable

– Amazon Dynamo

Page 13: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

NoSQL: reasons of success

• Support cluster architecture (Velocity, Volume)

– Google BigTable

– Amazon Dynamo

• Remove schema restriction (Variety, Velocity)

• Simple for simple tasks

Page 14: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

NoSQL

• A set of ill-defined systems that are not RDMBS

• Usually do not support SQL

• Are usually Open Source (not always)

• Often cluster-oriented (not always), hence no ACID

• Recent (after 2000)

• Schema free

• Oriented toward a single application

• It is more a ‘movement’ than a technology

Page 15: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Aggregate data models

• From many simple tables -> to just one collection of aggregated objects (simplified object data model)

• Aggregate data model is essential in order to work without transactions and without joins

Page 16: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Aggregate data models

• NoSQL data models:

– Aggregate data models:

• Key-value

• Document

• Column family

– Graph model

Page 17: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Graph model

• Set of triples <nodeid, property, nodeid> (FlockDB, Neo4J)

Page 18: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Aggregate orientation

http://martinfowler.com/bliki/AggregateOrientedDatabase.html

Page 19: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Aggregate data models

• Key value stores: the database is a collection of <key,value> pairs, where the value is opaque (Dynamo, Riak, Voldemort)

• Document database: a collection of documents (XML or JSON) that can be searched by content (MarkLogic, MongoDB)

• Column-family stores: a set of <key, record> pair (BigTable, HBase, Cassandra)– Columns are grouped in ‘column families’

Page 20: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Key-value stores implementation

• Implementation model:

– Key-based distribution of the pairs on a huge farm of inexpensive machines

– Constant time access

– Constant time parallel execution on all the pairs

– Flexible fault-tolerance

– MapReduce execution model

– Amazon Dynamo, Riak, Voldemort

Page 21: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Schemaless databases

• Schema first vs. schema later

• Homogeneous vs. non homogeneous

Page 22: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Materialized views

• OLAP applications greatly benefit from materialized views

• Materialized views can be used to regain the flexibility of the relational model

Page 23: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Key-Value distribution

• Sharding + replication• Sharding: splitting data among nodes according

to a key• Master-slave replication

– No update conflict– Read resilience– Master election

• P2P replication– No single point of failure

• The distributed consistency problem

Page 24: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Levels of Consistency

• Wrt. to write-write conflicts: avoiding to lose an update

• Read consistency:– Fresh data

– No intermediate data

– Session consistency

• Transactional consistency– Only write values that are based on currently valid

data

Page 25: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

The CAP Theorem: example

• Would like: Consistency + Availability + Partition tolerance• Store three copies of a value for Availability• 1 read 3 writes:

– Read from any 1 node– Before committing an update wait for three writes to be

completed

• 1 write 3 reads:– As soon as one write is ok, commit– Always read 3 copies and return newest value

• 2 writes 2 reads: – If you read 2, at least one is current

• Consistency + Availability + Partition tolerance = Impossible

Page 26: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

The CAP Theorem

• You cannot have all of:

– Consistency

– Availability

– Partition tolerance

• A trade-off between consistency and latency

• Relaxing consistency

– Two writes in the same cart

• Relaxing durability

Page 27: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Consistency: single operation atomicity

• The problem: avoiding r/w and w/w conflicts on a single operation

• Quorum: in a P2P system, an operation is successful if it gets a quorum of confirmations

– The write quorum:

• W > N/2

– The read quorum:

• R+W > N

Page 28: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Consistency: update consistency

• The problem: only update a value if nobodyelse did change it in the meanwhile

• Optimistic approach:

– You read the data item with a version stamp

– Every time you update, you change the version

– The update operation has the previous-versionparameter, and fails if the stamp changed: Compare-And-Set (CAS)

Page 29: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Consistency: replication optimism

• Assume we do not have quorum, and two copieswith versionid are updated in parallel: what doeshappen?– When version is a counter– When version is a random GUID

• P2P consistency problem: deciding the temporalrelationship between two different versions

• Local counter or GUID does not help • The vector clock:

– Assume nodes A,B,C, the version stamp is[A:7;B:5;C:9]

Page 30: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Parallelism: Map-Reduce

• Map(m): apply m in parallel to each object, to get a set of <key, value> pairs

• Shuffle-sort: collect all pairs with the same <key> to the same node, get sets with shape {<k,v1>,…,<k,vn>}

• Reduce(r): apply r to each set {<k,v1>,…,<k,vn>} to produce a result

Page 31: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Map-Reduce

Data 1

Data 2

Data 3

Data 4

Mapm: d → seq(k,v)

K3 v11K2 v12K3 v13

K4 v21K1 v22K3 v23

K4 v31K5 v32K4 v33

K3 v41K2 v42K3 v43

K1 v22

K3 v11K3 v43K3 v23K3 v13K3 v41

K4 v33K4 v31k4v21

K2 v42K2 v12

K5 v32

ShuffleAnd Sort

K1 r(v22)

K3 r(v11,v43,…)

K4 r(v33,v31,v21)

K2 r(v42,v12)

K5 r(v32)

INPUT OUTPUTReduce

r: seq(k,v) → k,v

Page 32: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Example: word count

• Problem: counting the number of occurrences for each word in a big collection of documents

• Map:– takes a couple (k, document), ignores k, returns a pair

(w,1) for each word w in document

• Shuffle&Sort:– groups the Map output by w and produces pairs of

the form (w, [1, …,1])

• Reduce:– takes a pair (w, [1, …,1]), and outputs (w, 1+…+1)

Page 33: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Example: word count

NoSQL Parallel NoSQL

Velocity NoSQL DBMS

Velocity Map Velocity

NoSQL Parallel NoSQL

Map(m)

NoSQL 1Parallel 1NoSQL 1

Velocity 1NoSQL 1DBMS 1

Velocity 1Map 1Velocity 1

NoSQL 1Parallel 1NoSQL 1

DBMS 1

NoSQL 1NoSQL 1NoSQL 1NoSQL 1NoSQL 1

Velocity 1Velocity 1Velocity 1

Parallel 1Parallel 1

Map 1

ShuffleAnd Sort

DBMS 1

NoSQL 5

Velocity 3

Parallel 2

Map 1

Reduce(r)

INPUT OUTPUT

Page 34: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Pseudo code

• Map( _, v ):

– for each w in v do emit(w, 1)

• Reduce(k, v):

– c=0;

– for x in v do c = c +1;

– emit(k, c)

Page 35: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Exercises

• Sales(Date,StoreId,ProdId,Amount)

• How to compute group_by({Date},{sum(Amount)})?

• Sales+Stores(StoreId,Region)

• How to compute join(Sales,Stores)?

Page 36: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Implementing map-reduce: Hadoop

• Input and output of each phase are stored in a distributed file system that manages the partitioning and the replication

• Spark approach: when possible, input and output are just kept in main memory

• The computation is divided among many small tasks

• A task manager assigns the task and, when a task fails re-executes it

Page 37: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Dataflow systems

• Dataflow systems are similar to map-reduce systems but they implement a wider range of parallel patterns, with vertices that generalizethe map and reduce vertices and edges thatgeneralize the key-based communicationbetween map and reduce

Page 38: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Key-Value Databases

• Basically, a persistent hash table• Sharding + replication• Consistency

– Single object– Riak: for each bucket (data space):

• Newest write wins / create siblings• Setting read / write quorum

• Query– By key– Full store scan (not always provided)

• Uses: session information, user profiles, shopping cart data by userid…

Page 39: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Document Databases: MongoDB

• One instance, many databases, many collections

• JSON documents with _id field

• Sharding + replication

Page 40: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Consistency

• Master/slave replication

– Automated failover, server maintenance, disaster recovery, read scaling

• Master is dynamically re-elected over fail

• One can specify a write quorum

• One can specify whether reads can be directed to slaves

Page 41: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Querying

• CouchDB:

– query via views (virtual or materialized)

• MongoDB:

– Selection, projection, aggregation

Page 42: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Column-family Stores

• A ‘column-family’ (similar to a ‘table’ in relational databases) is a set of <key,record> pairs

• If can be vertically divided in keyspaces

• Records are not necessarily homogeneous

Page 43: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Consistency

• In Cassandra:

– The DBA fixes the number of replicas for each keyspace

– the programmer decides the quorum for read and write operations (1, majority, all…)

– Transactions:

• Atomicity at the row level

• Possibility to use external transactional libraries

Page 44: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Queries (Cassandra)

• Row retrieval:

– GET Customer[‘johnsmith00012’]

• Field (column) retrieval:

– GET Customer[‘johnsmith00012’][‘age’]

• After you create an index on age:

– GET Customer WHERE age = 35

• Cassandra supports CQL:

– Select-project (no join) SQL

Page 45: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Graph Databases

• A graph database stores a graph

• A graph is, essentially, a database with one ternary table:

– Edges(NodeId1, NodeId2, EdgeAttributes)

– You may also have Nodes(NodeId, NodeAttributes)(optional)

• Example: Neo4J

Page 46: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Graph model

Page 47: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Consistency

• Graph databases are usually not sharded and transactional

• Neo4J supports master-slave replication

• Data can be sharded at the application level with no database support, which is quite hard

Page 48: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Querying: Cypher

MATCH (me {name:"Giorgio"})

RETURN me

Page 49: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Querying: Cypher

MATCH (expert)

-[:WORKED_WITH]->

(neodb:Database{name:"Neo4j"})

RETURN neodb, expert

Page 50: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Querying: Cypher

MATCH (me {name:"Giorgio"})

MATCH (expert)

-[:WORKED_WITH]->

(neodb:Database {name:"Neo4j"})

MATCH path = shortestPath( (me)-[:FRIEND*..5]-(expert) )

RETURN neodb, expert, path

Page 51: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Querying: Cypher

MATCH pattern matches

WHERE filtering conditions

RETURN what to return

ORDER BY properties to order by

SKIP nodes to skip from the top

LIMIT limit results

Page 52: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

NoSQL systems advantages

• Support for cluster architecture:

– Volume and Velocity

• Aggregate model, schemaless architecture:

– Velocity of development for simple applications

• Schemaless architecture:

– Supports Variability

• Flexible consistency:

– Supports Velocity

Page 53: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

NoSQL systems problems

• Transactional support is limited to a single aggregate

• Flexible consistency is hard to manage

• No SQL, no optimization:

– Complex data needs to be pre-aggregated

– different queries require the construction of different re-aggregations of the same data

Page 54: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Big Data architectural trends

• The data lake

• Polyglot systems

Page 55: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

The Data Lake

• Standard Data Warehouse architecture:

– Long phase of data design to decide the schema

– Complex phase of data cleaning to get high qualitydata

– Ready to play

• The Data Lake:

– Just collect all data you have in the Data Lake

– Run ML algorithms on the Lake

Page 56: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Polyglot systems

• Combine transactional RDBMSs, DSSs and NoSQL systems

• Advantages: pay the price of schemas and transactions only where they are needed

• Problems: maintenance and security

Page 57: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

SQL on top of MapReduce

• Serdar Yegulalp compiled this list in 2014 (ask Google):– Apache Hive: The original SQL-on-Hadoop solution– Stinger: Hortonworks development of Apache Hive– Apache Drill: An open source implementation of Google's Dremel (aka

BigQuery), to access multiple types of data stores– Spark SQL: Apache's Spark project is for real-time, in-memory,

parallelized processing of Hadoop data.– Apache Phoenix: Its developers call it a "SQL skin for HBase".– Cloudera Impala: another implementation of Dremel/Apache Drill for

Hadoop.– HAWQ for Pivotal HD: Pivotal version for its own Hadoop distribution – Presto: Built by Facebook's engineers, reminiscent of Apache– Oracle Big Data SQL– IBM BigSQL

Page 58: Big Data and NoSQLpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –Google BigTable –Amazon Dynamo •Remove schema restriction (Variety, Velocity) •Simple for simple tasks. NoSQL

Conclusion

• There is no ‘winner’: DBMSs, DSSs, parallel and distributed DBs, NoSQL systems: they are all hereto stay

• There is a terrible trend of moving everything to NoSQL and Machine Learning due to hype: greatoccasion for consultants, and for waste

• The only way of making a good choice is having a real understanding of:– The business problem to be solved

– The current state of the technology