Polyglot metadata for Hadoop

Polyglot Metadata for Hadoop

Jim Dowling Associate Prof @ KTH

Senior Researcher @ SICSCEO @ Logical Clocks AB

www.hops.io @hopshadoop

s9s Polyglot Persistence Meetup, 23rd August 2016 @ Spotify/Stockholm

http://www.hops.io/

Polyglot Persistence: the Right Tool for the Job

2

3

Minimize Impedence MismatchData Access Paradigm

SQLStatic Object Oriented

Dynamic Object OrientedFunctional

Free-Text SearchMatrix Operations

Database Paradigm

Relational (Row-based)Relational (Columnar)Document-BasedGraph-BasedKey-ValueObject-Oriented / Hierarchical / Network

?

4

In the Good Old Dayz….

[http://martinfowler.com/]

5

Led to Big Fights over Choice of Database

6

The Advent of Microservices


7

Should we give teams freedom to choose?• Ericsson

- 1000s of build Systems• Google

- One Build System, Blaze

8

Monolithic vs Polyglot Persistence


9

Problems: Polyglot Persistence/Microservices•What if the same data is updated by different microservices?- E.g., user data in subscriber systems

•Transactions (or agreement protocols) across microservices is very hard

•Where possible, try to store such data in a single database (single source of truth) ….

10

Problems: Polyglot Persistence/Microservices•…but, if we minimize the number of databases……

•…how do we handle different data access patterns for the same data?-OLTP SQL-OLAP SQL-Free-Text Search-Etc

11

Multi-functional Databases

SAP Hana, MemSQL combine OLTP and OLAP in a single DB

12

If OLAP doesn’t mutate your data…..

[http://severalnines.com/]

OLTP OLAP

Replication protocols from the SSOT to other Databases works for Immutable Data

13

If you have Big Data….

RDBMS(MySQL, Postgres,..)

Import

Export

SQOOP

Hadoop as a Polyglot Storage System

14

Hive as Polyglot Storage in Hadoop

15

16

Hive/Spark simplifies developmentsqlContext = HiveContext(sc) f1_df = sqlContext.sql( "SELECT id, count(*) AS nb_entries FROM my_db.log \ WHERE ts = '20160515' \ GROUP BY id")

sqlContext = SQLContext(sc)

f0 = sc.textFile('logfile')fpFields = [ StructField(‘ts', StringType(), True), StructField('id', StringType(), True), StructField(‘it', StringType(), True) ]fpSchema = StructType(fpFields)df_f0 = sqlContext.createDataFrame(f0, fpSchema)df_f0.registerTempTable('log') f1_df = sqlContext.sql("SELECT log.id, count(*) AS nb_entries FROM log WHERE ts = '20160515‘ GROUP BY id“)

SparkSQLHive-on-Spark

17

Hive as Metadata for HDFS Files

hive

warehouse

database

Table.hive

Hadoop Distributed Filesystem (HDFS)

Table

Hive DOESN’T maintain the integrity of the metadata table and the HDFS file!!Remove, the table Table.hive file and Hive doesn’t complain….

SQL

Let’s Dig Down and Understand Why…

18

Metadata in HDFS: Paths, Blocks, Replicas

Journal Nodes

NameNode

Zookeeper

StandbyNameNode

FSImage

Edit Log

HDFS Metadata cannot be easily extended (with a Schema, for example)

HDFSMetadataArchitecture

Metadata Totem Poles in Hadoop

20How do they ensure the consistency of the metadata and the data?

Hops Hadoop

21

Hops Hadoop Data Integratation Journey• Principled and safe mechanisms for adding metadata to HDFS and YARN

• Free-text search for files in HDFS- Instead of Pig Jobs to search HDFS, use ElasticSearch

• Integrate Kafka Metadata into Hops Hadoop- Kafka uses Zookeeper for ACLs, Metadata, etc- Hops Hadoop uses MySQL Cluster

HopsFS Architecture

23

NameNodes

NDB

Leader

HDFS Client

DataNodes

> 12 TB

[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]

24

HopsYARN Architecture

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource Trackers

25

Strongly Consistent Metadata for HDFS

FilesDirectories

Schema….

Single Database for HDFS Metadata

2-phase commit (transactions)

Metadata Integrity with Transactions and Foreign Keys.

26

HDFS Metadata APISchema-less APIattach(path, metadataJSON)detach(hdfsPath)

Schema-based APIregister(path, schemaJSON)add(path, schema, metadataJSON)remove(path, schema, metadataJSON)

Hops Metadata services

ElasticsearchDatabase[HDFS/YARN]

KafkaZookeeper

Metadata API

Distributed Database is the Single Source-of-Truth for Metadata

28

Good Metadata: Elasticsearch

FilesDirectoriesMetadata

Search Indexes

DatabaseElasticsearch one-way replication

Eventual Consistency for Metadata.Metadata Integrity maintained by

Asynchronous Replication and Metadata Immutability.

[ePipe Tutorial, BOSS Workshop, VLDB 2016]

immutable data

29

Not-so-Good Metadata: Kafka

TopicsPartitionsACLs

Zookeeper/KafkaDatabase

Eventual Consistency for Metadata.Metadata integrity maintained by custom recovery logic and polling.

Metadata API

polling

Demo

30

www.hops.site

31

A 2 MW datacenter research and test environment

5 lab modules, planned up to 3-4000 servers, 2-3000 square meters

[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]

Demo

32

Summing Up Multiple storage engines are here to stay

Picking the right tool for the right job is not easy

Do not be religious about your tool of choice!

Hops shows how you can combine multiple storage engines to give an improved user experience for Hadoop

33

The Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,

Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Vasileios Giannokostas, Ermias Gebremeskel, Antonios Kouzoupis, Misganu Dessalegn.

Alumni: Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Join us!http://github.com/hopshadoop

www.hops.io @hopshadoop

http://www.hops.io/

Polyglot metadata for Hadoop

Technology

Transcript of Polyglot metadata for Hadoop