Polyglot metadata for Hadoop

35
Polyglot Metadata for Hadoop Jim Dowling Associate Prof @ KTH Senior Researcher @ SICS CEO @ Logical Clocks AB www.hops.io @hopshadoop s9s Polyglot Persistence Meetup, 23 rd August 2016 @ Spotify/Stockholm

Transcript of Polyglot metadata for Hadoop

Page 1: Polyglot metadata for Hadoop

Polyglot Metadata for Hadoop

Jim Dowling Associate Prof @ KTH

Senior Researcher @ SICSCEO @ Logical Clocks AB

www.hops.io @hopshadoop

s9s Polyglot Persistence Meetup, 23rd August 2016 @ Spotify/Stockholm

Page 2: Polyglot metadata for Hadoop

Polyglot Persistence: the Right Tool for the Job

2

Page 3: Polyglot metadata for Hadoop

3

Minimize Impedence MismatchData Access Paradigm

SQLStatic Object Oriented

Dynamic Object OrientedFunctional

Free-Text SearchMatrix Operations

Database Paradigm

Relational (Row-based)Relational (Columnar)Document-BasedGraph-BasedKey-ValueObject-Oriented / Hierarchical / Network

?

Page 4: Polyglot metadata for Hadoop

4

In the Good Old Dayz….

[http://martinfowler.com/]

Page 5: Polyglot metadata for Hadoop

5

Led to Big Fights over Choice of Database

Page 6: Polyglot metadata for Hadoop

6

The Advent of Microservices

[http://martinfowler.com/]

Page 7: Polyglot metadata for Hadoop

7

Should we give teams freedom to choose?• Ericsson

- 1000s of build Systems• Google

- One Build System, Blaze

Page 8: Polyglot metadata for Hadoop

8

Monolithic vs Polyglot Persistence

[http://martinfowler.com/]

Page 9: Polyglot metadata for Hadoop

9

Problems: Polyglot Persistence/Microservices•What if the same data is updated by different microservices?- E.g., user data in subscriber systems

•Transactions (or agreement protocols) across microservices is very hard

•Where possible, try to store such data in a single database (single source of truth) ….

Page 10: Polyglot metadata for Hadoop

10

Problems: Polyglot Persistence/Microservices•…but, if we minimize the number of databases……

•…how do we handle different data access patterns for the same data?-OLTP SQL-OLAP SQL-Free-Text Search-Etc

Page 11: Polyglot metadata for Hadoop

11

Multi-functional Databases

SAP Hana, MemSQL combine OLTP and OLAP in a single DB

Page 12: Polyglot metadata for Hadoop

12

If OLAP doesn’t mutate your data…..

[http://severalnines.com/]

OLTP OLAP

Replication protocols from the SSOT to other Databases works for Immutable Data

Page 13: Polyglot metadata for Hadoop

13

If you have Big Data….

RDBMS(MySQL, Postgres,..)

Import

Export

SQOOP

Page 14: Polyglot metadata for Hadoop

Hadoop as a Polyglot Storage System

14

Page 15: Polyglot metadata for Hadoop

Hive as Polyglot Storage in Hadoop

15

Page 16: Polyglot metadata for Hadoop

16

Hive/Spark simplifies developmentsqlContext = HiveContext(sc) f1_df = sqlContext.sql( "SELECT id, count(*) AS nb_entries FROM my_db.log \ WHERE ts = '20160515' \ GROUP BY id")

sqlContext = SQLContext(sc)

f0 = sc.textFile('logfile')fpFields = [ StructField(‘ts', StringType(), True), StructField('id', StringType(), True), StructField(‘it', StringType(), True) ]fpSchema = StructType(fpFields)df_f0 = sqlContext.createDataFrame(f0, fpSchema)df_f0.registerTempTable('log') f1_df = sqlContext.sql("SELECT log.id, count(*) AS nb_entries FROM log WHERE ts = '20160515‘ GROUP BY id“)

SparkSQLHive-on-Spark

Page 17: Polyglot metadata for Hadoop

17

Hive as Metadata for HDFS Files

hive

warehouse

database

Table.hive

Hadoop Distributed Filesystem (HDFS)

Table

Hive DOESN’T maintain the integrity of the metadata table and the HDFS file!!Remove, the table Table.hive file and Hive doesn’t complain….

SQL

Page 18: Polyglot metadata for Hadoop

Let’s Dig Down and Understand Why…

18

Page 19: Polyglot metadata for Hadoop

Metadata in HDFS: Paths, Blocks, Replicas

Journal Nodes

NameNode

Zookeeper

StandbyNameNode

FSImage

Edit Log

HDFS Metadata cannot be easily extended (with a Schema, for example)

HDFSMetadataArchitecture

Page 20: Polyglot metadata for Hadoop

Metadata Totem Poles in Hadoop

20How do they ensure the consistency of the metadata and the data?

Page 21: Polyglot metadata for Hadoop

Hops Hadoop

21

Page 22: Polyglot metadata for Hadoop

Hops Hadoop Data Integratation Journey• Principled and safe mechanisms for adding metadata to HDFS and YARN

• Free-text search for files in HDFS- Instead of Pig Jobs to search HDFS, use ElasticSearch

• Integrate Kafka Metadata into Hops Hadoop- Kafka uses Zookeeper for ACLs, Metadata, etc- Hops Hadoop uses MySQL Cluster

Page 23: Polyglot metadata for Hadoop

HopsFS Architecture

23

NameNodes

NDB

Leader

HDFS Client

DataNodes

> 12 TB

[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]

Page 24: Polyglot metadata for Hadoop

24

HopsYARN Architecture

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource Trackers

Page 25: Polyglot metadata for Hadoop

25

Strongly Consistent Metadata for HDFS

FilesDirectories

Schema….

Single Database for HDFS Metadata

2-phase commit (transactions)

Metadata Integrity with Transactions and Foreign Keys.

Page 26: Polyglot metadata for Hadoop

26

HDFS Metadata APISchema-less APIattach(path, metadataJSON)detach(hdfsPath)

Schema-based APIregister(path, schemaJSON)add(path, schema, metadataJSON)remove(path, schema, metadataJSON)

Page 27: Polyglot metadata for Hadoop

Hops Metadata services

ElasticsearchDatabase[HDFS/YARN]

KafkaZookeeper

Metadata API

Distributed Database is the Single Source-of-Truth for Metadata

Page 28: Polyglot metadata for Hadoop

28

Good Metadata: Elasticsearch

FilesDirectoriesMetadata

Search Indexes

DatabaseElasticsearch one-way replication

Eventual Consistency for Metadata.Metadata Integrity maintained by

Asynchronous Replication and Metadata Immutability.

[ePipe Tutorial, BOSS Workshop, VLDB 2016]

immutable data

Page 29: Polyglot metadata for Hadoop

29

Not-so-Good Metadata: Kafka

TopicsPartitionsACLs

Zookeeper/KafkaDatabase

Eventual Consistency for Metadata.Metadata integrity maintained by custom recovery logic and polling.

Metadata API

polling

Page 30: Polyglot metadata for Hadoop

Demo

30

Page 31: Polyglot metadata for Hadoop

www.hops.site

31

A 2 MW datacenter research and test environment

5 lab modules, planned up to 3-4000 servers, 2-3000 square meters

[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]

Page 32: Polyglot metadata for Hadoop

Demo

32

Page 33: Polyglot metadata for Hadoop

Summing Up Multiple storage engines are here to stay

Picking the right tool for the right job is not easy

Do not be religious about your tool of choice!

Hops shows how you can combine multiple storage engines to give an improved user experience for Hadoop

33

Page 34: Polyglot metadata for Hadoop

The Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,

Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Vasileios Giannokostas, Ermias Gebremeskel, Antonios Kouzoupis, Misganu Dessalegn.

Alumni: Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Page 35: Polyglot metadata for Hadoop

Join us!http://github.com/hopshadoop

www.hops.io @hopshadoop