Polyglot metadata for Hadoop
-
Upload
jim-dowling -
Category
Technology
-
view
63 -
download
1
Transcript of Polyglot metadata for Hadoop
Polyglot Metadata for Hadoop
Jim Dowling Associate Prof @ KTH
Senior Researcher @ SICSCEO @ Logical Clocks AB
www.hops.io @hopshadoop
s9s Polyglot Persistence Meetup, 23rd August 2016 @ Spotify/Stockholm
Polyglot Persistence: the Right Tool for the Job
2
3
Minimize Impedence MismatchData Access Paradigm
SQLStatic Object Oriented
Dynamic Object OrientedFunctional
Free-Text SearchMatrix Operations
Database Paradigm
Relational (Row-based)Relational (Columnar)Document-BasedGraph-BasedKey-ValueObject-Oriented / Hierarchical / Network
?
4
In the Good Old Dayz….
[http://martinfowler.com/]
5
Led to Big Fights over Choice of Database
6
The Advent of Microservices
[http://martinfowler.com/]
7
Should we give teams freedom to choose?• Ericsson
- 1000s of build Systems• Google
- One Build System, Blaze
8
Monolithic vs Polyglot Persistence
[http://martinfowler.com/]
9
Problems: Polyglot Persistence/Microservices•What if the same data is updated by different microservices?- E.g., user data in subscriber systems
•Transactions (or agreement protocols) across microservices is very hard
•Where possible, try to store such data in a single database (single source of truth) ….
10
Problems: Polyglot Persistence/Microservices•…but, if we minimize the number of databases……
•…how do we handle different data access patterns for the same data?-OLTP SQL-OLAP SQL-Free-Text Search-Etc
11
Multi-functional Databases
SAP Hana, MemSQL combine OLTP and OLAP in a single DB
12
If OLAP doesn’t mutate your data…..
[http://severalnines.com/]
OLTP OLAP
Replication protocols from the SSOT to other Databases works for Immutable Data
13
If you have Big Data….
RDBMS(MySQL, Postgres,..)
Import
Export
SQOOP
Hadoop as a Polyglot Storage System
14
Hive as Polyglot Storage in Hadoop
15
16
Hive/Spark simplifies developmentsqlContext = HiveContext(sc) f1_df = sqlContext.sql( "SELECT id, count(*) AS nb_entries FROM my_db.log \ WHERE ts = '20160515' \ GROUP BY id")
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')fpFields = [ StructField(‘ts', StringType(), True), StructField('id', StringType(), True), StructField(‘it', StringType(), True) ]fpSchema = StructType(fpFields)df_f0 = sqlContext.createDataFrame(f0, fpSchema)df_f0.registerTempTable('log') f1_df = sqlContext.sql("SELECT log.id, count(*) AS nb_entries FROM log WHERE ts = '20160515‘ GROUP BY id“)
SparkSQLHive-on-Spark
17
Hive as Metadata for HDFS Files
hive
warehouse
database
Table.hive
Hadoop Distributed Filesystem (HDFS)
Table
Hive DOESN’T maintain the integrity of the metadata table and the HDFS file!!Remove, the table Table.hive file and Hive doesn’t complain….
SQL
Let’s Dig Down and Understand Why…
18
Metadata in HDFS: Paths, Blocks, Replicas
Journal Nodes
NameNode
Zookeeper
StandbyNameNode
FSImage
Edit Log
HDFS Metadata cannot be easily extended (with a Schema, for example)
HDFSMetadataArchitecture
Metadata Totem Poles in Hadoop
20How do they ensure the consistency of the metadata and the data?
Hops Hadoop
21
Hops Hadoop Data Integratation Journey• Principled and safe mechanisms for adding metadata to HDFS and YARN
• Free-text search for files in HDFS- Instead of Pig Jobs to search HDFS, use ElasticSearch
• Integrate Kafka Metadata into Hops Hadoop- Kafka uses Zookeeper for ACLs, Metadata, etc- Hops Hadoop uses MySQL Cluster
HopsFS Architecture
23
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]
24
HopsYARN Architecture
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
25
Strongly Consistent Metadata for HDFS
FilesDirectories
Schema….
Single Database for HDFS Metadata
2-phase commit (transactions)
Metadata Integrity with Transactions and Foreign Keys.
26
HDFS Metadata APISchema-less APIattach(path, metadataJSON)detach(hdfsPath)
Schema-based APIregister(path, schemaJSON)add(path, schema, metadataJSON)remove(path, schema, metadataJSON)
Hops Metadata services
ElasticsearchDatabase[HDFS/YARN]
KafkaZookeeper
Metadata API
Distributed Database is the Single Source-of-Truth for Metadata
28
Good Metadata: Elasticsearch
FilesDirectoriesMetadata
Search Indexes
DatabaseElasticsearch one-way replication
Eventual Consistency for Metadata.Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]
immutable data
29
Not-so-Good Metadata: Kafka
TopicsPartitionsACLs
Zookeeper/KafkaDatabase
Eventual Consistency for Metadata.Metadata integrity maintained by custom recovery logic and polling.
Metadata API
polling
Demo
30
www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
Demo
32
Summing Up Multiple storage engines are here to stay
Picking the right tool for the right job is not easy
Do not be religious about your tool of choice!
Hops shows how you can combine multiple storage engines to give an improved user experience for Hadoop
33
The Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Vasileios Giannokostas, Ermias Gebremeskel, Antonios Kouzoupis, Misganu Dessalegn.
Alumni: Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.