Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Real-time Hadoop: The Ideal Messaging System for Hadoop
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
460 -
download
0
Transcript of Real-time Hadoop: The Ideal Messaging System for Hadoop
![Page 1: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/1.jpg)
© 2016 MapR Technologies 1© 2014 MapR Technologies
Real-time Hadoop:The Ideal Messaging System for HadoopTed Dunning
![Page 2: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/2.jpg)
© 2016 MapR Technologies 2
Contact Information
Ted DunningChief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation
Email [email protected] [email protected]
Twitter @ted_dunning
Hashtags today: #stratahadoop #ojai
![Page 3: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/3.jpg)
© 2016 MapR Technologies 3
Streaming Architectureby Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)
Free copies at book signing today3:40PM @ MapR booth
http://bit.ly/mapr-ebook-streams
![Page 4: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/4.jpg)
© 2016 MapR Technologies 4
Goals• Real-time or near-time
– Includes situations with deadlines– Also includes situations where delay is simply undesirable– Even includes situations where delay is just fine
• Micro-services– Streaming is a convenient idiom for design– Micro-services … you know we wanted it– Service isolation is a key requirement
![Page 5: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/5.jpg)
© 2016 MapR Technologies 5
Real-time or Near-time?• The real point is flow versus state (see talk later today)
• One consequence of flow-based computing is real-time and near-time become relatively easy
• Life may be a bitch, but it doesn’t happen in batches!
![Page 6: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/6.jpg)
© 2016 MapR Technologies 6
Agenda• Background / micro-services
• Global requirements
• Scale
![Page 7: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/7.jpg)
© 2016 MapR Technologies 7
A microservice is
loosely coupledwith bounded context
![Page 8: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/8.jpg)
© 2016 MapR Technologies 8
How to Couple Services and Break micro-ness• Shared schemas, relational stores• Ad hoc communication between services• Enterprise service busses• Brittle protocols• Poor protocol versioningDon’t
do this!
![Page 9: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/9.jpg)
© 2016 MapR Technologies 9
How to Decouple Services• Use self-describing data • Private databases• Infrastructural communication between services• Use modern protocols• Adopt future-proof protocol practices
• Use shared storage where necessary due to scale
![Page 10: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/10.jpg)
© 2016 MapR Technologies 11
What is the Right Structure for Flow Compute?• Traditional message queues?
– Message queues are classic answer– Key feature/bug is out-of-order acknowledgement– Many implementations– You pay a huge performance hit for persistence
• Kafka-esque Logs?– Logs are like queues, but with ordering– Out of order consumption is possible, acknowledgement not so much– Canonical base implementation is Kafka– Performance plus persistence
![Page 11: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/11.jpg)
© 2016 MapR Technologies 12
ScenariosProfile Database
![Page 12: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/12.jpg)
© 2016 MapR Technologies 13
The task
![Page 13: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/13.jpg)
© 2016 MapR Technologies 14
Traditional Solution
![Page 14: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/14.jpg)
© 2016 MapR Technologies 15
What Happens Next?
![Page 15: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/15.jpg)
© 2016 MapR Technologies 16
What Happens Next?
![Page 16: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/16.jpg)
© 2016 MapR Technologies 17
How to Get Service Isolation
![Page 17: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/17.jpg)
© 2016 MapR Technologies 18
New Uses of Data
![Page 18: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/18.jpg)
© 2016 MapR Technologies 19
Scaling Through Isolation
![Page 19: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/19.jpg)
© 2016 MapR Technologies 20
Lessons• De-coupling and isolation are key• Private data stores/tables are important,
– but local storage of private data is a bug• Propagate events, not table updates
![Page 20: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/20.jpg)
© 2016 MapR Technologies 21
ScenariosIoT Data Aggregation
![Page 21: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/21.jpg)
© 2016 MapR Technologies 22
Basic Situation
Each location has many
pumps
Multiple locations
![Page 22: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/22.jpg)
© 2016 MapR Technologies 23
What Does a Pump Look Like
TemperaturePressure
Flow
TemperaturePressureFlow
Winding temperature
VoltageCurrent
![Page 23: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/23.jpg)
© 2016 MapR Technologies 24
Basic Situation
Each location has many
pumps
Multiple locations
![Page 24: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/24.jpg)
© 2016 MapR Technologies 25
Basic Architecture Reflects Business Structure
![Page 25: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/25.jpg)
© 2016 MapR Technologies 26
Lessons• Data architecture should reflect business structure
• Even very modest designs involve multiple data centers
• Schemas cannot be frozen in the real world
• Security must follow data ownership
![Page 26: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/26.jpg)
© 2016 MapR Technologies 27
ScenariosGlobal Data Recovery
![Page 27: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/27.jpg)
© 2016 MapR Technologies 28
![Page 28: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/28.jpg)
© 2016 MapR Technologies 29
![Page 29: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/29.jpg)
© 2016 MapR Technologies 30
![Page 30: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/30.jpg)
© 2016 MapR Technologies 31
![Page 31: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/31.jpg)
© 2016 MapR Technologies 32
Lessons• Arbitrary number of topics important for simplicity + performance
• Updates happen in many places
• Mobility implies change in replication patterns
• Multi-master updates simplify design massively
![Page 32: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/32.jpg)
© 2016 MapR Technologies 33
Converged Requirements
![Page 33: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/33.jpg)
© 2016 MapR Technologies 34
What Have We Learned?• Need persistence and performance
– Possibly for years and to 100’s of millions t/s• Must have convergence
– Need files, tables AND streams– Need volumes, snapshots, mirrors, permissions and …
• Must have platform security– Cannot depend on perimeter– Must follow business structure
• Must have global scale and scope– Millions of topics for natural designs– Multi-master replication and update
![Page 34: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/34.jpg)
© 2016 MapR Technologies 35
The Importance of Common API’s• Commonality and interoperability are critical
– Compare Hadoop eco-system and the noSQL world• Table stakes
– Persistence– Performance– Polymorphism
• Major trend so far is to adopt Kafka API– 0.9 API and beyond remove major abstraction leaks– Kafka API supported by all major Hadoop vendors
![Page 35: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/35.jpg)
© 2016 MapR Technologies 36
What we do
![Page 36: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/36.jpg)
© 2016 MapR Technologies 37
Evolution of Data Storage
FunctionalityCompatibility
Scalability
LinuxPOSIX
Over decades of progress,Unix-based systems have set the standard for compatibility and functionality
![Page 37: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/37.jpg)
© 2016 MapR Technologies 38
FunctionalityCompatibility
Scalability
LinuxPOSIX
HadoopHadoop achieves much higher scalability by trading away essentially all of this compatibility
Evolution of Data Storage
![Page 38: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/38.jpg)
© 2016 MapR Technologies 39
Evolution of Data Storage
FunctionalityCompatibility
Scalability
LinuxPOSIX
Hadoop
MapR enhanced Apache Hadoop by restoring the compatibility while increasing scalability and performance
FunctionalityCompatibility
Scalability
POSIX
![Page 39: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/39.jpg)
© 2016 MapR Technologies 40
FunctionalityCompatibility
Scalability
LinuxPOSIX
Hadoop
Evolution of Data Storage
Adding tables and streams enhances the functionality of the base file system
![Page 40: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/40.jpg)
© 2016 MapR Technologies 41
http://bit.ly/fastest-big-data
![Page 41: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/41.jpg)
© 2016 MapR Technologies 42
How we do this with MapR• MapR Streams is a C++ reimplementation of Kafka API
– Advantages in predictability, performance, scale– Common security and permissions with entire MapR converged data
platform• Semantic extensions
– A cluster contains volumes, files, tables … and now streams– Streams contain topics– Can have default stream or can name stream by path name
• Core MapR capabilities preserved– Consistent snapshots, mirrors, multi-master replication
![Page 42: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/42.jpg)
© 2016 MapR Technologies 43
MapR original Innovations• Volumes
– Distributed management– Data placement
• Read/write random access file system– Allows distributed meta-data– Improved scaling– Enables NFS access
• Application-level NIC bonding• Transactionally correct snapshots and mirrors
![Page 43: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/43.jpg)
© 2016 MapR Technologies 44
MapR's Containers
Each container contains Directories & files Data blocks
Replicated on servers No need to manage
directly
Files/directories are sharded into blocks, whichare placed into containers on disks
Containers are 16-32 GB segments of disk, placed on nodes
![Page 44: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/44.jpg)
© 2016 MapR Technologies 45
MapR's Containers
Each container has a replication chain
Updates are transactional Failures are handled by
rearranging replication
![Page 45: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/45.jpg)
© 2016 MapR Technologies 46
Container locations and replication
CLDB
N1, N2N3, N2N1, N2N1, N3N3, N2
N1
N2
N3Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
![Page 46: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/46.jpg)
© 2016 MapR Technologies 47
MapR ScalingContainers represent 16 - 32GB of data
Each can hold up to 1 Billion files and directories 100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container 25GB to cache all containers for 2EB cluster
But not necessary, can page to disk Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports Serve 100x more data-nodes Increase container size to 64G to serve 4EB cluster
Map/reduce not affected
![Page 47: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/47.jpg)
© 2016 MapR Technologies 48
But Wait, There’s More• Directories and files are implemented in terms of B-trees
– Key is offset, value is data blob– Internal transactional semantics guarantees safety and consistency– Layout algorithms give very high layout linearization
• Tables are implemented in terms of B-trees– Twisted B-tree implementation allows virtues of log-structured merge tree
without the compaction delays– Tablet splitting without pausing, integration with file system transactions
• Common security and permissions scheme
![Page 48: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/48.jpg)
© 2016 MapR Technologies 49
Similar to LSM implementations, tables are decomposed by key ranges
Distinct from HBase and Level DB, MapR tables used fixed number (greater than 1) of decompositions
Very unusually, relative to LSM and cousins, data structures at the leaf are mutable
![Page 49: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/49.jpg)
© 2016 MapR Technologies 50
Re-use of Proven Technology
Partitions are distributed just like file chunks
Same replication and transaction technology
![Page 50: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/50.jpg)
© 2016 MapR Technologies 51
And More …• Streams are implemented in terms of B-trees as well
– Topics and consumer offsets are kept in stream, not ZK– Similar splitting technology as MapR DB tables – Consistent permissions, security, data replication
• Standard Kafka 0.9 API• Plans to add OJAI for high-level structuring
• Performance is very high
![Page 51: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/51.jpg)
© 2016 MapR Technologies 52
Example
![Page 52: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/52.jpg)
© 2016 MapR Technologies 53
![Page 53: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/53.jpg)
© 2016 MapR Technologies 54
Lessons• API’s matter more than implementations
• There is plenty of room to innovate ahead of the community
• Posix, HDFS, HBASE all define useful API’s
• Kafka 0.9+ does the same
![Page 54: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/54.jpg)
© 2016 MapR Technologies 55
Call to action:
Support the common API’s
![Page 55: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/55.jpg)
© 2016 MapR Technologies 56
Call to action:
Support the Kafka API’s
And come by the MapR boothto check out MapR Streams
![Page 56: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/56.jpg)
© 2016 MapR Technologies 57
![Page 57: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/57.jpg)
© 2016 MapR Technologies 58
Streaming Architectureby Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)
Free copies at book signing today
http://bit.ly/mapr-ebook-streams
![Page 58: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/58.jpg)
© 2016 MapR Technologies 59
![Page 59: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/59.jpg)
© 2016 MapR Technologies 60
Thank you for coming today!
![Page 60: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/60.jpg)
© 2016 MapR Technologies 61
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache Hadoop and Spark professionals
community.mapr.com
![Page 61: Real-time Hadoop: The Ideal Messaging System for Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022012914/587c18d81a28abb5068b4c1f/html5/thumbnails/61.jpg)
© 2016 MapR Technologies 62
Q & A@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies