Post on 15-Jan-2015
description
1
Distributed, fault-tolerant, transactional
Real-Time Integration: MongoDB and SQL Databases
Eugene DvorkinArchitect, WebMD
2
WebMD: A lot of data; a lot of traffic
~900 millions page view a month~100 million unique visitors a month
3
How We Use MongoDB
User Activity
4
Why Move Data to RDBMS?
Preserve existing investment in BI and data warehouse
To use analytical database such as VerticaTo use SQL
5
Why Move Data In Real-time?
Batch process is slow
No ad-hoc queries
No real-time reports
6
Challenge in moving data
Transform Document to Relational Structure Insert into RDBMS at high rate
7
Challenge in moving data
Scale easily as data volume and velocity increase
8
Our Solution to move data in Real-time: Storm
tem. Storm – open source distributed real-time computation system.
Developed by Nathan Marz - acquired by Twitter
9
Hadoop Storm
Our Solution to move data in Real-time: Storm
10
Why STORM?
JVM-based framework
Guaranteed data processing
Supports development in multiple
languages
Scalable and transactional
11
Overview of Storm cluster
Master Node
Cluster Coordination
run worker processes
12
Storm Abstractions
Tuples, Streams, Spouts, Bolts and Topologies
13
Tuples
(“ns:events”,”email:edvorkin@gmail.com”)
Ordered list of elements
14
Stream
Unbounded sequence of tuples
Example: Stream of messages from message queue
15
Spout
Read from stream of data – Queues, web logs, API calls, mongoDB oplogEmit documents as tuples
Source of Streams
16
BoltsProcess tuples and create new streams
17
Bolts
Apply functions /transformsCalculate and aggregate data (word count!)Access DB, API , etc.Filter dataMap/Reduce
Process tuples and create new streams
18
Topology
19
Topology
Storm is transforming and moving data
20
MongoDB
How To Read All Incoming Data from MongoDB?
21
MongoDB
How To Read All Incoming Data from MongoDB?
Use MongoDB OpLog
22
What is OpLog?
Replication mechanism in MongoDBIt is a Capped Collection
23
Spout: reading from OpLog
Located at local database, oplog.rs collection
24
Spout: reading from OpLog
Operations: Insert, Update, Delete
25
Spout: reading from OpLog
Name space: Table – Collection name
26
Spout: reading from OpLog
Data object:
27
Sharded cluster
28
Automatic discovery of sharded cluster
29
Example: Shard vs Replica set discovery
30
Example: Shard discovery
31
Spout: Reading data from OpLog
How to Read data continuously from OpLog?
32
Spout: Reading data from OpLog
How to Read data continuously from OpLog?
Use Tailable Cursor
33
Example: Tailable cursor - like tail –f
34
Manage timestamps
Use ts (timestamp in oplog entry) field to track processed records
If system restart, start from recorded ts
35
Spout: reading from OpLog
36
SPOUT – Code Example
37
TOPOLOGY
38
Working With Embedded Arrays
Array represents One-to-Many relationship in RDBMS
39
Example: Working with embedded arrays
40
Example: Working with embedded arrays
{_id: 1, ns: “person_awards”, o: { award: 'National Medal of Science', year: 1975, by: 'National Science Foundation' }}
{ _id: 1, ns: “person_awards”,o: {award: 'Turing Award', year: 1977, by: 'ACM' }}
41
Example: Working with embedded arrays
public void execute(Tuple tuple) {
.........
if (field instanceof BasicDBList) {
BasicDBObject arrayElement=processArray(field)
......
outputCollector.emit("documents", tuple, arrayElement);
42
Parse documents with Bolt
43
{"ns": "people", "op":"i", o : { _id: 1, name: { first: 'John', last: 'Backus' }, birth: 'Dec 03, 1924’}
["ns": "people", "op":"i", [“id”:1, "name_first": "John", "name_last":"Backus", "birth": "DEc 03, 1924" ]]
Parse documents with Bolt
44
@Override
public void execute(Tuple tuple) {
......
final BasicDBObject oplogObject =
(BasicDBObject)tuple.getValueByField("document");
final BasicDBObject document =
(BasicDBObject)oplogObject.get("o");
......
outputValues.add(flattenDocument(document));
outputCollector.emit(tuple,outputValues);
Parse documents with Bolt
45
Write to SQL with SQLWriter Bolt
46
Write to SQL with SQLWriter Bolt
["ns": "people", "op":"i", [“id”:1, "name_first": "John", "name_last":"Backus", "birth": "Dec 03, 1924" ]
]insert into people (_id,name_first,name_last,birth) values
(1,'John','Backus','Dec 03,1924') ,
insert into people_awards
(_id,awards_award,awards_award,awards_by) values (1,'Turing
Award',1977,'ACM'),
insert into people_awards
(_id,awards_award,awards_award,awards_by) values (1,'National
Medal of Science',1975,'National Science Foundation')
47
@Override public void prepare(.....) {.... Class.forName("com.vertica.jdbc.Driver"); con = DriverManager.getConnection(dBUrl, username,password);
@Override public void execute(Tuple tuple) { String insertStatement=createInsertStatement(tuple); try { Statement stmt = con.createStatement(); stmt.execute(insertStatement); stmt.close();
Write to SQL with SQLWriter Bolt
48
Topology Definition
TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());
49
Topology Definition
TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());
50
Topology Definition
TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());
51
Topology Definition
TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)
StormSubmitter.submitTopology("OfflineEventProcess", conf,builder.createTopology())
52
Lesson learned
By leveraging MongoDB Oplog or
other capped collection, tailable cursor
and Storm framework, you can build
fast, scalable, real-time data
processing pipeline.
53
Resources
Book: Getting started with StormStorm Project wikiStorm starter projectStorm contributions projectRunning a Multi-Node Storm cluster tutorialImplementing real-time trending topicA Hadoop Alternative: Building a real-time data pipeline with StormStorm Use cases
54
Resources (cont’d)
Understanding the Parallelism of a Storm TopologyTrident – high level Storm abstraction A practical Storm’s Trident API Storm online forumMongo connector from 10gen Labs MoSQL streaming Translator in RubyProject source codeNew York City Storm Meetup
55
Questions
Eugene Dvorkin, Architect, WebMD edvorkin@webmd.net Twitter: @edvorkin LinkedIn: eugenedvorkin
56
57
58
Next Sessions at 2:505th Floor:
West Side Ballroom 3&4: Data Modeling Examples from the Real World
West Side Ballroom 1&2: Growing Up MongoDB
Juilliard Complex: Business Track: MetLife Leapfrogs Insurance Industry with MongoDB-Powered Big Data Application
Lyceum Complex: Ask the Experts: MongoDB Monitoring and Backup Service Session
7th Floor:
Empire Complex: How We Fixed Our MongoDB Problems
SoHo Complex: High Performance, High Scale MongoDB on AWS: A Hands On Guide