Couchbase Data Pipeline
-
Upload
justin-michaels -
Category
Data & Analytics
-
view
558 -
download
0
Transcript of Couchbase Data Pipeline
Phoenix Hadoop Users GroupCouchbase Data Pipeline
2
Agenda
Couchbase Technology What is Couchbase Couchbase and Hadoop Ecosystem Architecture (Node/SDK/Cluster)
Couchbase at PayPal Couchbase Deployment Use Case Overview
Kafka Connector Demo
What is Couchbase?
4
High availability
cache
Key-value store
Document
database
Embedded database
Sync management
Couchbase Server
Couchbase Lite
CouchbaseSync Gateway
Data management for a broad range of use cases
5
Couchbase Tenants
Flexible data model
Consistent performance at scale
High availability
Easy, affordable scalability
24x365
Couchbase and Hadoop Ecosystem
7
Couchbase View
NoSQL Hadoop NoSQL Hadoop
Overlap Compliment
NoSQL or Hadoop? NoSQL and Hadoop.
8
Couchbase View
Couchbase Spark Hadoop (Hive)Use cases • Operational
• Web / Mobile • Analytics• Machine
Learning
• Analytics• Machine
LearningProcessing mode
• Online • Ad Hoc (New!)
• Streaming• Ad Hoc • Batch
• Batch• Ad Hoc
Low latency = < 1ms ops Seconds MinutesUsers are typically
Millions of customers
100’s of analysts
100’s of analysts
Big data = 10s of Terabytes Petabytes(?) PetabytesANALYTICAL
OPERATIONAL
9
Lambda Architecture
1
4
5
DATA
SERVE
QUERY
New Data Stream Analysis
All DataPrecompute
Views (Map Reduce)
Process Stream
Incremental Views
BatchRecompute
Real-TimeIncrement
Batch Layer
Serving Layer
Speed Layer
2 BATCH
3 SPEED
10
Couchbase and Hadoop
New Data Stream Merged View
All DataPrecompute
Views (Map Reduce)
Process Stream
Incremental Views
Partial Aggregat
e
Partial Aggregat
e
Partial Aggregat
e
Real-Time Data
BatchRecompute
Batch Views
Real-Time Views
Real-TimeIncrement
Merge
Batch Layer
Serving Layer
Speed Layer
Couchbase HadoopConnector (Sqoop)
11
Couchbase HadoopConnector (Sqoop)
Couchbase and Hadoop
New Data Stream Merged View
All DataPrecompute
Views (Map Reduce)
Process Stream
Incremental Views
Partial Aggregat
e
Partial Aggregat
e
Partial Aggregat
e
Real-Time Data
BatchRecompute
Batch Views
Real-Time Views
Real-TimeIncrement
Merge
Batch Layer
Serving Layer
Speed Layer
Stream / Data
IngestionStore
Incremental Data / Stream
processing
Serving merged results /
responses
12
Couchbase Connectors
13
Couchbase Connectors
xDBC
App
CB Node
xDBC
ETL
xDBC
BI
xDBC
Visualization
CB Node CB Node
Visualization
Integrations, partnerships
COMPLEX EVENT PROCESSING
Real TimeREPOSITORY
PERPETUALSTORE
ANALYTICALDB
BUSINESSINTELLIGENCE
MONITORING
CHAT/VOICE
SYSTEM
BATCHTRACK
REAL-TIMETRACK
DASHBOARD
ArchitectureCouchbase Node
16
Couchbase Server NodeSingle-node type means easier administration and scaling Single installation Two major
components/processes: Data manager cluster manager
Data manager: C/C++ Layer consolidation of caching
and persistence Cluster manager:
Erlang/OTP Administration UI’s Out-of-band for data requests
17
Couchbase Read OperationAPPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
DOC 1
GETDOC 1
DOC 1
Single-node type means easier administration and scaling Reads out of cache are
extremely fast No other process/system to
communicate with Data connection is a TCP-
binary protocol
18
APPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
Couchbase Write Operation
DOC 1
DOC 1DOC 1
Single-node type means easier administration and scaling Writes are async by default Application gets
acknowledgement when successfully in RAM and can trade-off waiting for replication or persistence per-write
Replication to 1, 2 or 3 other nodes
Replication is RAM-based so extremely fast
Off-node replication is primary level of HA
Disk written to as fast as possible – no waiting
19
Couchbase Cache EjectionAPPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
DOC 1
DOC 2DOC 3DOC 4DOC 5
DOC 1
DOC 2 DOC 3 DOC 4 DOC 5
Single-node type means easier administration and scaling Layer consolidation means
read through and write through cache
Couchbase automatically removes data that has already been persisted from RAM
20
APPLICATION SERVER
MANAGED CACHE
DISK
DISKQUEUE
REPLICATIONQUEUE
DOC 1
Couchbase Cache Miss
DOC 2 DOC 3 DOC 4 DOC 5
DOC 2 DOC 3 DOC 4 DOC 5
GETDOC 1
DOC 1
DOC 1
Single-node type means easier administration and scaling Layer consolidation means
1 single interface for App to talk to and get its data back as fast as possible
Separation of cache and disk allows for fastest access out of RAM while pulling data from disk in parallel
ArchitectureCouchbase SDK
22
Documents are integral to the SDKs. All SDK’s support JSON format In addition: Serialized objects, Unquoted Strings,
Binary pass-through A Document contains:
Couchbase SDK
22
Property Description
ID The bucket-unique identifierContent The value that is storedExpiry An expiration timeCAS Check-and-Set identifier
23
Couchbase SDKWhat does it mean to be a Couchbase SDK?
Cluster
Bucket
CRUD View Query
N1QL Query
FunctionManage connections to the bucket within the cluster for different services.Provide a core layer where IO can be managed and optimized.Provide a way to manage buckets.
APIinsertDesignDocument()flush()listDesignDocuments()
FunctionHold on to cluster information such as topology.
APIReference Cluster ManagementopenBucket()info()disconnect()
FunctionGive the application developer a concurrent API for basic (k-v) or document management
APIget()insert()upsert()remove()
FunctionAllow for querying, execution of other directives such as defining indexes and checking on index state.
APIabucket.NewN1QLQuery( “SELECT * FROM default LIMIT 5” ) .Consistency(gocouchbase.RequestPlus);
FunctionAllow for view querying, building of queries and reasonable error handling from the cluster.
APIabucket.NewViewQuery().Limit().Stale()
24
Couchbase SDK Official SDKs
Java .NET Node.js Python
For each of these we have Full Document support Interoperability Common yet idiomatic Programming Model
Others: Erlang, Perl, TCL, Clojure, Scala
PHP C / C++ Go Ruby
JDBC and ODBC
ArchitectureCouchbase Cluster: Node and SDK Interaction
26
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
Basic Operation
SHARD5
SHARD2
SHARD9
SHARD SHARD SHARD
SHARD4
SHARD7
SHARD8
SHARD SHARD SHARD
SHARD1
SHARD3
SHARD6
SHARD SHARD SHARD
SHARD4
SHARD1
SHARD8
SHARD SHARD SHARD
SHARD6
SHARD3
SHARD2
SHARD SHARD SHARD
SHARD7
SHARD9
SHARD5
SHARD SHARD SHARD
Application has single logical connection to cluster (client
object) Data is automatically sharded resulting in
even document data distribution across cluster
Each vbucket replicated 1, 2 or 3 times (“peer-to-peer” replication)
Docs are automatically hashed by the client to a shard’
Cluster map provides location of which server a shard is on
Every read/write/update/delete goes to same node for a given key
Strongly consistent data access (“read your own writes”)
A single Couchbase node can achieve 100k’s ops/sec so no need to scale reads
27
Auto sharding – Bucket and vBuckets
vB
Data buckets
vB
1 ….. 1024
Virtual buckets
A bucket is a logical, unique key space Multiple buckets can exist within a single cluster of
nodes
Each bucket has active and replica data sets (1, 2 or 3 extra copies)
Each data set has 1024 Virtual Buckets (vBuckets) Each vBucket contains 1/1024th portion of the data set vBuckets do not have a fixed physical server location
Mapping between the vBuckets and physical servers is called the cluster map
Document IDs (keys) always get hashed to the same vbucket
Couchbase SDK’s lookup the vbucket -> server mapping
28
Cluster Map
29
Cluster Map
30
Cluster Map – 2 nodes added
31
Rebalance
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD5
SHARD2
SHARD SHARD
SHARD4
SHARD SHARD
SHARD1
SHARD3
SHARD SHARD
SHARD4
SHARD1
SHARD8
SHARD SHARD SHARD
SHARD6
SHARD3
SHARD2
SHARD SHARD SHARD
SHARD7
SHARD9
SHARD5
SHARD SHARD SHARD
SHARD7
SHARD
SHARD6
SHARD
SHARD8
SHARD9
SHARD
READ/WRITE/UPDATE
Application has single logical connection to cluster (client object) Multiple nodes added
or removed at once One-click operation Incremental
movement of active and replica vbuckets and data
Client library updated via cluster map
Fully online operation, no downtime or loss of performance
32
Fail Over Node
ACTIVE ACTIVE ACTIVE
REPLICA REPLICA REPLICA
Couchbase Server 1 Couchbase Server 2 Couchbase Server 3
ACTIVE ACTIVE
REPLICA REPLICA
Couchbase Server 4 Couchbase Server 5
SHARD5
SHARD2
SHARD SHARD
SHARD4
SHARD SHARD
SHARD1
SHARD3
SHARD SHARD
SHARD4
SHARD1
SHARD8
SHARD SHARD
SHARDSHARD6
SHARD2
SHARD SHARD SHARD
SHARD7
SHARD9
SHARD5
SHARD SHARD
SHARD
SHARD7
SHARD
SHARD6
SHARDSHARD8
SHARD9
SHARD
SHARD3
SHARD1
SHARD3
SHARD
Application has single logical connection to cluster (client object) When node goes down,
some requests will fail Failover is either
automatic or manual` Client library is
automatically updated via cluster map
Replicas not recreated to preserve stability
Best practice to replace node and rebalance
Couchbase at PayPalKafka Integration
34
Couchbase at PayPal
34
Footprint Overview Seven use cases (more going live at later date) Each cluster is 10 to 20 nodes per cluster Three data center locations per use case
Global Cookie Service Three clusters (two handle traffic, one for DR) Bi-Directional Replication Billions of Documents TB of Data (Maximum of 10 over time)
Challenge Data Analytics
35
Couchbase at PayPal
35
Couchbase Solution Couchbase Server deployed to
capture and serve global cookies Integrates with Hadoop to pass data
for additional offline analytics via Kafka
Results Consistent low latency
SLA 10ms application SLA 1ms Couchbase
High availability enabled by distributed cache and data center replication
Kafka integration for analytics within Hadoop cluster
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Aug/Sep Oct Nov DecMonth Month MonthMonth
36
Data volume/ Scalability• Online system ; >1B documents
• 4-10k size ; 5-10TB total storage
• Linearly Scalable
Availability• Multi data center – DR
• Availability requirement of
99.99%
Requirements for Database
Data Structure• Flexible & Schema less; document
based
Performance• 50% read/50% write;
• Low latency < 10 msec (5)
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 37
Couchbase TAP
• Snapshot Entire Database• Export Future mutations• TAP observe data changes in memcached
server
• Kafka - A high-throughput distributed messaging system.
Couchbase Kafka AdapterBased on Couchbase Tap & Kafka Producer
Kafka Producer
Fast
Scalable
Durable
Distributed
https://github.com/paypal/couchbasekafka
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Stream data out of databasehttps://github.com/paypal/couchbasekafka
38
Camus , MR Jobs
TAP StreamCouchbase Kafka Adapter
{TAP Client + Kafka Producer}
[1] [2] [3]
[4][5][6]
[7]
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CookieApp
CookieApp
CookieApp
XDCR
Active
WriteRead
39
Bi-directional Uni-directional
Active Passive
Deployment Model
DemoConnector … http://blog.couchbase.com/introducing-the-couchbase-
kafka-connector
Bits … https://github.com/couchbase/couchbase-kafka-connector
Thank You