Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in...
Transcript of Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in...
![Page 1: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/1.jpg)
Data Infrastructure at LinkedIn
Shirshanka Das
XLDB 2011
1
![Page 2: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/2.jpg)
Me
UCLA Ph.D. 2005 (Distributed protocols in content
delivery networks)
PayPal (Web frameworks and Session Stores)
Yahoo! (Serving Infrastructure, Graph Indexing, Real-time
Bidding in Display Ad Exchanges)
@ LinkedIn (Distributed Data Systems team): Distributed
data transport and storage technology (Kafka, Databus,
Espresso, ...)
2
![Page 3: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/3.jpg)
Outline
LinkedIn Products
Data Ecosystem
LinkedIn Data Infrastructure Solutions
Next Play
3
![Page 4: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/4.jpg)
LinkedIn By The Numbers
120,000,000+ users in August 2011
2 new user registrations per second
4 billion People Searches expected in 2011
2+ million companies with LinkedIn Company Pages
81+ million unique visitors monthly*
150K domains feature the LinkedIn Share Button
7.1 billion page views in Q2 2011
1M LinkedIn Groups
* Based on comScore, Q2 2011
4
![Page 5: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/5.jpg)
5
Member Profiles
![Page 6: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/6.jpg)
Signal - faceted stream search
6
![Page 7: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/7.jpg)
People You May Know
7
![Page 8: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/8.jpg)
Outline
LinkedIn Products
Data Ecosystem
LinkedIn Data Infrastructure Solutions
Next Play
8
![Page 9: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/9.jpg)
Three Paradigms : Simplifying the Data Continuum
• Member Profiles
• Company Profiles
• Connections
• Communications
Online
• Signal
• Profile Standardization
• News
• Recommendations
• Search
• Communications
Nearline
• People You May Know
• Connection Strength
• News
• Recommendations
• Next best idea
Offline
9
Activity that should
be reflected immediately
Activity that should
be reflected soon
Activity that can be
reflected later
![Page 10: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/10.jpg)
Data Infrastructure Toolbox (Online)
Capabilities
Key-value access
Rich structures (e.g.
indexes)
Change capture
capability
Search platform
Graph engine
10
Systems Analysis
{
![Page 11: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/11.jpg)
Data Infrastructure Toolbox (Nearline)
Capabilities
Change capture streams
Messaging for site
events, monitoring
Nearline processing
11
Systems Analysis
![Page 12: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/12.jpg)
Data Infrastructure Toolbox (Offline)
Capabilities
Machine learning,
ranking, relevance
Analytics on
Social gestures
12
Systems Analysis
![Page 13: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/13.jpg)
Laying out the tools
13
![Page 14: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/14.jpg)
Outline
LinkedIn Products
Data Ecosystem
LinkedIn Data Infrastructure Solutions
Next Play
14
![Page 15: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/15.jpg)
Focus on four systems in Online and Nearline
Data Transport
– Kafka
– Databus
Online Data Stores
– Voldemort
– Espresso
15
![Page 16: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/16.jpg)
Kafka: High-Volume Low-Latency Messaging System
LinkedIn Data Infrastructure Solutions
16
![Page 17: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/17.jpg)
Kafka: Architecture
17
WebTier
Topic 1
Broker Tier
Push
Event
s
Topic 2
Topic N
Zookeeper Offset
Management
Topic, Partition
Ownership
Sequential write sendfile
Kafk
a
Clie
nt Lib
Consumers
Pull
Events Iterator 1
Iterator n
Topic Offset
100 MB/sec 200 MB/sec
Billions of Events
TBs per day
Inter-colo: few seconds
Typical retention: weeks
Scale Guarantees
At least once delivery
Very high throughput
Low latency
Durability
![Page 18: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/18.jpg)
Databus : Timeline-Consistent Change Data Capture
LinkedIn Data Infrastructure Solutions
18
![Page 19: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/19.jpg)
Relay
Databus at LinkedIn
Event Win
19
DB
Bootstrap
Capture
Changes On-line
Changes
On-line
Changes
DB
Consistent
Snapshot at U
Consumer 1
Consumer n
Data
bus
Clie
nt Lib
Client
Consumer 1
Consumer n
Data
bus
Clie
nt Lib
Client
Features
Transport independent of data source: Oracle, MySQL, …
Portable change event serialization and versioning
Start consumption from arbitrary point
Guarantees
Transactional semantics
Timeline consistency with the data source
Durability (by data source)
At-least-once delivery
Availability
Low latency
![Page 20: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/20.jpg)
Voldemort: Highly-Available Distributed Data Store
LinkedIn Data Infrastructure Solutions
20
![Page 21: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/21.jpg)
Highlights
• Open source
• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”
In production
• Data products
• Network updates, sharing,
page view tracking,
rate-limiting, more…
• Future: SSDs,
multi-tenancy
Voldemort: Architecture
![Page 22: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/22.jpg)
Espresso: Indexed Timeline-Consistent Distributed
Data Store
LinkedIn Data Infrastructure Solutions
22
![Page 23: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/23.jpg)
Espresso: Key Design Points
Hierarchical data model
– InMail, Forums, Groups, Companies
Native Change Data Capture Stream
– Timeline consistency
– Read after Write
Rich functionality within a hierarchy
– Local Secondary Indexes
– Transactions
– Full-text search
Modular and Pluggable
– Off-the-shelf: MySQL, Lucene, Avro
23
![Page 24: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/24.jpg)
Application View
24
![Page 25: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/25.jpg)
Partitioning
25
![Page 26: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/26.jpg)
Node 3
Node 2
Partition Layout: Master, Slave
Cluster
Manager
Partition: P.1
Node: 1
…
Partition: P.12
Node: 3
Database
Node: 1
M: P.1 – Active
…
S: P.5 – Active
…
Cluster Node 1
P.1 P.2
P.4
P.3
P.5 P.6
P.9 P.1
0
P.5 P.6
P.8
P.7
P.1 P.2
P.11 P.1
2
P.9 P.1
0
P.1
2
P.11
P.3 P.4
P.7 P.8 Master
Slave
3 Storage Engine nodes, 2 way replication
![Page 27: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/27.jpg)
Espresso: API
REST over HTTP
Get Messages for bob
– GET /MailboxDB/MessageMeta/bob
Get MsgId 3 for bob
– GET /MailboxDB/MessageMeta/bob/3
Get first page of Messages for bob that are unread and in the inbox
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
27
![Page 28: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/28.jpg)
Espresso: API Transactions
• Add a message to bob’s mailbox • transactionally update mailbox aggregates, insert into metadata and details.
POST /MailboxDB/*/bob HTTP/1.1
Content-Type: multipart/binary; boundary=1299799120
Accept: application/json
--1299799120
Content-Type: application/json
Content-Location: /MailboxDB/MessageStats/bob
Content-Length: 50
{“total”:”+1”, “unread”:”+1”}
--1299799120
Content-Type: application/json
Content-Location: /MailboxDB/MessageMeta/bob
Content-Length: 332
{“from”:”…”,”subject”:”…”,…}
--1299799120
Content-Type: application/json
Content-Location: /MailboxDB/MessageDetails/bob
Content-Length: 542
{“body”:”…”}
--1299799120—
28
![Page 29: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/29.jpg)
Espresso: System Components
29
![Page 30: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/30.jpg)
Espresso @ LinkedIn
First applications
– Company Profiles
– InMail
Next
– Unified Social Content Platform
– Member Profiles
– Many more…
30
![Page 31: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/31.jpg)
Espresso: Next steps
Launched first application Oct 2011
Open source 2012
Multi-Datacenter support
Log-structured storage
Time-partitioned data
31
![Page 32: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/32.jpg)
Outline
LinkedIn Products
Data Ecosystem
LinkedIn Data Infrastructure Solutions
Next Play
32
![Page 33: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/33.jpg)
The Specialization Paradox in Distributed Systems
Good: Build specialized
systems so you can do each
thing really well
Bad: Rebuild distributed
routing, failover, cluster
management, monitoring,
tooling
33
![Page 34: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/34.jpg)
Generic Cluster Manager: Helix
• Generic Distributed State Model
• Centralized Config Management
• Automatic Load Balancing
• Fault tolerance
• Health monitoring
• Cluster expansion and
rebalancing
• Open Source 2012
• Espresso, Databus and Search
34
![Page 35: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/35.jpg)
Stay tuned for
Innovation
– Nearline processing
– Espresso eco-system
– Storage / indexing
– Analytics engine
– Search
Convergence
– Building blocks for distributed data
management systems
35
![Page 36: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/36.jpg)
Thanks!
36
![Page 37: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/37.jpg)
Appendix
37
![Page 38: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/38.jpg)
Espresso: Routing
Router is a high-performance HTTP proxy
Examines URL, extracts partition key
Per-db routing strategy
– Hash Based
– Route To Any (for schema access)
– Range (future)
Routing function maps partition key to partition
Cluster Manager maintains mapping of partition to hosts:
– Single Master
– Multiple Slaves
38
![Page 39: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/39.jpg)
Espresso: Storage Node
Data Store (MySQL)
– Stores document as Avro serialized blob
– Blob indexed by (partition key {, sub-key})
– Row also contains limited metadata
Etag, Last modified time, Avro schema version
Document Schema specifies per-field index constraints
Lucene index per partition key / resource
39
![Page 40: Data Infrastructure at LinkedIn · PDF fileLinkedIn By The Numbers 120,000,000+ users in August 2011 2 new user registrations per second 4 billion People Searches expected in 2011](https://reader033.fdocuments.us/reader033/viewer/2022051721/5a7fa7097f8b9aee018bbd9b/html5/thumbnails/40.jpg)
Espresso: Replication
MySQL replication of mastered partitions
MySQL “Slave” is MySQL instance with custom storage
engine
– custom storage engine just publishes to databus
Per-database commit sequence number
Replication is Databus
– Supports existing downstream consumers
Storage node consumes from Databus to update
secondary indexes and slave partitions
40