The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.
-
Upload
ann-nicholson -
Category
Documents
-
view
219 -
download
2
Transcript of The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.
![Page 1: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/1.jpg)
LinkedIn Confidential ©2013 All Rights Reserved
The Evolution of Data Infrastructure at Linkedin
Lei Gaohttp://www.linkedin.com/in/gaolei
![Page 2: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/2.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 2
Outline
1. Company and Mission
2. Products and Science
3. Data Infrastructure
4. Conclusion
![Page 3: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/3.jpg)
The World’s Largest Professional Network
Members Worldwide
2+ newMembers Per Second
132M+Monthly Unique Visitors
225M+ 2.9M+ Company Pages
Connecting the world’s professionals to make them more productive and successful
LinkedIn Confidential ©2013 All Rights Reserved 3
![Page 4: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/4.jpg)
4
Member ProfilesLarge dataset
Medium writes
Very high reads
Freshness <1s
![Page 5: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/5.jpg)
5
People You May KnowLarge dataset
Compute intensive
High reads
Freshness ~hrs
![Page 6: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/6.jpg)
6
LinkedIn Today Moving dataset
High writes
High reads
Freshness ~mins
![Page 7: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/7.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 7
LinkedIn Data Infrastructure: Three-Phase Abstraction
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections
• Messages • Endorsements• Skills
Near-Line Activity that should be reflected soon
• Activity Streams• Profile Standardization• News
• Recommendations• Search• Messages
Offline Activity that can be reflected later
• People You May Know• Connection Strength• News
• Recommendations• Next best idea…
![Page 8: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/8.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 8
The Big-Data Feedback Loop
Value
Insights
Scale
Product
ScienceData
Member
Engagement
Virality
Signals
Refinement
InfrastructureAnalytics
![Page 9: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/9.jpg)
9
LinkedIn Data Infrastructure: Sample Stack
Infra challenges in 3-phase ecosystem are diverse, complex and specific
Some off-the-shelf.Significant investment in home-grown, deep and
interesting platforms
Databus
![Page 10: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/10.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 10
The Original RDBMS Model
![Page 11: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/11.jpg)
11
Streaming Transactions for Search/Connections
![Page 12: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/12.jpg)
Databus : Timeline-Consistent Change Data Capture
LinkedIn Data Infrastructure Solutions
![Page 13: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/13.jpg)
13
Streaming Transactions for Search/Connections
RO
RO
RO
![Page 14: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/14.jpg)
Databus at LinkedIn
14
DB
Bootstrap
CaptureChanges
On-lineChanges
On-lineChanges
DB
Compressed
Delta Since T
Consistent
Snapshot at U
Transport independent of data source: Oracle, MySQL, …
Transactional semantics In order, at least once delivery
Tens of relays Hundreds of sources Low latency - milliseconds
Consumer 1
Consumer n
Client
Dat
abus
C
lient
Lib
Consumer 1
Consumer n
Dat
abus
C
lient
Lib
Client
Relay
Event Win
![Page 15: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/15.jpg)
15
Scaling Core Databases
RO
RO
RO
![Page 16: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/16.jpg)
16
Voldemort: Highly-Available Distributed KV Store
LinkedIn Data Infrastructure Solutions
![Page 17: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/17.jpg)
17
Scaling Core Databases
![Page 18: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/18.jpg)
• Pluggable components• Tunable consistency /
availability• Highly scalable key/value store
• 14 clusters, 400 nodes• 400K peak QPS• 100TB data• 2~3ms avg latency
Voldemort: Architecture
![Page 19: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/19.jpg)
19
Scaling Core Databases
Secondary Index
![Page 20: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/20.jpg)
20
Espresso: Indexed Timeline-Consistent Distributed Data Store
LinkedIn Data Infrastructure Solutions
![Page 21: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/21.jpg)
21
Storage with Richer Data Model
Espresso
![Page 22: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/22.jpg)
Application View
22
Hierarchical data model
Rich functionality on resources Conditional updates Partial updates Atomic counters
Rich functionality withinresource groups
Transactions Secondary index Text search
![Page 23: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/23.jpg)
23
Espresso: System Components
• Partitioning/replication• Timeline consistency• Change propagation
![Page 24: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/24.jpg)
24
Generic Cluster Manager: Helix
• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing
• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix
![Page 25: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/25.jpg)
25
Streaming Non-transactional Events
Hadoop/DW
Espresso
![Page 26: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/26.jpg)
26
Kafka: High-Volume Low-Latency Messaging System
LinkedIn Data Infrastructure Solutions
![Page 27: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/27.jpg)
27
Ingress – Offline Data Analytics
SecuredHadoop/
DW
![Page 28: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/28.jpg)
Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic2-part2
topic2-part1
topic1-part1 topic1-part2
topic1-part1 topic1-part2
topic2-part2
topic2-part1
Broker 1 Broker 2 Broker 3 Broker 4
Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication
Per day stats• writes: 10+ billion messages• reads: 50+ billion messages
![Page 29: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/29.jpg)
29
Egress – Analytics Results for Online Serving
SecuredHadoop/
DW
![Page 30: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/30.jpg)
30
WebHDFS + Faust
LinkedIn Data Infrastructure Solutions
+
![Page 31: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/31.jpg)
31
Egress – Getting Data Out from Offline
SecuredHadoop/
DW
WebHDFS
KafkaFaust
![Page 32: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/32.jpg)
32
Batch Environment Data Flow
![Page 33: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/33.jpg)
33
Workflow management: Azkaban
![Page 34: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/34.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 34
• Map-reduce jobs generate RO files• All index fits in memory for fast reads• File system cache for data
• Data transferred in parallel via WebHDFS
• Authentication always required for each file transfer out of Hadoop
Read-only Data Generation and Transfer
![Page 35: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/35.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 35
• Map-reduce jobs generate records• In Avro format• Annotated key and value fields
• Records published from Hadoop to Kakfa
• Faust consumes records from Kafka
• Faust streams records into Voldemort, Espresso, and other serving platforms
Modifiable Data Generation and Transfer
Plug-ins
V. Plug-in
E. Plug-in
Plug-ins
Kafka Plug-
in
Databus
Plug-in
Other Data Sources
Voldemort
Espresso
Other Data Sources
Hadoop
Teradata/ DWH
Kafka
Monitoring Throttling Scheduling
Faust
![Page 36: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/36.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 36
Summary
Read more @ data.linkedin.com
1. E2E: The Big-Data feedback loop is essential for product design
2. Infrastructure
1. Data Infra needs continuous innovation and iteration to scale out
2. Fast moving, Big, Clean Data + Agile Metadata = Goodness
3. Data-driven products need agile feedback infrastructure and measurement methodology.
3. Methodology
1. Data-Driven experimentation enables insights and agile products
2. Recommendation-driven products have big impact.
![Page 37: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/37.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 37
Help us. Come Have Fun with Us!
Info: data.linkedin.com
1. Science and Data Mining: Recommendation and Optimization Problems
2. Next-generation ad-hoc and OLAP query processing on Hadoop
3. Graph Computations: Off-line mining and On-line integration loops
4. nRT Data Streams in Near-line infrastructure
5. And much more…
![Page 39: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649cff5503460f949cfcd3/html5/thumbnails/39.jpg)
39