Using Scalding for Data Driven Product Development at LinkedIn
-
Upload
sasha-ovsankin -
Category
Technology
-
view
106 -
download
3
description
Transcript of Using Scalding for Data Driven Product Development at LinkedIn
![Page 1: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/1.jpg)
Using Scalding for Data-Driven Product Development
Sasha OvsankinLinkedIn
Presented to Scala By The BayAug 9, 2014
![Page 2: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/2.jpg)
/summary
Data-Driven Product
Development
![Page 3: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/3.jpg)
/summary
Data-Driven Product
Development
Scalding = Hadoop + Scala
![Page 4: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/4.jpg)
/summary
Data-Driven Product
Development
Scalding = Hadoop + Scala
![Page 5: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/5.jpg)
/data-driven
YourService
![Page 6: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/6.jpg)
/data-driven
YourService
Value
![Page 7: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/7.jpg)
/data-driven
YourService
Value Data
![Page 8: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/8.jpg)
/data-driven
YourService
Value Data
![Page 9: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/9.jpg)
/data-driven
YourService
Value Data
![Page 10: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/10.jpg)
/data-driven
YourAmazing
Service
Value Data
![Page 11: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/11.jpg)
“Online” World
/data-driven/linkedin
Web Applications
NoSQL Data Stores
ETL
“Offline” World (Hadoop)
HDFS
Hadoop Jobs
Tracking/logging
Analytics
Data Products
Messaging
Message delivery
Databases
![Page 12: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/12.jpg)
/linkedin/big-data/links
• “LinkedIn Big Data Ecosystem”– http://lnkd.in/big-data-ecosystem
• Grid Operations– http://lnkd.in/gridops2013
![Page 13: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/13.jpg)
/scalding
http://github.com/twitter/scalding• Scala-based DSL for Map/Reduce jobs• Built on Cascading, stable and mature Hadoop framework• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}
• Succinct and powerful• High level of abstraction
![Page 14: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/14.jpg)
/data-driven/problem/scaling
• Problem: Scaling• Solution– Distributed processing– High-level description of algorithms– Functional programming
![Page 15: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/15.jpg)
…/solution/scalding
![Page 16: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/16.jpg)
../problem/complexity
• Problem: Complexity• Solution– Consistent way of organizing data• Self-describing data formats (Avro)• File organization
– Type safety– Modularization
![Page 17: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/17.jpg)
…/solution/scalding
![Page 18: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/18.jpg)
/linkedin/hadoop/practices
• All online data end up in HDFS– Avro encoding is standard
• Production Process– CI/Automatic Build
• More info forthcoming
– Production Review– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem
![Page 19: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/19.jpg)
../solution/scala/killer-argument
• Map & reduce -- primitivesscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500
![Page 20: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/20.jpg)
/linkedin/scalding/status
• Started >1 year ago• Thousands of production LOC written in Scalding by our
team– Pretty happy with readability, maintainability and tooling
support• Dozens of flows are currently in production, and counting• Created Scalding user group• Growing interest• Learning:
– Scala[Scalding] < Scala[ _ ]
![Page 21: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/21.jpg)
/summary
Data-Driven Product
Development
Scalding = Hadoop + Scala
![Page 22: Using Scalding for Data Driven Product Development at LinkedIn](https://reader035.fdocuments.us/reader035/viewer/2022070304/54c67f114a7959a4368b46cb/html5/thumbnails/22.jpg)
/linkedin/join-us
• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them become
more productive and successful• We are looking for amazing people interested in Software
Engineering and Data Science– http://linkedin.com/careers
Questions?