DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
-
Upload
hakka-labs -
Category
Technology
-
view
298 -
download
0
Transcript of DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
![Page 1: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/1.jpg)
Building Satori: Web Data Extraction On Hadoop
Nikolai AvtenievSr. Staff Software EngineerLinkedIn
![Page 2: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/2.jpg)
Building Opportunity from the Empire State Building
2
LinkedIn NYC
![Page 3: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/3.jpg)
3
The Team
Nikita LytkinStaff Software Engineer
Pi-Chuan ChangSr. Software Engineer
David AstleSr. Software Engineer
Nikolai AvtenievSr. Staff Software Engineer
Eran LeshemSr. Staff Software Engineer
![Page 4: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/4.jpg)
THE ECONOMIC GRAPH
![Page 5: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/5.jpg)
Connecting talent with opportunity at massive scale
Members Companies Jobs Skills Schools Updates
![Page 6: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/6.jpg)
6
What we thought we neededThe BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
![Page 7: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/7.jpg)
7
Questions we wanted to answerFocused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?
![Page 8: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/8.jpg)
Identity Team
![Page 9: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/9.jpg)
Virtually All Member Value Relies On Identity Data
Susan KaplanSr. Marketing Manager at Weblo
SEARCHResearch & Contact
AD TARGETINGMarket Products
& Services
PMYKBuild Your Network
RECRUITERRecruit & Hire
FEEDGet Daily News
NETWORKKeep in Touch
RECOMMENDATIONSGet a Job/Gig
WVMPEstablish Yourself
as Expert
![Page 10: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/10.jpg)
Identity Use CaseA smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…
![Page 11: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/11.jpg)
Kafka/Samza Team
![Page 12: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/12.jpg)
• Avg. HTML Document is 6K 37% < 10K
• Samza can handle 1.2M messages per node [2]
• There is a limit of how much data is retained between 7 and 30 days.
• Most of the data is filtered out• Need to bootstrap Samza
stores
12
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node
![Page 13: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/13.jpg)
13
Help 400M members fully realize their professional identity on LinkedIn.
Find sources of professional content on the public internet.
Fetch the content, extract structured data and match it to member profiles
The Project: Satori
![Page 14: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/14.jpg)
Web Data Extraction HOW TO:
![Page 15: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/15.jpg)
• Enterprise VS Social Web use cases
• Web Sources • Wrappers
15
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.
![Page 16: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/16.jpg)
16
What is a Wrapper?
Candy Wrapper Web Wrapper
![Page 17: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/17.jpg)
Induce wrappers based on data [4]Build wrappers that are robust. [5]Cluster similar pages by URL [6]The web is huge and there are interesting things in the long tale[7]
17
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.
![Page 18: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/18.jpg)
Picking a Crawler
![Page 19: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/19.jpg)
HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
19
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 20109. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 200410.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 201411. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
![Page 20: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/20.jpg)
20
And the winner is …
![Page 21: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/21.jpg)
Satori
![Page 22: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/22.jpg)
• Built on Nutch 1.9• Runs on Hadoop 2.3• Scheduled to run every 5
hours• Respects robots.txt • Default crawl delay of 5
seconds
22
Crawl Flow
![Page 23: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/23.jpg)
• Output into target schema• Apply XPATH wrappers• Wrappers are hierarchical
mapping of Schema field to XPath expression
• Indexed by data domain and data source
23
Extract Flow
![Page 24: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/24.jpg)
Crawl rate is bound by the number of sites and the site
crawl delay
![Page 25: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/25.jpg)
Common Crawl Great Sourcehttps://commoncrawl.org/
Gobblin Great Ingestion Frameworkhttps://github.com/linkedin/gobblinn
25
Bootstrap From Bulk Sources
![Page 26: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/26.jpg)
XPath extractors can be challenging on sites with rich
data
![Page 27: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/27.jpg)
It is easy to exceed the Hadoop quota
![Page 28: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/28.jpg)
Match[in]
![Page 29: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/29.jpg)
Matching authors and publications to members to power profile edit experiences
![Page 30: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/30.jpg)
30
Overview
![Page 31: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/31.jpg)
Match using global identifiers, email or full name.
The data might not be clean after extraction
Start with a small set of data and get it to the users quickly
31
Start Simple
![Page 32: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/32.jpg)
Narrow the candidates with LSH[1]
Use the simple model to generate the ground truth
Train using a simple algorithm and a few hundred features
32
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
![Page 33: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/33.jpg)
Publications Companies
5.3
2.3
3.9
0.6
Extractor ObjectsTotal Processed
33
Current Status
Publication Company
562
5.62.5
1.2 0.1
Crawler ObjectsUnfetched FetchedGone
![Page 34: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/34.jpg)
Target a data source which has data that will be easy to fetch,
extract and match.
![Page 35: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/35.jpg)
Add tracking to the entire flow
![Page 36: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/36.jpg)
Do it all offline if you can
![Page 37: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/37.jpg)
Get the product to the customers early to validate the process and value proposition
![Page 38: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/38.jpg)
Most important of all write it all down and share it with everyone
![Page 39: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn](https://reader036.fdocuments.us/reader036/viewer/2022070516/5872f37f1a28ab8c718b50e9/html5/thumbnails/39.jpg)
©2014 LinkedIn Corporation. All Rights Reserved.