©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING©2013 LinkedIn Corporation. All Rights Reserved.
Hive at LinkedInMohammad Islam, Mark Wagner, Karthik Ramasamy
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 3
Agenda
LinkedIn Data and its Ecosystem Performance Improvements – Avro User experiences
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 4
LinkedIn Data Sources
Event Data– Page Views– Clicks– Search queries
Database Data– Profile (Users & Companies)– Connections
External Data– Salesforce, DoubleClick
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 5
Data Ecosystem at LinkedInMember
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 6
Data Ecosystem at LinkedInMember
Facing
Systems
Data
Data
Data
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 7
Data Ecosystem at LinkedInMember
Facing
Systems
DataData
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 8
Data Ecosystem at LinkedInMember
Facing
Systems
Data
Data
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 9
Data Ecosystem at LinkedInMember
Facing
Systems
Data
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 10
Data in Hadoop
Almost all LinkedIn data is stored in Hadoop Tools used
– Hive/HCatalog– Pig– Java MapReduce– Azkaban
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 11
Hive Usage
Use-cases– Ad-hoc query– Reporting– Building Platforms
Segmentation Engine Experimentations Engine
Users– Data Scientist– Business Analytics– Security team– Product team
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 12
Hive Challenges
Performance– Faster query execution
Performance– Faster query execution
Efficient MR* execution plan– Effective resource usage– Ensure cluster stability
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 13
LinkedIn Hive Initiatives
Make HCatalog work and deploy [OnGoing] Hive Performance Improvement (Avro data reading) [On
Going] Stabilize Hive Server 2 at LI [About to Start] Expand the scope of HCatalog metadata [Planning]
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 14
HCatalog Initiatives
Expand scope of meta-data– Who creates this data?– What are the inputs?
Helpful to create data lineage
– Who is the maintainer of data?
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERINGCourtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 16
What is the Problem?
Reading Avro record takes long time.– 52 micro-second/record
Found the hotspot using VisualVm
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 17
Improvement #1
Reduce the number of Schema.equals() calls Schema equality checks required primarily for evolved
schema. Solution includes caching to avoid unnecessary
expensive calls Results
– Trunk read overhead : 52 μs/record– After this patch read overhead : 32 μs/record
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 18
Improvement #2
Reduce extra data transformations Solution is to provide custom object inspectors Results
– Current read overhead : 52 μs/record– After this patch read overhead : 30 μs/record
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 19
Final Results
Trunk Improvement #1 Improvement #2 Combined0
10
20
30
40
50
6055
3230
11
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERINGCourtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 21
56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoop users:
Hive User Base at LinkedIn
of Hive jobs were from ad-hoc queries32%
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 22
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t
Data Scientists
Engineers
Product Managers
Customer Support Specialists
Analysts
Hive adoption among Hadoop users by job title
©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 23
Top concerns about Hive
Not friendly for long/complex workflows
Performance, especially for ad-hoc queries
Steep learning curve for tuning
Data/UDFs unavailability