Hadoop: A View from the Trenches

37
1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved. Hadoop: A View from the Trenches Milind Bhandarkar Chief Scientist, Pivotal Twitter: @techmilind

Transcript of Hadoop: A View from the Trenches

1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.

Hadoop: A View from the Trenches Milind Bhandarkar Chief Scientist, Pivotal Twitter: @techmilind

2 © Copyright 2013 Pivotal. All rights reserved.

About Me �  http://www.linkedin.com/in/milindb

�  Founding member of Hadoop team at Yahoo! [2005-2010]

�  Contributor to Apache Hadoop since v0.1

�  Built and led Grid Solutions Team at Yahoo! [2007-2010]

�  Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

�  Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly EMC-Greenplum)

3 © Copyright 2013 Pivotal. All rights reserved.

First, technology is good. Then it gets bad. Then it gets stable. - Alistair Croll (http://strata.oreilly.com/2013/01/data-warefare.html)

4 © Copyright 2013 Pivotal. All rights reserved.

History (2003-2010)

5 © Copyright 2013 Pivotal. All rights reserved.

Google Papers

6 © Copyright 2013 Pivotal. All rights reserved.

Yahoo! Search

+

=

7 © Copyright 2013 Pivotal. All rights reserved.

W-1-W

� WebMap : Graph processing for WWW

� Dreadnaught: Infrastructure for WebMap

�  Juggernaut: Infrastructure for W-1-W

�  JFS, JMR, Condor: Abandoned for Hadoop

8 © Copyright 2013 Pivotal. All rights reserved.

Lucene, Nutch

9 © Copyright 2013 Pivotal. All rights reserved.

Kryptonite

10 © Copyright 2013 Pivotal. All rights reserved.

Lessons Learned

� Multi-Tenancy from ground-up

� Agility in lieu of Performance

� Provisioning vs Procurement

�  “Weird” use cases as learning experience

� Academic collaboration

11 © Copyright 2013 Pivotal. All rights reserved.

(From Hadoop Summit 2010) Who Uses Hadoop ?

12 © Copyright 2013 Pivotal. All rights reserved.

http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/ Big Data Landscape (June 2012)

13 © Copyright 2013 Pivotal. All rights reserved.

http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html Hadoop Ecosystem (January 2013)

14 © Copyright 2013 Pivotal. All rights reserved.

15 © Copyright 2013 Pivotal. All rights reserved.

16 © Copyright 2013 Pivotal. All rights reserved.

17 © Copyright 2013 Pivotal. All rights reserved.

Hadoop Economics is Game Changer

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

18 © Copyright 2013 Pivotal. All rights reserved.

“Typical” Hadoop Use-Case

�  “User” Modeling

� Objective: Determine User-Interests by mining user-activities

� Large dimensionality of possible user activities

� Typical user has sparse activity vector

� Event attributes change over time

19 © Copyright 2013 Pivotal. All rights reserved.

Domain: Retail

� User = Customer

� Activities –  Online: Purchase, Ad click, FB Likes –  Offline : Brick-and-mortar purchases, returns, coupon clipping,

gift cards

� Personalized Product Recommendation

20 © Copyright 2013 Pivotal. All rights reserved.

Domain: IT Infrastructure

�  “User” = HW & SW Components

� Activities –  Log messages, Metrics, connectivity, communication events

� Goal: Proactive alerting of imminent failures

21 © Copyright 2013 Pivotal. All rights reserved.

Domain: Healthcare

� User = Patient

� Activities –  Doctor Visits, Medicine refills, Medical History –  3G/WiFi-enabled Pillbox...

� Goal: Prevent Hospital Readmissions

22 © Copyright 2013 Pivotal. All rights reserved.

Domain: Telecom

� User: Subscriber

� Activities –  Calls made, duration, calls dropped, locations, ... –  “social” graph, status updates

� Goal: Reduce customer churn

23 © Copyright 2013 Pivotal. All rights reserved.

Domain: Ad-Supported Web

� User = User :-)

� Activities –  Clicks on content, Likes, Repost –  Search Queries, Comments, Participation

� Goal: Increase Engagement, Increase Clicks on revenue-generating content (ads/premium content)

24 © Copyright 2013 Pivotal. All rights reserved.

User-Modeling Pipeline

� Sessionization

� Feature and Target Generation

� Model Training

� Offline Scoring & Evaluation

� Batch Scoring & Upload to serving

25 © Copyright 2013 Pivotal. All rights reserved.

What’s Next ?

26 © Copyright 2013 Pivotal. All rights reserved.

Trough of Disillusionment ?

27 © Copyright 2013 Pivotal. All rights reserved.

Or, Hadoop Everywhere ?

28 © Copyright 2013 Pivotal. All rights reserved.

Storage Wars

� HDFS

� KosmosFS, LocalFS, Quantcast FS, S3

� MapR

� GPFS, Isilon, Atmos, Swift, NetApp

� Lustre, Gluster, Ceph, PanFS, PVFS

� EMC ViPR

29 © Copyright 2013 Pivotal. All rights reserved.

NoSQL = Not Yet SQL ?

� Pivotal HAWQ

� Cloudera Impala

� Apache Drill, Spire (Drawn to Scale)

� Cascading Lingual, Optiq

� Hortonworks Stinger

� More to come....

30 © Copyright 2013 Pivotal. All rights reserved.

Prepare for Convergence

� HPC: Cache Coherence, Prefetching, Zero-copy, Low-contention locks

�  “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency

� Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization

31 © Copyright 2013 Pivotal. All rights reserved.

Convergence

� Resource Allocation, Scheduling, Lifecycle Management

� Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs

� Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering

32 © Copyright 2013 Pivotal. All rights reserved.

Hadoop As A Service

� Hadoop Platform-As-A-Service –  EMR competitor proliferation –  OpenStack, CloudStack, Joyent...

� Application-As-A-Service (Hadoop Inside) –  Cetas, Continuuity, Causata, Claritics, Tresata, Wibidata,…

� Pivotal One –  CloudFoundry, Hadoop, HAWQ, Analytics –  Spring, Redis, RabbitMQ

33 © Copyright 2013 Pivotal. All rights reserved.

New Hardware Platforms

� Mellanox - Hadoop Acceleration through Network Levitated Merge

� RoCE - Brocade, Cisco, Extreme, Arista...

� ARM - Low power Hadoop servers

� SSD - Velobit, Violin, FusionIO, Samsung..

� Niche - Compression, Encryption…

34 © Copyright 2013 Pivotal. All rights reserved.

IAAS as the new Hardware

� AWS, GCE, Azure

�  vSphere, OpenStack

� Easy Provisioning

� Scalable

� Elastic

� Ubiquitous

� Needs bundling with Data & Analytics as Services

35 © Copyright 2013 Pivotal. All rights reserved.

Big Data Platform of Future ?

depl

oy

Public Cloud

Private Cloud

On Premise

36 © Copyright 2013 Pivotal. All rights reserved.

Questions ?

A NEW PLATFORM FOR A NEW ERA