R on Hadoop

19
R ON HADOOP Kostiantyn Kudriavtsev Lviv Hadoop User Group, June 19, 2014

description

R on Hadoop Lviv Hadoop user group presentation, meetup #1 http://hug-lviv.blogspot.com/

Transcript of R on Hadoop

Page 1: R on Hadoop

R ON HADOOP

Kostiantyn Kudriavtsev

Lviv Hadoop User Group, June 19, 2014

Page 2: R on Hadoop

Agenda

• What is R?

• Linear Regression

• R on Hadoop

• Summary

Page 3: R on Hadoop

What is R?

Object-oriented and functional language for Stats, Math and Data Science created by statisticians with comprehensive data visualisation and statistical modelling capabilities;

5000+ (and grow) freely available specialised algorithms for finance, economics, genomics, linguistic and so on;

2M+ users with specialised domain skills;

… but some drawbacks are:

- limited by RAM

- single thread

Page 4: R on Hadoop

R development environment

RStudio is de-facto standard IDE for R development and available in local or server mode. Might be used not only for coding, but also visualisation. Suitable to develop R solutions on top of Hadoop.

Page 5: R on Hadoop

Apache Hadoop is an software framework that supports data-intensive distributed applications based on MapReduce algorithm (MR). Main idea: move computation to data.

MR idea:

- Map step: Map(k1,v1) → list(k2,v2)

- Magic here (sort by k2, data transfer between nodes, etc)

- Reduce step: Reduce(k2, list (v2)) → (k3, v3)

What is Hadoop?

Page 6: R on Hadoop

Linear regression

Web-store might use linear regression to predict sales of goods or discover trends.

sale(Product) ~ visitors(Product)

Linear regression might be used here:

sale = α * visitors + β

Page 7: R on Hadoop

Linear regression in Rdf <- read.csv("Phone.csv", header=TRUE)

qq <- qplot(visited,purchased,colour=product_page, data=df)

qq + geom_smooth(method='lm', formula=y~x)

Page 8: R on Hadoop

Linear regression in R

df.p2 <- df[df$product_page == 'phone_2', ]

m <- lm(purchased ~ visited, data=df.p2)

summary(m)

Page 9: R on Hadoop

R on Hadoop

Several options:

• Hadoop streaming

• RHadoop

• RHipe

• RSpark

• Oracle R Advanced Analytics for Hadoop

• etc.

Page 10: R on Hadoop

R Hadoop streaming

Hadoop was mainly designed to use Java and provides comprehensive Java API.

Other languages can be used through “Streaming API” Streaming API utilised standard input (reading) and standard output (writing) OS possibilities. It provides lightweight API for MapReduce in compare to Java API.

Streaming requires writing two separate scripts (per mapper and reducer) in any language (Python, Ruby, R, C#, Go, OCalm, Lisp, etc)

Page 11: R on Hadoop

R Hadoop streaming

Streaming API drawbacks:

● while the inputs to the reducer are grouped by key, they are still iterated over line-by-line, and the boundaries between keys must be detected by the user

● no possibilities to utilize different mappers in one MapReduce job

● no possibilities to create different outputs from reducer

● counters update through stderr

Additional disadvantage of implementing streaming in R:

•strong output control for R functions, because they are “buzzy”, however only meaning data must be pushed

Page 12: R on Hadoop

R Hadoop streaming: Mapper

Page 13: R on Hadoop

R Hadoop streaming: Reducer

Page 14: R on Hadoop

RHadoop

RHadoop - set of libraries (written in R language) for R languages aim to facilitate using R languages with Hadoop streaming to develop MR jobs. So, it has general drawbacks for Hadoop streaming.

Page 15: R on Hadoop

RHadoop

RHadoop is still R through Hadoop Streaming

Advantages compared to Streaming:

● don’t need to manage key change in Reducer

● don’t need to control functions output manually

● simple R API covers Streaming API

● R code can be run on local env/Hadoop without changes

Page 16: R on Hadoop

Demo time

Page 17: R on Hadoop

R on Hadoop in Real Life

Several steps are required to achieve the goal:

1.Data ingestion

2.Data preparation

3.R processing

4.Postprocessing http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry-Rogozhin.jpg

Page 18: R on Hadoop

Learned Lessons

R is slow… for million calculations it’s even slow with Hadoop!

How to improve the speed? Rewrite flow - maximum preprocessing work before R step.

Hadoop streaming supports mapper/reducer in different languages.

Think twice. R is great for exploratory analysis and researches, but in production might cause performance penalty.

Page 19: R on Hadoop

Q&A

• Thank you for your attention