Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
-
Upload
big-data-spain -
Category
Technology
-
view
562 -
download
2
description
Transcript of Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013
![Page 1: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/1.jpg)
Workshop – Hadoop + R
Carlos Gil Bellosta
![Page 2: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/2.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Big Data AnalyticsR & Hadoop
Carlos J. Gil Bellosta
November 2013
![Page 3: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/3.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & RAll about Hadoop
Hadoop FSHadoop & mapreduce
All about R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
![Page 4: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/4.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
File system: manages all aboutfiles
• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!
• Combination of hardware and software to hide boringactivities from users:
• Find space to write the files• Read/write files• Manage fragmentation• Etc.
• How many devices per FS?
• 1-to-1: diskettes, CD-ROMs, HDDs,...• n-to-1: partitioned HDDs,...• 1-to-n: RAIDs, Hadoop
![Page 5: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/5.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Hadoop goodies (as a FS)
• Chuncks (large) files among machines
• Replicates chunks (default, 3)
• Balances data
• Robust to hardware failures
• It is rack aware
Obviously, it requires some system to keep track of:
• Which servers/racks are up/down
• Where each chunk is located
• ...
![Page 6: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/6.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
How to work with data in Hadoop?
• Provides a shell (ls, cp, etc.)
• You can put/get data from your local FS to Hadoop FS
• This is:• You can dump your data to your local machine• You can run your programs in your local machine• You can put results back into Hadoop
• But what if the file is too large?
Solution
Rather than bringing the data to the code, why not moving thecode to the data?
One of the ways to move code to data is known as mapreduce.
![Page 7: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/7.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Mapreduce
• Two step process:• Map: run your code on chunks all over• Reduce: reshape the output into the desired format
• Hadoop manages issues:• System failures• Threads that do not return• And all (?) that made life of OpenMP, MPI, etc. users
miserable
• Slotted approach: mapreduce provides slots where you putthe mappers/reducers code
• The code is for you to provide!
![Page 8: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/8.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
What is R?
• R is a• software package?• programming language?• environment?
for data analysis and graphics.
• R users are (should be?) used to the mapreduce approach:
ddply(dfx, .(group, sex), summarize,
mean = mean(age),
sd = sd(age))
![Page 9: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/9.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)Graphics & big dataLet’s count... hexagons
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
![Page 10: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/10.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Visualizing a million
![Page 11: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/11.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Fluctuation plot
![Page 12: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/12.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table plot
![Page 13: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/13.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
• Non-trivial counting exercise (no, we are not countingwords today!)
• Good visualization features for big datasets
• Fits in mapreduce framework:• Map: Assigns points to hexagons• Reduce: aggregates counts on hexagons• The output is small and can be plotted locally
![Page 14: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/14.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
![Page 15: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/15.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
What you see: input/output, map,reduce
• input:• Type: text, csv, R object,...• Options: separator,...
• output: similar to input
• map & reduce:• Functions with (k,v) argument (k, key; v, value)• They return a k,v list• Thus, mapreduces can be chained together (the output of
the first one is the input for the second)
![Page 16: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/16.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
What you don’t see
$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes
-D stream.map.output=typedbytes
-D stream.reduce.input=typedbytes
-D stream.reduce.output=typedbytes
-D mapred.reduce.tasks=0
-input /tmp/RtmpUUrNMj/file68c0185e60c
-output /tmp/RtmpUUrNMj/file68c04c25d5f0
-mapper \"Rscript rmr-streaming-map68c018acf680 \"
-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a
-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080
-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680
-inputformat org.apache.hadoop.streaming.AutoInputFormat
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat 2>&1
![Page 17: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/17.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
![Page 18: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/18.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Scoring
• Externals consultants build a model (using R and small
data)
• Models in R should have a predict method
• You can then score your huge database (in batch)
• No need to rewrite the model into your systems!
![Page 19: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/19.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
The case for sampling
• Sampling works!
• Sampled datasets can be used to build small data models
• You can use R (& mapreduce) to sample data, but youbetter not
![Page 20: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/20.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Running simulations on Hadoop
• Some (many?) people say it is not the right tool
• You need input data, but simulations often not
• You want to control the number of mappers (which runyour simulations)
• Still mapreduce is nice for simulations...
• ... so let and old dog try its dirty trick!
![Page 21: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/21.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modellingLinear RegressionLogistic RegressionTrees & Random Forests
6 Final remarks
![Page 22: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/22.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Linear regression can beparallelized
Simple linear regression: y ∼ α + βx
β =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2=
=
∑ni=1 xiyi −
1n
∑ni=1 xi
∑nj=1 yj∑n
i=1(x2i )− 1n (∑n
i=1 xi )2
Operations are case by case!
![Page 23: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/23.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Multiple linear regression
• Based on X ′X and X ′y :
β = (X ′X )−1X ′y
• If X ′ = [X1|...|Xn] (by blocks), then X ′X =∑
i XiX′i .
![Page 24: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/24.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Can logistic regression beparallelized? Yes and no.
• Fitting logistic regression models is iterative and iterationsare not parallelizable.
• However, each iteration can be parallelized (these are notunlike fitting linear models as before)
• We will explore two big data alternatives:• Parallelize iterations using mapreduce (seehttp://goo.gl/ftx36r)
• Split your data meaningfully and do standard logisticregression in the nodes
![Page 25: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/25.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
How many bytes make knowledge?(aka the fractal nature of big data)
![Page 26: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/26.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Splitted logistic regression
![Page 27: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/27.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Viable alternatives to logisticmodels
• Trees• High interpretability• But unstable and tend to miss out details
• Random forests• Black boxes• Superb performance• These are collections of trees that can be built in parallel
• Both can be parallelized indifferent ways:• Similar to partitioned logistic models above• Within training
![Page 28: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/28.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
![Page 29: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/29.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Forget most of what you learnedtoday, seriously
• People strive to extend small data models to big data (aswe did today)...
• ... but is it the way to go?
• Achtung microlocal structure• Small data people knows microlocal structure as outliers• Global models (linear, logistic,...) cannot (easily?) exploit
microlocal structure• But the promises of big data lie precisely there• (Otherwise, just sample and you will be fine)
• Areas to watch for insights on big data modelling:• SNA (networks analysis)• Text analysis
![Page 30: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/30.jpg)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Thank you very much and...
... questions?
![Page 31: Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013](https://reader034.fdocuments.us/reader034/viewer/2022042814/554a40d3b4c9055a408b4e90/html5/thumbnails/31.jpg)