RANDOM FORESTSR vs PYTHONR & PYTHON
H!vin" fun when st!rtin" out in d!t! !n!l#sis
WHOLINDA URUCHURTU@lind!uruchurtu
Consult!nt !t DBi Web An!l"tics & D!t! Consult!nc"
Ph"sicist b" tr!inin#
OUTLINE OF THIS TALK• Motiv!tion• R!ndom Forests: R & Python
• Ex!mple: EMI music set
• Concludin" rem!rks
MOTIVATION
STARTING OUT IN DATA ANALYSIS
• Online: blo"s, GitHub, MOOCs, K!""le, D!t! T!u, Cross V!lid!ted, St!ckoverflow...
• Books• School work
TOO MANY RESOURCES
WHICH LANGUAGE SHOULD I USE?POPULAR QUESTION
LET’S ASK GOOGLE
• Pro"r!mmed in C• Used MATLAB !t Uni• Spent ! lon" time pl!#in" with s#mbolic
l!n"s M!them!tic! & M!ple
START BY WHAT YOU KNOW & ASK YOUR FRIENDS
MY EXPERIENCE
P.S. I h!d not met the iP"thon notebook.
BIG REVEAL: I AM AN AVID R USER
MY EXPERIENCE (cont)
P.S. I h!d not met the iP"thon notebook.
• Don’t h!ve ! web dev b!ck"round• Surrounded b# people doin" St!ts• Pick the ri"ht tool for the t!sk !t h!nd
TL;DR - CAN BE CONFUSING FOR A NEWBIE
LANGUAGE WARSToo m!n" !rticles !bout:
• “P!thon Displ"cin# R As The Pro#r"mmin# L"n#u"#e For D"t" An"l!sis”
• “Is P!thon re"ll! suppl"ntin# R for d"t" work?”• “10 Re"sons P!thon Rocks for Rese"rch”• “Wh! P!thon is ste"dil! e"tin# other l"n#u"#es' lunch”• “Wh! I’m bettin# on Juli"”• “Wh"t "re the "dv"nt"#es of usin# P!thon over R?”• “Wh! P!thon with Coffee is better th"n R with Ice
Cre"m”
[FAVE LANG] is BETTERBECAUSE I SAY SO
LANGUAGE WARSHowever, it is "ood to h!ve ! "ener!l underst!ndin" of the + !nd - of the v!rious d!t! !n!l#sis tools, in order to pick the ri"ht tool for the job.
• R h!s EVERYTHING "ou need for performin# st!tistic!l !n!l"sis.
• R / MATLAB / Python !re #re!t for protot"pin#• Python is ! full fe!tured pro#r!mmin# l!n#u!#e• E!sier to incorport!te Python outcomes into ! full
d!t! product workflow
DEFINE THE PROBLEMTime better spent definin# the problem !nd determinin# wh!t is the best w!" to solve it
GOOD TO HAVE A BIG BAG OF TRICKS
Re-do R !n!l"sis usin# Python d!t! !n!l"sis st!ck
WILL IT PYTHON? CREDIT: SLENDER MEANS
PYTHON SCIKIT LEARN
IT IS PRETTY AWESOME
• Libr!r" of M!chine Le!rnin# Al#orithms• Open source• API• P"thon, Nump" & Co• Accessible, m!n" models, document!tion &
ex!mples
EXAMPLE
CHOOSING A PROBLEMAlw!"s ! #ood ide! to look for ! d!t! set th!t is interestin# to "ou.
12 Formul!te ! question
3 Formul!te !n h"pothesis
4 Build Model to !nswer question !nd Test
SCIENTIFIC METHOD FTW
CHOOSING A DATA SETSTEP 1
EMI MUSIC “ONE MILLION INTERVIEW SET”
• One of the l!r#est preference d!t! sets in the world.
• Extr!ct used in Data Science London h!ck!ton !nd !v!il!ble in KAGGLE !s four sep!r!te d!t! sets.
FOUR DATA SETS• TRAIN / TEST - !rtist, tr!ck, userID, time & r!tin"s
• WORDS - userID, he!rd_of, own_!rtist_music , like_!rtist, 82 !djectives
• USERS - userID, "ender, !"e, workin" st!tus, re"ion, music, list_own (hours per d!#), list_b!ck (hours per d!#), 19 user h!bits questions (0-100)
USERSKEY STRING
1 “Music is import!nt to me but not necess!ril" most import!nt”
2 “I like music but it does not fe!ture he!vil" in m" life”
3 “Music me!ns ! lot to me !nd it is ! p!ssion of mine”
4 “Music h!s no p!rticul!r interest to me”
5 “Music is import!nt to me but not necess!ril" more import!nt th!n other hobbies”
6 “Music is no lon#er !s import!nt !s it used to be”
WORDS DATASET
UNINSPIRED, AGGRESSIVE, UNATTRACTIVE, BORING, CHEAP, IRRELEVANT, WAY OUT, ANNOYING, CHEESY, UNORIGINAL, OUTDATED, UNAPPROACHABLE...
82 ADJECTIVES
WHOLESOME
LEGENDARY
OLD
PIONEER DARK
WORDLY
NOSTALGIC
PROGRESSIVE
ICONIC
USERS19 MUSIC HABIT QUESTIONS: R!te (0-100) whether user !#rees with the st!tements:
“I enjo" !ctivel" se!rchin# for !nd discoverin# music th!t I h!ve never he!rd before”
“I !m not willin# to p!" for music”
“I like to be !t the cuttin# ed#e of new music”
“I love tech”
WHOLESOME
LEGENDARY
OLD
PIONEER DARK
WORDLY
NOSTALGIC
PROGRESSIVE
ICONIC
FORMULATE A QUESTIONSTEP 2
MOTIVATION
MOTIVATION• PRODUCTION - Che!per to produce (lower b!rriers to
entr# for buddin" !rtists).
• DISTRIBUTION - Internet h!s m!de music more !ccessible. Artists c!n decide where !nd how to sell.
• CONSUMPTION - People’s listenin" h!bits h!ve ch!n"ed due to the internet !nd to the ch!n"e in devices.
TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.
PROBLEMS• ARTISTS - E!sier to produce music, h!rder to m!ke
themselves known or e!rn ! livin".
• RECORD COMPANIES - People bu# per son", e!s# for listener to consume without p!#in". Wider competition field.
• LISTENERS - Too m!n# choices. Discover# is difficult.
QUESTIONS• C!n one predict the r!tin" of ! son"?
• Wh!t f!ctors !re import!nt to determine how much ! person likes ! son"?
• Wh!t is the minim!l set of f!ctors th!t !re needed to determine how much ! person likes ! son"?
FORMULATE AN HYPOTHESISSTEP 3
FIRST ATTEMPT• Re"ression problem
• Turn c!te"oric!l v!ri!bles into numeric v!ri!bles
• Consider ALL fe!tures !nd pick m!chine le!rnin" !l"orithm to do the job.
CAN ONE PREDICT THE RATING OF A SONG?
FIRST ATTEMPT
• Bec!use explor!tor# !n!l#sis reve!led r!tin"s !re hi"hl# clustered, we c!n look !t five different scores !nd formul!te problem !s ! cl!ssific!tion one.
CAN ONE PREDICT THE RATING OF A SONG?
We split r!tin"s 0-100 in 5 interv!ls,so e!ch becomes ! cl!ss !nd we l!bel these.
BUILD A MODELSTEP 4
RANDOM FORESTS
RANDOM FORESTS
• R"ndom Forests "re built from "##re#"tin# trees.
• C"n be used for re#ression & cl"ssific"tion problems.
• The! do not overfit "nd c"n h"ndle l"r#e "mount of fe"tures
• The! "lso output " list of fe"tures th"t "re believed to be import"nt in predictin# the v"ri"ble
Hi"hl# vers!tile ensemble method - combines sever!l models into one.
A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)
RANDOM FORESTSTHE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
MOVIES
20 QUESTIONS
WILL JAMIE LIKE X?
BRIENNE IS THE DECISION TREE FOR JAMIE’S MOVIES PREFERENCES
RANDOM FORESTSTHE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
Ask T!win, Cersei, T!rion...J"mie #ives e"ch of them sli#htl! different info.
THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES
J"mie dem"nds #ettin# different questions ever! time.
THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES
RANDOM FORESTS• A tree of m"xim"l depth is #rown on " bootstr"p s"mple of
size m of the tr"inin# set. There is no prunin#.
• A number m << p is specified such th"t "t e"ch node, m v"ri"bles "re s"mpled "t r"ndom out of p. The best split of these v"ri"bles is used to split the node into two subnodes.
• Fin"l cl"ssific"tion is #iven b! m"jorit! votin# of the ensemble of trees in the forest.
• Onl! two “free” p"r"meters: number of trees "nd number of v"ri"bles in r"ndom subset "t e"ch node.
RANDOM FORESTSOUT-OF-BAG (OOB) ERRORE"ch bootstr"p s"mple not used in the construction of the tree becomes " test set. The oob error estim"te is #iven b! the miscl"ssific"tion error (MSE for re#ression), "ver"#ed over "ll s"mples.
VARIABLE IMPORTANCE
Determined b! lookin# "t how much prediction error incre"ses when (OOB) d"t" for th"t v"ri"ble is permuted while "ll others "re left unch"n#ed.
RANDOM FORESTS IN R & PYTHON
randomForest PACKAGE
• V"rious implement"tions - randomForest, CARET, PARTY, BIGRF • We follow the KISS procedure - KEEP IT SIMPLE S.• One c"n test v"rious v"lues of mtr! "nd the number of
trees.
Used randomForest p"ck"#e 4.6-7 with R 2.15. Def"ults "re n=500 trees & mtr!= p/3 for re#ression & sqrt(p) for cl"ssific"tion.
RANDOM FORESTS IN R & PYTHONSCIKIT LEARNUsed SCIKIT LEARN 0.14.1 runnin# P!thon version 2.7.5.
COMPUTER: M"cbook Pro 2.53 GHz Intel Core 2 Duo with 4 GB 1067 Mhz DDR3 runnnin# OS X 10.6.8
• Tr"inin# Time• RS$ & RMSE (Re#ression)• Accur"c! (Cl"ssific"tion)
For the comp"rison we will build “sm"ll” forests "nd focus on the followin# simple metrics:
RANDOM FORESTS IN R
RESULTS REGRESSION
Split d"t" in tr"inin# "nd test sets. D"t"fr"me h"s 82,714 rows e"ch "nd 114 columns.
P"r"meters: 60 trees, s"mple of 50,000.
Tr"inin# time: 39.39 min RMSE: 14.587RS$: 0.581
rf <-‐ randomForest(training,ratings_train,ntree=60, sampsize = 50000, importance = TRUE)
RANDOM FORESTS IN PYTHON
RESULTS REGRESSION
Split d"t" in tr"inin# "nd test sets. D"t"fr"me h"s 82,714 rows e"ch "nd 114 columns.
P"r"meters: 60 trees, s"mple of 50,000.
Tr"inin# time: 3 min 7 sec RMSE: 14.687RS$: 0.575
rf = RandomForestRegressor(n_estimators=60, max_features='sqrt')
RANDOM FORESTS IN R & PYTHON
R
PYTHON / SCIKIT LEARN
RANDOM FORESTS IN RFEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Be!utiful T!lentedBorin# Like Artist
$16 C!tch"C!tch" Be!utiful
T!lented Borin#$9 Tr!ck$19 Distinctive
None of these CoolA#e $11
Tr!ck $12
$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se
$9 - I !m out of touch with new music
$19 - I like to know !bout music before other people
$11 -Pop music is fun
$12 - Pop music helps me esc!pe
Like !rtist - To wh!t extent do "ou like or dislikelistenin# to this !rtist?
RANDOM FORESTS IN RFEATURE IMPORTANCE
RANDOM FORESTS IN PYTHONFEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Distinctive 7C!tch" 3
Like Artist 2Fun -
T!lented 1Be!utiful 4Ori#in!l -
Unori#in!l -$11 9
Own Artist Music -
Own Artist Music - Do "ou h!ve this !rtist in "our music collection?
$11 -Pop music is fun
RANDOM FORESTS IN R & PYTHON
Model RMSER Random Forest 14.587
Python Scikit Learn Random Forest 14.687
Linear Regression 16.23
Multiple Linear Regs 15.53
RESULTS REGRESSION
RANDOM FORESTS IN RRESULTS CLASSIFICATION
Tr"inin# time: 8.75 min OOB error r"te: 44.01%Accur"c!: 0.567
rf <-‐ randomForest(training,ratings_train,ntree=60, sampsize = 50000, importance = TRUE)
ratings_train<-‐as.factor(ratings_train)
1 2 3 4 5
1 16777 4863 1633 139 37
2 5760 12411 6213 504 89
3 1485 5559 13144 1880 329
4 176 888 4094 2592 625
5 59 204 1008 856 1388
RANDOM FORESTS IN PYTHONRESULTS CLASSIFICATION
Tr"inin# time: 2.56 min OOB Score: 0.1964Accur"c!: 0.566
rf = sk.RandomForestClassifier(n_estimators=60,compute_importances=True, oob_score=True)
1 2 3 4 5
1 16930 4682 1758 129 53
2 5517 12369 6475 506 106
3 1500 5367 13448 1737 275
4 186 791 4171 2598 561
5 48 161 999 880 1466
Precision: 0.564Rec"ll: 0.5653F1 Score: 0.5611
RANDOM FORESTS IN RFEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
$9 Tr!ck$7 $11$5 $12$6 A#eA#e $6$10 $17
listBACK $9$19 $16
listOWN $4$16 $13
$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se
$9 - I !m out of touch with new music
$19 - I like to know !bout music before other people
$11 -Pop music is fun$12 - Pop music helps me esc!pe
$7 - I enjo" music prim!ril" from #oin# out to d!nce
$5 - I used to know where to find music
$6 - I !m not willin# to p!" for music
$10 - M" music collection is ! source of pride
$4 - I would like to bu" new music but I don’t know wh!t to bu"
$17 - I find seein# ! new !rtist ! useful w!" of discoverin# new music
RANDOM FORESTS IN PYTHONFEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
$11 2$12 3A#e 4$6 5$17 6$5 -$4 9$10 -$16 7$7 -
$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se
$11 -Pop music is fun
$12 - Pop music helps me esc!pe
$5 - I used to know where to find music
$6 - I !m not willin# to p!" for music
$10 - M" music collection is ! source of pride
$4 - I would like to bu" new music but I don’t know wh!t to bu"
$17 - I find seein# ! new !rtist ! useful w!" of discoverin# new music
RANDOM FORESTS IN R1 2 3 4 5 CLASS
1 16777 4863 1633 139 37 28.45%
2 5760 12411 6213 504 89 50.31%
3 1485 5559 13144 1880 329 41.31%
4 176 888 4094 2592 625 69.09%
5 59 204 1008 856 1388 60.51%
CONFUSION MATRIX
RANDOM FORESTS IN PYTHON1 2 3 4 5 CLASS
1 16930 4682 1758 129 53 28.12%
2 5517 12369 6475 506 106 50.47%
3 1500 5367 13448 1737 275 39.77%
4 186 791 4171 2598 561 68.73%
5 48 161 999 880 1466 58.75%
CONFUSION MATRIX
(Re)FORMULATE AN HYPOTHESISSTEP 2
FEATURE SELECTIONPRINCIPAL COMPONENT ANALYSIS - WORDSDetermine which fe"tures "ccount for most of the v"ri"nce.
FEATURE PC1 PC2
Distinctive 0.20 -0.059Authentic 0.19 -0.046T!lented 0.19 -0.083Credible 0.19 -0.084St"lish 0.18 -0.094
Anno"in# -0.06 -0.065Intrusive -0.06 -0.058Irrelev!nt -0.059 -0.087Uninspired -0.056 -0.092
Nois" -0.053 -0.13
FEATURE SELECTIONM"ke " simple model choosin# me"nin#ful v"ri"bles
WORDS - Anno#in", Depressin", Borin", C!tch#, T!lented, Distinctive, Be!utiful, Superst!r, Soulful !nd Popul!r.
QUESTIONS - $4, $5, $6, $9, $10 $11 !nd $19.
• Runnin# time in R ~ 15 min.• RMSE = 14.791 / Public le"der bo"rd 13.076
RESULTS
FULL MODELREDUCED MODEL
COMMENTSIt is well known th!t R!ndom Forests h!ve shown to be bi!sed tow!rds hi"hl# correl!ted v!ri!bles. Usin" condition!l inference trees, !melior!tes th!t bi!s (See Party PACKAGE in R)
SCIKIT learn’s implement!tion h!s n_jobs p!r!meter to p!r!llelise tr!inin". For ! simil!r fe!ture in R, see bigRF p!ck!"e.
CONCLUDING REMARKS
CONCLUDING REMARKS
We solved " problem usin# both R "nd PYTHON (vi" Scikit learn). Cle"rl! constr"ints for "ddressin# " #iven problem mi#ht differ "nd would dict"te the implement"tion of choice.
PICK THE TOOL THAT IS BEST FOR THE JOB
WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS
Both R "nd PYTHON (vi" SCIKIT LEARN) implement"tions h"ve "dded functions th"t "llow the user to explore the resultin# model "nd its perform"nce.
CONCLUDING REMARKSRANDOM FORESTS ARE GREAT
KEEP AN EYE OUT FOR INTERESTING DATA
It "ives "re!t !ccur!c#, c!n h!ndle m!n# fe!tures, does not require cross v!lid!tion !nd it even estim!tes wh!t v!ri!bles !re import!nt.
H!vin" d!t! th!t #ou !re interested in, le!ds to more interestin" questions !nd re!sons to explore new methods !nd !dd ! new trick to #our b!".
CONCLUDING REMARKSEMI DATASET IS GREAT TO TEST RIDE
TO DO’s - WILL IT PYTHON?
Set h!s ! lot of beh!viour!l inform!tion on ! subject th!t ever#one h!s some intuition.
Prediction usin" SVM’s !nd other M!trix F!ctoris!tion techniques. Full f!ctor !n!l#sis, etc.
THANKS!
Top Related