Building Data Products

Post on 18-Jul-2015

3.312 views 1 download

Tags:

Transcript of Building Data Products

1

Building  Data  Products  Josh  Wills,  Senior  Director  of  Data  Science  

About  Me  

2  

3

What  Do  Data  Scien<sts  Do?  

What  I  Think  I  Do  

4

What  Other  People  Think  I  Do  

5

What  I  Actually  Do  

6  

Data  Science  and  Data  Products  

7

8

Thinking  About  Data  Products  

The  Best  Way  To  Find  Insights  

9

Build  A  Team  

10

Measure  Everything  

11

Solve  the  Right  Problem  

12  

13

Building  Data  Products  with  Hadoop  

Hadoop  as  a  PlaMorm  for  Data  Products  

14

ETL,  Data  Science,  and  Machine  Learning  

15  

Changing  the  Unit  of  Analysis  

16

Machine  Learning  and  You  

17

The  Five  Ques<ons  

1.  When  should  I  use  it?    2.  What  does  the  input  look  like?  

3.  What  does  the  output  look  like?  

4.  How  many  parameters  do  I  have  to  tune?  

5.  Why  will  it  fail?  

18

1.  Collabora<ve  Filtering  

19

Collabora<ve  Filtering  (cont.)  

1.  To  see  things  that  are  hidden.  

2.  <user_id>,<item_id>,<weight>  

3.  <item1>,<item2>,<score>  

4.  The  distance  metric  and  the  weight  calcula<ons.  

5.  If  the  input  data  is  too  sparse.  

20

Collabora<ve  Filtering  on  Hadoop  

21

2.  K-­‐Means  Clustering  

22

K-­‐Means  Clustering  (cont.)  

1.  To  find  anomalous  events.  

2.  Vectors  of  normally  distributed  values.  

3.  Cluster  centroids.  

4.  The  choice(s)  of  K.  

5.  The  points  aren’t  even  remotely  normally  distributed.  

23

K-­‐Means  on  Hadoop  

24

3.  Random  Forests  

25

Random  Forests  (cont.)  

1.  To  classify  and  predict.  

2.  A  dependent  variable  and  many  independent  variables.  

3.  Lots  and  lots  of  liale  trees.  

4.  The  number  of  variables  to  consider  at  each  level.  

5.  Too  many  independent  variables.  

26

Random  Forests  on  Hadoop  

•  R’s  randomForest  and  rhadoop  tools  

•  Map:  par<<on  the  input  data  among  the  reducers  

•  Reduce:  fit  the  random  forests  to  each  par<<on  

•  Re-­‐combine  the  resul<ng  trees  in  the  client  

27  

The  Art  of  Model  Design  

28

Cau<on:  Mind  the  Gap  

29  

The  Joy  of  Experiments  

30

31

Introduc<on  to  Data  Science:  Building  Recommender  Systems  hap://university.cloudera.com/  

 Josh  Wills,  Director  of  Data  Science,  Cloudera            @josh_wills  

 

Thank  you!