Post on 20-Aug-2015
Josh Wills | Senior Director of Data Science
Training a New Generation of Data Scientists
About Me
What Do Data Scientists Do?
What I Think I Do
What Other People Think I Do
What I Actually Do
The Emergence of Data Science
Data Storage in 2001: Databases• Structured schemas• Intensive processing
done where data is stored• Somewhat reliable• Expensive at scale
Data Storage in 2001: Filers
• No schemas, stores any kind of file• No data processing
capability• Reliable• Expensive at scale
And Then, This Happened
Data Economics, Return on Byte
Big Data Economics• No individual record is
particularly valuable• Having every record is
incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising
Enter Hadoop
The Hadoop Distributed File System• Based on the Google File
System• Data stored in large files• Large block size: 64MB to
256MB per block• Blocks are replicated to
multiple nodes in the cluster
Simple, Reliable, Distributed Processing: MapReduce
•Map Stage• Embarrassingly parallel
• Shuffle Stage: Large-scale distributed sort• Reduce Stage• Process all the values that have the same key in a single step
• Process the data where it is stored•Write once and you’re done.
Thinking Like a Data Scientist
Solving Problems vs. Finding Insights
Parallelize Everything
Abundance vs. Scarcity
Building Data Products
Create a Data Science Team
Choose Good Problems
Design the Model
Mind the Gap
Amortize Costs
Measure Everything
Rinse and Repeat
Work Like a Data Scientist
Train Like a Data Scientist
Hadoop Developer Training
Hive and Pig Training
Introduction to Data Science
Introduction to Data Science:Building Recommender Systems
http://university.cloudera.com/
• Submit questions in the Q&A panel
• Watch on-demand video of this webinar at http://cloudera.com
• Follow Josh on Twitter @josh_wills
• Follow Cloudera University @ClouderaU
• Thank you for attending!
Register now for Cloudera training at http://university.cloudera.com
Use discount code DSvideo_10 to save 10% on new enrollments in Cloudera-delivered training classes until June 1
Use discount code 15off2 to save 15% on enrollments in two or more Cloudera-delivered training classes until June 1