The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine...
-
Upload
revolution-analytics -
Category
Documents
-
view
2.274 -
download
2
description
Transcript of The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine...
Revolution Confidential
T he R is e of Data S c ienc e in the age of B ig Data A nalytic sWhy Data Dis tillation and Mac hine L earning A ren’t E nough
David M S mithV P Marketing and C ommunityR evolution Analytic s
Revolution ConfidentialToday, we’ll dis c us s :
What is Data Science? Why machine learning isn’t enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources
2
Revolution Confidential
3© Dov Harrington, CC By-2.0http://www.flickr.com/photos/idovermani/4110546683/
Revolution ConfidentialWhere is it s afe to fis h near S an F ranc is co?
4San Francisco Estuary Institutehttp://www.sfei.org/tools/wqt
Revolution ConfidentialHurric ane S andy
Bob Rudishttp://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/
5
Revolution ConfidentialHurric ane S andy
Ed Chenhttp://blog.echen.me/hurricane-sandy-outages/
6
Revolution Confidential
When did Michael J acks on have his bigges t hits ?
New York Times, June 25 2009 (3 hours after Michael Jackson’s death)http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7
Revolution ConfidentialT hree E s s ential S kills of Data S c ientis ts
8Drew Conwayhttp://www.dataists.com/2010/09/the-data-science-venn-diagram/
Data IntegrationMashups
Applications
ModelsVisualizationPredictionsUncertainty
ProblemsData Sources
Credibility
EffectiveData
Applications
Revolution Confidential
9Image © Abode of Chaos, CC BY 2.0http://www.flickr.com/photos/home_of_chaos/6418989233/
Revolution ConfidentialMac hine learning (ML ) for predic tions
10
Res
pons
e
Feat
ures
Res
pons
es
MLscoring rules
Building the Model
Valid
atio
n se
t
Pre
dict
ions
scoring rules
Validating the Model
New
Dat
a
Pre
dict
ions
(sco
res)
scoring rules
Scoring new data
“Accuracy”
Revolution ConfidentialP roblem: A lac k of pers pec tive
11Image © 2010 David M Smith. Some rights reserved CC BY-2.0
Revolution ConfidentialP roblem: L ac k of c redibility
12
Revolution ConfidentialP roblem: C omplexity
13
Revolution ConfidentialData Science to the Rescue!
14
Revolution ConfidentialA ns wer Unas ked Ques tions
15Revolutions blog: “The Uncanny Valley of Big Data”http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html
Revolution Confidential
16
“More data beats better algorithms, every time” – Google
“Companies that have massive amounts of data without massive amounts
of clue are going to be displaced by startups that have less data but more
clue.” -- Tim O’Reilly
Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwdTechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html
F ill in knowledge gaps
Revolution ConfidentialAvoid ineffec tive reac tions
17Stupid Data Miner Trickshttp://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf
S&P
500
Revolution Confidential
18© Henricks Photos CC-BY-ND 2.0http://www.flickr.com/photos/hendricksphotos/3240667626/
Revolution Confidential0. Data (B ig & Mes s y)
19
Revolution Confidential1. A language for programming with data
20
Download the White Paper
R is Hotbit.ly/r-is-hot
Revolution Confidential
21
Grant awards to homeless veterans FY09Data: Data.govAnalysis: Drew Conway
User-defined functions
Internet API interfaceXML parsing
Custom graphics
Data import and pre-processing
Iterative data processing
Revolution Confidential2. S peed. L ots and lots of s peed.
22
Variable Transformation
Model Estimation
Model Refinement
Model Comparison / Benkmarking
Feature SelectionSampling
AggregationData Predictions
Revolution Confidential
Core 0(Thread 0)
Core n(Thread n)
Core 2(Thread 2)
Core 1(Thread 1)
Multicore Processor (4, 8, 16+ cores)
DataData Data
Disk
Shared Memory
Us e all available c omputing c yc les
23
Revolution Confidential
Compute Node
Compute Node
Master Node
DataPartition
DataPartition
Compute Node
Compute Node
DataPartition
DataPartition
3. A lgorithms that don’t choke on B ig Data
PEMAs: Parallel External-Memory Algorithms24
BIGDATA
Revolution ConfidentialDrink les s c offee!
25
Single ThreadedNon-optimized
algorithms
OptimizedParallelizedAlgorithms
Revolution Confidential4. Move c ode to data (not vic e vers a)
26
Map-Reduce
RHadoop: http://bit.ly/RHadoop
Revolution ConfidentialB ig Data A pplianc es
27
More info: http://bit.ly/R-Netezza
Revolution ConfidentialP lay Nic e with Others
• Business Intelligence Tools• Web-based data apps• Reporting / Spreadsheets
Presentation Layer
• R
Analytics Layer
• Relational datastores• Unstructured datastores
Data Layer
28
Revolution ConfidentialWhat every data s c ientis t needs
Open-Source RRevolution R
EnterpriseInterface with multiple data sources
Exploratory data analysis
Wide range of statistical methods
High-speed computation
Big Data support
Data/code locality (Hadoop, etc.)
Print-quality data visualization
Scheduled batch production
Works in a multi-tool ecosystem
Integration into Data Apps
29
Revolution ConfidentialR evolution R E nterpris e: B ig-Data R
Open-Source RRevolution R
EnterpriseInterface with multiple data sources
Exploratory data analysis
Wide range of statistical methods
High-speed computation
Big Data support
Data/code locality (Hadoop, etc.)
Print-quality data visualization
Scheduled batch production
Works in a multi-tool ecosystem
Integration into Data Apps
30www.revolutionanalytics.com/products
Revolution Confidential
31Image © www.tinyplanetphotography.com
Revolution ConfidentialA nd … the future?
Even more data
Cloud computing
Demand for Data Scientists
Diverging paradigms for data analytics
32http://www.indeed.com/jobtrends
Revolution ConfidentialDiverging data paradigms
33
HadoopNoSQL
FilesClusters
Data Appliances
More data, better fault tolerance
Easier programming, better performanceExplorationModeling
StoragePreprocessing
Production
Revolution ConfidentialData S c ienc e in P roduc tion
Real-time Big Data Analytics: From Deployment to Production
Thursday, November 29, 201210:00AM - 11:00AM Pacific Time
www.revolutionanalytics.com/news-events/free-webinars/
34
Revolution ConfidentialB uilding Data S c ienc e Teams
DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI
Statistics and Data Science graduates
Kaggle and Chorus
Revolution Analytics R Training: http://www.revolutionanalytics.com/services/training/
35
Revolution ConfidentialC los ing T houghts
Data Science process leads to more powerful, and more useful models
Data Scientists need a technology platform to think about, explore, and model data
Revolution R Enterprise is R for Big Data
36
Revolution ConfidentialR es ourc es
Revolution R Enterprise : R for Big Data www.revolutionanalytics.com/products
Rhadoop : Connecting R and Hadoop bit.ly/r-hadoop
Contact David Smith [email protected] @revodavid blog.revolutionanalytics.com
37
Revolution ConfidentialT hank you.
38
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.