Towards a rebirth of data science (by Data Fellas)
-
Upload
andy-petrella -
Category
Science
-
view
1.907 -
download
4
Transcript of Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of Data Science
by @DataFellas@Noootsab, 12th Nov. ‘15 @Devoxx
● (Legacy) Data Science Pipeline/Product● What changed since then● Distributed Data Science (today)● Challenges● Going beyond (productivity)
Outline
Data Fellas6 months old Belgian Startup
Andy Petrella@noootsab
MathsGeospatialDistributed Computing
@SparkNotebookSpark/Scala trainerMachine Learning
Xavier Tordoir@xtordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)Spark trainerMachine Learning
(Legacy) Data Science PipelineOr, so called, Data Product
Static Results
Lot of information lost in translation
Sounds like Waterfall
ETL look and feel
Sampling Modelling Tuning Report Interprete
(Legacy) Data Science PipelineOr, so called, Data Product
Mono machine!
CPU bounds
Memory bounds
Or resampling because small-ish data
Sampling Modelling Tuning Report Interprete
FactsData gets bigger or, precisely, the amount of available sources explodes
Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ
Our world TodayNo, it wasn’t better before
Consequences
HARD (or will be too big...)
Ephemeral
Restricted View
Sampling
Report
Our world TodayNo, it wasn’t better before
Interpretation ⇒ Too SLOW to get real ROI out of the overall system
How to work around that?
Our world TodayNo, it wasn’t better before
Consequences
Our world TodayNo, it wasn’t better before
Alerting system over descriptive charts
More accurate results
more or harder models (e.g. Deep Learning)
More data
Constant data flow
Online interactions under control (e.g. direct feedback)
Needs are
Our world TodayNo, it wasn’t better before
Distributed Systems
So, we need...
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
YO!Aren’t we talking about “Big” Data ? Fast Data ?
So could really (all) results being neither big nor fast?
Actually, Results are becoming themselves “Big” Data ! Fast Data !
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
how do we access data since 90’s? remember SOA? → SERVICES!
Nowadays, we’re talking about micro services.
Here we are, one service for one result.
Distributed Data ScienceSystem/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
C’mon, charts/Tables Cannot only be the only views offered to customers/clients right?
We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!
What about Productivity?Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
What about Productivity?Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops
sci
sci ops
sci
ops data
web ops data
web ops data
data
sci
What about Productivity?Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
What about Productivity?Streamlining development lifecycle most welcome
➔ Longer production line➔ More constraints (resources sharing, time, …)➔ More people➔ More skills
Overlooking these points and you’ll be soon or sooner
So, how to have:
● results coming fast enough whilst keeping accuracy level high?● Responsivity to external/unpredictable events?
WHEN...
kicked
WarningTeam Fight: seen by members
WarningTeam Fight: seen by managers
WarningTeam Fight: seen by employers
WarningTeam Fight: seen by customers
What about Productivity?Streamlining development lifecycle most welcome
At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time).
Hence, Data Fellas
● extends the Spark Notebook (interactivity)● builds the Shar3 product arounds it (Integrated Reactivity)
Concepts of Data Fellas’ Shar3Shareable and Streamlined Data Science
Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
Using Shar3yeah \o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
Using Shar3yeah \o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
Using Shar3yeah \o/
Using Shar3yeah \o/
Using Shar3yeah \o/
Using Shar3yeah \o/
Using Shar3yeah \o/
What do we need to do now?
● Deploy● connect the dots● track● scale
BoTh the Jobs and the services
Using Shar3yeah \o/
From notebook
to SBT project
to Docker
to Marathon
SNBSBT/JAR
Docker
marathon
Using Shar3yeah \o/
From Notebook
● to output● to Avro
SNB
Using Shar3yeah \o/
From Notebook
● to Avro● to service● to SBT● to Docker● to Marathon
SNBSBT/JAR Docker
marathon
Using Shar3yeah \o/
From Notebook
● to Avro● to Tableau● or QlikView● or D3.JS● or …
SNB
Using Shar3yeah \o/
So we have these information available:
● notebook’s markdown text● notebook’s code/model● data sources● Output/sinks● Output/services● Avro schema
Shouldn’t them all be
reused???
Using Shar3yeah \o/
Variant Analysis
There is a service!
Using Shar3yeah \o/
Let’s use it…
Using Shar3yeah \o/
What was the process?
Using Shar3yeah \o/
Fine and the output is in C*
Using Shar3yeah \o/
Let’s check what’s in-there
Using Shar3yeah \o/
not what I need, let’s ADAPT
Using Shar3yeah \o/
Poke us on
@DataFellas
@Shar3_Fellas
@SparkNotebook
@Xtordoir & @Noootsab
Now @TypeSafe: http://t.co/o1Bt6dQtgH
If you wanna learn more about the different tools… Join us @ O’Reilly
Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)
That’s all folksThanks for listening/staying