Dataiku - From Big Data To Machine Learning
-
Upload
dataiku -
Category
Technology
-
view
126 -
download
3
description
Transcript of Dataiku - From Big Data To Machine Learning
![Page 1: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/1.jpg)
1Dataiku04/10/2023
![Page 2: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/2.jpg)
04/10/2023 2Dataiku
Hi !
Current Life:CEO, Dataiku
Tweet about this: @dataiku @club_dsi_gun
Past Life: CriteoIsCool EntertainmentExalead
Florian Douetteau
Available on Slide Sharehttp://www.slideshare.net/Dataiku
Goals Today: • Concrete Feedback on Data Analytics
Projects• Data Team in practice and Key technologies • Motivate you to start a data science project
Slide deck allergic ? Check:https://github.com/dataiku
![Page 3: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/3.jpg)
04/10/2023Dataiku 3
Dataiku
Dataiku : An open source platform to help you build your data lab‟
”
![Page 4: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/4.jpg)
04/10/2023Dataiku 4
Motivation
![Page 5: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/5.jpg)
04/10/2023Dataiku 5
Collocation
Big Apple
Big Mama
Big Data
A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association.
Collocation:
![Page 6: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/6.jpg)
04/10/2023Dataiku 6
“Big” Data in 1999
struct Element { Key key; void* stat_data ;}….
C Optimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuse HTTP 0.9
1 Month
![Page 7: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/7.jpg)
04/10/2023Dataiku 7
Hadoop Java / Pig / Hive / Scala /
Closure / … A Dozen NoSQL data store MPP Databases Real-Time
Big Data in 2013
1 Hour
![Page 8: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/8.jpg)
04/10/2023Dataiku 8
Data Analytics: The Stakes
1 TB? $
Social Gaming2011Web Search
1999
Logistics2004
Online Advertising2012
1 TB100M $
E-Commerce2013
Banking CRM2008
1 TB1B $
Web Search2010
100 TB? $
10 TB10M $
1000TB500M $
50TB1B$
![Page 9: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/9.jpg)
04/10/2023 9
Meet Hal Alowne
Dataiku - Data Tuesday
Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data Scientist
Hal AlowneBI ManagerDim’s Private Showroom
Hey Hal ! We need a big data platform
like the big guys.Let’s just do as they do!
‟”European E-commerce Web site
• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)
Dim SumCEO & Founder Dim’s Private Showroom
Big DataCopy Cat Project
![Page 10: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/10.jpg)
04/10/2023Dataiku 10
Technology is complex
HadoopCeph
Sphere
Cassandra
Spark
Scikit-Learn
MahoutWEKA
MLBase
RapidMiner
PandaD3Crossfilter
InfiniDBLucidDB
Impala
Elastic Search
SOLR
MongoDBRiak
Membase
Pig HiveCascadingTalend
Machine Learning Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County Data Clean Wasteland
Statistician Old House
R
![Page 11: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/11.jpg)
04/10/2023Dataiku 11
Statistics and Machine Learning is complex !
Try to understand myself
![Page 12: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/12.jpg)
04/10/2023Dataiku 12
(Some Book you might want to read)
![Page 13: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/13.jpg)
04/10/2023Dataiku 13
Plumbing is not complex(but difficult)
Implicit User Data(Views, Searches…)
Content Data(Title, Categories, Price, …)
Explicit User Data(Click, Buy, …)
User Information(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation Matrix
Transformation Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
![Page 14: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/14.jpg)
04/10/2023Dataiku 14
MERIT = TIME + ROI
Targeted Newsletter
RecommenderSystems
Adapted Product/ Promotions
TIME : 6 MONTHS ROI : APPS
Build a lab in 6 months (rather than 18 months)
Find the right people
(6 months?)
Choose the technology(6 months?)
Make it work (6 months?)
Build the lab (6 months)
Deploy apps that actually deliver value
2013 2014
2013
• Train People• Reuse working patterns
![Page 15: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/15.jpg)
04/10/2023Dataiku 15
The Problem
It’s utterly complex and unreasonable
![Page 16: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/16.jpg)
04/10/2023Dataiku 16
Our Goal
Our Goal:
Change his perspective on data science projects
(sorry, we couldn’tfind a picture of Hal Smiling)
![Page 17: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/17.jpg)
04/10/2023Dataiku 17
Why and For What ?◦ Business Theory ◦ Concrete Projects
How people and project ? ◦ How to start◦ Dedicated team ?
What technologies ? ◦ Machine Learning◦ Architecture
Agenda
![Page 18: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/18.jpg)
04/10/2023Dataiku 18
Embodiment of Knowledge
Find your core business avantage
![Page 19: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/19.jpg)
04/10/2023Dataiku 19
Product Success driven by Quality !
Margin / Customer Value / Traffic / Acquisition
Example: Launching an Appon the App Store
![Page 20: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/20.jpg)
04/10/2023Dataiku 20
Margin for new customers might decline …
Margin for new
features might decline …
Is your business really scalable ?
you continue growing ….
![Page 21: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/21.jpg)
04/10/2023Dataiku 21
Existing Customers Profiles
Existing Product Assets
Existing Specific Business Model
And your KNOWLEDGE of it
Where is your core business advantage ?
![Page 22: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/22.jpg)
04/10/2023Dataiku 22
Data Driven BusinessWhat your value ?
Number of Customers
Customer Knowledge
Increase over time with:- Time spend in your app- User relationship (network effet)- Partner / Other Apps Interactions
Your Value
1,409,540 $1,03$2,57
$4,081,710,239
2,534,123
![Page 23: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/23.jpg)
04/10/2023Dataiku 23
Data ImpactNot all business equals
Online Advertising
Telecommunication
Insurance
Ability to Acquire
Margin New Services Overall
Subscription Market
Infrastructure Driver
Selling Data
Risk / Price Optimization
Subscription Market
Subscription Market
![Page 24: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/24.jpg)
04/10/2023Dataiku 24
From Theory To Practice
Concrete Projects
![Page 25: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/25.jpg)
04/10/2023Dataiku 25
What should be free in the application ?
How to optimize conversion ?
How to plan and create a business model ?
Main Pain Point:How to plan and optimize pricing in the application ?
Freemium Application
![Page 26: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/26.jpg)
04/10/2023Dataiku 26
Example (Freemium Application) Fremium Model Optimization
BusinessModel
User Cluster
Simulation
Optimized Pricing: Margin +23%
Business Planning Capability 1 month 9 months
R + Python + InfiniDBOn-Premise1TB Dataset 5 weeks project
![Page 27: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/27.jpg)
04/10/2023Dataiku 27
Business Intelligence Stack as Scalability and maintenance issues
Backoffice implements business rules that are challenged
Existing infrastructure cannot cope with per-user information
Main Pain Point:23 hours 52 minutes to compute Business Intelligence aggregates for one day.
Large E-Retailer
![Page 28: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/28.jpg)
04/10/2023Dataiku - Data Tuesday 28
• Relieve their current DWH
and accelerate production of some aggregates/KPIs
• Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc.,
• Train existing people around machine learning and segmentation experience
1h12 to perform the aggregate, available every morning
New home page personalization deployed in a few weeks
Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects
Large E-Retailer : The Datalab
![Page 29: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/29.jpg)
04/10/2023Dataiku - Data Tuesday 29
BI performed directly on production databases
New reports required the CTO direct work for design and implementation
Each photo tag manually validated and completed
Large Photo Bank
Main pain point:No visibility on new users behaviours
![Page 30: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/30.jpg)
04/10/2023Dataiku - Data Tuesday 30
Implementing a Cloud-based data lab to :
• centralize all available data, previously scattered between SQL DB and file systems,
• improve web tracking granularity to enhance customer knowledge via behavior modeling and segmentation,
• create content-based recommendation engines with keywords clustering and association.
Large Photo Bank : The Datalab
R + Vertica + HadoopAmazon Web Services8 weeks projects
Automated content filtering and recommendation
![Page 31: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/31.jpg)
04/10/2023Dataiku 31
Large set of manually crafted linguistic resources for interpreting users queries
New Brands, rare terms .. hard to maintain
Large Online Directory
Main Pain Point:Ability to maintain a very large ontological knowledge sets, with more than 100k concepts
![Page 32: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/32.jpg)
04/10/2023Dataiku 32
Analyze clicks, rephrasing navigation to detect queries that require specific processing
Gather web and external data to enrich the existing index
Train team to Hadoop and Machine Learning
Continuous Relevance Monitoring
Automated enrichment 2x more productivity
Hadoop (48 cores) PythonOn Premise10 weeks projects
Large Online Directory: The Data Lab
![Page 33: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/33.jpg)
Dataiku 33
Launch A Marketing campaign
After a few days PREDICT based on behaviours◦ Total ARPU for users
after 3 months◦ Efficiency of a campaign◦ Continue or not ?
Example ( E-Application ) Marketing Campaign Prediction
![Page 34: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/34.jpg)
04/10/2023Dataiku 34
A very large community
Some mid-size communities
Lots of small clusters mostly 2 players)
Correlation◦ between community size
and engagement / virality Meaningul patterns
◦ 2 players / Family / Group What is the minimum
number of friends to have in the application to get additional engagement ?
Example (Social Gaming) Social Gaming Communities
![Page 35: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/35.jpg)
04/10/2023Dataiku 35
What others do ? ◦ Concrete Projects
How people and project ? ◦ How to start◦ Dedicated team ?
What technologies ? ◦ Machine Learning◦ Architecture
Agenda
![Page 36: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/36.jpg)
04/10/2023Dataiku 36
First Steps
Drag picture to placeholder or click icon to add
![Page 37: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/37.jpg)
04/10/2023Dataiku 37
A / B Test (or equivalent for your business) is the first step to get into a “data-driven” mind set
No advanced analytics requires, some existing tools can help
Changing a color button +21%
(1) Be Data Driven
![Page 38: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/38.jpg)
04/10/2023Dataiku 38
People Microsoft Excel
(2) Use Excel
![Page 39: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/39.jpg)
04/10/2023Dataiku 39
Data Team Data Tools
(3) Build a team
The Business Expertwho knows maths
The Analyst that reveals patterns
The Coding Guy That is enthusiastic
![Page 40: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/40.jpg)
04/10/2023Dataiku 40
data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology
A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…)
TEAM + TOOLS = LAB
![Page 41: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/41.jpg)
04/10/2023Dataiku 41
Organization
Targeted campaingsPrice optimization
Personalized experience
Quality AssuranceWorkload and yield
management
User Feedback (A/B Test)Continuous improvement
Data
Product Designer
Business &
Marketing
Engineers
User Voice
![Page 42: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/42.jpg)
04/10/2023Dataiku 42
Short Term Focus Long Term Drive
Business People Optimize Margin, …. Create new business revenue streams
Marketing People Optimize click ratio Brand awareness and impact
IT People Make IT work Clean and efficient Architecture
Data People Get Stats Right, make predictions
Create Data Driven Features
It’s just a new team
![Page 43: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/43.jpg)
04/10/2023Dataiku 43
Super Intern
What is your ability to integrate a new smart guy and give him any data he would need and any computingpower he would need to enhance your product ?
![Page 44: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/44.jpg)
04/10/2023Dataiku 44
What others do ? ◦ Concrete Projects
How people and project ? ◦ How to start◦ Dedicated team ?
What technologies ? ◦ Machine Learning◦ Architecture
Agenda
![Page 45: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/45.jpg)
04/10/2023Dataiku 45
An oversimplified view of big data architecture
Architecture Patterns
![Page 46: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/46.jpg)
04/10/2023Dataiku 46
Database Business Layer Application
![Page 47: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/47.jpg)
04/10/2023Dataiku 47
(What it really looks like)
![Page 48: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/48.jpg)
04/10/2023Dataiku 48
What kind of scale?
Database Business Layer Application
Or
Data Science App
Or ?
![Page 49: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/49.jpg)
04/10/2023Dataiku 49
What kind of interaction ?
Database Business Layer Application
Data Science App
?
?
? ? ?
?
![Page 50: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/50.jpg)
04/10/2023Dataiku 50
Classic Columnar Architecture
Some data Some Place To Pour It In
Some Tool To To Some Maths And Graphs
![Page 51: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/51.jpg)
04/10/2023Dataiku 51
Classic Columnar Architecture
Lots of data Some Place To Pour It In
Some Tool To To Some Maths And GraphsWeb Tracking Logs
Raw Server Logs
Order / Product / Customer
Facebook Info
Open Data (Weather, Currency …)
![Page 52: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/52.jpg)
04/10/2023Dataiku 52
The Corinthian Architecture
Lots of dataSome Place To Perform Rapid Calculations
Some Tools To Do Some Maths And Charts
Some Place To Pour It In And Clean / Prepare It
![Page 53: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/53.jpg)
04/10/2023Dataiku 53
Data Storage And Preparation
Large Scale:Hadoop Cluster CassandraMPP SQL Columnar
Medium/Large Scale:CouchBaseMongoDB….
Selection Drivers
VolumeScalability
![Page 54: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/54.jpg)
04/10/2023Dataiku 54
Calculations
Classic Database• PostgresSQL• MySQL• ….
MPP SQL Database • Vertica, Vectorwise, InfiniDB,
GreenplumHD….
Hadoop New Databases• Impala
…
Selection Drivers:
Speed ( Interactivity )
Expressivity
![Page 55: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/55.jpg)
04/10/2023Dataiku 55
The Corinthian Architecture
Lots of dataSome Place To Perform Rapid Calculations
Some Tools To Do Some Maths And Charts
Some Place To Pour It In And Clean / Prepare It
Statistics
Cohorts
Regressions
Bar Charts For Marketing
Nice Infography for you Company Board
![Page 56: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/56.jpg)
04/10/2023Dataiku 56
The Corinthian Architecture
Lots of dataSome Database To Perform Rapid Calculations
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
![Page 57: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/57.jpg)
04/10/2023Dataiku 57
Statistical Tools
Open Source:• IPython • Rstudio
Commercial• RapidMiner• SAS• RevolutionR
Selection Drivers
Existing Knowhow
Scalability
![Page 58: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/58.jpg)
04/10/2023Dataiku 58
What is a statistical tool ?
Interact and explore data
Some stats capabilities
Some Graph Capabilities
![Page 59: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/59.jpg)
04/10/2023Dataiku 59
Visualization Tools
Open Source:• SpotFire• Tableau• QlikView
SAAS• BIME• ChartIO• RevolutionR
HTML5 / AdHoc• D3• GraphViz
Selection Drivers
How Many Contributors / Readers ?
Scalability
![Page 60: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/60.jpg)
04/10/2023Dataiku 60
The One Database won’t make it all problem
Lots of dataSome Database To Perform Rapid Calculations
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
JOIN / Aggregate
Rapid Goup By Computations
Direct Access to the computed Results to production etc..
![Page 61: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/61.jpg)
04/10/2023Dataiku 61
The Roman Social Forum
Lots of dataSome Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
![Page 62: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/62.jpg)
04/10/2023Dataiku 62
Graph
Databases• Neo4J• Titan• OrientDB• InfiniteGraph
Analytic / Visualization• Gephi
Selection Drivers
Scalability
What Algorithms ?
Licensing Constraints
![Page 63: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/63.jpg)
04/10/2023Dataiku 63
The Key Value Store
Lots of dataSome Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs And Some Distributed Key Value Store
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
![Page 64: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/64.jpg)
04/10/2023Dataiku 64
NoSQL
Search• SOLR• ElasticSearch
Document• MongoDB• CouchDB
KeyValue• Redis• Hbase
…
Selection Drivers
Durability / Avaiability …
Performance
Ease of use and API
Indexing
![Page 65: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/65.jpg)
04/10/2023Dataiku 65
Action requires Prediction
Lots of dataSome Database To Perform Rapid CalculationsAnd some databasefor graphs And Some Distributed Key Value Store
Some Tools To Do Some Maths Some Other To Do Some Charts
Some Place To Pour It In And Clean / Prepare It
Draw A Line For the future
What are my real users groups ?
Should I launch a discount offering or not ? To everybody or to specific users only ?
![Page 66: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/66.jpg)
04/10/2023Dataiku 66
The Medieval Fairy Land
Lots of data Some Tools To Do Some Maths Some Other To Do Some Charts and some MACHINE LEARNING
Some Place To Pour It In And Clean / Prepare It
Some Database To Perform Rapid CalculationsAnd Some DatabaseFor Graphs And Some Distributed Key Value Store
![Page 67: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/67.jpg)
04/10/2023Dataiku 67
Predictions
Java• Mahout (Hadoop)• WEKA
Python• Scikit-Learn• PyML
R
Commercial• Kxen• SAS• SPSS…
…
Selection Drivers
Scalability
Black Box / White Box ?
Data Management Integration
![Page 68: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/68.jpg)
04/10/2023Dataiku 68
Can be fun
Machine Learning
![Page 69: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/69.jpg)
Exploratory Data Analysis◦ Identifying and visualizing key patterns and correlations within the dataset
Unsupervised Learning◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)
Supervised Learning◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)
Time Series Prevision◦ Predict a time-dependent variable using its own history, and sometimes other covariates
(variables)
Graph Analysis◦ Analyzing relationships between a set of “nodes”, linked by “edges”
Associations / Sequences Mining◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time
And many more…
Classes of Machine Learning Problems
10/04/2023Dataiku - Innovation Services 69
![Page 70: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/70.jpg)
Mapping ML to Business Questions
10/04/2023Dataiku - Innovation Services 70
Class Sample Business Questions
Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?
Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The same navigation behavior ?
Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying users ? Who is going to leave my service ? What is the profile of the users who do X ?
Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast, can I also forecast my sales ?Product Sale Forecast (for surbooking)
Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends to my users ?
Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation path on my website ?
![Page 71: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/71.jpg)
Machine Learning Methods Detailed
10/04/2023Dataiku - Innovation Services 71
Analytical Task ML Task Sample Algorithms Shape of Dataset
Exploratory Data Analysis
Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features
Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square...
N obs. (1 row per obs.) * P features
Multivariate Analysis
Principal components analysis, multi-dimensional scaling correspondence analysis, factor analysis…
N obs. (1 row per obs.) * P features
“Oriented” Data Analysis
Unsupervised Learning
K-means, K-medoids, hierarchical clustering, gaussian mixture models, mean shift, dbscan, spectral clustering...
N obs. (1 row per obs.) * P features
Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM, naïve Bayes, K-NN, random forests…
N obs. (1 row per obs.) * P features
Time Series Prevision
ARMA, VARMAX, ARIMA… Time Series (rows: time period, columns: measures)
Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity (Louvain)…
Nodes and Edges lists (+ attributes)
Associations & Sequences
Frequent Itemsets, A priori, Market Basket… (Timestamped) events or transactions
![Page 72: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/72.jpg)
04/10/2023Dataiku 72
Cluster a dataset into K Buckets by choosing the “closest” neighbours
Unsupervised MethodK-Means
![Page 73: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/73.jpg)
04/10/2023Dataiku 73
Predict the color of a point depending on the colors of its K closest neighbours
Supervised K-Nearest-Neighbours
![Page 74: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/74.jpg)
04/10/2023Dataiku 74
Find the most “significant” input variable and split value
Split the dataset recursively
SupervisedDecision Tree
![Page 75: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/75.jpg)
Several Paths to Machine Learning
10/04/2023Dataiku - Innovation Services 75
Analytical Dataset
I’m looking for
clusters
I want to
predict a
variable
I’m looking variable
by variable, or pairs I know how
many groups to look for
HCA…
Partitioning (K-means…)
GMM…
DP GMM
…
K-means + Gap
| Silhouette | …
2-steps clusteri
ng
I just want to explore
Yes
No
Yes
No
Small Dataset (<<1K)Ye
sNo
Medium Dataset
(<<100K)Yes
No
I can sample
Yes
No
Affinity Propagation
, Mean Shift…
Unsupervised Learning
Yes
No
All my variables
are numeric Ye
sNo
CA…
I have a distance matrix
Yes
No
MDS...
PCA…
Exploratory Data Analysis Data Viz..
.
Yes
Not Only
I value interpretabil
ityGeneralized Linear
Model
Simple Decision Tree
Supervised Learning*
Correlation Analysis
GLM
Parametric and non parametric
stat. tests
* Methods generally working for both classification & regression
Support Vector
Machines
Neural Networ
ks
K-Nearest Neighbor
s
Ensembles (Random Forest, Gradient Boosted
Tree
MARS
Generalized
Additive Model
![Page 76: Dataiku - From Big Data To Machine Learning](https://reader033.fdocuments.us/reader033/viewer/2022050920/54c6fa594a795931168b45ea/html5/thumbnails/76.jpg)
04/10/2023Dataiku 76
Questions ?
Take Away◦ There are new ways to perform data
analytics that are within your reach and can bring business value
Some Additional Resources◦ Open Source Projects
Dataiku Cloud Transport Clienthttp://dctc.io
Dataiku Web Trackerhttps://github.com/dataiku/wt1
◦ Our Technical Blog http://www.dataiku.com/blog