Open source for customer analytics
-
Upload
matthias-funke -
Category
Software
-
view
206 -
download
0
Transcript of Open source for customer analytics
Open Source for Customer Analytics
Matthias FunkeBusiness & Technology Consultant
Agenda Topics
Open Source Software
Data Products
The “Data Process”
Tying it together
Open Source Software
Examples: Linux, LibreOffice, Eclipse, Hadoop
Source Code open, e.g. github.com (>3M users, 6.8M repos)
Governed by foundations, e.g. Apache Software Foundation, Free Software Foundation
Contributors / committers: Academia, start-ups, corporations, specialised OSS companies
Popular Apache Software Projects
Project Donated by...
Cassandra Facebook (2008)
Storm Twitter (2013)
Hadoop Yahoo (2008)
Kafka LinkedIn
Apache Software Foundation SponsorsGoogle, Yahoo, Microsoft, Facebook, Citrix…
HP, IBM, Hortonworks, Cloudera, Comcast
Auto & General, Huawei, Pivotal, …
Talend, Twitter
Benefits, Drawbacks & Facts
Benefits● No Licence Cost● Huge amount of
knowledge in the community
● High speed of innovation● Funny names
Drawbacks● Overwhelming choices● Varying maturity● Skills challenge (for
newer projects)
Facts of Life● Professional Services / Support not free
“Data Products”
Core: valuable data. Tools to display and manipulate.
Good: live, visual, searchable
Types:
● Exploratory● Internal production● Publicly facing (but free)● Commercial = monetised
VOLUME
VARIETY
VELOCITY
VERACITY
Popular Data Products
Google Flights (not a booking engine!)
CIA World Fact Book (simple presentation)
Inside AirBnB (“activist”)
data.gov.uk
The Data Process
1. Obtain data2. Explore & clean data3. Analyse & model4. Visualise5. Productionise & automate Data Pipeline
a. How and where to distribute?
b. How to scale?
c. How to secure?
d. How to manage day-to-day?
Data Exploration on One PC
Using ggplot2 for exploratory graphs
qplot(host$availability_365,+ geom="histogram",+ binwidth = 5, + main = "Histogram for Availability", + xlab = "AirBnB in London", + fill=I("blue"))
Statistical Analysis
SIMPLE
● Sum, Count, Mean / Median
● Variance / Standard Deviation
E.g. Average Revenue per User per Neighbourhood (by Month of the Year)
MORE COMPLEX
● Clustering
● Co-variance matrix
(dependencies between
variables)
● Predictive Models
● Machine Learning
Big Data Architectures (simplified)
“Big” Database Hadoop Cluster / File System
Query Engine (Data Access)
Execution Engine (Business Logic)
Search Engine (Accessibility)
Visualisation Layer
Visualisation using KIBANA
Trusted Analytics Platform - Brand New OSS
Interactive Notebooks
New breed of software to work interactively on data
Spark/Scala Notebook
Apache Zeppelin
Databricks: cloud (proprietary but built on Spark)