Open source for customer analytics

Open Source for Customer Analytics

Matthias FunkeBusiness & Technology Consultant

http://www.linkedin.com/in/mfunke

http://www.linkedin.com/in/mfunke

Agenda Topics

Open Source Software

Data Products

The “Data Process”

Tying it together

Open Source Software

Examples: Linux, LibreOffice, Eclipse, Hadoop

Source Code open, e.g. github.com (>3M users, 6.8M repos)

Governed by foundations, e.g. Apache Software Foundation, Free Software Foundation

Contributors / committers: Academia, start-ups, corporations, specialised OSS companies

Popular Apache Software Projects

Project Donated by...

Cassandra Facebook (2008)

Storm Twitter (2013)

Hadoop Yahoo (2008)

Kafka LinkedIn

Apache Software Foundation SponsorsGoogle, Yahoo, Microsoft, Facebook, Citrix…

HP, IBM, Hortonworks, Cloudera, Comcast

Auto & General, Huawei, Pivotal, …

Talend, Twitter

Benefits, Drawbacks & Facts

Benefits● No Licence Cost● Huge amount of

knowledge in the community

● High speed of innovation● Funny names

Drawbacks● Overwhelming choices● Varying maturity● Skills challenge (for

newer projects)

Facts of Life● Professional Services / Support not free

https://pixelastic.github.io/pokemonorbigdata/

https://pixelastic.github.io/pokemonorbigdata/

“Data Products”

Core: valuable data. Tools to display and manipulate.

Good: live, visual, searchable

Types:

● Exploratory● Internal production● Publicly facing (but free)● Commercial = monetised

VOLUME

VARIETY

VELOCITY

VERACITY

Popular Data Products

Google Flights (not a booking engine!)

CIA World Fact Book (simple presentation)

Inside AirBnB (“activist”)

data.gov.uk

http://insideairbnb.com/

http://insideairbnb.com/

The Data Process

1. Obtain data2. Explore & clean data3. Analyse & model4. Visualise5. Productionise & automate Data Pipeline

a. How and where to distribute?

b. How to scale?

c. How to secure?

d. How to manage day-to-day?

Data Exploration on One PC

Using ggplot2 for exploratory graphs

qplot(host$availability_365,+ geom="histogram",+ binwidth = 5, + main = "Histogram for Availability", + xlab = "AirBnB in London", + fill=I("blue"))

Statistical Analysis

SIMPLE

● Sum, Count, Mean / Median

● Variance / Standard Deviation

E.g. Average Revenue per User per Neighbourhood (by Month of the Year)

MORE COMPLEX

● Clustering

● Co-variance matrix

(dependencies between

variables)

● Predictive Models

● Machine Learning

Big Data Architectures (simplified)

“Big” Database Hadoop Cluster / File System

Query Engine (Data Access)

Execution Engine (Business Logic)

Search Engine (Accessibility)

Visualisation Layer

Visualisation using KIBANA

Trusted Analytics Platform - Brand New OSS

Interactive Notebooks

New breed of software to work interactively on data

Spark/Scala Notebook

Apache Zeppelin

Databricks: cloud (proprietary but built on Spark)

Open source for customer analytics

Software

Transcript of Open source for customer analytics