Big data solutions for advanced marketing analytics

Post on 28-Nov-2014

677 views 1 download

description

Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform. By using Hadoop and open source statistical language and tools such R and Python, we can execute a variety of machine learning algorithms, and scale them out on a distributed computing framework.

Transcript of Big data solutions for advanced marketing analytics

Big Data Solutions for Marketing Analytics

Natalino Busa@natalinobusa

Parallelism Hadoop Cassandra Akka

Machine Learning Statistics Big Data

Algorithms Cloud Computing Scala Spray

Natalino Busa@natalinobusa

www.natalinobusa.com

Humanize Data

The bank statements

Back to routine.Grocery, broken washmachine

After-vacation funPancake house.

Traveling back.

Just back home. Pizza.

Shopping in SicilyVacation!

The bank statements How I read the bank bills

Back to routine.Grocery, broken washmachine

After-vacation funPancake house.

Traveling back.

Just back home. Pizza.

Shopping in SicilyVacation!

The bank statements How I read the bank bills What happened those days

data is the fabric of our livesLet’s give more meaning and context to data.

Abraham Harold Maslow (April 1, 1908 – June 8, 1970) was an American psychologist who was best known for creating Maslow's hierarchy of needs

breathing, food, water, sleep

security of body, resources, health, employment, property

friend, family, partnersecurity of love and belonging

self-esteem, confidence, achievements, respect

spontaneity, creativity, acceptance, freedom, ethics

Physiology

Contractual

Love & Caring

Esteem

Self-actualization

Very human needs

How much caring can technology be?

Connectivity, Electricity, Hardware / Infra

security of basic operationsREST APIs, Encryption, Authentication

Notification, Alerts,Social bonding, Predictions

Set goals, planning,Achievements, Advisory role

Freedom, Trusted Companion

Physiology

Contractual

Love & Caring

Esteem

Self-actualization

Technology is reaching out

Data science top 3

Dimensionality

Reduction

Predictive

Analytics

Clustering

Segmentation

Data science: what’s working?

- Random Forests

- Artificial Neural Networks

- Clustering Algorithms

- Pattern Recognition

- Time-Serie analysis

- RegressionMost actual models are a

combination of these ones

Data science ^.^/

keep it scientific

cross-validate your models

keep it measurable

play with it

create new features

explore the available data

How to code data science?

# Multiple Linear Regression Example

fit <- lm(y ~ x1 + x2 + x3, data=mydata)

summary(fit) # show results

● Language for statistics● Easy to Analyze and shape data● Advanced statistical package● Fueled by academia and professionals● Very clean visualization packages

Packages for machine learningtime serie forecasting, clustering, classification decision trees, neural networks

Remote procedure calls (RPC)From scala/java via RProcess and Rserve

Data Science: R

>>> from sklearn.datasets import load_iris>>> from sklearn import tree>>> iris = load_iris()>>> clf = tree.DecisionTreeClassifier()>>> clf = clf.fit(iris.data, iris.target)

● Flexible, concise language● Quick to code and prototype● Portable, visualization libraries

Machine learning libraries:scipy, statsmodels, sklearn, matplotlib, ipython

Web librariesflask, tornado, (no)SQL clients

Data Science: Python

Earn the trust

The customer’s context

Personal history: amount of transactions ever done

Long term Interaction:how the users’ action correlate with others

Real time events:Trends and recent events

The customer’s context

context is related to time:

slow changing: the defining characteristic of a person

fast changing: events which influence our lives, trends

Require very different technology solutions !!!

Challenges

Not much time to reactEvents must be delivered fast to the new machine APIsIt’s Web, and Mobile Apps: latency budget is limited

Loads of information to processUnderstand well the user historyAccess a larger context

Big Data and Fast data

ranking and preference

segmentation and clustering

short term trending topics

rule-based recommendations

10’s Terabytes of Data. This can take hours ….

100’s of events per second.This must be fast ….

Back to the drawing board

core banking systems

SOAP services and DBs

System BUS

customer facing appls

channels

A high-level bank schematic

Higher separation !

Less silos

Interactions

with core

systems

Bigger and Faster

Human-centric applications

Some techs

Hadoop: Distributed Data OS

ReliableDistributed, Replicated File System

Low cost↓ Cost vs ↑ Performance/Storage

Computing Powerhouse

All clusters CPU’s working in parallel for running queries

Cassandra: A low-latency 2D store

ReliableDistributed, Replicated File System

Low latencySub msec. read/write operations

Tunable CAPDefine your level of consistency

Data model: hashed rows, sorted wide columns

Architecture model: No SPOF, ring of nodes, omogeneous system

Scala / Akka / Spray: a WEB API reactive framework

ActorA Actor

B

ActorC

msg 1msg 2

msg 3

msg 4● it scales horizontally (can run in cluster mode)

● maximum use of the available cores/memory

● processing is non-blocking, threads are re-used

● can parallelize computing power across many actors

Very fast: 1000’s messages/sec

Very reliable: auto recovery

Lazy: compute only when required

Putting it all together

Hadoop

application (actor based)

millions of millions of

λ= conversions

( lamda )Data queues

Science & Engineering

Statistics, Data Science

PythonRVisualization

IT InfraBig Data

JavaScalaSQL

Hadoop: Big Data Infrastructure, Data Science on large datasets

Big Data and Fast Data requires different profiles to be able to achieve the best results

Some lessons learned

● Mix and match technologies is a good thing● Fast Data must complement Big Data● Ease integration among teams● Hadoop, Cassandra, and Akka● Data Science takes time to figure out

Parallelism Mathematics Programming

Languages Machine Learning Statistics

Big Data Algorithms Cloud Computing

Natalino Busa@natalinobusa

www.natalinobusa.com

Thanks !Any questions?