Telecom datascience master_public

Post on 17-Jan-2017

56 views 0 download

Transcript of Telecom datascience master_public

Data Science in E-commerce industry Telecom Paris - Séminaires Big Data 2016/06/09Vincent Michel

Big Data Europe, BDD, Rakuten Inc. / PriceMinister

vincent.michel@rakuten.com @HowIMetYourData

2

Short Bio

ESPCI: engineer in Physics / Biology

ENS Cachan: MVA Master Mathematics Vision and Learning

INRIA Parietal team: PhD in Computer ScienceUnderstanding the visual cortex by using classification techniques

Logilab – Development and data science consultingData.bnf.fr (French National Library open-data platform)Brainomics (platform for heterogeneous medical data)

EducationExperience

Rakuten PriceMinister– Senior Developer and data scientistData engineer and data science consulting

Software engineeringLessons learned from (painful) experiences

4

Do not redo it yourself !

Lots of interesting open-source libraries for all your needs:Test first on a small POC, then contribute/developScikit-learn, pandas, Caffe, Scikit-image, opencv, ….Be careful: it is easy to do something wrong !

Open-data:More and more open-data for catalogs, …E.g. data.bnf.fr

~ 2.000.000 authors~ 200.000 works~ 200.000 topics

Contribute to open-source:Is there a need / pool of potential developers ?Do it well (documentation / test)Unless you are doing some kind of super magical algorithmMay bring you help, bug fixes, and engineers ! But it takes time and energy

5

Quality in data science software engineering

Never underestimates integration costEasy to write a 20 lines Python code doing somefancy Random Forests… …that could be hard to deploy (data pipeline, packaging, monitoring)Developer != DevOps != Sys admin

Make it clean from the start (> 2 days of dev or > 100 lines of code):Tests, tests, tests, tests, tests, tests, tests, …DocumentationPackaging / supervision / monitoringRelease often release earlierAgile development, Pull request, code versioning

Choose the right tool:Do you really need this super fancy NoSQL databaseto store your transactions?

6

Monitoring and metrics

Always monitor:Your development: continuous integration (Jenkins)Your service: nagios/shinkenYour business data (BI): KibanaYour user: trackerYour data science process : e.g. A/B test

Evaluation:Choose the right metricPrediction accuracy / Precision-recall …Always A/B test rather than relying on personal thoughtsGood question leads to good answer: Define your problem

Hiring remarksSelling yourself as a (good) data scientist

8

Few remarks on hiring – my personal opinion

Be careful of CVs with buzzwords!E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests, Regularization (L1, L2, Elastic net…) …”It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)

Often found in Junior CVs (ok), but huge warning in Senior CVs

Hungry for data?Loving data is the most important thing to checkOpendata? Personal project? Curious about data? (Hackaton?)Pluridisciplinary == knowing how to handle various datasets

Check for IT skills:Should be able to install/develop new libraries/algorithmsA huge part of the job could be to format / cleanup the dataExperience VS education -> Autonomy

Recommendations @RakutenData science use-case

10

Rakuten Group Worldwide

Recommendationchallenges

Different languagesUsers behaviorBusiness areas

11

Rakuten Group in Numbers

Rakuten in Japan

> 12.000 employees> 48 billions euros of GMS> 100.000.000 users> 250.000.000 items> 40.000 merchants

Rakuten Group

Kobo 18.000.000 usersViki 28.000.000 usersViber 345.000.000 users

12

Rakuten Ecosystem

Rakuten global ecosystem :Member-based business model that connects Rakuten servicesRakuten ID common to various Rakuten servicesOnline shopping and services;

Main business areasE-commerceInternet financeDigital content

Recommendation challengesCross-servicesAggregated dataComplex users features

13

Rakuten’s e-commerce: B2B2C Business Model

Business to Business to Consumer:Merchants located in different regions / online virtual shopping mallMain profit sources

• Fixed fees from merchants• Fees based on each transaction and other service

Recommendationchallenges

Many shopsItems referencesGlobal catalog

14

Big Data Department @ Rakuten

Big Data Department150+ engineers – Japan / Europe / US

Missions

Development and operations of internal systems for:

RecommendationsSearchTargetingUser behavior tracking

Average traffic

> 100.000.000 events / day> 40.000.000 items view / day> 50.000.000 search / day> 750.000 purchases / day

Technology stackJava / Python / RubySolr / LuceneCassandra / CouchbaseHadoop / Hive / PigRedis / Kafka

15

Recommendations on Rakuten Marketplaces

Non-personalized recommendationsAll-shop recommendations:

Item to itemUser to item

In-shop recommendationsReview-based recommendations

Personalized recommendationsPurchase history recommendationsCart add recommendationsOrder confirmation recommendations

System status and scaleIn production in over 35 services of Rakuten Group worldwideSeveral hundreds of servers running:

HadoopCassandraAPIS

RecommendationsThe big picture

17

Challenges in Recommendations

ItemsCatalogue

ItemsSimilarity

Recommendationsengine

EvaluationProcess

Items cataloguesCatalogue for multiple shops with different items

references ?Items similarity / distances

Cross services aggregation ?Lots of parameters ?

Recommendations engineBest / optimal recommendations logic ?

Evaluation processOffline / online evaluation ?Long-tail ? KPI ?

18

Recommendations Architecture: Constantly Evolving

BrowsingEvents

Cocounts Storage

PurchaseEvents

Cat

alog

ue(s

)

Dis

tribu

tion

laye

r

RecommendationsOffline / materialized

RecommendationsOnline algebra / multi-arm

19

Items Catalogues

Use different levels of aggregation to improve recommendations

Category-level(e.g. food, soda, clothes, …)

Product-level(manufactured items)

Item in shop-level(specific product sell by a specific shop)

Increased statistical power in co-events computation

Easier business handling(picking the good item)

20

Enriching Catalogues using Record Linkage

Marketplace 2Marketplace 1 Reference database

Record linkage Use external sources (e.g., Wikidata) to align markets' products Fuzzy matching of 600K vs 350K items for movies alignments usecase. Blocking algorithm

Cross recommendation Global catalog Items aggregation Helps with cold start issues Improved navigation

21

Semantic-web and RDF format

Triples: <subject> <relation> <object>URI: unique identifier

http://dbpedia.org/page/Terminator_2:_Judgment_Day

RecommendationsCocounts and matrixes

23

Recommendation datatypes

RatingsNumerical feedbacks from the usersSources: Stars, reviews, …

✔ Qualitative and valuable data✖ Hard to obtainScaling and normalization !

Users

Item

s

1 3 2

5 2

2 4 1

3 1 5

4 4 1 3

Unitary dataOnly 0/1 without any quality feedbackSources: Click, purchase…

✔ Easy to obtain (e.g. tracker)✖ No direct rating

Users

Item

s1 1 1

1 1

1 1 1

1 1 1

1 1 1 1

24

Collaborative filtering

User-user#items < #usersItems are changing quickly

Users

Item

s

1 3 2

5 2

2 4 1

3 1 5

4 4 1 3

?

1 – Compute users similarities(cosine-similarity, Pearson)

2 – Weighted average of ratings

Item-item#items >> #users

25

Matrix factorization

Users

Item

s

1 3 2

5 2

2 4 1

3 1 5

4 4 1 3

-0.7 1 0.4……………

2.3 0.2 -0.3

Item

s

0.5 0.3 … 1.2 …

1.2 -0.2 … -3.2

Users

~ X

Choose a number of latent variables to decompose the data

Predict new rating using the product of latent vectors

Use gradient descent technics (e.g. SGD)

Add some regularization

26

Matrix factorization – MovieLens example

Read filesimport csvmovies_fname = '/path/ml-latest/movies.csv'with open(movies_fname) as fobj: movies = dict((r[0], r[1]) for r in csv.reader(fobj))ratings_fname = ’/path/ml-latest/ratings.csv'with open(ratings_fname) as fobj: header = fobj.next() ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)]

Build sparse matriximport scipy.sparse as spuser_idx, item_idx = {}, {}data, rows, cols = [], [], []for u, i, s in ratings: rows.append(user_idx.setdefault(u, len(user_idx))) cols.append(item_idx.setdefault(i, len(item_idx))) data.append(s)ratings = sp.csr_matrix((data, (rows, cols)))reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems())reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())

27

Matrix factorization – MovieLens example

Fit Non-negative Matrix Factorizationfrom sklearn.decomposition import NMFnmf = NMF(n_components=50)user_mat = nmf.fit_transform(ratings)item_mat = nmf.components_

Plot resultscomponent_ind = 3component = [(reverse_item_idx[i], s)

for i, s in enumerate(item_mat[component_ind , :]) if s>0.] For movie, score in sorted(component, key=lambda x: x[1], reverse=True)[:10]: print movie, round(score)

Terminator 2: Judgment Day (1991) 24.0Terminator, The (1984) 23.0Die Hard (198 19.0Aliens (1986) 17.0Alien (1979) 16.0

Exorcist, The (1973) 8.0Halloween (197 7.0Nightmare on Elm Street, A (1984) 7.0Shining, The (1980) 7.0Carrie (1976) 7.0

Star Trek II: The Wrath of Khan (1982) 10.0Star Trek: First Contact (1996) 10.0Star Trek IV: The Voyage Home (1986) 9.0Contact (1997) 8.0Star Trek VI: The Undiscovered Country (1991) 8.0Blade Runner (1982) 8.0

28

Binary / Unitary data

Only occurences of items views/purchases/…

Jaccard distance

Cosine similarity

Conditional probability

29

Co-occurrences and Similarities Computation

Only access to unitary data (purchase / browsing)

Use co-occurrences for computing items similarity

Multiple possible parameters: Size of time window to be considered:

Does browsing and purchase data reflect similar behavior ?

Threshold on co-occurrencesIs one co-occurrence significant enough to be used ? Two ? Three ?

Symmetric or asymmetricIs the order important in the co-occurrence ? A then B == B then A ?

Similarity metricsWhich similarity metrics to be used based on the co-occurrences ?

30

Co-occurrences Example

Browsing

Purchase

Session ? Session ?Time window 1

Session ?Time window 2

07/11/2015 08/11/2015

08/11/2015

24/11/2015

08/11/2015

08/11/2015

10/09/2015

08/09/2015

10/09/2015

31

Co-occurrences Computation

Co-purchases

Co-browsing

Classical co-occurrences

Complementaryitems

Substituteitems

Other possible co-occurrences

Items browsed and bought together

Items browsed and not bought together

“You may also want…”

“Similar items…”

08/11/2015

08/11/2015

08/11/2015

07/11/2015

08/11/201510/09/2015

08/09/2015

07/11/2015

RecommendationsDevelopment and evaluation

33

Recommendations Algebra

Algebra for defining and combining recommendations engines

Keys ideasReuse already existing logics and combine them easily.Write business logic, not code ! Handle multiple input/output formats.

Available LogicsContent-basedCollaborative-filteringItem-itemUser-item

(personalization)

Available BackendsIn-memoryHDF5 filesCassandraCouchbase

Available HybridizationLinear algebra /

weightingMixedCascade enginesMeta-level

34

Python Algebra Example

Purchase-basedTop-20

AsymmetricConditional probability

Browsing-basedSimilarity > 0.01

SymmetricCosine similarity

+ 0.2 Composite engine

>>> engine1 = RecommendationsEngine(nb_recos=20, datatype=‘purchase’, asymmetric=True, distance=‘conditional_probability’)>>> engine2 = RecommendationsEngine(similarity_th=0.01, datatype=‘browsing’, asymmetric=False,

distance=‘cosine_similarity’)>>> composite_engine = engine1 + 0.2 * engine2

Get recommendations from items (item-to-item)

>>> recos = composite_engine.recommendations_by_items([123, 456, 789, …])

35

Python Algebra with Personalization

Purchase-basedTop-20

AsymmetricConditional probability

Browsing-basedSimilarity > 0.01

SymmetricCosine similarity

+ 0.2 Composite engine

Purchase-historyTime window 180 days

Time decay 0.01

>>> history = HistoryEngine(datatype=‘purchase’, time_window=180, time_decay=0.01)>>> engine1.register_history_engine(history)

…same code as previously (user-to-item)

>>> recos = composite_engine.recommendations_by_user(‘userid’)

36

Python Algebra – Complete Example

Purchase-basedTop-20

AsymmetricConditional probability

Browsing-basedSimilarity > 0.01

SymmetricCosine similarity

+ 0.2 Composite engine

Purchase-historyTime window 180 days

Time decay 0.01

X (cascade)

Purchase-basedCategory-level

Similarity > 0.01Asymmetric

Conditional probability

Browsing-basedCategory-levelSimilarity > 0.1

SymmetricCosine similarity

+ 0.1

Composite engine

37

Recommendation Quality Challenges

Recommendations categories

Cold start issue• External data ?• Cross-services ?

Hot products (A)• Top-N items ?

Short tail (B)

Long tail (C + D)

Minor Product

Major Product

(Popular)New Product

OldProduct

(A)(B)

(D)

(C)

38

Long Tail is Fat

Long tail numbers

• Most of the items are long tail• They still represent a large

portion of the traffic

Long tail approaches

• Content-based• Aggregation / clustering• Personalization

Popular

Short tail

Long tail

Browsing share Number of items

Long tail Short tail Popular

39

Recommendations Offline Evaluation

Pros/Cons

• Convenient way to try new ideas

• Fast and cheap• But hard to align

with online KPI

Approaches

• Rescoring• Prediction game• Business simulator

40

Public Initiative – Viki Recommendation Challenge

567 submissions from 132 participantshttp://www.dextra.sg/challenges/rakuten-viki-video-challenge

41

Datascience everywhere !

Rakuten provides marketplaces worldwide

Specific challenges for recommendations

Items catalogue: reinforce statistical power of co-occurrences across shops and services;

Items similarities: find the good parameters for the different use-cases;

Recommendations models: what is the best models for in-shop, all-shops, personalization?

Evaluation: handling long-tail? Comparing different models?

43

We are Hiring!

Big Data Department – team in Parishttp://global.rakuten.com/corp/careers/bigdata/

http://www.priceminister.com/recrutement/?p=197  

Data Scientist / Software Developer

Build algorithms for recommendations, search, targeting Predictive modeling, machine learning, natural language processing Working close to business Python, Java, Hadoop, Couchbase, Cassandra…

Also hiring: search engine developers, big data system administrators, etc.