Powered by Python - PyCon Germany 2016

22
Powered by Python Summarizing hotel reviews for 100 million travelers Steffen Wenz, CTO [email protected]

Transcript of Powered by Python - PyCon Germany 2016

Page 1: Powered by Python - PyCon Germany 2016

Powered by PythonSummarizing hotel reviews for 100 million travelers

Steffen Wenz, CTO

[email protected]

Page 2: Powered by Python - PyCon Germany 2016

10,000 hotelsuse TrustYou Analytics to analyze their guest reviews.

100 million travelerssee our data on Google, Hotels.com, Kayak … actually it’s probably more.

Page 3: Powered by Python - PyCon Germany 2016
Page 4: Powered by Python - PyCon Germany 2016

Architecture ;-)Hadoop Cluster

(Hortonworks Distribution)

Big Data Python

Machine LearningNLP

Scraping API

MagicLove

Page 5: Powered by Python - PyCon Germany 2016

Hadoop:

… slow & massive

Page 6: Powered by Python - PyCon Germany 2016

Python on Hadoop:

… possible, but not natural

Page 7: Powered by Python - PyCon Germany 2016
Page 8: Powered by Python - PyCon Germany 2016

Let’s try Spark!$ # how old is the C code in CPython?

$ git clone https://github.com/python/cpython && cd cpython

$ find . -name "*.c" -exec git blame {} \; > blame

$ head blame

dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)

daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a node on a file */

daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)

badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pgenheaders.h"

daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "token.h"

daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "node.h"

daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)

badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward */

20f6f686 (Tim Peters 2000-07-09 03:09:57 +0000 9) static void list1node(FILE *, node *);

20f6f686 (Tim Peters 2000-07-09 03:09:57 +0000 10) static void listnode(FILE *, node *);

Page 9: Powered by Python - PyCon Germany 2016

Let’s try Spark!import operator as op, re

# sc: SparkContext, connection to cluster

year_re = r"(\d{4})-\d{2}-\d{2}"

years_hist = sc.textFile("blame") \

.flatMap(lambda line: re.findall(year_re, line)) \

.map(lambda year: (year, 1)) \

.reduceByKey(op.add)

output = years_hist.collect()

Page 10: Powered by Python - PyCon Germany 2016

What happened here?

Page 11: Powered by Python - PyCon Germany 2016
Page 12: Powered by Python - PyCon Germany 2016

Grammars & ParsingOr: Why you should have paid attention in

compilers class

Page 13: Powered by Python - PyCon Germany 2016

Grammars and Parsing$ less Grammar/Grammar

...

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt

async_stmt: ASYNC (funcdef | with_stmt | for_stmt)

if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]

while_stmt: 'while' test ':' suite ['else' ':' suite]

for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]

...

Parsing: Given an input string, determine/guessgrammar production rules to generate it

Page 14: Powered by Python - PyCon Germany 2016

>>> grammar = nltk.CFG.fromstring("""

... OPINION -> NOUN COP ADJ

... OPINION -> ADJ NOUN

... NOUN -> 'hotel' | 'rooms'

... COP -> 'is' | 'are'

... ADJ -> 'great' | 'terrible'

... """)

>>> parser = nltk.ChartParser(grammar)

>>> sent = nltk.word_tokenize("great rooms")

>>> for tree in parser.parse(sent):

>>> print(tree)

(OPINION (ADJ great) (NOUN rooms))

Grammars and Parsing

Page 15: Powered by Python - PyCon Germany 2016

Word2Vec

● Map words to vectors● “Step up” from

bag-of-words model

● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts

>>> m["python"]

array([-0.1351, -0.1040,

-0.0823, -0.0287, 0.3709,

-0.0200, -0.0325, 0.0166,

0.3312, -0.0928, -0.0967,

-0.0199, -0.2498, -0.4445,

-0.0445,

# ...

Page 16: Powered by Python - PyCon Germany 2016

Fun with Word2Vec>>> # trained from 100k meetup descriptions!

>>> m = gensim.models.Word2Vec.load("data/word2vec")

>>> m.most_similar(positive=["python"])[:3]

[(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)]

>>> m.doesnt_match(["python", "c++", "javascript"])

'c++'

>>> m.most_similar(positive=["ladies"])[:3]

[(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]

Page 17: Powered by Python - PyCon Germany 2016

ML @ TrustYou

● gensim doc2vec model to create hotel embedding

● Used - together with other features - for various classifiers

Page 18: Powered by Python - PyCon Germany 2016
Page 19: Powered by Python - PyCon Germany 2016

● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs

Luigi

Page 20: Powered by Python - PyCon Germany 2016

class MyTask(luigi.Task):

def output(self):

return luigi.Target("/to/make/this/file")

def requires(self):

return [

INeedThisTask(),

AndAlsoThisTask("with_some arg")

]

def run(self):

# ... then ...

# I do this to make it!

Page 21: Powered by Python - PyCon Germany 2016
Page 22: Powered by Python - PyCon Germany 2016

[email protected] or www.trustyou.com/careers

We’re hiringweb developers & data engineers!