Powered by Python - PyCon Germany 2016
-
Upload
steffen-wenz -
Category
Technology
-
view
69 -
download
1
Transcript of Powered by Python - PyCon Germany 2016
Powered by PythonSummarizing hotel reviews for 100 million travelers
Steffen Wenz, CTO
10,000 hotelsuse TrustYou Analytics to analyze their guest reviews.
100 million travelerssee our data on Google, Hotels.com, Kayak … actually it’s probably more.
Architecture ;-)Hadoop Cluster
(Hortonworks Distribution)
Big Data Python
Machine LearningNLP
Scraping API
MagicLove
Hadoop:
… slow & massive
Python on Hadoop:
… possible, but not natural
Let’s try Spark!$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} \; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a node on a file */
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pgenheaders.h"
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "token.h"
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "node.h"
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward */
20f6f686 (Tim Peters 2000-07-09 03:09:57 +0000 9) static void list1node(FILE *, node *);
20f6f686 (Tim Peters 2000-07-09 03:09:57 +0000 10) static void listnode(FILE *, node *);
Let’s try Spark!import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(\d{4})-\d{2}-\d{2}"
years_hist = sc.textFile("blame") \
.flatMap(lambda line: re.findall(year_re, line)) \
.map(lambda year: (year, 1)) \
.reduceByKey(op.add)
output = years_hist.collect()
What happened here?
Grammars & ParsingOr: Why you should have paid attention in
compilers class
Grammars and Parsing$ less Grammar/Grammar
...
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
while_stmt: 'while' test ':' suite ['else' ':' suite]
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
...
Parsing: Given an input string, determine/guessgrammar production rules to generate it
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NOUN COP ADJ
... OPINION -> ADJ NOUN
... NOUN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... ADJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print(tree)
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
Word2Vec
● Map words to vectors● “Step up” from
bag-of-words model
● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040,
-0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166,
0.3312, -0.0928, -0.0967,
-0.0199, -0.2498, -0.4445,
-0.0445,
# ...
Fun with Word2Vec>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]
ML @ TrustYou
● gensim doc2vec model to create hotel embedding
● Used - together with other features - for various classifiers
● Build complex pipelines ofbatch jobs○ Dependency resolution○ Parallelism○ Resume failed jobs
Luigi
class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!
[email protected] or www.trustyou.com/careers
We’re hiringweb developers & data engineers!