Franta Polach - Exploring Patent Data with Python

Post on 27-Jan-2015

126 views 3 download



PyData Berlin 2014 Experiences from building a recommendation engine for patent search using pythonic NLP and topic modeling tools such as Gensim.

Transcript of Franta Polach - Exploring Patent Data with Python

Exploring patent spacewith python

Franta Polach@FrantaPolach

PyData 2014

@FrantaPolach 2

@FrantaPolach 3

@FrantaPolach 4

@FrantaPolach 5

@FrantaPolach 6


● Why patents● Data kung fu● Topic modelling● Future

@FrantaPolach 7

Why patents

@FrantaPolach 8

Why patents

● The system is broken● Messy, slow & costly process

● USPTO data freely available● Data structured, mostly consistent

● A chance to learn

@FrantaPolach 9

Data kung fu

Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu)

– a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete

@FrantaPolach 10

USPTO Data● xml, SGML key-value store

● 1975 – present● eight different formats● > 70GB (compressed)● patent grants● patent applications● How to parse?● Parsed data available?

– Harvard Dataverse Network

– Coleman Fung Institute for Engineering Leadership, UC Berkeley

– PATENT SEARCH TOOL by Fung Institute–

@FrantaPolach 11

Coleman Fung Institute for Engineering Leadership, UC Berkeley

patent data process flow

The code is in Python 2 on Github.

@FrantaPolach 12

Fung Institute SQL database schema

@FrantaPolach 13

Entity-relationship diagramPatents with citations, claims, applications and classes

@FrantaPolach 14

Descriptive statistics

@FrantaPolach 15

Topic modelling

● Goal: build a topic space of the patent documents

● i.e. compute semantic similarity● Tools: nltk, gensim● Data: patent abstracts, claims, descriptions● Usage: have invention description, find

semantically similar patents

@FrantaPolach 16

Text preprocessing

● Have: parsed data in a relational database● Want: data ready for semantic analysis

● Do: – lemmatization, stemming

– collocations, Named Entity Recognition

@FrantaPolach 17

Text preprocessingLemmatization, stemming

print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data"))

['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']

i.e. group together different inflected forms of a word so they can be analysed as a single item

Collocations, Named Entity Recognition

detect a sequence of words that co-occur more often than would be expected by chance

import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures

e.g. entity such as "General Electric" stays a single token


generic words, such as "six", "then", "be", "do"....

from gensim.parsing.preprocessing import STOPWORDS

@FrantaPolach 18

Data streamingWhy? data is too large to fit into RAM

Itertools are your friend

class PatCorpus(object):

def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('\t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens)

corpus_tokenized = PatCorpus('in.tsv')print(list(itertools.islice(corpus_tokenized, 2)))

[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …

@FrantaPolach 19


● First we create a dictionary, i.e. index text tokens by integers

id2word = gensim.corpora.Dictionary(corpus_tokenized)

● Create bag-of-words vectors using a streamed corpus and a dictionary

text = "A community for developers and users of Python data tools."bow = id2word.doc2bow(tokenize(text))print(bow)

[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]

def tokenize(text): return [t for t in simple_preprocess(text) if t not in


@FrantaPolach 20

Semantic transformations

● A transformation takes a corpus and outputs another corpus

● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc.

model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None)

_ = model.print_topics(-1)

INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row

INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss

INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm

INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference

@FrantaPolach 21

Transforming unseen documents

text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment."

1) transform text into the bag-of-words space

bow_vector = id2word.doc2bow(tokenize(text))print([(id2word[id], count) for id, count in bow_vector])

[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]

2) transform text into our LDA space

vector = model[bow_vector]

[(0, 0.024384265946835323), (1, 0.78941547921042373),...

3) find the document's most significant LDA topic

model.print_topic(max(vector, key=lambda item: item[1])[0])

0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...

@FrantaPolach 22


● Topic modelling is an unsupervised task ->> evaluation tricky

● Need to evaluate the improvement of the intended task

● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model

● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder

● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar

@FrantaPolach 23

The topic space

● a topic is a distribution over a fixed vocabulary of terms

● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics

@FrantaPolach 24

memory: 188cell: 146plurality: 102array: 86bit: 71address: 51

Exploring topic space

speed: 178line: 163performance: 107characteristic: 79skin: 63suspension: 45

signal: 324output: 142input: 108frequency: 62phase: 49clock: 35

portion: 310housing: 109end: 62edge: 53mounting: 43form: 35

@FrantaPolach 25

Topics distribution

many topics in total, but each document contains just a few of them ->> sparse model

@FrantaPolach 26

Semantic distance in topic space

● Semantic distance queriesfrom scipy.spatial import distancepairwise = distance.squareform(distance.pdist(matrix))

>> MemoryError

● Document indexingfrom gensim.similarities import Similarity

index = Similarity('tmp/index', corpus, num_features=corpus.num_terms)

The Similarity class splits the index into several smaller sub-indexes ->> scales well

@FrantaPolach 27

Semantic distance queries

query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment."

1) vectorize the text into bag-of-words space

bow_vector = id2word.doc2bow(tokenize(query))

2) transform the text into our LDA space

query_lda = model[bow_vector]

3) query the LDA index, get the top 3 most similar documents

index.num_best = 3


[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]

@FrantaPolach 28


● Graph of USPTO data (Neo4j)● Elasticsearch search and analytics● Recommendation engine (for applications)● Drawings analysis● Blockchain based smart contracts

● Artificial patent lawyer