Post on 27-Jan-2015
description
Exploring patent spacewith python
Franta Polach@FrantaPolach
IPberry.com
PyData 2014
@FrantaPolach 2
@FrantaPolach 3
@FrantaPolach 4
@FrantaPolach 5
@FrantaPolach 6
Outline
● Why patents● Data kung fu● Topic modelling● Future
@FrantaPolach 7
Why patents
@FrantaPolach 8
Why patents
● The system is broken● Messy, slow & costly process
● USPTO data freely available● Data structured, mostly consistent
● A chance to learn
@FrantaPolach 9
Data kung fu
Kung fu or Gung fu (/ˌkʌŋˈfuː/ or /ˌkʊŋˈfuː/; 功夫 , Pinyin: gōngfu)
– a Chinese term referring to any study, learning, or practice that requires patience, energy, and time to complete
@FrantaPolach 10
USPTO Data● xml, SGML key-value store
● 1975 – present● eight different formats● > 70GB (compressed)● patent grants● patent applications● How to parse?● Parsed data available?
– Harvard Dataverse Network
– Coleman Fung Institute for Engineering Leadership, UC Berkeley
– PATENT SEARCH TOOL by Fung Institute– http://funginstitute.berkeley.edu/tools-and-data
@FrantaPolach 11
Coleman Fung Institute for Engineering Leadership, UC Berkeley
patent data process flow
The code is in Python 2 on Github.
@FrantaPolach 12
Fung Institute SQL database schema
@FrantaPolach 13
Entity-relationship diagramPatents with citations, claims, applications and classes
@FrantaPolach 14
Descriptive statistics
@FrantaPolach 15
Topic modelling
● Goal: build a topic space of the patent documents
● i.e. compute semantic similarity● Tools: nltk, gensim● Data: patent abstracts, claims, descriptions● Usage: have invention description, find
semantically similar patents
@FrantaPolach 16
Text preprocessing
● Have: parsed data in a relational database● Want: data ready for semantic analysis
● Do: – lemmatization, stemming
– collocations, Named Entity Recognition
@FrantaPolach 17
Text preprocessingLemmatization, stemming
print(gensim.utils.lemmatize("Changing the way scientists, engineers, and analysts perceive big data"))
['change/VB', 'way/NN', 'scientist/NN', 'engineer/NN', 'analyst/NN', 'perceive/VB', 'big/JJ', 'datum/NN']
i.e. group together different inflected forms of a word so they can be analysed as a single item
Collocations, Named Entity Recognition
detect a sequence of words that co-occur more often than would be expected by chance
import nltk from nltk.collocations import TrigramCollocationFinder from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
e.g. entity such as "General Electric" stays a single token
Stopwords
generic words, such as "six", "then", "be", "do"....
from gensim.parsing.preprocessing import STOPWORDS
@FrantaPolach 18
Data streamingWhy? data is too large to fit into RAM
Itertools are your friend
class PatCorpus(object):
def __init__(self, fname): self.fname = fname def __iter__(self): for line in open(self.fname): patent=line.lower().split('\t') tokens = gensim.utils.tokenize(patent[5], lower=True) title = patent[6] yield title, list(tokens)
corpus_tokenized = PatCorpus('in.tsv')print(list(itertools.islice(corpus_tokenized, 2)))
[('easy wagon/easy cart/bicycle wheel mounting brackets system', [u'a', u'specific', u'wheel', u'mounting', u'bracket', u'and', u'a', u'versatile', u'method', u'of', u'using', u'these', u'brackets', u'or', u'similar', u'items', u'to', u'attach', u'bicycle', u'wheels', u'to', u'various', u'vehicle', u'frames', u'primarily', u'made', u'of', u'wood', u'and', u'a', u'general', u'vehicle', u'structure', u'or', u'frame', u'design', u'using', u'the', u'brackets', u'the', u'brackets', u'are', u'flat', …
@FrantaPolach 19
Vectorization
● First we create a dictionary, i.e. index text tokens by integers
id2word = gensim.corpora.Dictionary(corpus_tokenized)
● Create bag-of-words vectors using a streamed corpus and a dictionary
text = "A community for developers and users of Python data tools."bow = id2word.doc2bow(tokenize(text))print(bow)
[(12832, 1), (28124, 1), (28301, 1), (32835, 1)]
def tokenize(text): return [t for t in simple_preprocess(text) if t not in
STOPWORDS]
@FrantaPolach 20
Semantic transformations
● A transformation takes a corpus and outputs another corpus
● Choice: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), Random Projections (RP), etc.
model = gensim.models.LdaModel(corpus, num_topics=100, id2word=id2word, passes = 4, alpha=None)
_ = model.print_topics(-1)
INFO:gensim.models.ldamodel:topic #0 (0.010): 0.116*memory + 0.090*cell + 0.063*plurality + 0.054*array + 0.052*each + 0.044*bit + 0.039*cells + 0.032*address + 0.022*logic + 0.017*row
INFO:gensim.models.ldamodel:topic #1 (0.010): 0.101*speed + 0.092*lines + 0.060*performance + 0.045*characteristic + 0.036*skin + 0.028*characteristics + 0.025*suspension + 0.024*enclosure + 0.023*transducer + 0.022*loss
INFO:gensim.models.ldamodel:topic #2 (0.010): 0.141*portion + 0.049*housing + 0.031*portions + 0.028*end + 0.024*edge + 0.020*mounting + 0.018*has + 0.017*each + 0.016*formed + 0.016*arm
INFO:gensim.models.ldamodel:topic #3 (0.010): 0.224*signal + 0.099*output + 0.075*input + 0.057*signals + 0.043*frequency + 0.034*phase + 0.024*clock + 0.020*circuit + 0.016*amplifier + 0.014*reference
@FrantaPolach 21
Transforming unseen documents
text = "A method of configuring the link maximum transmission unit (MTU) in a user equipment."
1) transform text into the bag-of-words space
bow_vector = id2word.doc2bow(tokenize(text))print([(id2word[id], count) for id, count in bow_vector])
[(u'method', 1), (u'configuring', 1), (u'link', 1), (u'maximum', 1), (u'transmission', 1), (u'unit', 1), (u'user', 1), (u'equipment', 1)]
2) transform text into our LDA space
vector = model[bow_vector]
[(0, 0.024384265946835323), (1, 0.78941547921042373),...
3) find the document's most significant LDA topic
model.print_topic(max(vector, key=lambda item: item[1])[0])
0.022*network + 0.021*performance + 0.018*protocol + 0.015*data + 0.009*system + 0.008*internet + ...
@FrantaPolach 22
Evaluation
● Topic modelling is an unsupervised task ->> evaluation tricky
● Need to evaluate the improvement of the intended task
● Our goal is to retrieve semantically similar documents, thus we tag a set of similar documents and compare with the results of given semantic model
● "word intrusion" method: for each trained topic, take its first ten words, substitute one of them with a randomly chosen word (intruder!) and let a human detect the intruder
● Method without human intervention: split each document into two parts, and check that topics of the first half are similar to topics of the second; halves of different documents are dissimilar
@FrantaPolach 23
The topic space
● a topic is a distribution over a fixed vocabulary of terms
● the idea behind Latent Dirichlet Allocation is to statistically model documents as containing multiple hidden semantic topics
@FrantaPolach 24
memory: 188cell: 146plurality: 102array: 86bit: 71address: 51
Exploring topic space
speed: 178line: 163performance: 107characteristic: 79skin: 63suspension: 45
signal: 324output: 142input: 108frequency: 62phase: 49clock: 35
portion: 310housing: 109end: 62edge: 53mounting: 43form: 35
@FrantaPolach 25
Topics distribution
many topics in total, but each document contains just a few of them ->> sparse model
@FrantaPolach 26
Semantic distance in topic space
● Semantic distance queriesfrom scipy.spatial import distancepairwise = distance.squareform(distance.pdist(matrix))
>> MemoryError
● Document indexingfrom gensim.similarities import Similarity
index = Similarity('tmp/index', corpus, num_features=corpus.num_terms)
The Similarity class splits the index into several smaller sub-indexes ->> scales well
@FrantaPolach 27
Semantic distance queries
query = "A method of configuring the link maximum transmission unit (MTU) in a user equipment."
1) vectorize the text into bag-of-words space
bow_vector = id2word.doc2bow(tokenize(query))
2) transform the text into our LDA space
query_lda = model[bow_vector]
3) query the LDA index, get the top 3 most similar documents
index.num_best = 3
print(index[query_lda])
[(2026, 0.91495784099521484), (32384, 0.8226358470916238), (11525, 0.80638835174553156)]
@FrantaPolach 28
Future
● Graph of USPTO data (Neo4j)● Elasticsearch search and analytics● Recommendation engine (for applications)● Drawings analysis● Blockchain based smart contracts
● Artificial patent lawyer