A method for disambiguating word senses in a large corpus.pdf
Disambiguating Patent Inventors: A Non-Name-Matching Approach
description
Transcript of Disambiguating Patent Inventors: A Non-Name-Matching Approach
1
Disambiguating Patent Inventors: A Non-Name-
Matching ApproachPresenter: Hsini Huang
Co-authors: Li Tang and John P. WalshGeorgia institute of Technology
ESF-APE-INV 2nd “Name Game” workshop, Dec 9, 2010Madrid, spain
2
Authorship identification has been the Achilles' heel of bibliometric analyses at the individual level, e.g. citation impact analysis (Tang and Walsh, 2010).
Raffo and Lhuillery (2009) also warned, the reliability of the statistical results regarding patenting inventors highly depends on the accuracy rate derived from a fine matching heuristic.
A challenge to undertake
3
Several reasons why name-matching is probably not a good idea:◦ Cleaning typos of names (inventor, assignee, etc.)
is a difficult task◦ Those matching criteria are often used as
dependent variables too, e.g. co-authorship, knowledge flows and geographical spillover (Singh, 2004)
◦ “Name plus affiliation plus address” could be effective if inventors are not mobile
Why solve the “John Smith” problem differently?
4
Cognitive map◦ A process of a series of psychological transformations
by which an individual acquires, codes, stores, recalls, and decodes information in spatial/information environment
Structural equivalence◦ In a single-relation network, two actors are
structurally equivalent if they have identical ties to and from all the other actors
Approximate Structural Equivalence (A.S.E.)◦ Actors within a structural equivalent cluster are more
similar to each other than those outside the cluster
ASE Method: Key Concepts
5
The references in a publication or patent should reflect the cognitive map of the author or inventor
If two documents share one or more references, they are more likely to be by the same creator--> This is especially true if they share a rare reference
Therefore, ASE of reference networks should partition documents by creators, especially if we weight the matrix by how rare the references are, and by how many references are in the documents
Validated on publication data (70-80% accuracy), (Tang and Walsh, 2010)
ASE Method: Intuition
6
Graphically, the Approximately Structure Equivalence (A.S.E) is
Source: Tang and Walsh (2010)
7
- w1 and w2 are two weights w1 = Popularity of the cited references w2 = Number of references in a patent document
- D[ i, j] is the patent-reference matrix defined as [inventorsIDs X cited references]
The measurement of cognitive homogeneityMathematically, the score of similarity between authors is calculated as:
8
In the EPO, patent references are added by patent examiners. The concept of citation is to indicate the most technically relevant information with “minimum” references
In the USPTO, inventors or applicants should provide a complete list of all prior-arts they are aware of
Thus, USPTO data should more accurately reflect the cognitive maps of inventors
H: The A.S.E algorithm performs better in US patents than in EPO patents-In fact, should perform poorly in EPO case
A comparison of different citation governances in EPO and USPTO
9
The golden rule dataset:The French Benchmark Dataset (APE-INV project, Lissoni et al., 2009)
Exp1&2: EPO citation vs. USPTO citationWe retrieved reference data from PATSTAT
Exp3: A.S.E vs. Multi-stage matching method
Thanks to the open access dataset provided by Lai and his colleagues (2009), the “careers and co-authorship networks of U.S. patent-holders since 1975”
Data and experiment strategy
10
The flow chart of our experiment
11
Inventors Correct group
Predict group
false group
John Smith 1 1 0John Smith 1 2 False
negativeJohn Smith 1 1 0Joan Smith 2 3 0Joan Smith 2 1 False positiveJoan Smith 2 3 0Joe Smith 3 4 0Accuracy
rate((7 – 2) / 7) * 100 = 71%
Calculation of the accuracy rate
Over-clustering
Misclassified as a
singleton
12
Among all the 1850 patents in the French Benchmark dataset (incl. patents with no cites)◦ Using EPO references data,
the A.S.E method can reach 77% accuracy
◦ Using USPTO references data, the A.S.E method can reach 78% accuracy
Experiment 1: For all the records
13
Among all the patents with at lease one patent reference, ◦ Using EPO references data,
the A.S.E method can reach 79% accuracy (N=1051)
◦ Using USPTO references data, the A.S.E method can reach 82% accuracy (N=361)
Experiment 2: patents with at least one patent references
14
Among the 361 US patents, 299 records were found in Lai, D’Amour and Fleming’s inventor dataset◦ the A.S.E method can reach
80% accuracy (on either EPO or USPTO data)
◦ The multistage name-matching method reaches 61% accuracy
Experiment 3: A.S.E vs. Multistage method
15
Sensitivity analysis: Accuracy by Threshold, EPO vs. USPTO
16
The finding is not completely support our hypothesis, the A.S.E. method performs slightly better for the US patents than the European patents.◦ The French Benchmark dataset has many singletons◦ The EPO examiners did very good job reviewing each
inventors’ prior works? The A.S.E method reaches a higher accuracy rate
than the more elaborate multi-stages method Thus, our method works, but perhaps not for the
reasons we think, company benchmark data should be applied to double check this method in the future.
Summary of results
17
Advantages:1. Researchers using the A.S.E method will have less
worry about the mobility issue because the algorithm is insensitive to the change of address and/or affiliations. The only thing A.S.E. captures is the trajectory of the knowledge footprint
2. Less time consuming and less computational resources. The A.S.E method requires only a few pieces of information, i.e. patent no., patent references and the popularity of the cited references
3. A.S.E does not use affiliation or co-inventors in the disambiguation, so that these can be used to track mobility or collaboration
Discussions
18
Negatives:1. The A.S.E method can only be applied if the
inventor’s patent has at least one linkage with the rest of his patents. Patents with no references will be treated as singletons automatically
2. EPO examiners cite less references. Around 50% of the EPO patents in this study are singletons (vs. 5% in the USPTO) - In this experiment, although even including these, the result still yields nearly 80% accuracy, since many are in fact singletons using the French scientist data)
Discussion-cont.
19
Limitations:1. The A.S.E method may not be able to relate
inventors if someone radically changes project from one technical field to the other (although if they shift over time, the method will capture this with a transitivity rule)
2. Although the A.S.E method requires less parameters in the algorithm, it might be hard to apply to an X million by X million matrix. Some level of simple classification could help.
Discussion-cont.
20
Thanks for your attention.Comments or suggestions?