Disambiguating Patent Inventors: A Non-Name-Matching Approach

1

Disambiguating Patent Inventors: A Non-Name-

Matching ApproachPresenter: Hsini Huang

Co-authors: Li Tang and John P. WalshGeorgia institute of Technology

ESF-APE-INV 2nd “Name Game” workshop, Dec 9, 2010Madrid, spain

2

Authorship identification has been the Achilles' heel of bibliometric analyses at the individual level, e.g. citation impact analysis (Tang and Walsh, 2010).

Raffo and Lhuillery (2009) also warned, the reliability of the statistical results regarding patenting inventors highly depends on the accuracy rate derived from a fine matching heuristic.

A challenge to undertake

3

Several reasons why name-matching is probably not a good idea:◦ Cleaning typos of names (inventor, assignee, etc.)

is a difficult task◦ Those matching criteria are often used as

dependent variables too, e.g. co-authorship, knowledge flows and geographical spillover (Singh, 2004)

◦ “Name plus affiliation plus address” could be effective if inventors are not mobile

Why solve the “John Smith” problem differently?

4

Cognitive map◦ A process of a series of psychological transformations

by which an individual acquires, codes, stores, recalls, and decodes information in spatial/information environment

Structural equivalence◦ In a single-relation network, two actors are

structurally equivalent if they have identical ties to and from all the other actors

Approximate Structural Equivalence (A.S.E.)◦ Actors within a structural equivalent cluster are more

similar to each other than those outside the cluster

ASE Method: Key Concepts

5

The references in a publication or patent should reflect the cognitive map of the author or inventor

If two documents share one or more references, they are more likely to be by the same creator--> This is especially true if they share a rare reference

Therefore, ASE of reference networks should partition documents by creators, especially if we weight the matrix by how rare the references are, and by how many references are in the documents

Validated on publication data (70-80% accuracy), (Tang and Walsh, 2010)

ASE Method: Intuition

6

Graphically, the Approximately Structure Equivalence (A.S.E) is

Source: Tang and Walsh (2010)

7

- w1 and w2 are two weights w1 = Popularity of the cited references w2 = Number of references in a patent document

- D[ i, j] is the patent-reference matrix defined as [inventorsIDs X cited references]

The measurement of cognitive homogeneityMathematically, the score of similarity between authors is calculated as:

8

In the EPO, patent references are added by patent examiners. The concept of citation is to indicate the most technically relevant information with “minimum” references

In the USPTO, inventors or applicants should provide a complete list of all prior-arts they are aware of

Thus, USPTO data should more accurately reflect the cognitive maps of inventors

H: The A.S.E algorithm performs better in US patents than in EPO patents-In fact, should perform poorly in EPO case

A comparison of different citation governances in EPO and USPTO

9

The golden rule dataset:The French Benchmark Dataset (APE-INV project, Lissoni et al., 2009)

Exp1&2: EPO citation vs. USPTO citationWe retrieved reference data from PATSTAT

Exp3: A.S.E vs. Multi-stage matching method

Thanks to the open access dataset provided by Lai and his colleagues (2009), the “careers and co-authorship networks of U.S. patent-holders since 1975”

Data and experiment strategy

10

The flow chart of our experiment

11

Inventors Correct group

Predict group

false group

John Smith 1 1 0John Smith 1 2 False

negativeJohn Smith 1 1 0Joan Smith 2 3 0Joan Smith 2 1 False positiveJoan Smith 2 3 0Joe Smith 3 4 0Accuracy

rate((7 – 2) / 7) * 100 = 71%

Calculation of the accuracy rate

Over-clustering

Misclassified as a

singleton

12

Among all the 1850 patents in the French Benchmark dataset (incl. patents with no cites)◦ Using EPO references data,

the A.S.E method can reach 77% accuracy

◦ Using USPTO references data, the A.S.E method can reach 78% accuracy

Experiment 1: For all the records

13

Among all the patents with at lease one patent reference, ◦ Using EPO references data,

the A.S.E method can reach 79% accuracy (N=1051)

◦ Using USPTO references data, the A.S.E method can reach 82% accuracy (N=361)

Experiment 2: patents with at least one patent references

14

Among the 361 US patents, 299 records were found in Lai, D’Amour and Fleming’s inventor dataset◦ the A.S.E method can reach

80% accuracy (on either EPO or USPTO data)

◦ The multistage name-matching method reaches 61% accuracy

Experiment 3: A.S.E vs. Multistage method

15

Sensitivity analysis: Accuracy by Threshold, EPO vs. USPTO

16

The finding is not completely support our hypothesis, the A.S.E. method performs slightly better for the US patents than the European patents.◦ The French Benchmark dataset has many singletons◦ The EPO examiners did very good job reviewing each

inventors’ prior works? The A.S.E method reaches a higher accuracy rate

than the more elaborate multi-stages method Thus, our method works, but perhaps not for the

reasons we think, company benchmark data should be applied to double check this method in the future.

Summary of results

17

Advantages:1. Researchers using the A.S.E method will have less

worry about the mobility issue because the algorithm is insensitive to the change of address and/or affiliations. The only thing A.S.E. captures is the trajectory of the knowledge footprint

2. Less time consuming and less computational resources. The A.S.E method requires only a few pieces of information, i.e. patent no., patent references and the popularity of the cited references

3. A.S.E does not use affiliation or co-inventors in the disambiguation, so that these can be used to track mobility or collaboration

Discussions

18

Negatives:1. The A.S.E method can only be applied if the

inventor’s patent has at least one linkage with the rest of his patents. Patents with no references will be treated as singletons automatically

2. EPO examiners cite less references. Around 50% of the EPO patents in this study are singletons (vs. 5% in the USPTO) - In this experiment, although even including these, the result still yields nearly 80% accuracy, since many are in fact singletons using the French scientist data)

Discussion-cont.

19

Limitations:1. The A.S.E method may not be able to relate

inventors if someone radically changes project from one technical field to the other (although if they shift over time, the method will capture this with a transitivity rule)

2. Although the A.S.E method requires less parameters in the algorithm, it might be hard to apply to an X million by X million matrix. Some level of simple classification could help.

Discussion-cont.

20

Thanks for your attention.Comments or suggestions?

Disambiguating Patent Inventors: A Non-Name-Matching Approach

Documents

Transcript of Disambiguating Patent Inventors: A Non-Name-Matching Approach