Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

18
Neuchatel, 12-14 September 2011 Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat) {ichim,perani,seri}@istat.it EESW European Establishment Statistics Workshop 2011 Designing Linkage between Patents and Business Registers: the Italian Experience Neuchatel, 12 – 14 September 2011

description

Designing Linkage between Patents and Business Registers: the Italian Experience. Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat) {ichim,perani,seri}@istat.it EESW European Establishment Statistics Workshop 2011. Neuchatel, 12 – 14 September 2011. - PowerPoint PPT Presentation

Transcript of Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Page 1: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Daniela Ichim, Giulio Perani, Giovanni SeriItalian National Statistical Institute (Istat)

{ichim,perani,seri}@istat.it

EESW European Establishment Statistics Workshop 2011

Designing Linkage between Patents and Business Registers: the Italian

Experience

Neuchatel, 12 – 14 September 2011

Page 2: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

OutlineEESWEuropean Establishment Statistics Workshop 2011

Project description

Data sets

Linkage approach

Pre-processing of the input files

Choice of the matching variables

Choice of the similarity function

Creation of the search space of link candidate pairs

Choice of the decision model and selection of unique links

Record linkage evaluation

Preliminary results

Future works

Page 3: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Project description

Aim: profiling the Italian patenting enterprises

Linking economic data and technological information on patenting enterprises in order to identify the key drivers of patenting propensity

Evaluating the economic impact of the patenting activity

Identifying and collecting additional information on enterprises to be surveyed as R&D performers

Investigating specific sub-population of enterprises (e.g. biotech enterprises)

EESWEuropean Establishment Statistics Workshop 2011

Page 4: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Project description

Source of data: PATSTAT - EPO Worldwide Patent Statistical Database

Target data: applicants based in ItalyPeriod: patent applications from 1985 to 2010

Subject classification criterium: A) individuals B) establishments

Business enterprises Public institutions Non profit institutions Universities

EESWEuropean Establishment Statistics Workshop 2011

Page 5: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Data sets: patentsEESW European Establishment Statistics Workshop 2011

Additional information to be retrieved from the above database:• Year of first/last application by applicant• Number of patent applications filed by applicant• Region of residence of the applicants

PATSTAT (1) Applications 299769Application number (by year)International Patent Classification (IPC) code (each application can be classified under several IPC codes)

PATSTAT (2) Applications 72034Application number (by year)Applicant nameApplicant codePostal/Zip CodeApplicant Country (=IT)

Page 6: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Data sets: enterprisesEESW European Establishment Statistics Workshop 2011

Italian business register: ASIA (Archivio Statistico Imprese Attive)it is the frame for Istat surveys built as a logical and physical combination of data from both surveys and administrative sources (Tax Register, Register of Enterprises and Local Units, Social Security Register, Work Accident Insurance Register, Register of the Electric Power Board).

ASIAEnterprises identification numberEnterprises namePostal/Zip CodeNACE codeAddress, municipality, province, region Legal formFiscal code

Enterprise’s size variables: Number of employeesTurnover

ASIA 1998-2008 (size 2008 ~ 4.5million records)

Page 7: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Data sets: linkage outputEESW European Establishment Statistics Workshop 2011

Applicant identification number

Enterprises identification number

Surveys

Shared variables:NamePostal/Zip Code

Page 8: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Pre-processing of the input filesEESW European Establishment Statistics Workshop 2011

• Accents

• Symbols & special characters

• Double spaces

• Dots (e.g. L.T.D. in LTD), punctuations

• Known abbreviations (about 150 ways to say “in

short”)

• Most frequent words (more than 1000 and 100)

• Lower/upper letters

• Deduplication of words

• Known legal forms (reduced to 6 main categories)

• Universities/public administrations dropped

Standardisation:

Page 9: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Choice of the matching variablesEESW European Establishment Statistics Workshop 2011

1. Std name in upper letter and alphabetical order2. Postal/Zip code3. Legal form

Freq

Application number 72037

Applicant code 26509

Std Applicants 23833

 Legal forms COOP SAS SNC SPA SRL Total

Freq 8979 63 501 756 6164 7370 23833

% 37,7% 0.3% 2.1% 3.2% 25.9% 30.9% 100%

% ASIA >70% ~1% 5/7% 6/8% ~0.5% 7/13%

Page 10: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Search space reduction EESW European Establishment Statistics Workshop 2011

Patent applicants:Establishments (Enterprises) – Individuals

- several words in a name (OK only for enterprises, not for individuals)

Individuals: Std Applicant name does not contain- legal form- a name not included in the database of Italian first names “List of italian first names”* - special terms: “enterprise”, “construction”, “hotel”, “systems”, “group”, … (63 values)

Freq Std applicants

Std applicant enterprises

16132 ~65 %

*(http://www.nomix.it/nomi-italiani-maschili-e-femminili.php)

Page 11: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Search space reductionEESW European Establishment Statistics Workshop 2011

• Blocking by year of application (reduces only the size of the patent applicants archive:

ineffective)

• Blocking by Postal/Zip Code-Region (ineffective)

• Partition of ASIA 2008 (more than 10 employees, 1 employee with legal form)

• ASIA 2007-1998 (recursively removing the enterprises included in most recent ASIA archives)

• R&D survey frame (as a subset of ASIA archive)

Page 12: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Search space reductionEESW European Establishment Statistics Workshop 2011

Neighbourhoods of words: the set of ASIA enterprises having at least one word in common with the patent applicant name

Huge number of small problems!!!!

Page 13: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Search space reductionEESW European Establishment Statistics Workshop 2011

Hypotheses:

- assumes at least one word in a name registered at the same manner in both registers

Problems:

- very short words (1-2 letters) generate huge neighbourhoods

- very common words generate huge neighbourhoods

- names without neighbourhood

- not applicable in a probabilistic approach

* 23338 Patent applicants ~ ASIA 2008 (10+ number of employees)

Neighbourhoods of words:

Page 14: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Preliminary resultsEESW European Establishment Statistics Workshop 2011

Still under expert clerical check (~hundreds)

No Duplicated Enterprises code

Clerical review on 190 randomly chosen links 97,5%

Classes ofEmplyees

NeuchatelFirst phase

NeuchatelSecond phase

Freq % Freq %

1 793 8.1 2459 19.3

(1-10) 1995 20.4 2502 19.6

[10, 6985 71.5 7809 61.2

Total 9773 12770

Page 15: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Preliminary results

Patent applicants by year: lost and found (black and red)

EESW European Establishment Statistics Workshop 2011

Page 16: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Preliminary results

Patenting enterprises in ASIA 2008by economic activity (NACE 2007)

NACE NACE Description Freq % Cum Cum%

28 Manufacture of machinery and equipment n.e.c.

2203 22.0 2203 22.0

25 Manufacture of fabricated metal products, except machinery and

equipment899 9.0 3102 31.0

46 Wholesale trade, except of motor vehicles and motorcycles

736 7.3 3838 38.3

22 Manufacture of rubber and plastic products 607 6.1 4445 44.3

27 Manufacture of electrical equipment 465 4.7 4910 49.0

EESW European Establishment Statistics Workshop 2011

The 5 most frequent NACE’s divisions

Page 17: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Future Work

Methods•Neighborhood based on similarity instead of equality•Probabilistic approach (using the R&D survey frame)

UnitsNames containing only 2 letters wordsIndividuals (names without legal form)

List of companies’ owners and partnersList of University Professors/Researchers

No neighbourhood names

Analyses•Produce analytical evidence on specific technological areas (e.g. Biotech) using ICP codes•Overall classification of patent applicants

EESW European Establishment Statistics Workshop 2011

Page 18: Daniela Ichim, Giulio Perani, Giovanni Seri Italian National Statistical Institute (Istat)

Neuchatel, 12-14 September 2011

Thank you for your attention!

EESW European Establishment Statistics Workshop 2011