© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of...
-
Upload
eileen-russon -
Category
Documents
-
view
217 -
download
3
Transcript of © Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of...
© Federal Statistical Office Germany, IV A2
Federal Statistical Office Germany
Application of Regular Expressions in the German Business Register
Session 5: Projects on Improvements for Business Registers
Wiesbaden Group on Business RegistersParis, November 26th 2007, Patrizia Moedinger
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 2
Example 1: Improving legal form coding by using regular expressions
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 3
Background
information on legal forms mainly from VAT records
not all administrative sources provide information on legal forms
use of different not compatible legal form coding or different aggregation levels
special requirements for other purposes like the coding of institutional sectors
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 4
Background
enterprises (legal units) with certain legal forms are legally obliged to carry their legal form in the enterprise name: incorporated firms non-incorporated firms cooperatives merchants that are registered in the German
Commercial Register
enterprise names can be used for legal form coding
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 5
Definition of search patterns
patterns from nomenclature, abbreviation and notations (tax authorities)GmbH, AG & Co.KG, Limited, Ltd.
patterns from BR real data mistakes in writing, missing blanks, ..
construction of regular expression
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 6
Evaluation of search patterns
completeness of codinglegal obligation: high level of found legal forms in enterprise names
degree of reliance: evaluation of coding results drawing sample after legal form coding classification of the coding results calculation of sensitivity, specificity, positive
predictive value, negative predictive value
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 7
Completeness of coding
93.7
9.9
3.2
89.7
6.3
90.1
96.8
10.3
0 50 100
1
2
3
4
%
no legal form could be detected from enterprise name
legal form could be detected from enterprise name
sole proprietors
non-incorporated firms
incorporated firms
miscellaneous legal forms (including cooperatives)
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 8
Evaluation of Type I and II errors
Enterprise name contains
legal formno or wrong legal form
regularexpressiondetects
legal form 1,009 4
PPV (positive predictive value) = 1,009 / (1,009 + 4)= 99.6 %
no or wrong legal form
26 2,961
NPV (negative predictive value) = 2,961 / (2,961 + 24)= 99.1 %
Sensitivity = 1,009 / (1,009 + 26) = 97.5 %
Specificity = 2,961 / (4 + 2,961)= 99.8 %
N =4,000
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 9
Example 2: Data pre-processing as a preliminary for record linkage
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 10
Background no common unique identifiers available
data from different sources are initially linked by names and addresses
different or none address standards
different notations “BMW“ or “Bayerische Motorenwerke“ or “Bay. Motorenwerke“
German BR is technically limited in storing several addresses (only dispatch and domicile)
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 11
Problem of non standardized notations matching by administrative identifiers
dependent variable =
match by administrative identifiers + no change in the postal code
independent variable =
differences between enterprise names, street names and town names (Levenshtein edit distance)
same (administrative) source
different sources (administrative source – BR)
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 12
Matching probability against string similarity within an administrative source (Employment Agency) (Model: Logistic regression)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Levenshtein - Edit - Distance / Maximum String Length
pre
dic
ted
y
EnterpriseName
Street Name
Town Name
Match
No Match
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 13
Matching probability against string similarity between an administrative source (Employment Agency) and BR (Model: Logistic regression)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Levenshtein Edit Distance / Maximun String Length
pre
dic
ted
y
Match
No match
Street NameEnterprise Name
Town Name
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 14
Pre-processing of administrative data for record linkagehigh level of similarity between two strings identical units
high level of disparity between two strings different units
differences in name or address
low high
identical unit
different unit
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 15
Pre-processing of administrative data for record linkage conversion into specific variables for
string matching
BMW
AG
Branch Munich Mr Mueller
enterprise name:
legal form:
other elements:
BMW AG Branch MunichMr Mueller
enterprise address
simplify comparison strings
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 16
Methods for evaluation
evaluate link between string similarity and match before and after pre-processing the data
evaluation of matching results
(drawing sample after matching process)
classification of the matching results calculation of sensitivity, specificity,
positive predictive value, negative predictive value
controlling for effects caused by the used matching program
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
Federal Statistical Office Germany
18.04.23 Slide 17
Synopsis
BR text data needs special treatment in data processing
applications for regular expressions simple application: legal form coding
(limited set of search pattern)more complex application: pre-
processing (set of pattern depends on data source and later use)
application of regular expressions should always be evaluated