An Ensemble Approach to Financial Entity Matching
IE @ FEIII Challenge 2016DSMM workshop 2016
Enrico Palumbo, ISMB, ItalyGiuseppe Rizzo, ISMB, Italy
Raphaёl Troncy, EURECOM, France1
Introduction
FEIII Challenge requires to find matching financial entities between:
Task 1: the Federal Financial Institution Council (FFIEC) dataset to Legal Entity
Identifiers (LEI)
Task 2: the Federal Financial Institution Council (FFIEC) dataset to the Security
and Exchange Commission (SEC)
Dataset Num. of entities Fields Format
FFIEC 6652 15 .csv
LEI 53958 39 .csv
SEC 129312 24 .csv
2
Baseline
Duke (https://github.com/larsga/Duke) implements Naive Bayes classification
a. Select a number of fields for the comparison
b. Index records with a Lucene database
c. Cleaners to process strings before comparison
d. Comparators define the metrics to compare strings
e. Field-wise similarities are turned into probabilities p = P( Match | s)
f. Probabilities are aggregated and turned into a decision according to the rule:
3
Development set
Semi-automatic process:
● Start from the small sample data
released from FEIII
● Configure Duke to have a high recall
● Run it in interactive mode
● Annotate matches, including true
and false positives
● Create a dev set as Duke test file: +,
id1,id2, 1.0
4
Properties and cleaners
FFIEC SEC LEI Property Cleaner
Name
cleaned
CONFORMED-
NAME
Name
cleanedNAME
LowCase+FinancialInstitutionName
Cleaner
Address B-STREETAddress
line cleanedADDRESS LowCase
City B-CITY Address city CITY LowCase
State B-STPRAddress
Region 2STATE LowCase
Zipcode B-POSTALAddress
postal codeZIPCODE DigitsOnly
5
Comparators
Property Comparator
NAMESemantic Financial
Institution Comparator
ADDRESS Jaro Winkler
CITY Jaro Winkler
STATE Exact Comparator
ZIPCODE Exact Comparator
Semantic Financial Institution
Comparator: aimed at reducing false
positives, such as “acnb bank”- “acnb
corp”, increasing precision.
Allows a match only if certain keywords
such as “corp”, “bancorp” are present
in both names.
If none of the keywords are in neither of
the names, it measures Jaro Winkler
distance.
6
Threshold
Crucial parameter
Manually set to optimize performance on dev set
Task 1: 0.890, 0.895
Task 2: 0.870, 0.865
Problem: trade-off between precision and recall.
High threshold -> low recall, high precision
Low threshold -> high recall, low precision
Hard to find optimal value -> ensemble
7
Ensemble approach
1. Blocking: limit the comparison to
relevant candidates, avoid naive N*M.
Inverted index, Lucene database.
2. Duke: aggregates field-wise similarities
into a final probability of match, turned
into a binary decision through a
threshold
3. Ensemble: combine the decisions of a
collection of classifiers corresponding to
different decision thresholds
(parameters: (N, a))
8
Results Task 1
Method t a N p r F1
Duke 0.890 95.45 80.44 87.31
Duke 0.895 96.29 78.43 86.44
Ensemble
majority0.830 0.02 10 96.46 77.02 85.65
Ensemble
union0.830 0.02 10 96.32 79.23 86.95
FEIII avg 80.49 86.90 79.78
9
Results Task 2
Method t a N p r F1
Duke 0.870 86.67 56.52 68.42
Duke 0.865 82.21 58.26 68.19
Ensemble
majority0.865 0.01 10 86.18 56.96 68.59
Ensemble
union0.865 0.01 10 84.81 58.26 69.07
FEIII avg 63.86 71.80 62.36
10
Conclusions
On both tasks, we obtain an F1 score above the average of the participants to the
challenge
Duke baseline yields best results for Task 1
Ensemble union yields best results for Task 2
On both tasks, this is due to higher precision and lower recall than the average
Error analysis: in both tasks, the exact comparator on the “zip code” property is too
strict and it significantly reduces the recall of the algorithm
Further development:
Removed the zip code property11
Enrico Palumbo, ISMB, Turin, [email protected]
https://github.com/enricopal/sfemhttp://www.slideshare.net/EnricoPalumbo2
Thank you!
12
Top Related