An ensemble approach to financial entity matching

An Ensemble Approach to Financial Entity Matching

IE @ FEIII Challenge 2016DSMM workshop 2016

Enrico Palumbo, ISMB, ItalyGiuseppe Rizzo, ISMB, Italy

Raphaёl Troncy, EURECOM, France1

Introduction

FEIII Challenge requires to find matching financial entities between:

Task 1: the Federal Financial Institution Council (FFIEC) dataset to Legal Entity

Identifiers (LEI)

Task 2: the Federal Financial Institution Council (FFIEC) dataset to the Security

and Exchange Commission (SEC)

Dataset Num. of entities Fields Format

FFIEC 6652 15 .csv

LEI 53958 39 .csv

SEC 129312 24 .csv

2

Baseline

Duke (https://github.com/larsga/Duke) implements Naive Bayes classification

a. Select a number of fields for the comparison

b. Index records with a Lucene database

c. Cleaners to process strings before comparison

d. Comparators define the metrics to compare strings

e. Field-wise similarities are turned into probabilities p = P( Match | s)

f. Probabilities are aggregated and turned into a decision according to the rule:

3

https://github.com/larsga/Duke

Development set

Semi-automatic process:

● Start from the small sample data

released from FEIII

● Configure Duke to have a high recall

● Run it in interactive mode

● Annotate matches, including true

and false positives

● Create a dev set as Duke test file: +,

id1,id2, 1.0

4

Properties and cleaners

FFIEC SEC LEI Property Cleaner

Name

cleaned

CONFORMED-

NAME

Name

cleanedNAME

LowCase+FinancialInstitutionName

Cleaner

Address B-STREETAddress

line cleanedADDRESS LowCase

City B-CITY Address city CITY LowCase

State B-STPRAddress

Region 2STATE LowCase

Zipcode B-POSTALAddress

postal codeZIPCODE DigitsOnly

5

Comparators

Property Comparator

NAMESemantic Financial

Institution Comparator

ADDRESS Jaro Winkler

CITY Jaro Winkler

STATE Exact Comparator

ZIPCODE Exact Comparator

Semantic Financial Institution

Comparator: aimed at reducing false

positives, such as “acnb bank”- “acnb

corp”, increasing precision.

Allows a match only if certain keywords

such as “corp”, “bancorp” are present

in both names.

If none of the keywords are in neither of

the names, it measures Jaro Winkler

distance.

6

Threshold

Crucial parameter

Manually set to optimize performance on dev set

Task 1: 0.890, 0.895

Task 2: 0.870, 0.865

Problem: trade-off between precision and recall.

High threshold -> low recall, high precision

Low threshold -> high recall, low precision

Hard to find optimal value -> ensemble

7

Ensemble approach

1. Blocking: limit the comparison to

relevant candidates, avoid naive N*M.

Inverted index, Lucene database.

2. Duke: aggregates field-wise similarities

into a final probability of match, turned

into a binary decision through a

threshold

3. Ensemble: combine the decisions of a

collection of classifiers corresponding to

different decision thresholds

(parameters: (N, a))

8

Results Task 1

Method t a N p r F1

Duke 0.890 95.45 80.44 87.31

Duke 0.895 96.29 78.43 86.44

Ensemble

majority0.830 0.02 10 96.46 77.02 85.65

Ensemble

union0.830 0.02 10 96.32 79.23 86.95

FEIII avg 80.49 86.90 79.78

9

Results Task 2

Method t a N p r F1

Duke 0.870 86.67 56.52 68.42

Duke 0.865 82.21 58.26 68.19

Ensemble

majority0.865 0.01 10 86.18 56.96 68.59

Ensemble

union0.865 0.01 10 84.81 58.26 69.07

FEIII avg 63.86 71.80 62.36

10

Conclusions

On both tasks, we obtain an F1 score above the average of the participants to the

challenge

Duke baseline yields best results for Task 1

Ensemble union yields best results for Task 2

On both tasks, this is due to higher precision and lower recall than the average

Error analysis: in both tasks, the exact comparator on the “zip code” property is too

strict and it significantly reduces the recall of the algorithm

Further development:

Removed the zip code property11

Enrico Palumbo, ISMB, Turin, [email protected]

https://github.com/enricopal/sfemhttp://www.slideshare.net/EnricoPalumbo2

Thank you!

12

mailto:[email protected]

https://github.com/enricopal/sfem

An ensemble approach to financial entity matching

Data & Analytics

Transcript of An ensemble approach to financial entity matching