Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...

47
Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software

Transcript of Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...

Page 1: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

Fuzzy Matching in Fraud Analytics

Grant Brodie, President, Arbutus Software

Page 2: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

2

Outline

What Is Fuzzy?

Causes

Effective Implementation

Application to Specific Products

Demonstration

Q&A

Page 3: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

3

Why Is Fuzzy Important?

Big data

Too many transactions

User-entered data (web sites)

E-Commerce

Less manual oversight

Page 4: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

4

What Is Fuzzy?

Subset of duplicates testing

Find specific keywords in text (FCPA, PCard)

Close, but not the same

Two reasonable definitions

Proximity

Looks similar

Page 5: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

5

Proximity

Sorts close together

Characters

“Albert” vs. “Albertson”

Numbers

123,456.78 vs. 123,792.16

Dates

Jan 19, 2014 vs. Jan 20, 2014

Page 6: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

6

Looks Similar

Characters

Microsoft vs. Wicrosoft

Numbers

127,894.63 vs. 12,894.63

Dates

Jan 13, 2014 vs. Jan 31, 2014

Page 7: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

7

Traditional Approach to “Close”

Pronunciation based

Soundex

NYSIIS

Designed for names

Many false positives

Not useful for numbers or dates

Page 8: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

8

Fuzzy Today

Based on physical string matching

Levenshtein (ACL)

Damerau-Levenshtein (Arbutus)

N-Gram

Jaro-Winkler

And many more…

Differences expressed as a “distance” or percentage

Page 9: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

9 9

Quick Lesson: Damerau-Levenshtein

Min. # changes to make one string into another

Insert, delete, replace, transpose

‘123 Main Street’ vs. ‘123 Main St’ = 4

34567 vs. 34576 = 1 (Levenshtein: 2)

‘Rob’ vs.‘Robert’ = 3

‘Gary’ vs.‘Mary’ = 1

‘Gary’ vs.‘gary’ = 1

Page 10: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

10

Problems with String Matching

Very literal

Doesn’t apply any context

“John Smith” vs. “John Smith” (1)

“Smith John” vs. “Smith, John” (1)

“John Smith” vs. “john smith” (2)

México vs. Mexico (1)

“John Smith” vs. “john smith” same as “John Hmitz” (2)

Page 11: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

11

What Do You Use?

Whatever your tool offers

Almost impossible to implement manually

VERY compute intensive

Page 12: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

12

Causes

Accidental errors

Carelessness/mistyping

Transpositions

Blurry source

Punctuation

Extra blanks

1 vs. I, 0 vs. O (particularly with OCR)

Page 13: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

13

Errors vs. Fraud

All of the causes were likely “errors”

Fraud uses intentional errors to mask activity

Obscure duplicates

Obscure relationships

Trick through similarity

Disparate systems make comparison even harder

Page 14: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

14

Practical Issues

Generally hard to “target” fuzzy tests

Forced to use broad tests

Most findings will be errors

Even so, the finding is still valuable

Need a process to address errors found

Page 15: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

15

“Our System Catches Duplicates”

Exact matches only

Strict application (i.e. company, vendor, invoice)

May only warn

Not all duplicates are payments

Most only test document numbers

Page 16: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

16

Types of Duplicates

Names

Personal

Corporate

Addresses

Document numbers (e.g., invoice)

Contact information

Phone numbers

Emails

Page 17: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

17

Issues

Very compute intensive (wait times)

Exponential relationship

1000x data = 1,000,000x more work

False positives

Ease of use

Page 18: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

18

False Positives

Easily the most challenging aspect

Any time spent on a false positive is wasted

Can easily outnumber the true positives by 10, 100, 1000 to 1

If too many, can remove any cost effectiveness

How does this happen?

Only one way to get an exact match

Virtually unlimited ways to get close

Page 19: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

19

False Positive Examples

Matching to “12345” with a single difference:

Missing (1245): 5, Transposition (12435): 4

Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char)

Extra (123345): min 60 (200+ if alpha, 1,000+ if any char)

Hundreds/thousands of ways that differ by just 1

Not just errors, all close values

Exponentially more with a distance of 2

Bad actor tries to rely on his needle in a haystack

Page 20: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

20

How to Address the Issues

Data preparation

Utilize “context”

Use “tight” specifications

Choose software that meets needs

Rank your results

Page 21: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

21

Choose Your Software

Has the capabilities you need

Can process your data volumes

Easy to implement

Easy to automate

ACL, Arbutus, IDEA, fraud-specific, non-audit tools

Page 22: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

22

Data Preparation

Remove immaterial differences first (i.e., normalization)

Text manipulation

Upper case

Punctuation

Extra blanks

Foreign characters (México vs. Mexico, Québec vs. Quebec)

Page 23: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

23

Data Preparation (Cont.) (Remove immaterial differences first, normalization)

Eliminate “noise” words

Different by type of data

Address: Suite, Unit

Corporate name: Company, Co, Inc

Personal name: Mr, Ms, Dr, Prof

Page 24: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

24

Data Preparation (Cont.)

(Remove immaterial differences first, normalization)

Common misspellings/typos

Common vocabulary (chair vs. silla)

Different by data type

Avenue: Av, Ave, Aven, Avenu

First vs. 1st…

West vs. W…

Richard, Rick, Dick, Ricky, Rich

Page 25: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

25

Data Preparation (Cont.) (Remove immaterial differences first, normalization)

Word order

“123 W Main St.” vs. “123 Main St. W”

Page 26: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

26

Data Preparation: Result

Well implemented data prep. minimizes the need for fuzzy

Consider the two addresses:

“#200-1234 Main Street West”

“1234 W MAIN ST, Suite 200”

Levenshtein distance is 20

Applying data prep can make both strings identical

W ST MAIN 200 1234

Page 27: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

27

Text Manipulation: ACL Create a computed field

Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler)

Punctuation: Include(field, “ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), but…

Extra blanks: (replace 2 with 1) Replace(Replace(field, “ ”, “ ”), “ ”, “ ”)…

Foreign characters: Replace(Replace(field “É”, “E”), “Á”, “A”)…

Replace(Replace(Replace(Replace(Include(Upper(field), ‘ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘), “É”, “E”)…

In practice, many more replace calls

May break up into multiple fields for clarity

Page 28: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

28

Text Manipulation: Arbutus Create a computed field

Upper case: Upper(field)

Punctuation: Include(field, “ 0~9A~Z”), but…

Extra blanks: Compact(field)

Foreign characters: Replace(field, “É”, “E”, “Á”, “A”,…)

Replace(Compact(Include(Upper(field), ‘ 0~9A~Z”)), “É”, “E”…)

May break up into multiple fields for clarity

Only for unusual situations (use Normalize function)

Page 29: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

29

Eliminate “Noise” Words: ACL Use “whole words”

Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD…”, F), but…

Omit(field, “INC”): CINCH INDUSTRIES becomes CH INDUSTRIES

Problem is, many noise words to eliminate—two solutions:

Long list

Alltrim(Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD ,CORPORATION , CORP ,…”))

Sequential omits of a variable in a group

v_field=Omit(field…

v_field=Omit(v_field…

Page 30: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

30

Common Vocabulary: ACL

Similar to noise words, only Replace instead of Omit

Use “whole words”

Replace(field+“ ”, “ROAD ”, “RD ”)

Otherwise, “BROADWAY” becomes “BRDWAY”

Don’t omit, as Peachtree Lane is not the same as Peachtree Court

Problem is, MANY vocabulary words to potentially normalize

USPS 400 street terms, 500+ male names, 700+ female names

Nested functions (with Replace instead of Omit)

Sequential replaces of a variable in a group

Page 31: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

31

Word Order: ACL

No practical way to address this

Page 32: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

32

Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works

Instead: Use Normalize() or SortNormalize()

Automatically implements ALL of the data prep described

(Upper case, punctuation, blanks, foreign, noise, vocabulary)

Normalize(address, “addr.txt”)

Norm(“Suite 200-1234 Main Street West”, “addr.txt”) = “200 1234 MAIN ST W”

SortNormalize has the same syntax, but = “W ST MAIN 200 1234”

Normalize can use a separate vocabulary file (addr.txt)

Replaces or omits any word, on a “whole word” basis

User configurable and selectable, by data type

Page 33: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

33

Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example)

FIRST 1ST

SEVENTH 7TH

AV AVE

AVENU AVE

AVENUE AVE

AVN AVE

PARKWAY PKWY

PARKWY PKWY

PKWAY PKWY

PKY PKWY

SUITE

UNIT

Page 34: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

34

False Positive Reduction: Utilize Context

Data elements always have a “context”

Names or address: location (e.g., city, state, ZIP, country, etc.)

Documents: vendor, employee, etc.

Reference the similarities to minimize the ambiguity

Same state, city, similar address

“123 Main St.”, Springfield, IL/MA

Same vendor, date, amount, similar invoice number

Page 35: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

35

Utilize Context: Application

ACL FUZZYDUP: Only supports one key field

Concatenate fields into a single expression/computed field

State+City+Address

Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno

Arbutus DUPLICATES: Supports multiple key fields

Specify each key separately

Last key can be fuzzy

Page 36: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

36

False Positive Reduction: Use “Tight” Specs

Levenshtein distance 1, or 2 max

Looser specifications = more false positives

Avoid Soundex and similar approaches

There is no substitute for good data prep

Page 37: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

37

False Positives: Rank Your Results

Order based on exposure

Size of item

Degree of inherent risk (cash)

Order based on degree of similarity

Distance (1 vs. 2)

Number of matching “same” elements

Page 38: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

38

Execution: ACL

Separate menu item

Analyze/fuzzy duplicates

Choose your (concatenated) key

Choose diff. threshold (1 or 2)

Select other fields to use in investigation

Select the output table name

Be patient

Page 39: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

39

Execution: Arbutus

Included with duplicates testing

Analyze/duplicates

Choose your key fields (any type)

Choose either near or similar processing

Choose max. difference (0, 1, or 2)

Select other fields to use in investigation

Select output location and name

Page 40: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

40

“Similar” Processing: Arbutus

Specifically designed to work with document IDs

Uses Damerau-Levenshtein, but auto. pre-processes

Removes all blanks and punctuation, upper cases

Matches similar characters: O=0, I=1, 5=S, etc.

Works on all data types

127,894.63 vs. 12,894.63 (diff. 1)

I-12345 vs. 112345 (diff 0)

Particularly useful with OCR

Page 41: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

41

“Similar” Processing: ACL Not explicitly supported

Pre-process the data to create a computed field

Upper case

Include only numbers and letters (no blanks, punctuation)

Convert numbers and dates to strings (date or string)

Use the FUZZYDUP command as in the past

Page 42: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

42

Manual Duplicates Testing: ACL Data prep is still important

LevDist(string1, string2 <, case sensitive>)

Case sensitive by default

Filter: LevDist(name1, name2, F) < 3

IsFuzzyDup(string1, string2, distance <, diff%> )

Automatically case insensitive

Filter: IsFuzzyDup(name1, name2, 2)

Can also be used as a join test

Page 43: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

43

Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs)

Difference(string1, string2 <, case sensitive>)

Filter: difference(name1, name2, F) < 3

Near(field1, field2, difference)

Filter: near(name1, name2, 2)

Applies to all data types

Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803)

Similar(field, field2, difference)

Applies to all data types, always uses Damerau-Levenshtein

Char: prepared data; numbers and dates: 123,456 vs. 12,456

Page 44: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

44

Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA

Use the Find function:

Filter: IF Find(“Exotic”, desc)

Multiple words: IF Find(“Exotic”, desc) OR Find(“IPad”, desc)…

Not case sensitive, not whole word

Create a Logical computed field (say “Exception”):

T IF Find(“Exotic”, desc)

T IF Find(“IPad”, desc)

F

Filter: IF Exception

Page 45: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

45

Find Specific Keywords in Text: Arbutus Find function works the same as ACL

Use the ListFind function instead:

Filter: IF ListFind(“exceptions.txt”, desc)

Simple text file

Easily maintained in Notepad

Unlimited entries

Supports an external reference file or an internal array

Like Find function, not case sensitive, not whole word

Page 46: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

46

Continuous Monitoring

Mostly errors

“Test” vs. “control”

Ownership of the process

May relate to frequency

Detective vs. Preventative

Entire presentation detective

Opportunity to run against documents before committing

Preventative almost certainly a “control”

Page 47: Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software . 2 Outline ... Text Manipulation: ACL

47

Fuzzy Testing in action

Demonstration