Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

17
Driving Improved Customer Experience via Entity Resolution and Machine Learning Resolving Entities with Machine Learning

Transcript of Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

Page 1: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

Driving Improved Customer Experience via Entity Resolution and Machine LearningResolving Entities with Machine Learning

Page 2: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

2

Agenda

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Why?What

Value?

Business Context

ML ExamplesCleanse Entity

Resolution HDFSHDFS

Many moving pieces Collaboration with business

user is key

Many moving pieces Collaboration with business

user is key

E.g.:- Consider only features of

interest- Move to upper case- Remove punctuation- Standardization (e.g.,

truncate to 5 characters US zip codes)

- ….- ….

Proprietary and Confidential - ©2014 Clarity Solution Group, Inc.

Entity Resolution

Stad

ardi

zedD

ata G

OGOOGP

HierarchicalClustering

Grouping entries based on their distance

Creating all pairs and computing each distance:- Dist = 0 same- Dist = 1 notthe same

0.90.2……0……0.10

distance

…..

≠~

=

Dist.

Logistic Regression

Trained Threshold

0.9 No0 Yes

0.2 Yes… No

Full

Dat

aset

Dist.

Training the Machine: Visually ID a subset as match or non match

Yes

Nodistance

Subs

et

Determining the best separating line

O G P

Key Consideratio

ns

Business Problem

Page 3: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

3

The Business Problem Order from chaos: Common definition of Partners, and Partner's clients drives improved growth, service, innovation

Client Business Partners

Business Partner Clients

Financial services firm with ~ 1,000,000 direct partner

relationships and significant duplication

Classic entity resolution issue

Problem multiplies exponentially with duplication within Partner’s

Clients

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 4: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

4

Resolving duplication “noise” has many positive impacts on customer experience and business measures

Client Business Partners

Business Partner Clients

The Business Opportunity

Improved Service

Decreased repetitive, labor-intensive

activity

Accelerated Client-onboarding

Increased revenueImproved network

visibility driving cross-selling

Improved bottom-line

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 5: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

5

Machine Learning – Some Notorious Examples

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 6: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

6

The Overall Process

Cleanse

Entity Resolutio

n

Defining the end-to-end solution scope is key

!

E.g.:- Remove punctuation- Standardization (e.g.,

truncate to 5 characters US zip codes)

- ….

Data Points: - Integration: ~8 weeks- Machine learning dev.: ~6

weeks

Storage

Storage

!

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 7: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

7

The Overall Process

Cleanse

Entity Resolutio

n

Defining the end-to-end solution scope is key

!

E.g.:- Remove punctuation- Standardization (e.g.,

truncate to 5 characters US zip codes)

- ….

Data Points:- Integration: ~8 weeks- Machine learning dev.: ~6

weeks

Storage

Storage

!Entity Resolutio

n

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 8: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

8

Stan

dard

ized

Dat

a GOGOOGP

HierarchicalClustering

Entity Resolution via Machine Learning Grouping entries based on their distance

Creating all pairs and computing each distance:- Dist. = 0 same- Dist. = 1 not the same

0.90.2……0……0.10

distance

…..

≠~

=

Dist.

Logistic Regression

Trained Threshold

0.9 No0 Yes

0.2 Yes… No

Full

Dat

aset

Dist.

Training the Machine: Visually ID a subset as match or non match

YesNo

distance

Subs

et

Determining the best separating line

O G P

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 9: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

9

HierarchicalClustering

Stan

dard

ized

Dat

a GOGOOGP

Trained Threshold

0.9 No0 Yes

0.2 Yes… No

Dist.Su

bset

Entity Resolution: Behind the Scenes

distance

USITUSITCA

USUSITIT

Filter out exact matches

Use common features to parallelize the calculation

CA

Label unique entries without clustering

1.2 million

entries !!!

!

Logistic Regression

distance

YesNo

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 10: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

10

An ExampleName Address Countr

yGroup

SPACCANAPOLI PIZZERIA 123 WEST SUNNYSIDE USA ????

SPACCA PIZZERIA 123 W SUNNYSIDE AVE USA ????

SPACCANAPOLI PIZZERIA 123 WEST SUNNYSIDE IT ????

SPACCA PIZZERIA 123 W SUNNYSIDE AVE IT ????PIZZERIA LIBRETTO 221 OSSINGTON

AVENUECA ????

Out

put

Name Address Country

Group

SPACCANAPOLI PIZZERIA

123 WEST SUNNYSIDE USA USA_1

SPACCA PIZZERIA 123 W SUNNYSIDE AVE USA USA_1SPACCANAPOLI

PIZZERIA123 WEST SUNNYSIDE IT IT_1

SPACCA PIZZERIA 123 W SUNNYSIDE AVE IT IT_1PIZZERIA LIBRETTO 221 OSSINGTON

AVENUECA CA_1

Inpu

t

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 11: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

11

Findings and Data PointsBusiness Value

• Reduction in Effort: ~ 3 FTE’s

• Increased Client Onboarding: ~ 30%

• Individual / anecdotal evidence of increased cross-selling and loyalty

Technical Measures

• Approximately ~500K duplicates identified from ~1.2MM total customer records

• Job parallelism reduced run-time from >> 24 hours to 15 minutes

• Run-time enabled overnight process, with capability to run intra-day if needed

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 12: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

12

Key Considerations

Consideration Implication

Machine Learning is not a stand-alone exercise

Outline end to end process with business application integration points

Business Collaboration in “training” process is critical

Ensure heavy degree of subject matter expert involvement

Recognize the importance of technique in the solution

Leverage a data science process: Problem to hypothesis to technique selection

Underlying technology is not “one size fits all”

Machine Learning / Big Data solutions require customization and corresponding

investment in people

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 13: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

13

Key Considerations

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Questions?

Page 14: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

14

Appendix

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 15: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

15

Cle

an D

ata

GOGOOGP

Trained Threshold

0.9 No0 Yes

0.2 Yes… No

Dist.Su

bset

Entity Resolution: Jaccard Distance on Shingles

Jaccard Distance1) ‘CLARITY SOLUTION GROUP’:

['CLA', 'LAR', 'ARI', 'RIT', 'ITY', 'TY ', 'Y S', ' SO', 'SOL', 'OLU', 'LUT', 'UTI', 'TIO', 'ION', 'ON ', 'N G', ' GR', 'GRO', 'ROU', 'OUP']

2) ‘CLARITY SOL GR’:['CLA', 'LAR', 'ARI', 'RIT', 'ITY', 'TY ', 'Y S', ' SO', 'SOL', 'OL ', 'L G', ' GR']

1− ¿¿1+¿2−¿

¿=¿ (0→1)

0.90.2……0……0.10

…..

≠~

=

Dist.

HierarchicalClustering

distance

O G P

Logistic Regression

distance

YesNo

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 16: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

16

Stan

dard

ized

Dat

a

HierarchicalClustering

Entity Resolution via Machine Learning Grouping entries based on their distance

Creating all pairs and computing each distance:- Dist. = 0 same- Dist. = 1 not the same

0.90.2……0……0.1

distance

…..

≠~

=

Dist.

Logistic Regression

Trained Threshold

0.9 No0 Yes

0.2 Yes… No

Full

Dat

aset

Dist.

Training the Machine: Visually ID a subset as match or non match

Yes

Nodistance

Subs

et

Determining the best separating line

P A E

APAPPA

E

Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Page 17: Clarity Solution Group presentation at the Chief Data Officer Insurance 2016

17Proprietary and Confidential - ©2016 Clarity Solution Group, LLC

Stan

dard

ized

Dat

a

Entity Resolution via Machine Learning

Logistic Regression

Trained Threshold

0.9 No0 Yes

0.2 Yes… No

Dist.Yes

Nodistance

Subs

etHierarchicalClustering

P A E

APAPPA

E

1.2 million

entries !!!

!

USITUSITCA

Filter out exact matches

Use common features to parallelize the calculation

Label unique entries without clustering

USUSITITCA