Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for...

29
Data Mining for Biosecurity Regulation Andrew Robinson CEBRA University of Melbourne August 10, 2016 Centre of Excellence for Biosecurity Risk Analysis

Transcript of Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for...

Page 1: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Data Mining for Biosecurity Regulation

Andrew Robinson

CEBRAUniversity of Melbourne

August 10, 2016

Cen t r e o f Exce l l en ce f o rB i o se cu r i t y R i sk Ana l y s i s

Page 2: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Outline

Biosecurity

CEBRA

Data-Mining Examples

Failures & Lessons Learned

Page 3: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Biosecurity

Page 4: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Biosecurity is Important

Page 5: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Biosecurity is Important

I Tree snakes in Guam:12 native bird speciesnow extinct.

I Annual cost of invasives:$1.4 trillion; 5% GGDP.

Page 6: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Biosecurity is Important

I 2001 FMD in UK — cost 8 billion pounds;6 M sheep & cattle were slaughtered (2030 tested positive!)

I Modelled impact in Australia — $7 or $16 B; now $50 B.

Page 7: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Biosecurity is Expensive

Department of Agriculture and Water Resources2014–15 Annual Report.

17 900 000 Air passengers146 100 000 Mail Articles18 000 Vessel First-Port Arrivals611 000 Air Freight Consignments (< $1000)450 000 Cargo units referred from Customs (in 2014)

Page 8: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Biosecurity is Difficult

Now, here, you see, it takes all the running you cando, to keep in the same place.

— The Red Queen, Through the Looking Glass.

Page 9: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

CEBRA

Page 10: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

CEBRA

Centre of Excellence for Biosecurity Risk Analysis

I CEBRA established in the University of Melbourne

I Four year contract, started July 1 2013

I Jointly funded byI Department of Agriculture and Water Resources, andI New Zealand’s Ministry for Primary Industries.

I CEBRA curates proposal development inside departments.

http://www.cebra.unimelb.edu.au

Page 11: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Data-Mining Examples

Page 12: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Data-Mining Examples

I Border data case studies

I Geolocating dirty mail

I Text mining

I Pooling passenger data

I Hunting brokers

I Profiling international vessels

I Performance indicators for compliance monitoring

I Predicting hitch-hikers

Page 13: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

ACERA 0806: 2001 IQI — rollback

Page 14: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

ULD (External Inspection)

CEBRA provided a spreadsheet tool to the Department.

Table: Predicted 95% risk rate and tentative future sampling rate for2007 for a risk cutoff of 1%.

Region Insp. Cont. p (%) f (%) π nBrisbane 37743 58 0.154 0.190 1.86 701Far North 2957 33 1.116 1.470 100.00 2957NSW 207764 137 0.066 0.076 0.19 389SA 17510 59 0.337 0.415 9.31 1630VIC 91491 24 0.026 0.036 0.43 389WA 14067 0 0.000 0.014 1.36 191National 371532 311 0.084 0.092 0.15 552

Page 15: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

The Benefits

I Monitoring ULDs — 370,000 in 2008; 14,000 in 2014.

I Monitoring reportable documents — 2.7 million in 2008;16,000 in 2014.

I Sea containers — 2 million in 2008; expanded CAL, hugereduction in non-CAL inspection; 370,000 in 2014.

NB: imperfect inspection data.

Page 16: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

CEBRA 1301A1: Spatial Analysis of Intercepted Mail

International mail is monitored by DDU, X-ray, and manualinspection in Gateway Facilities.

I Delivery address is recorded for all articles intercepted withbiosecurity risk material (BRM).

I Addresses can be geolocated to ABS census region.

CEBRA used data-mining tools to identify patterns.

I Spatial analysis — spatial patterns in intercepted goods?

I Statistical analysis — any correlation with census-measuredcharacteristics at the ABS statistical unit level 2 or 3?

Page 17: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Greater Melbourne seizures — 2008

Page 18: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Greater Melbourne seizures — 2008

Page 19: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Potential Future Directions

I Postcode profiling (SGF mail counter, urban/rural)

I Case studies, e.g.,I Locale: Sydney area.I Infrastructure: Universities.I Interceptions: khat, tea, seeds, finfish.

I Other SourcesI Air CargoI Customs analysis.

I Address analysis — postboxes?

Page 20: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

CEBRA 1401C/D: SAC Text Mining

SAC: self-assessed clearance, < $1000 declared value for a rangeof goods. C.f. FID.

Brief: to assess automated prediction of economic tariff codesfrom free-text goods descriptions in SAC.

In particular, is the desired accuracy of 80% or more feasible?

SAC comprises 1304 tariff codes.

Page 21: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Text Mining: Data & Analysis

Data:

I 3830 goods descriptions with tariff codes assigned byDepartment staff.

I 278 unique tariff codes.

I Dictionary of tariff codes and their descriptions.

I Highly uneven distribution — 75% tariffs have < 10 entries

Strategy:

I Random forest using the RTextTools package in R

I 5–fold cross-validation

Page 22: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Text Mining: Results

Overall accuracy 53.0% (95% CI: 51.4%, 54.5%).

Specific tariffs: e.g. XXXX 88.9% (95% CI: 83.7%, 92.9%).

Conclusion: could be ok for triage.

Page 23: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Text Mining: Results

Overall accuracy 53.0% (95% CI: 51.4%, 54.5%).

Specific tariffs: e.g. XXXX 88.9% (95% CI: 83.7%, 92.9%).

Conclusion: could be ok for triage.

Page 24: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Failures & Lessons Learned

Page 25: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Failures

I Tried too hard.

I Too many ideas, not enough structure.

I Ideas began outside, not inside.

I Great ideas, poor fit.

Page 26: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Key Lessons Learned 1/2

Bromides.

1. Operational utility is not the same as statisticalsignificance.

I Sensitivity and sometimes specificity trump p-values.

2. The outcome of data-mining might (should?) not be astatistical model.

I Statistical models are half-way there.

3. Start small — solve case studies.I Individually: non-threatening low-bar concrete outcomes.I Swarm.

4. Analyze the data that you have now.I Delay doesn’t compensate short-comings, Action does.

5. Failures, done right, aren’t failures.I Critique thoroughly, including when to try again.

Page 27: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Key Lessons Learned 2/2

Facing the Organization

6. Visit & sustain engagement.I Be in the room.

7. Deliver useful, usable outcomes but operationalise lightly.I Statistical models are half-way there.

8. Build bridges inside and outside the organization.I Prepare for the new normal. Network.

9. Identify, cultivate, & reward champions.I How can you help them to think differently about what you

can possibly do?

10. Manage expectations carefully.I Under-promise and over-deliver.

Be patient!

Page 28: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Grateful Thanks

Matt ChisholmSandy ClarkeGreg HoodRichard GaoChris WoodlandNyree StenekesTarik ZamanWayne Atkinson

Page 29: Data Mining for Biosecurity Regulation - Meetupfiles.meetup.com/14535342/Data Mining for Biosecurity...NB: imperfect inspection data. CEBRA 1301A1: Spatial Analysis of Intercepted

Outline

Biosecurity

CEBRA

Data-Mining ExamplesRisk–Return Case StudiesSpatial Analysis of Intercepted MailText Mining for Profiling

Failures & Lessons Learned