Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work...

32
Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli

description

Alyona Medelyan (Pingar), Anna Divoli (Pingar)presented at Strata O'Reilly Making Data Work Conference on March 1, 2012The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.And read a full interview with Alyona and Anna at http://radar.oreilly.com/2012/02/unstructured-data-analysis-tools.html

Transcript of Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work...

Page 1: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Mining Unstructured Data: Practical Applications

Alyona Medelyan @zelandiya Anna Divoli @annadivoli

Page 2: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

New York London

Problem 1

Images: Ambro / FreeDigitalPhotos.net

How do lawyers scan, file, store & share client’s case documents efficiently?

Page 3: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

slambo_42@

flickr A

noto AB

@flickr

 EHR  EMR  PHR  

How do doctors, patients & researchers distribute & share medical records efficiently?

Page 4: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Foreign  Financial  Ins.tu.on  

with  IRS  agreement  

annual  report      30%  witholding  tax  

waiver  

with  waiver  

without  waiver  

U.S.  account  holders  U.S.  ownership  en..es  

30%  witholding  tax  

Custodian  bank  without  IRS  agreement  

The FATCA Legislation Takes effect 1 January 2013

Problem 3

How can a financial institution find U.S. citizens in masses of paperwork efficiently?

Page 5: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

How much time do we actually spend on …

Searching,  gathering  info  

Wri.ng  emails  

Crea.ng  docs  

Analyzing  info  

Reviewing  docs  

Organizing  docs  

Crea.ng  presenta.ons  

Edi.ng  images  

Entering  data  

Approving  docs  

Publishing  docs  

Transla.ng  docs  

17  

14  

13  

10  

9  

7  

7  

6  

6  

4  

4  

1

Translates  to  annual  costs:  Search:  17h  /  week  =  $37,000  /  year  

IDC: Hidden cost of information average hours / week

Page 6: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

introduction

unstructured data real life problems

unstructured data & text analytics

metadata in legal domain

healthcare records issues

conclusions

compliance in finance

Page 7: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Videos  

Emails  

Literature  

Audio  

News  

Images  

Social  Media  

Databases  

Blogs  

Page 8: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Text Mining Natural Language Processing

unstructured data

Opinion Mining

Business Intelligence

Document Organization

Data Extraction

Search

Machine Learning

Text Processing

Statistics Linguistics

Page 9: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

What can one mine from unstructured data?

text text text text text text text text text text text text text text text text text text

sentiment

keywords tags

genre

categories taxonomy terms

entities

names patterns

biochemical entities … text text text

text text text  text text text  text text text  text text text  text text text  

Page 10: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Videos  

Emails  

Literature  

Audio  

News  

Images  

Social  Media  

Databases  

Blogs  

Page 11: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

text text text text text text text text text text text text text text text text text text

People U.S. politicians News about U.S. politicians

News

Structured    biological  data  

Unique  iden.fiers  

Literature  references  

Experts’  annota.on  (free  text)  

Structured & unstructured data interplay

Page 12: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

introduction

unstructured data real life problems

unstructured data & text analytics

metadata in legal domain

healthcare records issues

conclusions

compliance in finance

Page 13: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

scan  

ocr  

metadata  

dms  

save  

Legal document processing pipeline

Images: Ambro / FreeDigitalPhotos.net

New York London

Page 14: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Assigning metadata (approximation)

15 docs per day 3 min per doc 0.75 h per day

240 working days per year $200 hourly charge

$36,000 per year per lawyer

Keyword extraction 0.0027 min per doc

10 min for yearly worth of docs

jacockshaw@

flickr

Page 15: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Integra.ng      metadata    extrac.on    with    scanning  

h[p://www.youtube.com/watch?v=kluVp25upag  

Page 16: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

metadata  

dms  

Efficient (legal) document processing pipeline

keywords tags

Page 17: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

introduction

unstructured data real life problems

unstructured data & text analytics

metadata in legal domain

healthcare records issues

conclusions

compliance in finance

Page 18: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

 EHR  EMR  PHR  

slambo_42@

flickr A

noto AB

@flickr

Page 19: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

EMR     EHR  

 PHR          

 

Na.onal  Alliance  for  Health  Informa.on  Technology  (NAHIT)  

defini.ons    

Discon.nued!  

?  

 

1.  Name,  birth  date,  blood  type  2.  Emergency  contact(s)  3.  Primary  caregiver/phone  number  4.  Medicines,  dosages,  and  how  long  

taken  5.  Allergies/allergic  reac.ons  6.  Date  of  last  physical  7.  Dates/results  of  tests  and  

screenings  8.  Major  illnesses/surgeries  and  their  

dates  9.  Chronic  diseases  10.  Family  illness  history  11.  …  

h?p://www.nlm.nih.gov/medlineplus/magazine/  

PHI  

de-­‐idenHficaHon  process  

 

1.  Name,  birth  date,  blood  type  2.  Emergency  contact(s)  3.  Primary  caregiver/phone  number  4.  Medicines,  dosages,  and  how  long  

taken  5.  Allergies/allergic  reac.ons  6.  Date  of  last  physical  7.  Dates/results  of  tests  and  

screenings  8.  Major  illnesses/surgeries  and  their  

dates  9.  Chronic  diseases  10.  Family  illness  history  11.  …  

h?p://www.nlm.nih.gov/medlineplus/magazine/  

Page 20: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Medical  researchers  use  pa.ent  records  for    discoveries…  

…  records  with  removed  PHI:  informa.on  from  structured  fields  but  mostly  from  free  text!  

AMIA  2012  

Page 21: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

 

www.hcpro.com  

 

siliconangle.com/blog/  

 

www.informaHon-­‐age.com  

“The  Health  Insurance  Portability  and  Accountability  Act  of  1996  (HIPAA)  Privacy  and  Security  Rules”    “The  Pa.ent  Safety  and  Quality  Improvement  Act  of  2005  (PSQIA)  Pa.ent  Safety  Rule”  

   

Page 22: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Names  

 

Geographic  subdivisions  smaller  than  a  State:  street  address,  city,  county,  precinct,  zip  code…    

 

Dates  (except  year):  birth,  admission,  discharge…    

 

Phone  /  Fax  numbers  

Email  addresses    

 

Social  security  #  Medical  records    #  Health  plan  beneficiary#  Accounts    #  

PHI  18 identifiers!

Vehicle  iden.fiers  &  serial  numbers,  incl.  license  plate  numbers    

 

Device  iden.fiers  &  serial  numbers    

 

URLs        /              IP  addresses    

 

Biometric  iden.fiers,  including  finger  and  voice  prints    

 

Face  photo  images    &  any  comparable  images    

 

Any  other  unique  IDs  etc.  

Page 23: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Thanks  for  discussions:        Nigam  Shah,  Stanford        Eneida  Mendonca,  UWinscosin,  Madison        Irena  Spasic,  Cardiff  University  

keywords tags

slambo_42@

flickr A

noto AB

@flickr

text text text text text text  text text text  text text text  text text text  text text text  

Page 24: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

introduction

unstructured data real life problems

metadata in legal domain

conclusions

compliance in finance unstructured data

& text analytics

healthcare records issues

Page 25: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Foreign  Financial  Ins.tu.on  

with  IRS  agreement  

annual  report      30%  witholding  tax  

waiver  

with  waiver  

without  waiver  

U.S.  account  holders  U.S.  ownership  en..es  

30%  witholding  tax  

Custodian  bank  without  IRS  agreement  

The FATCA Legislation Takes effect 1 January 2013

Page 26: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

FATCA COMPLIANCE – STEP 1 Detect U.S. citizenship indicators

Page 27: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Recommended Solution from FATCA Legislation:

•  “Query an electronic database using standard queries in programming languages”

•  “Adopt similar approaches as used for the Anti-money-laundering and Know-your-customer requirements”

•  “Note that information, data, or files are not electronically searchable if they are stored as images”

Page 28: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

walm

ink,  thomwatson@

flikr  

FATCA COMPLIANCE – STEP 2 Contact client for additional info or a waver

Page 29: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Actual Solution for the FATCA Legislation:

ocr  

link  analysis  

en.ty  extrac.on  

analysis  

gather  the  trail  client’s  data  

convert  all  images  to  text  

detect  loca.ons,  bank  numbers  

auto-­‐categorize  

check   resolve  inconsistencies  

Page 30: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Efficient FATCA Compliance

Page 31: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

introduction

unstructured data real life problems

metadata in legal domain

healthcare records issues

conclusions

compliance in finance unstructured data

& text analytics

healthcare records issues

Page 32: Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Making Data Work Conference March 2012

Alyona Medelyan, PhD @zelandiya

Anna Divoli, PhD @annadivoli

Natural Language Processing Text Mining Wikipedia Mining Machine Learning

Try out text analytics provided by the Pingar API!

Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api

Biomedical Text Mining Search User Interfaces Human Factors Knowledge Discovery