Mining Unstructured Data_ Practical Applications Presentation

download Mining Unstructured Data_ Practical Applications Presentation

of 32

Transcript of Mining Unstructured Data_ Practical Applications Presentation

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    1/32

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    2/32

    New York London

    Problem 1

    Images: Ambro / FreeDigitalPhotos.net

    How do lawyers scan, file, store & shareclients case documents efficiently?

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    3/32

    slambo_42@

    flickr

    AnotoAB@flickr

    !"#

    !%#

    &"#

    How do doctors, patients &researchers distribute & sharemedical records efficiently?

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    4/32

    "#$%&'( "&()(*&)+

    ,(-./0.#(1&/2 ,34 )'$%%5%(/

    )((0)+ $%6#$/ 789 1&/2#+:&(' /);

    1)&4> #1(%$-2&6 %(..%-

    789 1&/2#+:&(' /);

    ?0-/#:&)( @)(A1&/2#0/ ,34 )'$%%5%(/

    The FATCA LegislationTakes effect 1 January 2013

    Problem 3

    How can a financial institution find U.S. citizensin masses of paperwork efficiently?

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    5/32

    How much time do we actually spend on

    4%)$*2&('B ')/2%$&(' &(C#

    D$&.(' %5)&+-

    ?$%).(' :#*-

    E()+FG&(' &(C#

    3%

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    6/32

    introduction

    unstructured datareal life problems

    unstructured data& text analytics

    metadata

    in legal domain

    healthcare

    records issues

    conclusions

    compliancein finance

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    7/32

    '()*+,

    !-.(/,

    0(1*2.132*

    43)(+

    5*6,

    7-.8*,

    9+:(./

    %*)(.

    ;.1.

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    8/32

    Text MiningNaturalLanguageProcessing

    unstructured data

    Opinion Mining

    Business Intelligence

    Document Organization

    Data ExtractionSearch

    Machine Learning

    Text Processing

    StatisticsLinguistics

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    9/32

    What can one minefrom unstructured data?

    text text texttext text texttext text texttext text texttext text text

    text text text

    sentiment

    keywordstags

    genre

    categoriestaxonomy terms

    entities

    names patterns biochemicalentitiestext text text

    text text text

    text text text

    text text text

    text text text

    text text text

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    10/32

    '()*+,

    !-.(/,

    0(1*2.132*

    43)(+

    5*6,

    7-.8*,

    9+:(./

    %*)(.

    ;.1.

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    11/32

    text text texttext text texttext text texttext text texttext text text

    text text text

    People U.S. politicians News aboutU.S. politicians

    News

    4/$0*/0$%:

    @+#'&*)+

    :)/)

    =(&U0% &:%(.V%$-

    W&/%$)/0$% $%C%$%(*%-

    I;6%$/-X

    )((#/).#(

    YC$%% /%;/Z

    Structured & unstructured data interplay

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    12/32

    introduction

    unstructured datareal life problems

    unstructured data& text analytics

    metadata

    in legal domain

    healthcare

    records issues

    conclusions

    compliancein finance

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    13/32

    -*)(

    #*$

    5%/):)/)

    :5-

    -)

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    14/32

    Assigning metadata(approximation)

    15 docs per day3 min per doc0.75 h per day

    240 working days per year

    $200 hourly charge

    $36,000 per year per lawyer

    Keyword extraction0.0027 min per doc

    10 minfor yearly worth of docs

    jacockshaw@

    flickr

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    15/32

    ,(/%'$).('

    5%/):)/)

    %;/$)*.#(

    1&/2

    -*)((&('

    2[6QRR111>F#0/0@%>*#5R1)/*2\

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    16/32

    5%/):)/)

    :5-

    Efficient(legal) document processing pipeline

    keywordstags

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    17/32

    introduction

    unstructured datareal life problems

    unstructured data& text analytics

    metadata

    in legal domain

    healthcare

    records issues

    conclusions

    compliancein finance

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    18/32

    !"#

    !%#

    &"#

    slambo_42@

    flickr

    AnotoAB@flickr

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    19/32

    !%#

    !"#

    &"#

    `).#()+ E++&)(*% C#$ a%)+/2 ,(C#$5).#( K%*2(#+#'F

    Y`Ea,KZ

    :%V(&.#(-

    b&-*#(.(0%:c

    \

    L>

    `)5%B @&$/2 :)/%B @+##: /F6%

    ^>

    I5%$'%(*F *#(/)*/Y-Z

    7> J$&5)$F *)$%'&

    d%:&*&(%-B :#-)'%-B )(: 2#1 +#('

    /)A%(

    _> E++%$'&%-R)++%$'&* $%)*.#(-

    P> b)/% #C +)-/ 62F-&*)+

    M>

    b)/%-R$%-0+/- #C /%-/- )(:

    -*$%%(&('-

    e> d)f#$ &++(%--%-R-0$'%$&%- )(: /2%&$

    :)/%-O>

    ?2$#(&* :&-%)-%-

    L8>

    ")5&+F &++(%-- 2&-/#$F

    LL> g

    >?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B

    &"7

    )*G()*DHI:.H+D @2+:*,,

    L>

    `)5%B @&$/2 :)/%B@+##: /F6%

    ^>

    I5%$'%(*F *#(/)*/Y-Z

    7>

    J$&5)$F *)$%'&

    d%:&*&(%-B :#-)'%-B )(: 2#1 +#('

    /)A%(

    _>

    E++%$'&%-R)++%$'&* $%)*.#(-

    P> b)/% #C +)-/ 62F-&*)+

    M> b)/%-R$%-0+/- #C /%-/- )(:

    -*$%%(&('-

    e>

    d)f#$ &++(%--%-R-0$'%$&%- )(: /2%&$

    :)/%-O>

    ?2$#(&* :&-%)-%-

    L8>

    ")5&+F &++(%-- 2&-/#$F

    LL>

    g

    >?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    20/32

    d%:&*)+ $%-%)$*2%$-

    0-% 6).%(/ $%*#$:-

    C#$ :&-*#

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    21/32

    666C>:@2+C:+-

    ,(/(:+D.D8/*C:+-B

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    22/32

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    23/32

    K2)(A- C#$ :&-*0--(-Q

    `&')5 42)2B 4/)(C#$:

    I(%&:) d%(:#(*)B =D&(-*#-&(B d):&-#(

    ,$%() 46)-&*B ?)$:&o =(&

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    24/32

    introduction

    unstructured datareal life problems

    metadata

    in legal domain

    conclusions

    compliancein finance unstructured data

    & text analytics

    healthcare

    records issues

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    25/32

    "#$%&'( "&()(*&)+

    ,(-./0.#(1&/2 ,34 )'$%%5%(/

    )((0)+ $%6#$/ 789 1&/2#+:&(' /);

    1)&4> #1(%$-2&6 %(..%-

    789 1&/2#+:&(' /);

    ?0-/#:&)( @)(A1&/2#0/ ,34 )'$%%5%(/

    The FATCA LegislationTakes effect 1 January 2013

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    26/32

    FATCA COMPLIANCE STEP 1Detect U.S. citizenship indicators

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    27/32

    Recommended Solutionfrom FATCA Legislation:

    Query an electronic database usingstandard queries in programming languages

    Adopt similar approaches as used for theAnti-money-laundering and Know-your-customerrequirements

    Note that information, data, or files are notelectronically searchable if they are stored asimages

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    28/32

    1)+5&(AB/2#5

    1)/-#(pq&A$

    FATCA COMPLIANCE STEP 2Contact client for additional info or a waver

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    29/32

    Actual Solutionfor the FATCA Legislation:

    #*$

    +&(A )()+F-&-

    %(./F %;/$)*.#(

    )()+F-&-

    ')/2%$ /2% /$)&+ *+&%(/X- :)/)

    *#(

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    30/32

    EfficientFATCA Compliance

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    31/32

    introduction

    unstructured datareal life problems

    metadata

    in legal domain

    healthcare

    records issues

    conclusions

    compliancein finance unstructured data

    & text analytics

    healthcare

    records issues

  • 8/12/2019 Mining Unstructured Data_ Practical Applications Presentation

    32/32

    Alyona Medelyan, PhD@zelandiya

    Anna Divoli, PhD@annadivoli

    Natural Language Processing

    Text MiningWikipedia MiningMachine Learning

    Try out text analytics provided by the Pingar API!Online demo: apidemo.pingar.com

    Free Sandbox account: pingar.com/get-the-api

    Biomedical Text MiningSearch User InterfacesHuman FactorsKnowledge Discovery