Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week...

Post on 14-Jul-2020

0 views 0 download

Transcript of Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week...

Catching the CriminalsMona,Sarasswathi,JonathanRahman, Emmanuel

Parallelization

16PB -- 3 weeks BUT law company can process 4TB in weekSOwe use TACC or other national supercomputingwe need 1000/64 nodes each having 64 coresbasically 16 nodes

Data Pre-processing

● Data cleaning● Data redundancy● Data integrity● Data quality measures● Data checksums

Tag Data Upon Creation

BillTransactionsPaymentReciept Checks

Data Quality

Data Cleaning

Using Amazon Mechanical Turk

YES! We need crowdsourcing for accurate manual review! ● low cost, high accuracy and fast

AMT

Using decision tree for classification

● Firstly create the train dataset from AMT results (~20%-30% of the whole data)

● Feed the train dataset to the decision tree and use the cleaned data as the test dataset

● The leaf nodes are representing legal and illegal documents

DT in general

A sample DT for finding illegal cases