Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week...
Transcript of Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week...
Catching the CriminalsMona,Sarasswathi,JonathanRahman, Emmanuel
Parallelization
16PB -- 3 weeks BUT law company can process 4TB in weekSOwe use TACC or other national supercomputingwe need 1000/64 nodes each having 64 coresbasically 16 nodes
Data Pre-processing
● Data cleaning● Data redundancy● Data integrity● Data quality measures● Data checksums
Tag Data Upon Creation
BillTransactionsPaymentReciept Checks
Data Quality
Data Cleaning
Using Amazon Mechanical Turk
YES! We need crowdsourcing for accurate manual review! ● low cost, high accuracy and fast
AMT
Using decision tree for classification
● Firstly create the train dataset from AMT results (~20%-30% of the whole data)
● Feed the train dataset to the decision tree and use the cleaned data as the test dataset
● The leaf nodes are representing legal and illegal documents
DT in general
A sample DT for finding illegal cases