Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week...

11
Catching the Criminals Mona,Sarasswathi,Jonathan Rahman, Emmanuel

Transcript of Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week...

Page 1: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Catching the CriminalsMona,Sarasswathi,JonathanRahman, Emmanuel

Page 2: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Parallelization

16PB -- 3 weeks BUT law company can process 4TB in weekSOwe use TACC or other national supercomputingwe need 1000/64 nodes each having 64 coresbasically 16 nodes

Page 3: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Data Pre-processing

● Data cleaning● Data redundancy● Data integrity● Data quality measures● Data checksums

Page 4: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Tag Data Upon Creation

BillTransactionsPaymentReciept Checks

Page 5: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Data Quality

Page 6: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Data Cleaning

Page 7: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Using Amazon Mechanical Turk

YES! We need crowdsourcing for accurate manual review! ● low cost, high accuracy and fast

Page 8: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

AMT

Page 9: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

Using decision tree for classification

● Firstly create the train dataset from AMT results (~20%-30% of the whole data)

● Feed the train dataset to the decision tree and use the cleaned data as the test dataset

● The leaf nodes are representing legal and illegal documents

Page 10: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

DT in general

Page 11: Catching the Criminals Rahman, Emmanuel Mona,Sarasswathi ... · law company can process 4TB in week SO we use TACC or other national supercomputing ... Step 1: Find work Search or

A sample DT for finding illegal cases