Utilizing Technology Assisted Review (TAR) on the … 1 of 18. Utilizing . Technology Assisted...

18
Page 1 of 18 Utilizing Technology Assisted Review (TAR) on the Enron Data Set in Eclipse by Amanda L. Best, JD LightSpeed, LLC January 2015

Transcript of Utilizing Technology Assisted Review (TAR) on the … 1 of 18. Utilizing . Technology Assisted...

Page 1 of 18

Utilizing Technology Assisted Review (TAR)

on the Enron Data Set in Eclipse

by Amanda L. Best, JD LightSpeed, LLC

January 2015

Page 2 of 18

Contents Overview ....................................................................................................................................................... 3

Initializing the TAR Project ............................................................................................................................ 4

Phases of the TAR Project ............................................................................................................................. 5

Training Phase ............................................................................................................................................... 6

Validation Phase ........................................................................................................................................... 7

Discrepancy Reports ................................................................................................................................. 8

Project Score Distribution Reports ........................................................................................................... 9

Project Summary Reports ....................................................................................................................... 11

Validation Details Reports ...................................................................................................................... 13

Findings from the Validation Rounds ...................................................................................................... 14

Certification Phase ...................................................................................................................................... 14

Summary ..................................................................................................................................................... 16

Disadvantages and Negatives ................................................................................................................. 16

Advantages and Positives ....................................................................................................................... 17

Conclusion ................................................................................................................................................... 18

Page 3 of 18

Overview Technology Assisted Review (“TAR”) combines computer analytics with human

expertise to cull a set of responsive documents from a much larger set of data. The computer gradually learns what is responsive through an iterative process wherein the human subject matter expert (“SME”) codes and validates documents. This is otherwise known as predictive coding.1

The goal of TAR or predictive coding is to train the computer to identify a larger set of responsive documents based on the coding of a smaller set of documents so that the user can efficiently review the most important, responsive documents rather than performing a laborious, costly linear review of all the documents in the set.

For this example project, I used the Enron data which has been released to the

public under a Creative Commons license. The data set totals 482,576 documents. By far, .MSG files were the most prolific file type in the set.

1 “Predictive coding” is defined as a “technology-assisted review process involving the use of a machine learning algorithm to distinguish relevant from non-relevant documents, based on a subject matter expert’s coding of a training set of documents.” The Grossman-Cormack Glossary of Technology-Assisted Review, http://www.edrm.net/resources/glossaries/grossman-cormack.

0

50000

100000

150000

200000

250000

300000

350000

MSG URL DOC XLS DAT PDF JPG PPT WMF EML

Top 10 File Types

Page 4 of 18

Initializing the TAR Project Prior to beginning the TAR project, the project’s goals must be defined. Choose the issues or type of documents that you seek from the data, and what parameters must be met before a reviewer or SME declares a document as responsive. In this example case, I decided to search for data related to contracts, agreements, or legal department documentation from the Enron data set. I expected that my responsive documents would largely be emails, Word documents, Adobe/.PDF files, or similar file types. For the purpose of this example, I kept my parameters fairly broad and did not limit responsive documents to any particular individuals, groups, or a narrow set of key words or phrases. An analytics index and clusters2 must be created prior to beginning a TAR project in Eclipse. Although this data set had been clustered into sixteen (16) sub-clusters, I selected the entire case cluster as the basis for the TAR project.

2 “Clustering” refers to “[a]n unsupervised learning method in which documents are segregated into categories or groups so that the documents in any group are more similar to one another than to those in other groups. Clustering involves no human intervention, and the resulting categories may or may not reflect distinctions that are valuable for the purpose of a search or review effort.” The Grossman-Cormack Glossary of Technology-Assisted Review, http://www.edrm.net/resources/glossaries/grossman-cormack.

Page 5 of 18

Phases of the TAR Project A TAR project in Eclipse has three (3) distinct phases: training, validation, and certification. The training set is based on the Eclipse platform’s search on the case cluster which finds documents required for review. The reviewer or SME reviews the training set documents and codes them as responsive or non-responsive. This tells the platform which type of documents, with which type of clusters, are most likely responsive. Based on the SME’s coding, the Eclipse’s algorithm infers how to distinguish between responsive and non-responsive documents beyond those contained in the training set. The validation sets are comprised of several random sample sets of remaining documents which are reviewed to verify Eclipse’s decisions. They validate the algorithm’s results. Documents within validation sets have not yet been reviewed and were not included in the training set. Certification is optional but useful when the goal is to limit the number of documents reviewed by humans. Analytics categories must be built after the training phase is complete and must be re-built after each validation round. Within Eclipse TAR projects, the platform builds a “Responsive” category and a “Non-Responsive” category based on the training and validation examples.

from Ipro Eclipse Administrator’s Guide, Eclipse 2014.1.0, Q3 2014

Page 6 of 18

Training Phase The only setting the user determines for the training phase is the size of each batch. Here, Eclipse determined that 3,079 documents should be reviewed for the training phase. These documents comprised less than 1% (approximately 0.64%) of the total 482,576 documents in the database. These were batched out in sets of 100 documents per batch (with the last batch containing 79 documents). I coded 393 of the training phase documents as “responsive” and 2,686 of the documents as “non-responsive.” Based on my selections, the platform’s algorithm predicted that 13% of the total data set—or 61,577 documents—would be responsive.

Once all the training batches were completed, I built the analytics categories. The computer did not categorize all 482,576 documents in the data set. Rather, it categorized approximately 96% (roughly 463,000) of those documents. The remaining 4% (roughly 19,000) were not categorized as they were determined to be unusable for

Training Set Responsive,

393

Training Set Non

Responsive, 2,686

13%

87%

Projected Responsive,

61,577

Projected Non Responsive,

420,999

13%

87%

Page 7 of 18

categorization. This is common in media files like MP4, database files, and other non-standard file types that have no real valid conceptual data. Items with a coherency3 score of less than 50 will not be categorized.

Of the 96% of the documents that were categorized, the computer categorized 13% (approximately 60,000 documents) as responsive and 87% (approximately 403,000 documents) as non-responsive.

Validation Phase You can run as many validation rounds as you desire. For the purposes of this TAR example, I completed 15 validation rounds. There were a total of 1,802 documents that were batched and reviewed during this phase. When you move forward to the validation phase, you have the option of telling Eclipse to select a fixed size sample of documents or a statistical sample of documents. The fixed sample simply designates the number of random documents to review per batch. A statistical sample creates a sample using statistical calculations based on a confidence level and a margin of error. As defined by Eclipse,

Confidence level represents the reliability of an estimate, that is, how likely it is that the sample returned will be representative of all documents in the set. The larger the confidence level is, the narrower the range of documents will be that are considered to be representative.4

The margin of error refers to the amount of error to be allowed in the sample where 1 is the least amount of error (and returns the most documents) and 5 is the highest amount of error (and returns the fewest documents). 3 “Coherency” refers to the conceptual similarity between documents. A high coherence score indicates that a cluster contains a highly-related group of documents. 4 Ipro Eclipse Administrator’s Guide, Eclipse 2014.1.0, Q3 2014, p. 15-7.

Page 8 of 18

For the first three rounds, I used a statistical sample. I chose a confidence level

of 95 with a margin of error of 5. For the remaining twelve rounds, I used a fixed sample. One was a fixed sample

of 100 documents (the 9th validation round) while the remaining rounds were fixed samples of 50 documents each. I decided to vary the sample size in the 9th round for two reasons: (1) I was spending more time re-building categories than reviewing documents when the batches were a small set of 50 each; and (2) I wanted to see how the sample size affected the results. What I eventually discovered was that the discrepancy percentage spiked when I transitioned from a 50-document sample set to a 100-document sample set, and then decreased when I returned to a 50-document sample set.

Discrepancy Reports After each round, Eclipse provided a discrepancy report which showed the documents it had predicted as non-responsive which I decided were actually responsive, and the number of documents it had predicted as responsive which I later coded as non-responsive. The platform had incorrectly predicted the designation of 138 documents of the total 1,802 documents (or 7.66%) reviewed during the validation phase.

0

5

10

15

20

25

1st V

alid

atio

n…2n

d Va

lidat

ion…

3rd

Valid

atio

n…4t

h Va

lidat

ion…

5th

Valid

atio

n…6t

h Va

lidat

ion…

7th

Valid

atio

n…8t

h Va

lidat

ion…

9th

Valid

atio

n…10

th V

alid

atio

n…11

th V

alid

atio

n…12

th V

alid

atio

n…13

th V

alid

atio

n…14

th V

alid

atio

n…15

th V

alid

atio

n…

Number IncorrectlyCategorized as NonResponsive

Number IncorrectlyCategorized asResponsive

Page 9 of 18

The number of discrepancies began to level out after the tenth round and decreased to 6% per batch, and the discrepancies sharply dropped for the last two rounds with only 2% of each batch experiencing discrepancies.

Project Score Distribution Reports Eclipse provided a project score distribution report which reflected the category concept relevancy score for each round. This category score (on a range from 0 to 100) expands or narrows the evaluation for conceptually matching documents. A higher number represents higher precision but less recall, and a lower number represents lower precision but higher recall. The default value is 50. In this example TAR project, there were no documents that were scored between 0 and 50. Of the responsive documents, there was a fairly consistent 5% that were scored at 76 or above (eventually totaling 24,827 documents), while the responsive documents scored between 51 and 75 dropped from 7% to 6% (28,546 documents) during the validation phase.

0%2%4%6%8%

10%12%14%16%

Percentage of Batch

Percentage of Batch

010,00020,00030,00040,00050,00060,000

Responsive 75-51

Responsive 100-76

Page 10 of 18

Of the non-responsive documents, those scored between 76 and 100 increased significantly from 44% after the first validation round to 51% after the fifteenth validation round. The non-responsive documents scoring between 51 and 75 decreased from 44% to 37%.

After the first validation round, Eclipse had determined that 53,055 documents were responsive and 407,940 documents were non-responsive (for a total of 460,995 documents). After the last validation round, Eclipse had adjusted its predictions to 53,373 responsive documents and 411,292 non-responsive documents (for a total of 464,665 documents).

050,000

100,000150,000200,000250,000300,000350,000400,000450,000

Non Responsive 75-51

Non Responsive 100-76

Total Responsive Total Non-Responsive

050000

100000150000200000250000300000350000400000450000

1st Validation Round

15th Validation Round

Page 11 of 18

Project Summary Reports After each validation round, Eclipse provided a project summary report that detailed the round’s precision, recall, and FMeasure, along with Eclipse’s projected total responsive and non-responsive documents. Precision measures the accuracy of the program’s identification of documents as responsive. A higher precision value means a more accurate result in that the documents that the program identified as responsive were reviewed and found to actually be responsive. In this example TAR project, Eclipse’s precision increased substantially beginning with the eighth validation round, even attaining a precision score of 1.0 in two rounds (with 0.0 being the lowest and 1.0 being the highest score). The average precision value across all validation rounds was 0.7011.

Recall is a measure of completeness that is based on the number of documents tagged as responsive compared to the total number of truly responsive documents.5

5 Ipro Eclipse Administrator’s Guide, Eclipse 2014.1.0, Q3 2014, p. 15-18.

0

0.2

0.4

0.6

0.8

1

1.2

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

13th

14th

15th

Precision

Precision

0

0.2

0.4

0.6

0.8

1

1.2

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

13th

14th

15th

Recall

Recall

Page 12 of 18

The average recall rate across all validation rounds was 0.6843, but the last four rounds scored at 0.75 or above, including one that scored at 1.0. FMeasure is a method for factoring both precision and recall to indicate how well the TAR process has identified truly responsive documents.6 Here, the average FMeasure across all validation rounds was 0.6787 with the highest individual scores in the last three rounds.

After each validation round, Eclipse projected the total responsive and total non-responsive documents from the entire database. Here, the total projected responsive documents only varied by 1,762 documents, from 52,802 at the lowest (ninth round) to 54,564 at the highest (second round).

6 Ipro Eclipse Administrator’s Guide, Eclipse 2014.1.0, Q3 2014, p. 15-18.

0

0.2

0.4

0.6

0.8

1

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

13th

14th

15th

Fmeasure

Fmeasure

51,500

52,000

52,500

53,000

53,500

54,000

54,500

55,000

1st 3rd 5th 7th 9th 11th 13th 15th

Estimated Responsive Counts

EstimatedResponsive Counts

Page 13 of 18

Eclipse’s projections of the total non-responsive documents varied by 3,236 documents, from 410,304 at the lowest (second round) to 413,540 at the highest (tenth round).

Validation Details Reports Eclipse also provides a validation details report after each validation round. It supplies information found in other reports, including precision, recall, and FMeasure rates. Additionally, it shows the number of responsive documents both the human reviewer and the computer tagged as responsive (i.e., responsive – true), along with the number of non-responsive documents both the human reviewer and the computer tagged as non-responsive (i.e., non responsive – true). Also, it reflects the number of documents which the reviewer tagged as responsive that the TAR analysis projected as non-responsive (i.e., responsive – false), and the number of documents the reviewer tagged as non-responsive that the computer projected as responsive (i.e., non responsive – false).

Review Round Responsive -

True Non Responsive -

True Responsive -

False Non Responsive -

False 1st Validation Round 28 324 11 21

2nd Validation Round 34 315 23 12 3rd Validation Round 35 324 16 9 4th Validation Round 4 41 2 3 5th Validation Round 2 45 1 2 6th Validation Round 4 41 4 1 7th Validation Round 3 40 4 3 8th Validation Round 1 48 0 1 9th Validation Round 7 83 4 6

10th Validation Round 4 43 1 2 11th Validation Round 3 44 1 2 12th Validation Round 3 44 2 1

408,000409,000410,000411,000412,000413,000414,000

1st 3rd 5th 7th 9th 11th 13th 15th

Estimated Non Responsive Counts

Estimated NonResponsive Counts

Page 14 of 18

13th Validation Round 6 40 3 1 14th Validation Round 7 42 0 1 15th Validation Round 4 45 1 0

Findings from the Validation Rounds I noticed that the discrepancy percentages fluctuated significantly until the 10th validation round when percentages flattened and then dropped to a mere 2% in the 14th and 15th rounds. Similarly, precision, recall, and FMeasure all increased and became more stable beginning with the 10th round. I decided to discontinue validation rounds after the 15th round considering the stability of the last 5 rounds and the relatively small difference in the actual number of responsive documents.

Certification Phase Once the validation phase is completed, you have the option of re-building the analytics categories and proceeding with a final review of the documents. Alternatively, you may re-build the analytics categories and create a final certification batch of documents when the goal is to limit the number of documents to be reviewed by humans and where documents found to be non-responsive will not be reviewed. As with validation batches, you have the option of either a fixed sample or a statistical sample for the certification batch. I selected a statistical sample with a confidence level of 95 and a margin of error of 5. Eclipse then batched 384 documents in one batch for the certification phase. Of those 384 documents, I tagged 366 (or 95.31%) of them as non-responsive.

As the margin of error was 5, Eclipse determined the certification error percentage as -5, and further estimated 40,019 as the number of responsive documents remaining in the TAR project that were not categorized as responsive. The formula used to make that determination relies on:

• % Combined Non Responsive (percentage of documents in the certification round that were tagged as non-responsive);

• Certification Error % (the margin of error defined for the certification round, expressed as a plus/minus range); and

• Number of documents in the TAR project categorized as non-responsive. Eclipse provided a Project Summary report:

Page 15 of 18

Documents in Categorization Set 482,576 Training Document Set 3,079 Training Set Responsive 393 Training Set Non Responsive 2,686 Training Set Responsive Percentage 13% Projected Responsive 61,577 (13%) Projected Non Responsive 420,999 (87%) Current Validation Round Precision 0.8 Current Validation Round Recall 1 Current Validation Round FMeasure 0.8889 Current Estimated Responsive Counts 53,628 Current Estimated Non Responsive Counts 412,950

Therefore, I reviewed a total of 5,265 documents from 482,576 (or 1.09% of the total) which resulted in Eclipse determining that 53,628 (or 11.11%) are responsive, and 412,950 (or 85.57%) are non-responsive, while the remaining 15,998 (or 3.32%) are unusable for categorization. I reviewed random documents from the responsive and non-responsive categories after the TAR project completed and I found that nearly all of the Word documents and many of the emails and Adobe/.PDF documents marked “responsive” were, indeed, responsive. Other file types—including Excels, PowerPoints, or anything largely comprised of numbers—that were marked “responsive” were sometimes, but not often, responsive.

Reviewed, 5265

Total, 482576

Page 16 of 18

Summary

Disadvantages and Negatives

First, the Enron data set contains hundreds of different file types and quite a bit of data debris. For example, there are 70,000 documents in this database that are nothing but URLs and internet shortcuts which were completely useless and non-responsive, yet appeared frequently in the TAR batches as random samples. Second, the time to review a batch may have been short, but the time to re-build the analytics categories between batches took at least 20 minutes each time and I had to wait for the re-building to complete before moving on to another task in the TAR project. This significantly impedes the SME’s review rate as it takes 5 total hours just to re-build the categories for 15 batches of documents, not including the time to actually review the documents. Third, the program seemed to have difficulty determining the responsiveness of files that contained little to no text. This is not surprising as TAR analytics are based on concept clustering and it is difficult—if not impossible—to cluster numerical concepts. Fourth, there are no 100% guarantees on anything, including analytics and TAR. A margin of error is built into the program’s algorithm, and it is a bit disconcerting to see a final report that indicates there may be another 40,000 responsive documents (or 8% of the original 482,576) remaining in the TAR project that were not categorized as responsive. Nevertheless, a straight linear review conducted by humans also carries a significant margin of error. Neither method is perfect.

Responsive, 53628

Non Responsive,

412950

Unusable, 15998

Page 17 of 18

Advantages and Positives The primary advantage is that I reviewed only 1% of the documents in order to exclude 89% (or 428,948 documents) as non-responsive and/or unusable. Although document review rates vary, let us make a reasonable assumption that a reviewer can review 50 documents per hour for 8 hours per day (400 documents per day), 5 days per week (2,000 documents per week), at a cost of $30.00 per hour. It would take a team of 40 reviewers 6 weeks to review all 482,576 documents at a cost of $288,000. Even if you double the review rate such that a reviewer can review 100 documents per hour, it would still take a team of 40 reviewers 3 weeks to review all the documents at a cost of $144,000. That’s a significant expenditure of time and money. However, if the firm uses the TAR process similar to the example herein, then the firm could have their SME review 1% of the total documents for the TAR process. Most likely, and because TAR only looks at responsiveness, the SME will review the documents in a very short amount of time. Indeed, I was able to review a batch of 50 documents for responsiveness at a rate of 20 minutes per batch. If a firm’s SME could review the same 5,265 documents at the same rate, the SME would be able to review 1,200 documents per 8-hour day and complete their portion of the review in less than 4.5 days. Assume the SME’s time is billed at $300 per hour, and the resulting cost is $10,800.

Then, the firm could prioritize the 53,628 “responsive” documents which could be reviewed by 40 reviewers within 3.5 days (at a rate of 50 documents per hour) at a cost of $33,600. If those reviewers can double their review rate to 100 documents per hour, the top priority “responsive” documents would be reviewed within 13.4 hours at a cost of $16,080. Even including the SME’s time and expense, the review process of the “responsive” documents could complete within 8 days for less than $45,000. Additionally, if the firm took steps to quickly exclude data debris early on, the firm could further reduce the total number of documents that would be utilized in the TAR process and further cut down on total time and costs spent reviewing the “non-responsive” documents, assuming they choose to review the non-responsive documents.

If the Enron TAR project were a real case, I would make an effort to cull out data debris by building a more narrow analytics index that excluded certain document types and “noise” or “stop” words (commonly used words that should be ignored such as the, and, in, or of). I would consider running the index off certain sub-clusters rather than the entire case cluster. Then, I would perform the TAR process and, finally, batch out the “responsive” emails and Word documents as high priority, the rest of the “responsive” documents as medium priority, and all the “non-responsive” documents as low priority.

Page 18 of 18

Conclusion

Overall, a firm faced with the time and costs associated with reviewing several hundred thousand documents would find the TAR process helpful. The time spent on TAR is minimal compared to the time and expense that would be spent performing a linear human review on nearly half a million documents. TAR saves an enormous amount of time and costs when 89% of the documents are excluded as non-responsive or low priority. Even if those non-responsive documents are eventually reviewed, there is great benefit in quickly identifying and reviewing the small number of top priority documents first.