The Ultimate Predictive Coding Handbook - Kroll … · The Ultimate Predictive Coding Handbook by...

Ediscovery White Paper US

The Ultimate Predictive Coding HandbookA comprehensive guide to predictive coding fundamentals and methods.

2 The Ultimate Predictive Coding Handbook by KLDiscovery

Copyright © 2018 LDiscovery, LLC. All rights reserved. All other brands and product names are trademarks or registered trademarks of their respective owners.

This document is neither designed nor intended to provide legal or other professional advice, but is intended to be a starting point for research and information on the subject of electronic discovery. While every attempt has been made to ensure the accuracy of this information, no responsibility can be accepted for errors or omissions. Recipients of information or services provided by KLDiscovery shall maintain full, professional and direct responsibility to their clients for any information or services rendered by KLDiscovery.

3The Ultimate Predictive Coding Handbook by KLDiscovery

Contents

4 What Is Predictive Coding?

6 Training the Predictive Coding System

8 Predictive Coding Workflows in Ediscovery

14 Validating Predictive Coding with Sampling

18 Conclusion


What Is Predictive Coding?In simple terms, predictive coding is the use of a computer system to help determine which documents are representative of a defined category. The system performs this classification based on training it receives via human input (i.e., “machine learning”). By utilizing machine learning, the system can classify documents with remarkable accuracy - even documents humans have not yet seen.

In legal matters, predictive coding is most commonly used to identify documents that are relevant to a legal proceeding.

Predictive Coding in Ediscovery

For decades, litigants have relied on combinations of text/metadata searching and costly attorney review as a means for dealing with large volumes of data in ediscovery. As the growth of “big data” continues to exceed the economic feasibility of such approaches, the legal industry has turned to computers for assistance. More recently, predictive coding has been a secret weapon for advanced legal teams looking for an edge. Predictive coding is now widely accepted as a critical tool in the ediscovery process.

TRAIN

PREDICT

Finding the right document as fast as possible

Non-responsive

Responsive

Evaluate

Train

Predict

Sorting and grouping documents more efficiently

Validating work performed before production

Predictive coding works for ediscovery by solving the following key problems


Busting Predictive Coding Myths

There is still apprehension toward the use of predictive coding by a minority of legal professionals. Below are the most common myths associated with predictive coding:

Myth: Predictive coding requires large time investment from expensive subject matter expert(s) (SMEs).

Busted: Relying on SMEs is a highly effective approach to predictive coding, but it is far from the only approach. In fact, some alternative approaches do away with SMEs entirely.

Myth: You need at least 10,000 (or 25,000 or 50,000...) documents to use predictive coding effectively.

Busted: There is no minimum data set size required for modern predictive coding systems. The benefit of predictive coding can be seen even on extremely small data sets.

Myth: Document review and production must wait until the predictive coding training is complete.

The concept of “complete” is something of a myth in and of itself. The benefit of predictive coding can be realized almost immediately and when the training process should end is a matter of cost-benefit, not a technical requirement.

Myth: Using predictive coding is expensive.

A properly managed predictive coding workflow utilizing modern systems will always yield a net savings over the technology cost. A vastly improved work product should also be considered, as that leads to substantial indirect savings over the life of a case.

Myth: Predictive coding cannot be used without understanding algorithms, learning strategies and everything else that happens inside the “black box.”

Busted: While a detailed understanding of technicalities may boost some users’ confidence, it is not necessary in order to have a successful project. Predictive coding works and there are very simple ways to prove it on every case.

Myth: All data must be present at the outset of predictive coding training.

Busted: The machine learning process will adjust to changing data sets without issue. Modern predictive coding systems are designed to be flexible and agile to the fluid nature of ediscovery.


Training the Predictive Coding SystemFor a predictive coding system to make accurate categorizations on its own, the system needs direction from humans. Understanding how a system “gets smart” is vital to proper use of predictive coding in ediscovery. This process is commonly referred to as the “training” phase.

During training, humans will review documents and select the issues that apply. Documents that are reviewed during training impact the quality of the classifications made by the system. It is therefore important that the trainers understand basic principles of how the system works.

Focus on the Text

Most predictive coding systems only work with text; therefore, trainers should only consider the text of the document when performing review.

Consistency Counts

Not all documents have to be coded perfectly; however, more consistent training will always yield better results.

The Four Corners Rule

Trainers should make decisions based on the contents of a single document and nothing else. Content contained in other documents should not be considered, even if those documents are related.

Selecting Training Documents

Using contextually valuable documents for training is key to a successful predictive coding project. This is also known as developing an effective learning strategy. The most optimal learning strategy will maximize machine learning with the least possible amount of human effort.

Training documents can be identified either by humans or by the system itself. It is generally not recommended to rely solely on human selection; however, both options can be used together with great success.

User Selection (Seeds) Machine Selection

What: Training documents that are identified by humans outside of the predictive coding process. Often referred to as “seeds” or “seed sets.”

What: Training documents that are identified by the predictive coding system.

When: Generally used at beginning of training to kick-start (i.e., “seed”) the process, but can be utilized at any time.

When: Almost always. When used in conjunction with seeds, machine selection helps to find the documents that seeds did a poor job of classifying.

Considerations: Not usually a random sample, contrary to popular belief.

Considerations: In general, there are four different types of machine selection. These are detailed in the next section.


Understanding the Types of Machine Selection

In general, there are four different types of machine selection used for predictive coding sampling. While these go by different names depending on the conversation, we will refer to them as: Active Learning, Prioritization, Stratified Sampling and Simple Random Sampling.

Active Learning

Focuses training on documents that have a high degree of uncertainty.

■ AKA: Focus Training, Uncertainty Sampling, TAR 1.0

■ Designed explicitly to reduce the burden of labeling (i.e., training)

■ Rankings/classifications are generally not adjusted once training completes

■ Advanced systems may incorporate automatic error correction and blind verifications

Prioritization

The highest scoring documents are escalated to maximize the value of ongoing review.

■ AKA: Continuous Active Learning (CAL); TAR 2.0+

■ The easiest method to understand; takes the emphasis off statistics and difficult concepts

■ Rankings/classifications are updated on a regular basis

■ Advanced systems may incorporate some degree of Active Learning or Random Sampling

Stratified Sampling

Selective random sampling among a manually created subset of documents.

■ AKA: Advanced Random Sampling

■ Less efficient than Active Learning or Prioritization and complicated to understand

■ Requires an understanding of how strata should be created

■ Useful for specific probability sampling, such as Quality Control and targeted validations

Simple Random Sampling

Random sampling among the entire data set or a randomly selected subset of the data set

■ AKA: Simple Passive Learning/SPL, Random Sampling

■ The least efficient type of sampling, requiring a large time commitment from the trainers

■ When positive documents are rare, this method can fail completely

■ Useful for basic estimation, control sets and validation processes

■ Provides estimations on subsets of data; useful for quality control and targeted validations


Predictive Coding Workflows in EdiscoveryThere are many effective ways to improve ediscovery processes through the use of predictive coding. Depending on the requirements of the matter, however, certain approaches may be better suited than others. In this guide, we will focus on four applications of predictive coding that are not only common, but also extremely effective.

Predictive Coding Workflows

SME TrainingPrioritized Review

(aka CAL)

Hybrid Multimodal Review

Quality Control


PROS

■ Has the highest ceiling for efficiency and cost savings;

■ Creates a small yet very accurate training set, which can yield very good results with minimal effort;

■ Helps to quash logistic and cost concerns associated with very large databases;

■ Can be used as an alternative to review, a supplemental culling mechanism, or for more targeted review;

■ High benefit potential from seeding.

CONS

■ Most effective with one or more Subject Matter Experts (SME), who generally have limited bandwidth and high associated cost;

■ Moderate upfront effort can delay the start of a larger process;

■ Return on investment (ROI) can be an issue for small-to-medium matters;

■ Validation is strongly recommended and can add to the SME burden.

Training

A SME trains a number of documents until comfortable with the results. Typically, Active Learning will be used to select training documents; however, it is not uncommon to supplement with seeds or even random/stratified samples, especially when richness is low.

Process

The system will build a model to identify documents that have a high probability of correct classification into pre-defined categories. Once the model is created, legal teams incorporate the downstream process that best suits the case. This is often additional culling and/or automated/semi-automated review.

Validation

Validation is strongly recommended when using this approach in most, but not all, cases. Generally, a control set is used to get visibility into the efficacy of the model. Post-categorization “null set” sampling is also common to ensure the existence of false negatives is minimal.

SME Training

SME (Subject Matter Expert) Training is the traditional approach to predictive coding in ediscovery and most effectively utilizes Active Learning as the sampling methodology. This is what most people think of when they hear “predictive coding” or “TAR 1.0.”

Predictive Coding

Production

Privileged + Relevant

Privileged + Irrelevant

Non-priviledged + Relevant

Non-privileged + Irrelevant

Confirmed and logged

Sampling


Prioritized Review (aka CAL)

Prioritized Review (aka CAL) is becoming a very popular predictive coding workflow in ediscovery. Its simplicity and very low learning curve essentially eliminates the barrier to entry often encountered with SME Training. Prioritized Review is very popular in scenarios where there is a desire or requirement to put human eyes on all or most documents at issue in a case. This is occasionally referred to as TAR 2.0 since it is a more recently adopted workflow.

PROS

■ Escalates important documents very rapidly and keeps irrelevant documents at bay;

■ Is the most efficient way to organize a human review of any data set, assuming that data set needs at least some amount of human review;

■ Can be used more effectively on multiple issues simultaneously than SME Training;

■ Enables even small review teams to be extremely productive;

■ Almost no barrier to entry - any legal team can take advantage at any time.

CONS

■ Potential for cost savings is reduced compared to SME Training;

■ The nature of legal review often requires that “families” (emails with attachments) be reviewed together, which inevitably interferes with a purely prioritized review;

■ Validation is strongly recommend if review stops before all documents are reviewed, which introduces SME burden to the workflow;

■ May create an unwieldy model comprised of an excessive amount of documents.

Production

Optional

Known relevant documents

Highest rated documents escalated to front review

Review

No validation required

System Update

REVIEW ALL

DOCUMENTS

STOP EARLY

Sampling


Training

The vast majority of training is simply the act of reviewing prioritized documents over time. The model is rebuilt at specified intervals to take advantage of new knowledge gained since the previous classification. Training usually commences with some combination of Active Learning, Random Sampling and seeds, but the impact of those samples diminishes as more documents are trained.

Process

A legal team starts review and the system learns from human decisions when the model is built during specified intervals. The highest scoring documents are placed at the front of the review, so they get assigned earlier. The ratio of responsive documents to nonresponsive documents is tracked and reviewed (usually daily) to determine when relevant documents have been exhausted. Most teams stop review once the rate reaches a point where further review is not justifiable.

Validation

Validation is strongly recommended only if a legal team decides to stop before all documents are reviewed. In this scenario, a random sample of all remaining documents should be taken, often referred to as the null set or elusion set, to confirm there is no significant percentage of relevant documents remaining. Validation passes when a strong proportionality (diminishing returns) argument can justify discarding the remaining documents.


Hybrid Multimodal Review

Hybrid Multimodal Review combines a Prioritized Review with ongoing SME input. The SME in this workflow utilizes many tools at her disposal to drive important and beneficial documents to review in an effort to minimize the review of irrelevant and redundant documents. There is no pre-defined script with this approach; instead, it is a fluid process that adjusts based on SME leadership.

PROS

■ Has most of the benefits of Prioritized Review, plus a heightened ceiling for efficiency;

■ When led by an experienced SME, this approach yields the highest quality of output of any method (as defined by Recall & Precision) on most cases.

CONS

■ A highly skilled SME - beyond the skill of a traditional SME - is required to drive the process for most of the project’s duration;

■ ROI can be an issue for small matters.

Training

Training is done through the entire review, much like in a Prioritized Review. A SME determines which documents are trained by using his or her best judgment and a combination of tools.

Process

Whether working alone or with a team, the SME is in charge of the process. Since the process will vary from case-to-case and SME-to-SME, it is both variable and subjective, but highly effective when done correctly. Generally, the SME will halt review before all documents are reviewed.

Validation

Validation is strongly recommended when using this approach to permanently eliminate documents from review. In this scenario, a random sample of all remaining documents should be taken, often referred to as the null set or elusion set, to confirm there is no significant percentage of relevant documents remaining. Validation passes when a strong proportionality (diminishing returns) argument can justify discarding the remaining documents.

Optional

Optional

80 85 90 95 99

Rev

iew

Document escalated to front of review by SME

STOP

REVIEW

All Predictive Coding and TAR tools have been

utilized

Null Set

Proof further review might be needed

Production


Quality Control

Predictive coding can also be used as a Quality Control (QC) mechanism, either exclusively or in conjunction with another workflow. Because the model will create a probabilistic ranking for every document, Review Managers can use that information to assess human performance and take remedial action when appropriate. Predictive coding can also help identify widespread problems, providing an opportunity to evaluate instructions and consistency between teams.

Responsive QC

Any human-coded documents drastically conflicting with the system’s suggestion are sent to a defined QC stage in the review workflow. Document scores allow Review Managers to pinpoint which documents are most likely incorrectly coded. Examination of these documents can also lead to identification in an ongoing QC check that can be invaluable for ensuring complete productions.

Privilege QC

Review Managers can utilize predictive coding to help identify privileged documents at risk of being inadvertently produced. Additionally, predictive coding can help reduce the burden of privilege logging by reducing the number of false positives.

Individual reviewer QC

Reporting on reviewer decisions measured against predictive rankings can help Review Managers identify problematic individuals on their teams. This is a useful tactic for both remediation and prevention, and is especially effective when used in conjunction with existing QC methodologies designed to flag performance issues.


Validating Predictive Coding with SamplingAlthough sampling concepts and practices can be confusing, they are often necessary for the validation of the predictive coding project. Understanding sampling methods and derived metrics is an essential skill for evaluating success. Sampling can also be used in a number of other scenarios in ediscovery, even those unrelated to predictive coding.

Statistical Sampling vs. Judgmental Sampling

For predictive coding, there are two types of samples that are most commonly used:

■ A statistical sample is where everything was selected at random.

■ A judgmental sample is where any factor was used in selecting the sample set. If a sample is not a statistical sample, then it is a judgmental sample.

Legal teams can draw conclusions on an entire document population from a random sample but not from a judgmental sample.

Statistical Sample

Documents picked at random from entire batch

Reflective of population

Judgemental Sample

Group of relevant documents picked from batch

Not reflective of population


Key Terminology

To understand the benefits of sampling, a user must first grasp a few key terms. Learning these terms can help interpret results and assess how well a document review is progressing. The following terms are helpful to know for an effective review.

Population Size is the total number of items from which a random sample is taken.

Margin of Error (MOE) is the maximum amount by which an estimate based on the sample results might deviate from the actual amount. Note that Confidence Interval is a related measurement and is equal to 2 x MOE.

Confidence Level refers to the probability that the estimate from the sample, along with the margin of error, include the actual resulting amount.

Prevalence (also called Richness) is the number of positive items in a sample divided by the total number of items in the same sample.

Applications of Sampling

Sampling has numerous applications in document review. The methods discussed below can be used to check quality, compare methods of coding and verify that all relevant documents are being queued for review.

A Point Estimate is created by applying the percentage of certain documents in a sample to the entire population. The number of positive (i.e., relevant) documents in the sample are used to estimate the number of positive documents in the population.

Control Sets are random samples that are used to conduct an efficacy assessment that is representative of the entire population. Control Sets are extremely useful for determining success of a classification effort without reviewing a larger portion of the population.

Null Sets (also called Elusion samples) are random samples taken from the entire population of negatively classified items.

Sample

Proof further review may be neededControl Set

Method 1

Method 2

Margin of error

Percent chance result falls into circle

Estimate


Deriving Metrics to Assess Predictive Coding Results

There are specific metrics that can help quantify the efficacy of a predictive coding project. Many predictive coding systems generate a report with these metrics to provide visibility into classifier performance. To understand these metrics, legal teams must first understand how they are derived.

Consider the scenario below showing classification of “responsive” vs. “non-responsive” documents (a common task in ediscovery). Legal teams can understand classifier performance by analyzing the accuracy of these decisions across a sample that is representative of the larger data set.

TRUE POSITIVE document correctly

suggested “responsive”

FALSE POSITIVE document incorrectly

suggested “responsive”

TRUE NEGATIVE document correctly

suggested “non-responsive”

FALSE NEGATIVE document incorrectly

suggested “non-responsive”

Actually responsive

Actually non-responsive

Suggested responsive


Precision, Recall, F-Measure

The most common metrics used in predictive coding projects are recall, precision and f-measure.

Precision is a measure of exactness. Precision measures the percentage of truly positive items within the subset of items classified as positive by the model.

1. EXACT, BUT NOT COMPLETE

This image shows 100 percent precision, but low recall. The only documents classified as positive were, in fact, positive, but many positive documents were left out.

Recall is the measure of completeness. Recall measures the percentage of all truly positive items in the sample that were classified as positive by the model.

2. COMPLETE, BUT NOT EXACT

This image shows 100 percent recall, but low precision. All documents were classified as positive, but many of these documents are actually negative.

F-measure is the weighted average between Recall and Precision.

3. EXACT AND COMPLETE

Actually non-responsive

Suggestedresponsive

Actually responsive

These results are not perfect, but represent very good performance based on high recall and precision.


ConclusionPredictive Coding is a great tool for any document review in some form or fashion. Users of Predictive Coding are able to minimize costs and maximize efficiency during document reviews. The goal of this guide was to dispel the common myths around Predictive Coding and present the knowledge needed to use Predictive Coding in an effective manner. With data continuing to grow at rapid rates, the use of Predictive Coding will play a rising role inside litigation and other document review. Organizations and individuals with systems in place to utilize Predictive Coding in anticipation of litigation will best thrive, while unprepared organizations will face consequences.

The Ultimate Predictive Coding Handbook - Kroll … · The Ultimate Predictive Coding Handbook by...

Documents

Transcript of The Ultimate Predictive Coding Handbook - Kroll … · The Ultimate Predictive Coding Handbook by...