Auto-grouping Emails for Faster eDiscovery

India Research Lab

Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp*

IBM Research – India *IBM Software Group

India Research Lab

Outline of the Talk

eDiscovery Process

A new way of eDiscovery Review: Group Level Review

Creating Syntactic Groups

Creating Semantic Groups

Experiments and Conclusion

India Research Lab

eDiscovery Process

Discovery: Process in pre-trial phase- Produce relevant information

eDiscovery: FRCP 2006 amendment- Produce relevant Electronically Stored Information (ESI)

Emails, chats, word docs, presentations etc.

Huge volumes of ESI - Process is expensive- 60% of cases warrant some form of eDisovery- 4.8 billion dollars industry in 2011

India Research Lab

eDiscovery Process

High cost due to review stage- Lawsuit between Clinton administration and tobacco

companies (U.S. Vs. Philip Morris)

Apply Text Mining Techniques to reduce high costs involved in eDiscovery Process

India Research Lab

Named entity annotatorLanguage AnnotatorSignature Annotator

Architecture of eDiscovery Review Systems

India Research Lab

Group Level Review

Review groups of documents that are “related” instead of individual documents- Mark whole group as responsive/unresponsive or privileged- Efficient and consistent

- Syntactically Similar Documents Automated messages, Near and exact duplicates

- Semantically Similar Documents Threads, semantic categories

India Research Lab

Detecting Syntactic Groups: Automated Messages

India Research Lab

Detecting Near Duplicates

S1: I am away from 17/2/2011 to 19/2/2011. Please mail xyz@in.ibm.com in case of any need

S2: I am away from 26/7/2011 to 31/7/2011. Please mail abc@us.ibm.com in case of any need

Notion of Similarity: Resemblance

kwindowwithsentenceforchunksallofsetS

2121 SS

Use fingerprinting (Rabin) instead of actual chunks.

India Research Lab

Efficient Detection of Near Duplicates

For a document of length n words there would be - n-K+1 chunks with a window size of K

It suffices to keep for each document a relatively small fixed size signature

Let Sn be the set of permutations of [n]And let be chosen uniformly at random over Sn

][}1,...,0{ nnSD

||)}(min{)}(Pr(min{ BAr

India Research Lab

Signature Annotator

In practice choosing the permutations randomly is hard

Use a set of n one-to-one functions fi and keep only the smallest value for each fi

Keep only j lowest significant bits for each value

India Research Lab

Discovering Automated Messages

Generating groups of near duplicate – Index Based Clustering- For each document d in index I do

If d is not covered

- Let S = {S1, S2, …, Sn} be the signature of document d

- D = Query(I, atleast(S,k))

- For each document d’ in D d’ is covered

Discovering Groups of Automated Messages- Automated Messages, Group of bulk emails, Group of forward emails

Use MD5 to detect bulk emails. Emails with one segment are automated messages

India Research Lab

Detecting Semantic Groups: Email Threads

A tree like structure

A link denotes that the child node was written as a reply to the parent node.

Capture the context in which an email was written

India Research Lab

Detecting Email Threads

Meta data based methods- Headers are not

consistently used

Content of old mail remains in the new mail- A segment contains text of

only one communication

An email ei contains ej iff ei

approximately contains all the segment of ej

India Research Lab

Method for Thread Detection

Email Segment Generator (ESG)

– Creates segments of it where each segment contains content of only one email.

Segment Signature Generator (SSG):

– Generates a signature for a segment

• Use near duplicate signatures

For practical implementation, we limit on the number of segment signatures (N) that can be associated with an email, e.g. 20 segments.

India Research Lab

Method: Processing at Indexing Timew1w2

Word index

Meta index

Signature index

India Research Lab

Method: Processing at Query Time

Word index

Meta index Signature index

Generating Candidate Thread Set

Use Signature

Of First Segment

India Research Lab

Detecting Email Threads

Given a Candidate Thread Set- Identify the email with only

root segment- An email ec is child of an

email ep if ec minimally contains ep

India Research Lab

Creating Semantic Categories

Focus Categories- Documents that are likely to be responsive- Legal Content, Financial Communication, Intellectual Property- High recall

Filter Categories- Documents that are likely to be unresponsive- Bulk emails, Private communication, Jokes- High precision

India Research Lab

Creating Semantic Categories

Email Segmentation

Pattern based annotation: Use System T based method

Consolidation- Each concept is independent- Apply additional constraints over concepts

India Research Lab

Experiments – Near Duplicate Detection

Enron Corpus- 517K emails from 150 users

Measuring precision- Manually evaluated near

duplicate set for 500 queries- With more bits precision is

100% even with 40% similarity threshold

Only 33.3 % emails are unique

India Research Lab

Experiments – Email Thread Detection

No ground truth for threads Subject approximation Method: Based on “Re:”, “Fw:” etc in subject Manually verified the results of thread for our method and subject

approximation method- The union of correct emails in thread for both approaches is treated as

ground truth.

India Research Lab

Experiments – Semantic Group

Ground truth: Sampled 2200 emails using generic keywords and then manually labeled

India Research Lab

Conclusions

We developed a framework that allow group level review of documents

We developed methods for finding syntactic groups such as automated messages for creating groups

We developed methods for finding email threads and semantic groups

We showed significant reduction in the review time by using the group level review and integrated the proposed techniques with IBM Infosphere eDiscovery Analyzer product

Auto-grouping Emails for Faster eDiscovery

Documents

Transcript of Auto-grouping Emails for Faster eDiscovery

Lexbe eDiscovery Webinar- Are You Paying Too Much for eDiscovery Processing?

eDISCOVERY FOR FINANCIAL SERVICES … eDiscovery...eDISCOVERY FOR FINANCIAL SERVICES CONFERENCE NEW YORK, NY FEBRUARY 6-7, 2012 Desmond Tutu Conference Center Commercial Banks, Investment

for Disability Rights Advocates eDiscovery & Case Management€¦ · eDiscovery & Case Management for Disability Rights Advocates | eDiscovery Webinar Series | June 26, 2014 2011

O365 eDiscovery

Cross-Border eDiscovery

eDiscovery & Information Governance Services - EYFILE/EY-ediscovery-information-governance-servic… · eDiscovery & Information Governance Services 1 1 2 3 The volume, ... Data identification

Minimizing eDiscovery risks - Zurich Insurancehpd.zurichna.com/Whitepaper/Zurich-Minimizing-eDiscovery-Risks.pdfMinimizing eDiscovery risks What organizations need to know in today’s

Proportionality in Ediscovery

Veritas eDiscovery Platform

DISCOVERING eDISCOVERY - CBO Projects€¦ · DISCOVERING eDISCOVERY . What is eDiscovery? Electronic discovery (eDiscovery) is any process in which electronic data is identified,

Lexbe eDiscovery Webinar- On-Demand eDiscovery Processing

Lexbe eDiscovery Webinar- Redefining High Speed eDiscovery Processing & Production

EDiscovery Presentation

eDiscovery Advantage - Winston

SharePoint 2013 eDiscovery

VeritasTM eDiscovery Platform · 2018-09-13 · VERITAS eDISCOVERY PLATFORM The VeritasTM eDiscovery Platform is the leading enterprise eDiscovery solution that enables enterprises,

Best practices for meeting critical eDiscovery · PDF fileBest practices for meeting critical eDiscovery ... A key eDiscovery best practice involves proactive information ... 4 Gartner

eDISCOVERY X1 eDiscovery Search Suitearchive.x1.com/download/x1_ediscovery.pdf · Cost Effective Early Case Assessment X1 ® eDiscovery Search Suite eDISCOVERY Despite being heralded

eDISCOVERY X1 eDiscovery Search Suitesupport.x1.com/download/x1_ediscovery.pdfThe patented, fast-as-you-type™ X1 eDiscovery Client provides an interactive interface for searching

Auto-Grouping Emails For Faster E-Discovery · 2019-07-12 · emails, processes them and provides an interface where a user can query, manually review and categorize the emails. The