Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State...

Post on 16-Jan-2016

213 views 0 download

Tags:

Transcript of Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State...

Data Mining of E-Mails to Support Periodic & Continuous Assurance

Glen L. GrayCalifornia State University at Northridge

Roger DebrecenyUniversity of Hawai`i at Mānoa

5th Symposium on Information Systems Assurance5th Symposium on Information Systems Assurance

Toronto: October 2007

In this Presentation

Continuous monitoring of emails – why? Technologies

Social Network Analysis Text analysis

Challenges Opportunities

Continuous Monitoring of Emails – Why?

Increased focus on forensic approaches to auditing

Increased interest in continuous assurance and monitoring of business processes

Emails = Organization’s DNA Evidential matter on:

Employee & management fraud (overrides) Compliance (e.g., HIPAA) Loss of intellectual property Corporate policies

Enron Email Archive

Released by Federal Energy Regulatory Commission

500K emails 151 Enron employees Cleaned version at Carnegie Mellon

www.cs.cmu.edu/~enron/ Relational DB version at USC

www.isi.edu/~adibi/Enron/Enron_Dataset_Report.pdf

Email Mining Targets

EmailData Mining

Key WordQueries

DeceptionClues

Volume &Velocity

Social NetworkAnalysis

ContentAnalysis

LogAnalysis

Content Analysis

Key Word Queries

Yes, people do say self-incriminating things in their emails Fraud Corporate dysfunction

Overwhelming false positives Need “smart” compound queries Good continuous auditing (CA) candidate

Already scanning for spam, porn, etc.

Sender Deception -- Content

Deceptive emails include: Fewer first-person pronouns to dissociate

themselves from their own words Fewer exclusive words, such as but and

except, to indicate a less complex story More negative emotion words because of the

sender’s underlying feeling of guilt More action verbs to, again, indicate a less

complex story

Sender Deception -- Identification

Writeprint features Lexical -- characters & words

Function words Root words

Syntactic -- sentences Structural -- paragraphs Content-specific

Sender Deception -- Identification

Number of potential features unlimited Optimum number can vary by

context and language Developing user profiles and comparing new

emails to profiles would be challenging for real-time CA

Temporal/Log Analysis

Volume & Velocity

Volume = number of emails a person sends and/or receives over a period of time.

Velocity = how quickly the volume changes. Many external factors (e.g., vacations, seasonal

activities, etc.) impact these numbers Need “rolling histogram”

Volume & Velocity

Key issue -- determining the optimum time intervals to sample the data

Continuous monitoring cannot be continuous in terms of sampling in real time

Comparing hourly, daily, and even weekly volumes and velocities will result in many false positives

Optimum time internal could vary by job title

Social Network Analysis

Social Network Analysis

Social relationships as an undirected graph Importance of understanding relationships

within the flow of email exchanges

Social Network Analysis in Emails

Emails semi-structured data sender primary recipient(s) copied recipient(s) date subject line

Social groups and cliques CA = who doesn’t belong?

Thread Analysis – This?Time

S R

C

C

SR

C

C

R

C

C

S

S

R

C

C

Thread Analysis – Or this?Time

S

R

C

C

S

R

R

C

S

C

R

R

S

R

Integrating Content Analysis and Social Network Analysis

EmailData Mining

Key WordQueries

DeceptionClues

Volume &Velocity

Social NetworkAnalysis

ContentAnalysis

LogAnalysis

Challenges of Email Mining

Textual Inconsistent use of abbreviations Misspelled words Smileys etc. etc. Replies, replies, and more replies…

Inability to identify: Identities of email participants

anon@anon.mail.sender.net Roles and responsibilities

What Enron Emails Show?

People do say the darnest things What did he know and when did he know it? Verified numerous bodies of email data

mining research Content analysis Social network analysis

Tools

Content monitoring eSoft Corporation’s ThreatWall Symantec’s Mail Security 8x00 Series Vericept Corporation’s Vericept Content 360º Reconnex Corporation’s iGuard Appliance InBoxer, Inc. Anti-Risk Appliance

Social networks Microsoft SNARF Heer Vizter

Research Opportunities

Research Questions

Role of email monitoring in overall CA environment?

Join SNA with examination of textual patterns. Link SNA with control environment Frauds/control overrides footprint? What email cleaning is required for CA purposes? Privacy and policy issues? Lessons from existing commercial products?

Your Questions

Thank You

glen.gray@csun.edu

rogersd@hawaii.edu