Use of Machine Learning in Big Data Analytics for Insider Threat … · 2017-04-21 · Use of...

1Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)

Integrity �� Service �� Excellence

Use of Machine Learning in Big Data Analytics for

Insider Threat Detection

October 26 2015

Mr. Michael Jay Mayhew (AFRL)

Mr. Michael Atighetchi (BBN)

Dr. Aaron Adler (BBN)

Dr. Rachel Greenstadt (Drexel University)


Outline

• General concepts underlying BBAC

• The BBAC Web Service Prototype

• Scalable Compute Platform

• Machine Learning Techniques

• Experimental results

• Future work and conclusion


Urgent Need in the DoD Enterprise:

•Cyber operation in the DoD enterprise incurs undue risk.

•No systematic way to determine trustworthiness of

information, its sources, and consumers.

Low-level Observables

Actionable Trustworthiness of Documents, Actors, and Services

Access Control Logs Provenance trails

Technical Challenges:

At mission speed

At enterprise scale

With high accuracy

Audit Trails

Novelty of Research:

•Synergistic combination of rule-based

techniques with statistical learning

•Strategic integration with existing

access control schemes

•Multi-layered analysis to achieve

scale and timeliness

Behavior

Impact:

Diminishes the risk of misplaced trust,

increases mission reliability and assurance,

and deters abuse of authorized privileges

Scientific

Methodology:

•Quantitative

Metrics

•Experimentation

Problem Overview


BBAC

Classifier

Cyber Attack ContextAdvanced Persistent Threats:

An extremely proficient, patient, determined,

and capable adversary, including two or more

of such adversaries working together.*

Insider Threat:

A person, known or suspected, who uses their

authorized access to DoD facilities, systems,

equipment, information or infrastructure to

damage, disrupt, operations, commit espionage

on behalf of a foreign intelligence entity or

support international terrorist organizations.*

* TERMS & DEFINITIONS OF INTEREST FOR DoD COUNTERINTELLIGENCE PROFESSIONALS,

OFFICE OF COUNTERINTELLIGENCE (DXC) DEFENSE CI & HUMINT CENTER DEFENSE INTELLIGENCE AGENCY, 2 May 2011

Mitigation through BBAC:

Analysis of behaviors at multiple layers

Counts per time:

# connections

• to intra/inter/DMZ

• by port range

size and duration

Network

Flows

HTTP

Requests

URL

HTTP Headers

Text and

Micro-text

(Email, Blogs)

Inputs Features

Various

Stylometric

Features


Firewalls

Proxy Filters

BBAC

PDP/PEP for ABAC

Network Access Controls

Application Access Controls

Behavior Within Privilege Realm

Content Filtering

HIDS/NIDS Security Monitoring

Cyber Security Monitoring and Enforcement

Layered Defense-in-Depth: BBAC works in conjunction with existing capabilities

Insider

Threat

Cyber Defense Context (1/2)


Cyber Defense Context (2/2)

Compensation Controls: Sophisticated analysis at higher layers avoids the need for

draconian deny rules at lower layer

Proxy FiltersRequests Block All

Proxy Filters

BBAC

RequestsAccept

Block

Accept

Block


HDFS

• BBAC provides a web based User Interface:

– Reports of suspicious behaviors

– Assessment of training & classification

– Performance statistics

J2EE WebApp

(Tomcat

Webserver)

Storm Accumulo

Zookeeper

Processing Framework

Prototype Web Service System


BBAC High-Level Architecture

BRO Data

Files

Country IP

Info

Accumulo

Aggregation

IngestorIngestor

Ingestor

Context

Merging Clustering

Build

Classifier

Accumulo

Aggregation

Accumulo

Clustering

Accumulo

Aggregation

Accumulo

ClusteringAttack

GeneratorAttack

Generator

Context

Merging

Accumulo

Aggregation

Ingest Topology Cluster Topology Attack Topology Build Classifier Topology

Classify Topology

ClassifierAccumulo

Aggregation

Accumulo

Clustering

UI /

ActuatorsTraining Execution

Classification OrderTask


• Use of Clustering

– Separate features

into spaces that contain

similar behaviors

– Train one classifier per cluster

• Goals / Benefits

– Enable parallelization of training

processes across clusters

– Increase accuracy per cluster by

grouping similar behaviors

• BBAC uses the K-Means++ clustering algorithm that is implemented in

Weka

K-Means++ Clustering


• BBAC is using Support Vector

Machines (SVM) for Classification

– Commonly used supervised learning

algorithm

– Trains on a set of labeled input samples

(Training Data)

– Once trained an SVM labels / classifies

incoming samples (Test Data)

• Our samples are labeled as NORMAL or SUSPICOUS

• BBAC uses an SVM provided by the Weka Machine Learning

Framework

Support Vector Machine (SVM)


Intelligent Clustering

Raw Data: Every host Is an individual

Clustering: Grouping of thehosts based on proximity in space

Decision Tree : Dividingthe space into subspaces

• Clustering enables training high quality classifiers efficiently.

However:

– Clusters are not intuitively nor descriptive

– Assigning a cluster to a new host is problematic

• Solution: Extract simple rules for assigning a group to a host


Metric Measured Phase II Target Met Status Details

Accuracy

Correctness of characterization (CC)

TP=% of attacks correctly identified

FP=% of normal traffic incorrectly identified

TP ≥ 80%

FP ≤ 1%Yes

TP = 96%

FP = 0.45%

TCP : [92.9%, 0.5%]* *reported as [TP,FP]

HTTP: [99.6%, 0.9%]

Correctness of attribution (CA) TP ≥ 80%

FP ≤ 1%Yes

TP = 88%

FP = 0.53%

Wiki: [76%,1%] Twitter: [96%, 0.18%]

Email: [93%, 0.4%]

Precision

Positive Predictive Value (PPV) for CCPPV=# TP / (# TP + # FP)

≥ 90% Yes 97% TCP: 94.2%, HTTP: 99.1%

PPV for CA≥ 75% Yes 99.6%

Wikipedia Edits*: 99%, Email: 99.99%,

Twitter: 99.8%

Timeliness Classification latency < 100 ms Yes 0.7 ms

TCP: m=0.74ms, s=0.52ms

HTTP: m=19ms, s=5.4ms

Wiki: m=1.3µs, s=30.2µs

Experimental Results

• High accuracy for HTTP classifier

• Classification is fast and IO bound, training is slow and CPU bound


Solutions:

•Simulated Ground Truth for Attacks

•Focus on Realistically Observable

Information

•Find Reduced Size Data Sets

•Establish Correlation Through Attack

Splicing

Cyber Security Data Sets

Problems:

•Value Proposition Bootstrap

•Granularity Mismatch

•Lack of Ground Truth

•Accidental Complexities

•Lack of Correlated Data Sets

•Relevance


Conclusions / Next Steps

• The DoD collects large amounts of audit

data but lacks capabilities for

performing real-time analysis to enable

decision makers to respond to evolving

cyber events

• BBAC continuously assess

trustworthiness of actors, documents,

and services by virtue of classifying

behaviors at multiple layers

• Current prototype analyzes TCP, HTTP,

Text, and provenance records and scales

using a streaming cloud paradigm

• Technical Enhancements

– Implement functional clustering

– Perform extended

experimentation

– Implement security controls

• Transition

– BBN’s Inter Departmental Service

Group (IDSG)

– Application to Cross Domain

Information Sharing


Publications

1. Michael Atighetchi, Jonathan Webb, Dr. Rachel Greenstadt, Michael Mayhew, "Behavior-Based Access Control - Needs, Benefits, and

Concepts", Military Communications Conference (MILCOM), Orlando, Florida, October 29-November 1, 2012.

2. Jeffrey Segall, Michael Mayhew, Michael Atighetchi, Rachel Greenstadt. "Assessing Trustworthiness in Collaborative Environments" in

Proceedings of the Cyber Security and Information Intelligence Research Workshop, Oak Ridge National Labs, 2012

3. Michael Atighetchi, Michael Jay Mayhew, Rachel Greenstadt, Aaron Adler, "Developing and Validating Statistical Cyber Defenses", 25th

Annual IEEE Software Technology Conference (STC), Salt Lake City, Utah, April 8-11, 2013.

4. Aaron Adler, Michael Jay Mayhew, Jeffrey Cleveland, Michael Atighetchi, "User Selection of Clusters and Classifiers in BBAC",

International Conference on Intelligent User Interfaces (IUI) 2013 Workshop on Interactive Machine Learning, March 19-22, Santa

Monica, CA USA, 2013

5. Jeffrey Segall and Rachel Greenstadt, "The Illiterate Editor: Metadata-driven Revert Detection in Wikipedia", Proceedings of the 2013

Joint International Symposium on Wikis and Open Collaboration (WikiSym + OpenSym 2013)

6. Jeffrey Cleveland, Michael Jay Mayhew, Aaron Adler, Michael Atighetchi, "Scalable Machine Learning Framework for Behavior-Based

Access Control", 1st IEEE International Symposium on Resilient Cyber Systems (ISRCS) 2013, August 13-15, 2013

7. Aaron Adler, Michael Jay Mayhew, Jeffrey Cleveland, Michael Atighetchi, Rachel Greenstadt, "Using Machine Learning for Behavior-

Based Access Control: Scalable Anomaly Detection on TCP Connections and HTTP Requests", 32rd Military Communications Conference

(MILCOM 2013), San Diego, CA, November 18 -20, 2013

8. Michael Atighetchi, Michael Jay Mayhew, Rachel Greenstadt, Aaron Adler, "Problems and Mitigation Strategies for Developing and

Validating Statistical Cyber Defenses", CrossTalk - The Journal of Defense Software Engineering, Vol. 27 No. 2, March/April 2014.

9. Rebekah Overdorf, Travid Dutko, Rachel Greenstadt, "Blogs and Twitter Feeds: A Stylometric Environmental Impact Study“, Privacy

Enhancing Technologies Symposium, Philadelphia, PA, June 30 – July 2, 2014


Michael J. Mayhew, AFRL/RIEBA

BBAC Program Manager

[email protected]

315-330-2898 (DSN = 587)

Michael Atighetchi, BBN

BBAC Principal Investigator

[email protected]

617-873-1679

Dr. Rachel Greenstadt

Statistical Machine Learning Expert

[email protected]

Project Website:

https://dist-systems.bbn.com/tech/Cross/

Use of Machine Learning in Big Data Analytics for Insider Threat … · 2017-04-21 · Use of...

Documents

Transcript of Use of Machine Learning in Big Data Analytics for Insider Threat … · 2017-04-21 · Use of...