Use of Machine Learning in Big Data Analytics for Insider Threat … · 2017-04-21 · Use of...
Transcript of Use of Machine Learning in Big Data Analytics for Insider Threat … · 2017-04-21 · Use of...
1Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Integrity ���� Service ���� Excellence
Use of Machine Learning in Big Data Analytics for
Insider Threat Detection
October 26 2015
Mr. Michael Jay Mayhew (AFRL)
Mr. Michael Atighetchi (BBN)
Dr. Aaron Adler (BBN)
Dr. Rachel Greenstadt (Drexel University)
2Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Outline
• General concepts underlying BBAC
• The BBAC Web Service Prototype
• Scalable Compute Platform
• Machine Learning Techniques
• Experimental results
• Future work and conclusion
3Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Urgent Need in the DoD Enterprise:
•Cyber operation in the DoD enterprise incurs undue risk.
•No systematic way to determine trustworthiness of
information, its sources, and consumers.
Low-level Observables
Actionable Trustworthiness of Documents, Actors, and Services
Access Control Logs Provenance trails
Technical Challenges:
At mission speed
At enterprise scale
With high accuracy
Audit Trails
Novelty of Research:
•Synergistic combination of rule-based
techniques with statistical learning
•Strategic integration with existing
access control schemes
•Multi-layered analysis to achieve
scale and timeliness
Behavior
Impact:
Diminishes the risk of misplaced trust,
increases mission reliability and assurance,
and deters abuse of authorized privileges
Scientific
Methodology:
•Quantitative
Metrics
•Experimentation
Problem Overview
4Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
BBAC
Classifier
Cyber Attack ContextAdvanced Persistent Threats:
An extremely proficient, patient, determined,
and capable adversary, including two or more
of such adversaries working together.*
Insider Threat:
A person, known or suspected, who uses their
authorized access to DoD facilities, systems,
equipment, information or infrastructure to
damage, disrupt, operations, commit espionage
on behalf of a foreign intelligence entity or
support international terrorist organizations.*
* TERMS & DEFINITIONS OF INTEREST FOR DoD COUNTERINTELLIGENCE PROFESSIONALS,
OFFICE OF COUNTERINTELLIGENCE (DXC) DEFENSE CI & HUMINT CENTER DEFENSE INTELLIGENCE AGENCY, 2 May 2011
Mitigation through BBAC:
Analysis of behaviors at multiple layers
Counts per time:
# connections
• to intra/inter/DMZ
• by port range
size and duration
Network
Flows
HTTP
Requests
URL
HTTP Headers
Text and
Micro-text
(Email, Blogs)
Inputs Features
Various
Stylometric
Features
5Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Firewalls
Proxy Filters
BBAC
PDP/PEP for ABAC
Network Access Controls
Application Access Controls
Behavior Within Privilege Realm
Content Filtering
HIDS/NIDS Security Monitoring
Cyber Security Monitoring and Enforcement
Layered Defense-in-Depth: BBAC works in conjunction with existing capabilities
Insider
Threat
Cyber Defense Context (1/2)
6Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Cyber Defense Context (2/2)
Compensation Controls: Sophisticated analysis at higher layers avoids the need for
draconian deny rules at lower layer
Proxy FiltersRequests Block All
Proxy Filters
BBAC
RequestsAccept
Block
Accept
Block
7Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
HDFS
• BBAC provides a web based User Interface:
– Reports of suspicious behaviors
– Assessment of training & classification
– Performance statistics
J2EE WebApp
(Tomcat
Webserver)
Storm Accumulo
Zookeeper
Processing Framework
Prototype Web Service System
8Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
BBAC High-Level Architecture
BRO Data
Files
Country IP
Info
Accumulo
Aggregation
IngestorIngestor
Ingestor
Context
Merging Clustering
Build
Classifier
Accumulo
Aggregation
Accumulo
Clustering
Accumulo
Aggregation
Accumulo
ClusteringAttack
GeneratorAttack
Generator
Context
Merging
Accumulo
Aggregation
Ingest Topology Cluster Topology Attack Topology Build Classifier Topology
Classify Topology
ClassifierAccumulo
Aggregation
Accumulo
Clustering
UI /
ActuatorsTraining Execution
Classification OrderTask
9Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
• Use of Clustering
– Separate features
into spaces that contain
similar behaviors
– Train one classifier per cluster
• Goals / Benefits
– Enable parallelization of training
processes across clusters
– Increase accuracy per cluster by
grouping similar behaviors
• BBAC uses the K-Means++ clustering algorithm that is implemented in
Weka
K-Means++ Clustering
10Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
• BBAC is using Support Vector
Machines (SVM) for Classification
– Commonly used supervised learning
algorithm
– Trains on a set of labeled input samples
(Training Data)
– Once trained an SVM labels / classifies
incoming samples (Test Data)
• Our samples are labeled as NORMAL or SUSPICOUS
• BBAC uses an SVM provided by the Weka Machine Learning
Framework
Support Vector Machine (SVM)
11Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Intelligent Clustering
Raw Data: Every host Is an individual
Clustering: Grouping of thehosts based on proximity in space
Decision Tree : Dividingthe space into subspaces
• Clustering enables training high quality classifiers efficiently.
However:
– Clusters are not intuitively nor descriptive
– Assigning a cluster to a new host is problematic
• Solution: Extract simple rules for assigning a group to a host
12Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Metric Measured Phase II Target Met Status Details
Accuracy
Correctness of characterization (CC)
TP=% of attacks correctly identified
FP=% of normal traffic incorrectly identified
TP ≥ 80%
FP ≤ 1%Yes
TP = 96%
FP = 0.45%
TCP : [92.9%, 0.5%]* *reported as [TP,FP]
HTTP: [99.6%, 0.9%]
Correctness of attribution (CA) TP ≥ 80%
FP ≤ 1%Yes
TP = 88%
FP = 0.53%
Wiki: [76%,1%] Twitter: [96%, 0.18%]
Email: [93%, 0.4%]
Precision
Positive Predictive Value (PPV) for CCPPV=# TP / (# TP + # FP)
≥ 90% Yes 97% TCP: 94.2%, HTTP: 99.1%
PPV for CA≥ 75% Yes 99.6%
Wikipedia Edits*: 99%, Email: 99.99%,
Twitter: 99.8%
Timeliness Classification latency < 100 ms Yes 0.7 ms
TCP: m=0.74ms, s=0.52ms
HTTP: m=19ms, s=5.4ms
Wiki: m=1.3µs, s=30.2µs
Experimental Results
• High accuracy for HTTP classifier
• Classification is fast and IO bound, training is slow and CPU bound
13Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Solutions:
•Simulated Ground Truth for Attacks
•Focus on Realistically Observable
Information
•Find Reduced Size Data Sets
•Establish Correlation Through Attack
Splicing
Cyber Security Data Sets
Problems:
•Value Proposition Bootstrap
•Granularity Mismatch
•Lack of Ground Truth
•Accidental Complexities
•Lack of Correlated Data Sets
•Relevance
14Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Conclusions / Next Steps
• The DoD collects large amounts of audit
data but lacks capabilities for
performing real-time analysis to enable
decision makers to respond to evolving
cyber events
• BBAC continuously assess
trustworthiness of actors, documents,
and services by virtue of classifying
behaviors at multiple layers
• Current prototype analyzes TCP, HTTP,
Text, and provenance records and scales
using a streaming cloud paradigm
• Technical Enhancements
– Implement functional clustering
– Perform extended
experimentation
– Implement security controls
• Transition
– BBN’s Inter Departmental Service
Group (IDSG)
– Application to Cross Domain
Information Sharing
15Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Publications
1. Michael Atighetchi, Jonathan Webb, Dr. Rachel Greenstadt, Michael Mayhew, "Behavior-Based Access Control - Needs, Benefits, and
Concepts", Military Communications Conference (MILCOM), Orlando, Florida, October 29-November 1, 2012.
2. Jeffrey Segall, Michael Mayhew, Michael Atighetchi, Rachel Greenstadt. "Assessing Trustworthiness in Collaborative Environments" in
Proceedings of the Cyber Security and Information Intelligence Research Workshop, Oak Ridge National Labs, 2012
3. Michael Atighetchi, Michael Jay Mayhew, Rachel Greenstadt, Aaron Adler, "Developing and Validating Statistical Cyber Defenses", 25th
Annual IEEE Software Technology Conference (STC), Salt Lake City, Utah, April 8-11, 2013.
4. Aaron Adler, Michael Jay Mayhew, Jeffrey Cleveland, Michael Atighetchi, "User Selection of Clusters and Classifiers in BBAC",
International Conference on Intelligent User Interfaces (IUI) 2013 Workshop on Interactive Machine Learning, March 19-22, Santa
Monica, CA USA, 2013
5. Jeffrey Segall and Rachel Greenstadt, "The Illiterate Editor: Metadata-driven Revert Detection in Wikipedia", Proceedings of the 2013
Joint International Symposium on Wikis and Open Collaboration (WikiSym + OpenSym 2013)
6. Jeffrey Cleveland, Michael Jay Mayhew, Aaron Adler, Michael Atighetchi, "Scalable Machine Learning Framework for Behavior-Based
Access Control", 1st IEEE International Symposium on Resilient Cyber Systems (ISRCS) 2013, August 13-15, 2013
7. Aaron Adler, Michael Jay Mayhew, Jeffrey Cleveland, Michael Atighetchi, Rachel Greenstadt, "Using Machine Learning for Behavior-
Based Access Control: Scalable Anomaly Detection on TCP Connections and HTTP Requests", 32rd Military Communications Conference
(MILCOM 2013), San Diego, CA, November 18 -20, 2013
8. Michael Atighetchi, Michael Jay Mayhew, Rachel Greenstadt, Aaron Adler, "Problems and Mitigation Strategies for Developing and
Validating Statistical Cyber Defenses", CrossTalk - The Journal of Defense Software Engineering, Vol. 27 No. 2, March/April 2014.
9. Rebekah Overdorf, Travid Dutko, Rachel Greenstadt, "Blogs and Twitter Feeds: A Stylometric Environmental Impact Study“, Privacy
Enhancing Technologies Symposium, Philadelphia, PA, June 30 – July 2, 2014
16Distribution A. Approved for public release; distribution unlimited . (# 88ABW-2015-4593)
Michael J. Mayhew, AFRL/RIEBA
BBAC Program Manager
315-330-2898 (DSN = 587)
Michael Atighetchi, BBN
BBAC Principal Investigator
617-873-1679
Dr. Rachel Greenstadt
Statistical Machine Learning Expert
Project Website:
https://dist-systems.bbn.com/tech/Cross/