Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies...
Transcript of Explaining and Aggregating Anomalies to Detect Insider … · Explaining and Aggregating Anomalies...
Explaining and Aggregating Anomalies to Detect Insider Threats
Henry G. Goldberg, Willam T. Young, Alex Memory, Ted E. Senator
Leidos, Inc.Arlington, VA USA
{goldberghg,youngwil,memoryac,senatort}@leidos.com
Abstract—Anomalies in computer usage data may be indica-tive of insider threats. Distinguishing actual malicious activitiesfrom unusual but justifiable activities requires not only asophisticated anomaly detection system but also the expertise ofhuman analysts with access to additional data sources. Becauseany anomaly detection system for extremely rare events willgenerate many false positives, human analysts must decidewhich anomalies are worth their time and effort for follow-up investigations. Providing a ranked or scored list of users– the typical output of an anomaly detection system – isnecessary but far from sufficient for this purpose. Anomaliesindicative of insider threats can be distinguished from thosethat arise from legitimate activity by explaining why they areanomalous, and high-risk users may be identified by theirrepeated appearance near the top of the ranked and scoredlists. This paper describes results of experiments that showthe utility of these techniques of explaining and aggregatinganomalies to detect insider threats with greater accuracy thanis achieved solely with anomaly detection methods.
Keywords-Anomaly Detection; Outlier Detection; Explana-tion; Temporal Aggregation
I. INTRODUCTION
An automated insider threat detection system that detects
anomalies in a single data source such as computer usage
data generates leads for investigation by human analysts.
These leads consist of ranked or scored lists of users
whose activities on particular days may warrant further
investigation. The next step in the investigative process is
for a counter-intelligence analyst to review these leads and
determine which leads are worth his/her time to investigate
further.We propose and evaluate two complementary methods
to help determine which anomalies should be investigated
further and which should not. (1) Explaining anomalies
by providing more details than the simple risk score can
help an analyst understand why the system considered a
particular user’s activity on a particular day unusual and
provides insight as to whether such activity requires further
investigation or has a legitimate explanation. (2) Because
insider threat scenarios typically are executed over multiple
days, and because malicious users may engage in repeated
improper activities, aggregating anomalous user-days by
user can help identify users who are likely to be engaging
in improper activity. Further, such aggregation allows an
analyst to focus on malicious users who are the ultimate
target of real investigations – rather than just on malicious
activities. This paper describes experiments and analyses
we have performed using these two techniques of anomaly
explanation and anomaly aggregation to more accurately dis-
criminate between improper malicious activities and unusual
but innocent activities.
We conduct our research using an anomaly detection
system called PRODIGAL that is described in Section II.
The structure of this paper is as follows. First, we describe
the PRODIGAL system including the type of explanations
it computes and how they are used in the analyst interface.
Next we describe our experiments, methods, results and
analyses of PRODIGAL’s explanation generation ability
involving combinations of features from individual detection
algorithms. The next section of the paper discusses our
experiments about how best to aggregate single day anomaly
scores.
II. THE PRODIGAL SYSTEM
PRODIGAL comprises data processing and anomaly de-
tection components that are described in reference [1].
PRODIGAL has been configured to explore methods for
unsupervised and semi-supervised anomaly detection as the
first step in a multistage detection process for insider threats
[2]. As such it represents one of several approaches to the
problem. (See [3] for a comprehensive survey of methods
for anomaly detection, while [4] surveys approaches to
insider threat analysis and prediction.) PRODIGAL uses an
ensemble technique to combine results from multiple diverse
detectors to identify anomalous user-days in a database
of real computer usage activity [5]. PRODIGAL has been
tested and evaluated against realistic independent red-team
inserted scenarios. PRODIGAL’s unsupervised anomaly de-
tection ensemble combines scores from multiple diverse
detectors into single user-day scores each month, resulting in
a ranked and scored list of user-days ordered by the degree
of anomalousness. This technique consistently achieves a
level of performance on unknown inserted scenarios com-
parable to the performance of its best component detector
as determined after the answer key has been provided,
as described in reference [5]. This gives us confidence in
PRODIGALs ability to detect not only known and suspected
insider threat scenarios but also variants and combinations
of such scenarios, and, more important, previously unknown
scenarios.
2016 49th Hawaii International Conference on System Sciences
1530-1605/16 $31.00 © 2016 IEEE
DOI 10.1109/HICSS.2016.344
2739
Figure 1. Overall user-day scores compared to all users over the month. User #410400’s scores compared to the baseline population for the month. Theblack squares represent the users scores. The green box highlights the day in question for the user, which is September 9, 2014.
Figure 2. The individual feature scores for user 410400 on September 9, 2014. The user is in the less than 0.0001% percentile of all users on this daywith respect to multiple file features.
A. Background: Insider Threat
Malicious insider activity on real computer networks is
carried out by a small number of authorized users, and,
more important, represents only a small fraction of their
overall activities on their computers. Our anomaly detectors,
which we distinguish from algorithms, comprise not only the
algorithms but also the specifications of the entity-extents
whose activity is being examined and the contexts against
which their activities are being compared. For example, an
entity-extent may be an individual user or a group of users,
defined by common projects, locations, organizations, or job
functions, by relationships such as communications patterns
or shared resource usage patterns, or by community member-
ship, where communities are identified by combining aspects
of these definitions. These entity extents may be defined over
different time periods as well. The context against which
activities by entity extents are compared includes various
choices for peer groups and community memberships as
well as various choices for time periods. For example, an
individual user’s activities on a given day may be compared
with his/her activities on all days in a month or longer; they
2740
Figure 3. Detailed drill-down of events. Individual observations and select attributes associated with the top-ranked individual feature score, Distinct FilesCount.
may be compared with activities of other users or groups
of users on the same day; with activities of other users
or groups of users during a month; or any variant thereof.
Explanations therefore, must provide to a human analyst not
only the activity of the entity-extent being examined, but
also the context of the activities of other entity-extents with
which it is being compared.
B. Using Explanations to Support the Analyst
Analysts need explanations that illustrate, in terms relating
to user activity, why such activities are anomalous. Analysts
are not interested in algorithmic mechanisms used to identify
anomalies. (See [6] for a similar approach to explaining
document classification, which is also performed in a high
dimension feature space.) While analysts would benefit from
explanations of users’ overall plans and intentions, this is
infeasible because of the inaccessibility of data needed to
explain such hypotheses. Furthermore, and perhaps more
important, it is infeasible due to the unbounded amount of
diverse domain knowledge that would be required to explain
even a portion of the full set of plans that a computer user
might be undertaking at any given time.
Three levels of explanation meet these needs. The first
level consists of pre-computed single-feature outlier detec-
tion scores that are available for examination by analysts.
Section III-A describes their computation. This provides the
ability to examine the actual activity data of a user in the
context of similar activities of other users over similar time
periods. The second level is a collection of features or sets
of features that contribute most to the anomaly score from
an individual detector for an individual user. The third level
combines the contribution from diverse detectors that are
incorporated into the ensemble computation of the overall
anomaly scores.
C. PRODIGAL Analyst Interface
This section presents examples of the use of single feature
outlier scores in the PRODIGAL Analyst Interface (AI).
Analysis starts with a list of entity extents (user-days) sorted
by highest ensemble anomaly score. The analyst compares
these scores with others for the date or the entire month
using the display shown in Figure 1. The AI presents
user day scores using a box plot, with the upper whisker
representing the top 5 percent of scores in order to highlight
the most anomalous behaviors. The black squares represent
the selected user’s scores for every day. User 410,400 has
scores (as shown in Figure 1) that are in the top 5 % for
several days in the month. The day highlighted by the green
box is the highest-ranked day for the entire month.
For a given day of interest, the AI enables the analyst to
view the individual features associated with the ensemble
score, allowing the analyst to focus on specific anomalous
behaviors while investigating a particular scored entity, as
shown in Figure 2. The AI lists the data type associated
with the feature score (e.g., file, email, URL, printer, lo-
gon), a summary description of the feature name, and the
normalized score for that feature.
Finally, drill-down to underlying user-computer transac-
tions is included to let the analyst view the behavior from
which features and ultimately anomaly scores were com-
puted. (Figure 3) (Note, to preserve privacy in the research
database, numerical hash keys replace unique user names,
file names, domain names, and email addresses. A live
implementation of PRODIGAL would present these to the
analyst.) Inclusion of the underlying observations associated
with the feature scores enables the analyst to visually inspect
the data and assess whether the user’s unusual behavior
is concerning and merits further exploration (outside of
PRODIGAL).
2741
Figure 4. Histograms of values (left) and Logit outlier scores (right) forthe feature: Uploads / Distinct URL domain ratio
III. EXPLANATIONS OF ANOMALIES
An important motivation for creating and refining ways
of explaining anomaly detection is that explanations are
needed to support an analyst in distinguishing malicious
from non-malicious user-days. Recently, studies such as [7]
have confirmed the benefits of explanations for analysts. In
lieu of direct utility judgments by analysts (which would
be costly), our domain expert labeled possibly malicious
activities and has prepared ground truth for a test of feature
utility.
Specifically, he considered the known malicious activities
inserted by a red team into two months’ of live computer
usage data, totalling 81 user-days. He labeled each user-
day as either containing or not containing each of 12
activities of concern that he would be likely to cite to another
analyst to explain why a particular user-day is worth further
investigation (see Table I).
A. Single Feature Outlier Scores
This section explains how we calculate single feature
outlier scores. Each entity receiving a score in PRODIGAL
has associated with it a large number of feature values,
V (U,D, F ), where U is the user ID, D is the date, and Fis the feature ID. PRODIGAL computes a statistical outlier
score for each value by comparing it against all other users
for that date, the comparison population being V (x,D, F ).We compute an outlier score using the cumulative distribu-
tion function (CDF) of the logistic distribution with mean
and variance of this population. This score is normalized to
[0, 1] and is easily compared with other features’ scores. This
score is a “marginal” explanation of the anomaly of the user-
day, because it estimates the likelihood that the feature value
is greater than those of other users from the base population.
An explanation of a scored entity is a list of features plus
outlier scores.
Pre-computed features have been selected from a wide
range of user behaviors identified by counter-intelligence
analysts. Some examples include: URL upload count, email
event count, average recipient count per email sent, fixed
drive file event count, and upload/distinct URL domain ratio.
A sample of the values and outlier scores computed for the
last example is shown in Figure 4.
Figure 5. Histograms of values (left) and Gamma outlier scores (right)for the feature: Uploads / Distinct URL domain ratio
No. Description
1 Copies lots of files2 Copies to removable3 Other file activity4 Searches networked drives5 Unusual web upload6 Other unusual browsing activity7 Excessive email attachments8 Unusual email send activity9 Unusual email received activity
10 Prints a lot (jobs and pages)11 Unusual printer activity12 Unusual logons (events and distinct WS)
Table ILABELS OF ACTIVITIES OF CONCERN TO AN ANALYST
For these examples of pre-computed features, the first
histogram shows the raw feature value with the density
(red) and CDF (green) of the fitted logistic distribution. The
second histogram shows the resulting outlier scores. Other
distributions may fit this data better, such as the Gamma
distribution (shown in Figure 5).
B. Drop-Out Explanations
In addition to the single feature outlier scores, we have
modified several general purpose anomaly detection algo-
rithms, e.g., [8], to produce explanations based on a “drop-
out” method of sensitivity analysis.[9] These explanations
also comprise a set of the same features as previously
described, plus weights representing how much impact re-
moving each feature has on the score the detector computes
for any user-day. Other methods of generating explanations
are possible; in [10], Dang et al., propose a technique for
finding anomalies and explanations simultaneously.
C. Evaluating Explanations
To evaluate explanations in terms of the 100 PRODIGAL
features against a ground truth vector of 12 labels, we tried
two approaches, direct prediction of analyst-assigned labels
by the feature vectors, and transforming the feature vectors
into label vectors for comparison with a metric such as
cosine similarity. To measure how well any particular single
feature outlier score predicts a specific label, we treat the
set of outlier scores as a detector over the collection of
labeled user-days and compute the AUC (Area under the
2742
Figure 6. Performance (measured by AUC) of specific features (vertical axis) at predicting each label (horizontal axis). Since features and labels arederived from a full day of user activity, correlation between several activities can produce non-intuitive predictions, such as printer features predicting fileactivity or web features predicting excessive email attachments.
ROC Curve). This results in a measure of how well each
feature distinguishes user days with and without the label.
The results of this experiment are presented in Figure 6 as a
color map. Rows are individual features, but we have labeled
groups of features that address aspects of various types of
computer usage. Columns are the analyst’s labels. We can
see a number of areas of the grid where features involving
particular activities (e.g., file access of various types) predict
labels involving the same activities well. However, there are
other places which surprise us. For example, several file
activity features appear to predict Unusual Printer Activity.
This may be a result of the way the sensor measuring
file activities (on network drives) picks up movement of
print jobs. A more puzzling case is where Unusual Email
Send/Received labels are predicted by ratios derived from
web activity, such as the ratio of uploads to distinct URLs.
This may be due to the fact that features are derived over
an entire day, and users in the test and training set tend to
perform both types of activity.
To allow direct comparison of our feature-based explana-
tions (both single feature and multi-feature drop-out) to the
ground truth labels, we learn a transformation from feature
space to label space. This is done by deriving the correlation
matrix from a training set of user-days. A cell of this matrix,
M(i, j), contains the Pearson Correlation coefficient of all
values of feature i (in [0, 1]) with all values of label j (0 or
1). Multiplying a feature vector by this matrix produces a
predicted label vector, which we then compare to the ground
truth for that user-day. Figure 7 displays the correlation
coefficients of a matrix derived from single feature outlier
scores over all labeled user-days. We avoid over-fitting via
repeated random sub-sampling validation (RRSSV) cross-
validation in which we select some user-days to test and
derive a correlation matrix from the remainder. We compute
two metrics — cosine similarity and Euclidean distance –
and find the average of each metric over all user-days in the
test sample.
D. Results and Analysis
The chart in Figure 8 shows the results of running 100
iterations where 20 cases are used for testing and 111 to
derive the matrix. We see that, using a correlation matrix
2743
Figure 7. Transformation Matrix (using Pearson Correlation) of Features to Labels. The matrix shown was derived from 131 labeled user-days. The factthat the matrix is relatively sparse, with many low-correlation cells, suggests that our feature set is relatively well aligned with the labels. Columns withno strong correlations point out areas where we need additional features.
learned from a relatively few, labeled cases, we can derive
analyst’s labels that are similar to ground truth by cos(41°).
We also tested explanation weights generated at random
and found that they performed nearly as well as single
feature explanations. This is likely due to the ability of the
learned matrix to capture a model of prior likelihoods of
label occurrence, especially over a small sample size.
Finally, the drop-out explanations derived from our IForest
anomaly detector perform roughly as well as the single
feature explanations whether using a matrix learned from the
single feature explanations or from the IForest explanation
weights themselves, yielding similarity scores of cos(45°)
and cos(39°) respectively. We would have expected drop-
out explanations to do better. Single feature explanations
are independent of one another, and their score, or expla-
nation weight, is a comparison to other instances of the
same feature from other users. While, the drop-out weights
depend on the entire anomaly score, which is derived from
the full feature vectors. One possible explanation is inter-
feature correlation. In the case of two essential features
that are highly correlated with one another, neither would
receive a high drop-out score, since the method tests them
individually. This suggests a path to improving the drop-out
methods by learning inter-feature correlations and testing
entire groups for sensitivity.
IV. TEMPORAL AGGREGATION OF ANOMALIES
A. Background and Introduction
PRODIGAL scores user days; however, we do not sys-
tematically apply the user day scores to find users who
repeatedly display the most unusual behavior. For exam-
ple, ranking user days does not identify users who had
multiple, high-scoring days in a time period whereas an
analyst, visually reviewing the output from PRODIGALs
ensemble, would likely recognize patterns (e.g., a user who
exhibits anomalous behavior on consecutive days, or a week
apart on the same day). Our goals in temporal aggregation
experimentation were to develop a detector, D, that (1) used
output from the ensemble (user ID, rank, and day) to find
the most unusual users in a time period and (2) could serve
as another detector in the PRODIGAL system.
2744
Figure 8. Cross Validation Results of Label Vectors Derived from Features(Error bars show 95% confidence interval)
Table IIPARAMETERS USED FOR TEMPORAL AGGREGATION.
Name Definition Possible values
τ1 The rank of a userday score; interpretedas a cutoff point
The number of user day ranks in thetop 5, 10, 20, 50, 100, 200, 500,1000, 5000, and 10000; 10 values intotal
τ2 The count of thenumber of user daysa specific user has atrank r
The number of days in the timeperiod that a user has at or higherthan a given rank cutoff point; for amonth, 1 — 31
Our experiments differ from research focused on temporal
aggregation techniques in the context of time series analysis.
Traditionally, this research has focused on topics in eco-
nomics and finance such as modeling interest and exchange
rates [11], [12], [13], [14], [15], [16]. Recent research has
extended the previous work in temporal aggregation time
series analysis to agronomy and meteorology [17], [18] and
some in the social sciences (e.g., traffic patterns) [19], [20].
B. Methodology
1) Designing the temporal aggregation model: The
PRODIGAL system consists of over one hundred features
and detectors, whose output is combined into a single
ensemble detector score. We used the date and rank position
of these scores in a family of temporal aggregation detectors
parametrized by rank cutoff (τ1) and the number of days
(τ2) that a user has at a given rank cutoff. Table II below
describes how we specified these parameters.
We selected several values of τ1 which correspond to what
analysts expect to see in operational environments. Detector
parameter τ2 — the number of times that a user has a day at
a specific rank — covers the time period of our operational
surveillance. Thus, in a month, there are between 280 and
310 possible detectors.
2) Specifying the detector: Figure 9 depicts our tempo-
ral aggregation model development methodology. For each
month, we (1) obtain the count of all distinct user IDs
(including RT users), ranks, and days from the ensemble
detector. Using those inputs, we (2) find the count of the
all users at each value of (τ1) and the count of days for all
users at by rank cutoff point (τ2). We form combinations
of each parameter (4) and develop the detectors (D) for the
period of analysis; examples of D include: Top 5, 1 Day;
Top 10, 3 Days; Top 50, and 4 Days. We denote a detector
with thresholds τ1 and τ2 as D(τ1, τ2).3) Data: We used 21 months of test data (approximately
165,000 total user days scores/month) from September, 2012
to July, 2014 to populate the model. We aggregated all user
behavior for the month and did not distinguish between a
Red Team user’s days with and without inserted events. In
an operational context, an analyst could discover even low-
signal malicious behavior by starting with a higher-ranked
day or a pattern of lower-ranked days. Ensemble ranks by
user-day were the inputs for temporal aggregation, and the
Red Team’s answer key allows calculation of lift.
4) Metrics and evaluation: We used lift as the value of
the detector. Lift characterizes the improvement offered by a
classifier over random choice, and is an appropriate method
to apply in our temporal aggregation. As lift measures the
amount of data enrichment offered by a classifier, it enables
us to assess the improvement in detecting malicious insiders
by looking at focused subsets (e.g., the number of users
who have a rank at or above 50 three days in the month) of
the overall population. In our experiments, we define lift at
thresholds τ1 and τ2
L(τ1, τ2) =
[nR(τ1, τ2)
n(τ1, τ2)
/NR
N
](1)
where
• nR(τ1, τ2) is the number of RT users at D(τ1, τ2),• n(τ1, τ2) is the total number of users at D(τ1, τ2),• NR is the total number of RT users in the data set and
• N is the total number of all users.
We evaluated the temporal aggregation detectors perfor-
mance in two ways: (1) average lift across all months
and (2) average lift by specific Red Team scenario. In
the first approach we calculated the lift by data month
(across multiple and different scenarios) and averaged lift of
each classifier across all months in the set (i.e., 21 months
between September 2012 through July 2014). In the second
approach, we calculated lift by each scenario type, averaging
lift of each classifier across scenario instance. There are 36
scenarios and 74 distinct data sets in the data range. For
example, we averaged lift by classifier for the five instances
of the Snowed In scenario, spanning multiple data months
(July and October 2013 and July 2014).
5) Experiment results: Table III shows the final results
of the model across all months. Table IV shows the top ten,
2745
Figure 9. Our temporal aggregation model development methodology.
Table IVTHE TOP TEN, MOST FREQUENTLYOCCURRING TEMPORAL
AGGREGATION CLASSIFIERS ACROSS ALL MONTHS.
Rank Lift Detector Rank Lift Detector
1 114.50 Top 10, 4 Days 6 34.37 Top 20, 5 Days2 91.64 Top 5, 3 Days 7 31.24 Top 50, 10 Days3 42.96 Top 20, 7 Days 8 29.84 Top 10, 3 Days4 42.96 Top 20, 6 Days 9 20.83 Top 50, 9 Days5 31.15 Top 5, 2 Days 10 20.83 Top 50, 8 Days
most frequently occurring temporal aggregation classifiers
across all months. A review of the most frequently-occurring
temporal aggregation classifiers suggests that analysts focus
on users who are often highly unusual within a given time
period. Table V shows the performance of the temporal
aggregation detector by scenario, with lift averaged across
the distinct instances within a scenario.
Table VI lists the temporal aggregation detectors that
produced the highest and lowest lift values by scenario and
relates the number of red team and all users for each detector
and the number of all users in the month of the best detector.
C. Discussion
We have implemented a temporal aggregation filter in
our analyst interface (AI) and intend to present highly-
anomalous users identified by the temporal aggregation to
counter intelligence analysts from the data provider and
determine the number of those users whose actions are of
interest. Also, in reviewing our results in the data laboratory,
we noticed that a high percentage of frequently anomalous
users (e.g., users who have multiple days in the top 20 user
days) appear to perform tasks associated with job roles and
functions categorized as high-risk for insider threat (e.g.,
system administrators).
V. CONCLUSIONS AND ONGOING RESEARCH
Our results to date suggest that analysts will find explana-
tions useful to discriminate between malicious and legitimate
activities with similar high anomaly detection scores, and
that useful explanations can be generated from vectors of
single feature outlier scores. We are pursuing several lines of
research involving improvements to explanation generation
that take into account inter-feature dependence as well as use
of explanations generated by individual anomaly detectors
to guide the ensemble process that produces PRODIGAL’s
overall scores. [9][21]
Our experiments with temporal aggregation also suggest
simple approaches to detect users who are likely to warrant
further investigation by finding users who frequently appear
towards the top of the anomaly detection score list on
multiple days. As we configure PRODIGAL to operate over
various time periods we will refine these approaches to
fit the operational requirements of specific insider threat
surveillance enterprises.
2746
Table IIIFINAL RESULTS OF THE TEMPORAL AGGREGATION MODEL ACROSS ALL MONTHS.
Days Top5 Top10 Top20 Top50 Top100 Top200 Top500 Top1000 Top5000 Top10000
31 030 0 0 029 0 0 0 0 028 0 0 0 0 0 027 0 0 0 0 0 026 0 0 0 0 0 025 0 0 0 0 0 024 0 0 0 0 0 023 0 0 0 0 2.03 3.0322 0 0 0 4.39 1.72 1.6121 0 0 0 0 3.99 1.62 1.9720 0 0 0 0 4.80 1.31 1.4519 0 0 0 0 2.69 1.00 1.8518 0 0 0 0 1.88 0.88 3.5817 0 0 0 0 1.57 0.72 2.8716 0 0 0 0 1.31 1.52 2.5415 0 0 0 0 1.16 1.42 2.4614 0 0 0 0 0.89 3.02 2.9813 0 0 0 7.39 4.00 2.62 2.4512 0 0 0 5.21 5.20 3.43 2.2311 0 20.83 11.46 6.83 4.75 3.10 2.0510 31.24 17.62 8.49 6.83 3.58 2.87 1.86
9 20.83 15.80 7.05 6.55 3.14 2.53 2.068 0 20.83 11.46 7.05 8.07 2.51 2.83 1.997 42.96 18.33 8.33 6.64 5.73 1.89 2.97 2.016 42.96 13.89 6.94 12.91 4.03 1.46 2.71 1.955 0 34.37 8.81 6.64 11.90 3.05 1.94 2.88 2.354 114.56 20.83 7.64 6.64 7.78 2.03 3.08 2.75 2.293 91.64 29.88 14.32 6.64 4.98 3.60 1.57 2.97 2.70 2.122 37.15 12.27 10.11 5.05 4.12 1.86 2.55 4.30 2.34 1.771 6.84 4.10 4.24 2.02 3.44 1.92 3.23 3.10 1.88 1.56
Table VITHE TEMPORAL AGGREGATION DETECTORS THAT PRODUCED THE HIGHEST AND LOWEST LIFT VALUES BY SCENARIO. THE NUMBER OF RED TEAM
AND ALL USERS FOR EACH DETECTOR AND THE NUMBER OF ALL USERS IN THE MONTH OF THE BEST DETECTOR.
Scenario name Averagelift, bestdetector
Best detector (D) forthis scenario
Num./RTusers at D
Num./Allusers at D
Num./Allusers for themonth
Snowed In 1347.67 Top 5, 3 days 1 1 4124Anomalous Encryption 147.84 Top 1000, 7 days 1 19 5618Exfiltration Prior to Termination 146.84 Top 20, 1 day 1 13 4372Selling Login Credentials 109.44 Top 10000, 23 days 1 4 5691Czech Mate 51.47 Top 500, 2 days 1 83 4272Hiding Undue Affluence 2.22 Top 5000, 2 days 1 867 5721Parting Shot 2 - Deadly Aim 2.22 Top 5000, 2 days 1 635 4230From Belarus With Love 2.21 Top 10000, 5 days 1 723 4392Manning Up 1.99 Top 10000, 4 days 1 959 5729Panic Attack 0.86 Top 10000, 5 days 1 700 4286
ACKNOWLEDGMENT
The authors wish to thank the researchers and engineers
of the PRODIGAL team. Funding was provided by the
U.S. Army Research Office (ARO) and Defense Advanced
Research Projects Agency (DARPA) under Contract Number
W911NF-11-C-0088. The content of the information in this
document does not necessarily reflect the position or the
policy of the Government, and no official endorsement
should be inferred. The U.S. Government is authorized to
reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation here on.
REFERENCES
[1] T. E. Senator, H. G. Goldberg, A. Memory, W. T. Young,B. Rees, R. Pierce, D. Huang, M. Reardon, D. A. Bader,E. Chow et al., “Detecting insider threats in a real corporatedatabase of computer usage activity,” in Proceedings of the19th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 2013, pp. 1393–1401.
[2] T. E. Senator, “Multi-stage classification,” in Data Mining,
2747
Table VTHE PERFORMANCE OF THE TEMPORAL AGGREGATION DETECTOR BY
SCENARIO, WITH LIFT AVERAGED ACROSS THE DISTINCT INSTANCES
WITHIN A SCENARIO.
Scenario Name Count of scenarioinstances
Average lift ofD across scenarioinstances
Snowed In 5 1374.67Anomalous Encryption 2 147.84Exfil. Prior to Termination 2 146.84Selling Login Credentials 1 109.44Czech Mate 1 51.47Manning Up Redux 1 36.04Byte Me 2 30.21Breaking the Stovepipe 3 25.66Credit Czech 1 23.89Blinded Me With Science 1 23.39Survivor’s Burden 3 21.52Job Hunter 1 21.44What’s the Big Deal 1 16.03The Big Goodbye 1 12.47Insider Startup 7 11.49Bona Fides 2 9.93Conspiracy Theory 2 9.57Bollywood Breakdown 1 7.92Layoff Logic Bomb 2 7.72Parting Shot 1 6.91Masquerading 2 2 6.49Circumventing Sureview 2 6.05Strategic Tee Time 1 4.75Indecent RFP 2 2 4.08Indecent RFP 1 4.08Passed Over 4 3.76Exfil...Using Screenshots 3 2.88Gift Card Bonanza 1 2.82Byte Me Middleman 2 2.63Naughty by Proxy 4 2.45Outsourcer’s Apprentice 3 2.45Hiding Undue Affluence 2 2.22Parting Shot 2 1 2.22From Belarus With Love 2 2.21Manning Up 2 1.99Panic Attack 2 0.86
Fifth IEEE International Conference on. IEEE, 2005, pp.8–pp.
[3] V. Chandola, A. Banerjee, and K. Vipin, “Anomaly detection:A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 15:1–15:58, 2009.
[4] A. Azaria, A. Richardson, S. Kraus, and V. S. Subrahmanian,“Behavioral analysis of insider threat: A survey and boot-strapped prediction in imbalanced dat,” IEEE Transactionson Computational Social Systems, vol. 1, no. 2, pp. 135–153,2014.
[5] W. T. Young, A. Memory, H. G. Goldberg, and T. E. Senator,“Detecting unknown insider threat scenarios,” in Security andPrivacy Workshops (SPW), 2014 IEEE. IEEE, 2014, pp.277–288.
[6] D. Martens and F. Provost, “Explaining data-driven documentclassifications,” MIS Quarterly, vol. 38, no. 1, pp. 73–99,2014.
[7] K. L. Wagstaff, N. L. Lanza, D. R. Thompson, T. G. Diet-terich, and M. S. Gilmore, “Guiding scientific discovery withexplanations using demud.” in AAAI, 2013.
[8] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,”in Data Mining, 2008. ICDM’08. Eighth IEEE InternationalConference on. IEEE, 2008, pp. 413–422.
[9] M. A. Siddiqui, A. Fern, T. G. Dietterich, and W.-K. Wong,“Sequential feature explanations for anomaly detection,” inKDD15 Digital Proceedings, Outlier Definition, Detection,and Description (ODDx3) Workshop [to be published 2015].
[10] X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schu-bert, “Discriminative features for identifying and interpretingoutliers,” in Data Engineering (ICDE), 2014 IEEE 30thInternational Conference on. IEEE, 2014, pp. 88–99.
[11] F. C. Drost and T. E. Nijman, “Temporal aggregation ofgarch processes,” Econometrica: Journal of the EconometricSociety, pp. 909–927, 1993.
[12] D. Geltner, “Temporal aggregation in real estate return in-dices,” Real Estate Economics, vol. 21, no. 2, pp. 141–166,1993.
[13] M. Marcellino, “Some consequences of temporal aggregationin empirical analysis,” Journal of Business & EconomicStatistics, vol. 17, no. 1, pp. 129–136, 1999.
[14] R. J. Rossana and J. J. Seater, “Temporal aggregation andeconomic time series,” Journal of Business & EconomicStatistics, vol. 13, no. 4, pp. 441–451, 1995.
[15] W. W.-S. Wei, Time series analysis. Addison-Wesley publ,1994.
[16] A. A. Weiss, “Systematic sampling and temporal aggregationin time series models,” Journal of Econometrics, vol. 26,no. 3, pp. 271–281, 1984.
[17] T. A. Buishand, M. V. Shabalova, and T. Brandsma, “Onthe choice of the temporal aggregation level for statisticaldownscaling of precipitation,” Journal of Climate, vol. 17,no. 9, pp. 1816–1827, 2004.
[18] L. Van Bussel, C. Muller, H. Van Keulen, F. Ewert, andP. Leffelaar, “The effect of temporal aggregation of weatherinput data on crop growth models results,” Agricultural andforest meteorology, vol. 151, no. 5, pp. 607–619, 2011.
[19] E. Vlahogianni and M. Karlaftis, “Temporal aggregation intraffic data: implications for statistical characteristics andmodel choice,” Transportation Letters, vol. 3, no. 1, pp. 37–49, 2011.
[20] T. Usman, L. Fu, and L. Miranda-Moreno, “Accident predic-tion models for winter road safety: Does temporal aggregationof data matter?” Transportation Research Record: Journal ofthe Transportation Research Board, no. 2237, pp. 144–151,2011.
[21] A. Memory and T. Senator, “Towards robust anomaly de-tection ensembles using explanations,” in KDD15 DigitalProceedings, Outlier Definition, Detection, and Description(ODDx3) Workshop [to be published 2015].
2748