Anomaly detection with Apache Spark
-
Upload
centro-de-investigacion-para-la-gestion-tecnologica-del-riesgo-cigtr -
Category
Technology
-
view
139 -
download
1
description
Transcript of Anomaly detection with Apache Spark
1
Anomaly Detection withApache Spark: WorkshopSean Owen / Director of Data Science / Cloudera
Anomaly Detection
2
• What is “Unusual”?• Server metrics• Access patterns• Transactions
• Labeled, or not• Sometimes know
examples of “unusual”• Sometimes not
• Applications• Network security• IT monitoring• Fraud detection• Error detection
Clustering
3
• Identify dense clusters of data points
• Unusual = far from any cluster
• What is “far”?• Unsupervised learning• Can “supervise” with
some labels to improve or interpret
en.wikipedia.org/wiki/Cluster_analysis
k-means++ clustering
4
• Simple, well-known, parallel algorithm
• Iteratively assign points, update centers (“means”)
• Goal: points close to nearest cluster center
• Must choose k, number of clusters
mahout.apache.org/users/clustering/fuzzy-k-means.html
5
Anomaly Detection in KDD Cup ‘99
KDD Cup 1999
6
• Annual ML competitionwww.sigkdd.org/kddcup/index.php
• ’99: Computer network intrusion detection
• 4.9M connections• Most normal, many
known to be attacks• Not a realistic sample!
7
0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
Label
Service Bytes Received
% SYN errors
Apache Spark: Something For Everyone
8
• From MS Dryad, UC Berkeley, DataBricks
• Scala-based• Expressive, efficient• JVM-based
• Scala-like API• Distributed works like
local, works like streaming• Like Apache Crunch is
Collection-like
• Interactive REPL• Distributed• Hadoop-friendly
• Integrate with where data, cluster already is
• ETL no longer separate• MLlib
9
Clustering, Take #0
10
val rawData = sc.textFile("/user/srowen/kddcup.data", 120)rawData: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15
rawData.count...res1: Long = 4898431
11
0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
12
import scala.collection.mutable.ArrayBuffer
val dataAndLabel = rawData.map { line => val buffer = ArrayBuffer[String]() buffer.appendAll(line.split(",")) buffer.remove(1, 3) val label = buffer.remove(buffer.length-1) val vector = buffer.map(_.toDouble).toArray (vector,label)}
val data = dataAndLabel.map(_._1).cache()
13
import org.apache.spark.mllib.clustering._
val kmeans = new KMeans()val model = kmeans.run(data)
model.clusterCenters.foreach(centroid => println(java.util.Arrays.toString(centroid)))
val clusterAndLabel = dataAndLabel.map { case (data,label) => (model.predict(data),label) }val clusterLabelCount = clusterAndLabel.countByValue
clusterLabelCount.toList.sorted.foreach { case ((cluster,label),count) => println(f"$cluster%1s$label%18s$count%8s") }
14
0 back. 22030 buffer_overflow. 300 ftp_write. 80 guess_passwd. 530 imap. 120 ipsweep. 124810 land. 210 loadmodule. 90 multihop. 70 neptune. 10720170 nmap. 23160 normal. 972781
0 perl. 30 phf. 40 pod. 2640 portsweep. 104120 rootkit. 100 satan. 158920 smurf. 28078860 spy. 20 teardrop. 9790 warezclient. 10200 warezmaster. 201 portsweep. 1
Terrible.
15
Clustering, Take #1: Choose k
16
import scala.math._import org.apache.spark.rdd._
def distance(a: Array[Double], b: Array[Double]) = sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d * d).sum)
def clusteringScore(data: RDD[Array[Double]], k: Int) = { val kmeans = new KMeans() kmeans.setK(k) val model = kmeans.run(data) val centroids = model.clusterCenters data.map(datum => distance(centroids(model.predict(datum)), datum)).mean}
val kScores = (5 to 40 by 5).par.map(k => (k, clusteringScore(data, k)))
17
18
(5, 1938.8583418059309)(10,1614.7511288131)(15,1406.5960973638971)(20,1111.5970245349558)(25, 905.536686115762)(30, 931.7399112938756)(35, 550.3231624120361)(40, 443.10108628017787)
19
kmeans.setRuns(10)kmeans.setEpsilon(1.0e-6)(30 to 100 by 10)
(30, 886.974050712821)(40, 747.4268153420192)(50, 370.2801596900413)(60, 325.883722754848)(70, 276.05785104442657)(80, 193.53996444359856)(90, 162.72596475533814)(100,133.19275833671574)
20
Clustering, Take #2: Normalize
21
data.unpersist(true)
val numCols = data.take(1)(0).lengthval n = data.countval sums = data.reduce((a,b) => a.zip(b).map(t => t._1 + t._2))val sumSquares = data.fold(new Array[Double](numCols)) ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2))val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n }val means = sums.map(_ / n)
val normalizedData = data.map( (_,means,stdevs).zipped.map((value,mean,stdev) => if (stdev <= 0) (value-mean) else (value-mean)/stdev)).cache()
val kScores = (50 to 120 by 10).par.map(k => (k, clusteringScore(normalizedData, k)))
22
(50, 0.008184436460307516)(60, 0.005003794119180148)(70, 0.0036252446694127255)(80, 0.003448993315406253)(90, 0.0028508261816040984)(100,0.0024371619202127343)(110,0.002273862516438719)(120,0.0022075535103855447)
23
Clustering, Take #3: Categoricals
24
val protocols = rawData.map( _.split(",")(1)).distinct.collect.zipWithIndex.toMap...
val dataAndLabel = rawData.map { line => val buffer = ArrayBuffer[String]() buffer.appendAll(line.split(",")) val protocol = buffer.remove(1) val vector = buffer.map(_.toDouble)
val newProtocolFeatures = new Array[Double](protocols.size) newProtocolFeatures(protocols(protocol)) = 1.0 ... vector.insertAll(1, newProtocolFeatures) ... (vector.toArray,label)}
25
(50, 0.09807063330707691)(60, 0.07344136010921463)(70, 0.05098421746285664)(80, 0.04059365147197857)(90, 0.03647143491690264)(100,0.02384443440377552)(110,0.016909326439972006)(120,0.01610738339266529)(130,0.014301399891441647)(140,0.008563067306283041)
26
Clustering, Take #4: Labels, Entropy
27
0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
Label
Using Labels with Entropy
28
• Measures mixed-ness• Bad clusters have
very mixed labels• Function of cluster’s
label frequencies, p(x)• Good clustering =
low entropy clusters
- p log pΣ
29
def entropy(counts: Iterable[Int]) = { val values = counts.filter(_ > 0) val sum: Double = values.sum values.map { v => val p = v / sum -p * log(p) }.sum}
def clusteringScore(data: RDD[Array[Double]], labels: RDD[String], k: Int) = { ... val labelsInCluster = data.map(model.predict(_)).zip(labels). groupByKey.values val labelCounts = labelsInCluster.map( _.groupBy(l => l).map(t => t._2.length)) val n = data.count labelCounts.map(m => m.sum * entropy(m)).sum / n}
30
(30, 1.0266922080881913)(40, 1.0226914826265483)(50, 1.019971839275925)(60, 1.0162839563855304)(70, 1.0108882243857347)(80, 1.0076114958062241)(95, 0.4731290640152461)(100,0.5756131018520718)(105,0.9090079450132587)(110,0.8480807836884104)(120,0.3923520444828631)
31
Detecting an Anomaly
32
val kmeans = new KMeans()kmeans.setK(95)kmeans.setRuns(10)kmeans.setEpsilon(1.0e-6)val model = kmeans.run(normalizedData)
def distance(a: Array[Double], b: Array[Double]) = sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d * d).sum)
val centroids = model.clusterCentersval distances = normalizedData.map(datum => (distance(centroids(model.predict(datum)), datum), datum))
distances.top(5) (Ordering.by[(Double,Array[Double]),Double](_._1))
From Here to Production?
33
• Real data set!• Algorithmic
• Other distance metrics• k-means|| init
• Use data point IDs
• Real-Time• Spark Streaming?• Storm?
• Continuous Pipeline• Visualization