BOTs or not? - person.dibris.unige.it · Telegram I Conversational bots are aimed at automatically...
Transcript of BOTs or not? - person.dibris.unige.it · Telegram I Conversational bots are aimed at automatically...
BOTs or not?A case study on bot recognition from web session log
Alberto Cabri - email: [email protected] student
DIBRIS - University of Genoa
June 6, 2017
1/23
2/23
Our aim. . .
I analyse the usage logs of a web site to verifywhether it is possible to distinguish legitimatehuman crawlers from bots using computationalintelligence
I identify bots sessions by taking the earliestpossible decision on online or real time HTTPrequests of an incomplete session
Alberto Cabri - email: [email protected] student BOTs or not?
3/23
What is a BOT?
I Bots are programs that perform specific actionson computers connected to a network, withoutany intervention of human user
I Bots is the short for Web Robots, akaAutonomous Internet Agents
I Statistics report that more than half the trafficof a web site is due to bots [Zeifman - 2016]
I They can be good or malicious (those representmore than half the bot traffic)
Alberto Cabri - email: [email protected] student BOTs or not?
4/23
Other definitions
I A session is a sequence of web server requestthat can be associated to a single IP address ora user
I HTTP is the HyperText Transfer Protocol,defined at the application level, used toexchange network resources in a client-servercomputing model
I A logfile or simply log is a file that recordsspecific events that occur on a system
Alberto Cabri - email: [email protected] student BOTs or not?
5/23
Bot types and operation modes - 1
I Primary application for bots is web crawling,typically for indexing purposes
I Social bots are used to post automaticallygenerated messages on platforms like Twitter orTelegram
I Conversational bots are aimed at automaticallyinteracting with humans using natural language
I Wikipedia bots perform routine maintenancetasks, such as adding templates and replacingtext
[Microsoft bot framework]
Alberto Cabri - email: [email protected] student BOTs or not?
6/23
Bot types and operation modes - 2
I All these categories are collaborative agents, sayethical bots, and usually comply with thedirectives as of the file robots.txt
I The main drawback of ethical bots is theincrease in network traffic, which is usually keptunder control by the bots themselves.
Alberto Cabri - email: [email protected] student BOTs or not?
7/23
Bot types and operation modes - 3
I Malicious bots are used to perform harmful orfraudulent activities
I They can impersonate different user-agents ormimic human behaviour
I Botnets can be created to operate from differentIP addresses, thus yielding the control to asupervisor bot, called herder[Goodman - 2017]
I They are increasingly used to gain undueadvantage in online business [Invalid Clicks]
Alberto Cabri - email: [email protected] student BOTs or not?
8/23
The recognition problem
Recognizing bots from humans can be formalized inthe 2 following problems:
I Offline Bot Recognition – Given a set ofHTTP requests from a web session, label the itas BOT or NON-BOT [Suchacka - 2014 and2015]
I Online Bot Recognition – Given an incomingstream of HTTP requests from a web session,detect BOTs as soon as possible, before thesequence ends (if doable)
Alberto Cabri - email: [email protected] student BOTs or not?
9/23
The offline problem
I it’s basically a classification problem as thesessions can be regarded as sets, entirelyavailable at the time of decision taking
I the analysis is based on a set of descriptivesummary features, extracted from web site logs
Our dataset consists of more than 13500sessions
Alberto Cabri - email: [email protected] student BOTs or not?
10/23
The online problem
I the requests must be considered as time orderedsequences of descriptive features
I a correlation between 2 subsequent requests islikely to exist
I shortest decision time is required to minimizethe negative impact of bots
We’re now getting into the online problem.
Alberto Cabri - email: [email protected] student BOTs or not?
11/23
Features pre-processing - 1
In the log file, a request record is a set of features as shown below:
Feature Sample Value Description
Interarrival Time (ms) 38 time interval between two subsequent requests of the
same session
HTTP method GET indicates the desired action to be performed for a given
resource
HTTP code 200 response status codes, divided into five classes
Size 1,43 volume of data transferred in the response in KiloBytes
Empty referrer True boolean value indicating whether we know the web-
page requesting the resource or not
is embedded False boolean value indicating if the requested resource is
an embedded object
is graphic False boolean value indicating if the requested resource is a
graphic file
is style False boolean value indicating if the requested resource is a
stylesheet
is datafile False boolean value indicating if the requested resource is a
file with data
is script False boolean value indicating if the requested resource is a
script
session # 2 incremental value for session identification: not used
in BOT detection
Table: Online request features
Alberto Cabri - email: [email protected] student BOTs or not?
12/23
Features pre-processing - 2
To improve the classification results, originalfeatures must be transformed as follows:
I for each boolean feature, True becomes 1 andFalse is set to 0
I the categorical features (say HTTP method andcode) are encoded in the one-hot mapping
After encoding, the initial 10 feature columns(excluding the session id) become 25 input features,used to feed the neural network.
Alberto Cabri - email: [email protected] student BOTs or not?
13/23
The challenge
Online bot classification is complex because:I sessions have variable lengthI bots may change their navigation styleI there’s no a-priori information on user-agent
strings to identify botsI a reliable decision should be taken as-you-go,
without the acquisition of entire sequence at thebeginning of the decision process
I samples must be processed one at a time,sequentially, therefore we can assume they arecorrelated and time-dependent
Alberto Cabri - email: [email protected] student BOTs or not?
14/23
The approach - 1
Q. is it really a sequential classification problem?
A. some heuristic tests have been performed on theinput dataset, in order to consolidate ourperception of the data structure; experimentalresults show that a simple MLP is capable ofclassifying samples with an accuracy above 95%and up to 99.80%
This implies that core information on BOT requestsis intrinsic and sequentially independent.
Great result!
Alberto Cabri - email: [email protected] student BOTs or not?
15/23
The approach - 2
I consider sequential nature of samples to improveclassification results
I three outputs are possible for each observedsample:
1: the crawler is a BOT; no further observations sampled0: the crawler is human; no further observations sampled
None: no decision is taken at present; it’s delayed to future samples
I learning is supervised with a MLP, using aleave-one-out training model
I decision taken on posterior probability of eachclass according to a sequential probability ratiocriterion [Wald - 1945]
Alberto Cabri - email: [email protected] student BOTs or not?
16/23
The sequential probability ratio - 1
At step t, f1(xt) and f0(xt) are the class conditional probabilities ofthe current observation xt for BOTs and humans respectively.
Assumptions
I known probabilities (output by MLP)
I mutual independence of observation (relaxed constraint)
The sequential probability ratio is:
p1(t)
p0(t)=
f1(x1)f1(x2) · · · f1(xt)
f0(x1)f0(x2) · · · f0(xt)=
t∏i=1
f1(xi)
f0(xi)(1)
Equation (1) has been transformed in a logarithmic form to avoidnumerical problems.
Alberto Cabri - email: [email protected] student BOTs or not?
17/23
The sequential probability ratio - 2
Assigning two thresholds, C0 and C1 for humans and BOTsrespectively, the classifier outputs:
1 if p1(t)
p0(t) ≥ C1
0 if p1(t)p0(t) ≤ C0
None if C0 ≤ p1(t)p0(t) ≤ C1
(2)
If None is still output when the session ends, the decision is takenbased on the highest value of the sequential probability.
Note: our implementation uses symmetrical thresholdsC1 > 0 and C0 = 1
C1
Alberto Cabri - email: [email protected] student BOTs or not?
18/23
The classifier architecture
Figure: MLP with SPRT
Modified multi-layer perceptronfrom Andrej Karpathy, with anadditional SPRT output module,as shown in the figure aside.
I MLP geometry definedby subsequent refinementsto find optimal model
I cross-entropyloss function on training
I tanh as activation function
I softmax for output probability
I adaptive learning rate
Alberto Cabri - email: [email protected] student BOTs or not?
19/23
Classification results
Initial tests over 1000 sessions with MLP without SPRT withsession request grouping and initial learning rate LR = 0.0001
#hlayers #hunits final LR Session Accuracy Sample Accuracy
2 13 3.3 · 10−6 80.96% 99.07%
2 25 1.56 · 10−6 74.46% 99.12%
2 50 7.8 · 10−7 85.41% 99.80%
3 50 7.8 · 10−7 87.04% 99.19%
Initial tests over 1000 sessions with MLP, SPRT and leave-one-out:
Sample Accuracy 100% TOO GOOD TO BE TRUE
Preliminary results to be verified on all sessions available!
Alberto Cabri - email: [email protected] student BOTs or not?
20/23
The end
Thank you for your attention.
Contact: [email protected]
Alberto Cabri - email: [email protected] student BOTs or not?
21/23
Cross Entropy Loss
It’s an optimal objective function for neuron learning evaluation.
If y is the target value and y the estimated output, then
H(y , y) = −∑
[y log y + (1 − y) log(1 − y)]
Properties
I Non negative
I if neuron output is close to target, cross-entropy is close tozero
Alberto Cabri - email: [email protected] student BOTs or not?
22/23
Hyperbolic Tangent
It can be expressed as:
tanh(z) = ez−e−z
ez+e−z
and its derivative is:
tanh′(z) = 1 − tanh2(z)
Alberto Cabri - email: [email protected] student BOTs or not?
23/23
Softmax
Activation function for the MLP output layer that squashes aK-dim vector z of arbitrary real values to a K-dim vector σ(z) ofreal values in the range (0, 1] that add up to 1.
σ(z) = ez∑K ezk
where z =∑
wx + b, being x the neuron input, w the inputweights and b the bias.
The σ(z) values can be interpreted as output classes probabilities
Alberto Cabri - email: [email protected] student BOTs or not?