Rule-Based On-the-fly Web Spambot Detection Using Action Strings

Post on 12-Jul-2015

1.098 views 2 download

Tags:

Transcript of Rule-Based On-the-fly Web Spambot Detection Using Action Strings

RULE-BASED ON-THE-FLY WEB SPAMBOT DETECTION USINGACTION STRINGSPedram Hayati (p.hayati@curtin.edu.au)Vidyasagar Potdar, Alex Talevski & William F. Smyth

What is Spam 2.02

Propagation of unsolicited, anonymous, mass content to infiltrate legitimate web 2.0 applications”

www.antispamresearchlab.com

How does Spam 2.0 work

www.antispamresearchlab.com

3

Web Spambots (Spambot) A web crawler/tool that navigates the WWW with the

sole purpose of planting unsolicited content on external web 2.0 applications

How is Spam 2.0 currently managed

www.antispamresearchlab.com

4

Flood control, Nonce, Hash-Cash, Email validation Completely Automated Public Turing test to tell

Computers and Human Apart (CAPTCHA)

Problem

www.antispamresearchlab.com

5

CAPTCHA Decreases human users’ convenience Computers are getting more powerful to decipher it.

Content-Based solutions (Option Spam, Social Spam, Video Spam etc.) Focussed on one particular form of spam Do not come with satisfactory results.

Solution Idea

www.antispamresearchlab.com

6

Main assumption: human web usage behaviour is intrinsically different from spambot behaviour.

Web usage data User click-stream Widely used Two additional attributes

Session ID Username

Solution: Action

www.antispamresearchlab.com

7

Action Model web usage data into a behavioural model Set of user efforts to achieve certain purposes Suitable discriminative feature to model user behaviour Extendible to many other Web 2.0 platforms

Example Register a new user account action

1. User navigate to registration page2. User fill up registration form fields3. User click on submit button

Solution: Action String

www.antispamresearchlab.com

8

Actions String Sequence of action in alphabetical format

Solution: Trie

www.antispamresearchlab.com

9

A way to store and retrieve information Ease of updating and handling Shorter access time Removing redundancies form of a tree structure.

We construct actions strings using Trie data structure fast on-the-fly pattern matching

Solution: Framework

www.antispamresearchlab.com

10

Solution: Framework

www.antispamresearchlab.com

11

Performance Measurement

www.antispamresearchlab.com

12

Matthews Correlation Coefficient (MCC) Best performance measurement methods of binary

classifications Considers true and false positives and returns a

value between -1 and +1.

Experiment

www.antispamresearchlab.com

13

Data Set No publicly available collection Spambot data from our HoneySpam 2.0 project Human data from an active forum 16594 entries

11039 spambots records 5555 human records

Test Five random datasets (DS1 to DS5) 2/3 for building up Trie structure 1/3 for test

Experiment: On-The-Fly Detection

www.antispamresearchlab.com

14

Simulate real world practices where user action strings grow over the time

System creates action strings as they happen. Make a window over test action strings

Run our classifier Increase the window’s size

Aim: identify spambot in the least amount of actions

A B C D E F G A B C D E F G

Experiment: Results

www.antispamresearchlab.com

15

Window size ranges from 2 to 10 characters Threshold from -0.05 to 0.05

Experiment: Results

www.antispamresearchlab.com

16

Experiment: Discussion

www.antispamresearchlab.com

17

System can predict better as user uses the system over time.

Performance remains the same after some windows size Datasets are randomly selected Same happens for accuracy of results

Conclusion

www.antispamresearchlab.com

18

Quite young area of research. Current work Focussed on one particular type of

spam. Our aim: detect web spambots as a source of spam

problems on the Web 2.0 platform. Based on web usage behaviour Formulated into Actions => Action String On-the-fly detection: using Trie

Result: average accuracy of 93%

THANK YOU!

Pedram Hayati (p.hayati@curtin.edu.au)Vidyasagar Potdar (v.potdar@curtin.edu.au)Alex Talevski (a.talevski@curtin.edu.au)William F. Smyth (smyth@mcmaster.ca)

http://www.antispamresearchlab.comhttp://debii.curtin.edu.auhttp://www.curtin.edu.au

Appendix: Related Works

www.antispamresearchlab.com

20

Tan et al.: web robot navigational patterns such as session length and set of visited webpages is different from those of humans.

Park et al.: malicious web robot detection based on types of requests for web objects and existence of mouse/keyboard activity

Göbel et al. : interaction with spam botnet controllers. Yu et al. and Yiqun et al. : categorise spam webpages

from legitimate webpages by employing user web access logs

Appendix: Future Works

www.antispamresearchlab.com

21

Compare different performance measurement techniques.

Develop adaptive solution Experiment on different platforms (e.g. datasets)