1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA...

20
1 The Research on Analyzing Time-Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical Engineering (ISEE) Kyushu University

Transcript of 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA...

Page 1: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

1

The Research on Analyzing Time-Series Data and Anomaly Detection in Internet Flow

Yoshiaki HARADA

Graduate School of Information Science and Electrical Engineering (ISEE)

Kyushu University

Page 2: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

2

Contents

Background Purpose Background Knowledge

AS and Internet routing Property of Internet Flow

Analysis method Progress of this research Conclusion and Future Work

Page 3: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

3

Background

Internet is growing as a Global Information Infrastructure

always-on connection by laptop PC, cellular, etc. many service as music and video delivery distance medicine and learning

reliable Internet system are required

We should grasp tendency of flows in Internet to manage reliable Internet infrastructure

Page 4: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

4

Background

It is difficult to grasp the tendency of Internet flows Amount of flow are increasing with development of

Internet A lot of Garbage such as DDos Attack and illegal

accesses are flows in Internet. Physical hazard such as electrical power failure and

router failure Expert engineers are requires to manage Internet

system It take a great deal of time and effort

Page 5: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

5

Purpose

It is required that the method to detecting anomaly and tendency in Internet flow automatically There are many research of macro analyzing research in Internet

flow It is difficult to grasp detail bias and anomaly because Internet

flow are complicated

I suggest that micro analyzing method by segment Network Flows in port number, AS number ,area information and country etc.

I can analyze Flow Data in detail The drop of false alarm can give reduce managing cost

I suggest that detecting anomaly in Network traffic, and visualize

Page 6: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

6

Background knowledge AS(Autonomous system)

Collection of IP networks and routers under the control of one entity (or sometimes more) that presents a common routing policy to the Internet.

An Internet Service Provider (ISP) A very large organization

AS numbers are currently 16-bit integers, which allow for a maximum of 65536 assignments.

AS:1 AS:2

AS:3

AS:4

Router

Page 7: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

7

BGP table BGP

BGP is the core routing protocol in Internet It works by maintaining a table of IP networks or 'prefixes'

which designate network reachability among autonomous systems (AS).

We find out the destination AS number by referring to the prefix

Network Next Hop Metric LocPrf Weight Path*>i3.0.0.0 210.138.15.145 300 0 2497 2497 701 703 80 i*>i4.0.0.0 210.138.15.145 300 0 2497 2497 3356 i*>i4.23.112.0/22 210.138.15.145 300 0 2497 2497 174 21889 i*>i4.23.180.0/24 210.138.15.145 300 0 2497 2497 701 6128 30576 i

reachable prefix (IP address)

destination AS number

Page 8: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

8

Flow-Data

Flow-Data is the collection of unidirectional packets which used in same

application is exported by router include the information that source (destination) IP address, port

number, number of packet, etc. are enormous quantity, so we use sampling data

The example of Flow Data (of Kyushu University)

Page 9: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

9

Analysis method

We propose that hierarchically building of database to enhance scalability

I export Flow Data and BGP routing information maintained in server, and calculate AS number from Flow Data.I make database which include necessary data (AS number, port number, number of packets, etc..).

I categorize database as country, area, and port number. I sort database and calculate correlation for each data which we want to see tendency.

I refer to the categorized database, and visualize.I calculated the database and detect anomaly.

analyzing trafficcategorize

visualizeanomaly detection

Page 10: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

10

Analysis method – BGP table and Flow Data

I use the collecting BGP table exported from QGPOP and the collecting Flow Data exported from Kyushu University

Flow Data I analyze the sampled day’s data which is collected at 0-5

minutes in every hour Sampling rate is 10%

KOREN

SINET

QGPOPInformation communication network dedicated to academic research

Korea Advanced Research Network

BGP table

IIJInternet Initiative Japan

Kyushu University

Universities

Researchinstitutes

Universities and research institutes

Flow Data

Page 11: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

11

Analysis method 1

Detailed Analysis and Categorize I assign AS number to IP address with reference

BGP table and Flow Data. I categorize Flow Data as port number

(communicative purpose), country, area information (Asia, Europe, etc.).

I analyze the distribution of the port number in each country. The distribution of port number may be nonbiased in the

countries which frequently accesses with illegal port number illegal accesses use various (random) port number.

Page 12: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

12

Time change of number of flows in Asia

Almost of traffic flew with Japan, and number of flows in Japan is increasing for a year.

This figure shows time change of number of flows of top 5 country in decreasing order of amount

Page 13: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

13

Time change of number of flows in Asia

This figure shows time change of number of flows of top 4 country in decreasing order of amount, except Japan.

The number of flows in China is increasing for a year.

Page 14: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

14

Analyzing distribution of port number I analyze the distribution of port number used with port 53 flows. I analyze the destination of port number accessed by the

host which accessed the DNS server The host is determined by the IP address on Flow Data

port:53

port:??

port:??

port:XX

DNS serverhost

database

port

number

20 22 25 53 80 443 well – known

registrated

private and

dynamic

2007/0104

504 76 21757 27179 25066 1294 51077 15011 3519

・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・

Page 15: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

15

The distribution of port 53 flows and port 25 flows

2007/01/04 ~ 02/22every Wednesday’s Flow data(every one hours)

Horizontal axis show the number of flows in port 25Vertical axis show the number of flows in port 53

The number of port 53 flows is increasing with the number of port 25 flows (positive correlation)

Page 16: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

16

Analysis method 2

Anomaly detection We handle the database compiled from Flow Data

We smooth the database to make data visualizing easier by adopting exponential smoothing method

Flow Data have periodicity (daily, or weekly), so we use Holt-Winters method

Page 17: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

17

Anomaly detection

Data smoothing When I analyze long term in Flow Data, I use Exponentially Weighted

Moving Average (EWMA) method. applies weighting factors which decrease exponentially. The weighting for each older data point decreases exponentially

Flow Data have periodicity property, so we adopt Holt-Winters method in short term analysis. Holt-Winters method is expanded EWMA method for the periodicity data

Yt+1 = at + bt + ct+1-m

Yi = α * Yi - 1 + ( 1 - α ) * Yi - 1

at = α( Yt + ct-m ) + ( 1 - α)( at-1 + bt-1 )

bt = β( at - at ) + ( 1 - β) bt-1

ct = γ( Yt - at ) + ( 1 - γ) ct-m

Page 18: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

18

Anomaly detection

I smooth Flow Data by using EWMA or Holt-Winters method, and calculate threshold. When the value exceed the threshold, I consider this point

as anomaly

0 time

Num

ber of flow

s

1 cycle (one day)anomalyhigh threshold

level

low threshold level

threshold area

Page 19: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

19

Visualization

I develop the tool which detect anomaly and visualize The tool should analyze only specific Flow Data which

is selected by user (port number, country etc.) In Internet traffic, there are communication data which have

large amount of packets, such as port 8000 (DVTS)

We want to grasp the tendency not only All Flow Data but also the Flow Data restricted to certain country, AS or port number. It should be versatile tool.

Page 20: 1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.

20

Conclusion and future work

Implementation of analyzing Flow Data The program that categorize Flow Data as country, AS

number, and port number are completed I will develop the program to find out the correlation

between each port number. Anomaly detection and visualization

I smooth the Database made by analyzing program, and calculate the threshold and detect anomaly in Flow Data

I develop the tool to visualize not only all data and anomaly, but also the data which is selected by user.

I conduct verification experiment for Flow Data include electrical power failure.