Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

download Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

of 4

Transcript of Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

  • 8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

    1/4

    Analysis and Design for Intrusion Detection System

    Based on Data Mining

    Duanyang Zhao, Qingxiang Xu, Zhilin Feng

    Zhijiang College of Zhejiang University of Technology

    Hangzhou, Zhejiang Province, 310024, China{sunny, xqx, fengzl}@zjc.zjut.edu.cn

    AbstractNetwork and host Intrusion Detection Systems (IDS)

    have become a standard component in security infrastructures.

    As the action of intrusion represents variable, complicated, and

    uncertainty characteristic, they face so many problems to resolve

    for intrusion detection. Each approach has its strengths and

    weaknesses. A truly effective intrusion detection system will

    employ both technologies. We discusses the differences in host-

    and network-based intrusion detection techniques to demonstratehow the two can work together to provide additionally effective

    intrusion detection and protection. We propose a hybrid IDS,

    which combines network and host IDS, with anomaly and misuse

    detection mode, utilizes auditing programs to extract an extensive

    set of features that describe each network connection or host

    session, and applies data mining programs to learn rules that

    accurately capture the behavior of intrusions and normal

    activities.

    Keywords-intrusion detection; hybrid ids; data mining; analysis

    engine; apriori algorithm

    I. INTRODUCTION

    Apriori algorithm in data mining can show that theattribute-values frequently appear together in a given data set.It can mine the relationships between attribute values from adatabase table, and is more suitable method for intrusiondetection system.

    The most representative of the research in the world isWenke Lee Research Group in Columbia University [1][2],1998. They were supported by the Defense Advanced ResearchProjects Agency (DARPA) and the National Natural ScienceFoundation (NSF) funding, and focused on the research in thisarea. Since then, the IDS Research Group of the Department ofComputer Science, under the leadership of Professor SalvatoreJ. Stolfo, carried out extensive study on data mining-based IDS.They have been divided their research into twelve sub-topics.Their research is on top in the world. The SANS (SystemAdmin, Audit, Network, Security) has outstanding

    performance in this area [3].

    In recent years, both the Chinese Academy of Sciences(CAS) and key universities and colleges in China are activelycarrying out researches in this area [4][5]. With the help of theDevelopment Project of National Key Basic Research and theMajor Projects Fund of CAS Knowledge Innovation Project,PhD Xu Jing, in Computing Center, Research Institute of HighEnergy Physics of CAS, made a preliminary implementationfor Intrusion Detection System based on data mining. With the

    help of the National Natural Science Fund, doctoral students ofDepartment of Computer Science at Nanjing University ofScience, Wuhan University, Northern Jiaotong University, andother key universities carried out similar researches.

    By analyzing the characteristics of hacker programs withback door, by which hackers control target hosts, networks

    may cause unexpected connection records. Because of hugeamount of data in the network processing, the number ofconnection records after filtered is also very impressive. Whileestablishing a connection, it will increase a record. Therefore,we can not simply compare the connection records to achieveintrusion detection.

    In recent years, the use of data mining knowledge forintrusion detection system has won more and more attention,

    but there are a lot of problems. For examples, it is difficult tohave a clear standard in the selection of test data, there arelarge amounts of useless information in the results of miningout of the experiment data, and how we express the rules minedfrom the experiment data for intrusion detection system.

    The remaining section of this paper is organized as follows.

    In the second section, the paper describes the framework ofhybrid intrusion detection system. In the third section, we showthe experimental design and results of apriori algorithm in datamining. Finally, we draw a conclusion and exhibit a prospect.

    II. THE FRAMEWORK OF HYBRID IDS

    Intrusion detection technology is a new security supportmechanism, and monitors the network system without affectingthe network performance to prevent internal and externalattacks and misuse. Intrusion detection systems have a varietyof classifications. In accordance with the objects of the systemdetection, they are divided into the host-based, the network-

    based, and the hybrid IDS; in accordance with systemarchitecture, they are divided into centralized and distributedIDS; and finally in accordance with the detection type, they can

    be divided into anomaly-based model and misuse-based modelIDS.

    The hybrid IDS in the paper is a combination of intrusiondetection engines of misuse and anomaly detection, uses datamining algorithms as the data processing for vast amounts ofsecurity audit data, and generates detection models and testmodels separately from the network data and host system calls,as shown in Fig. I:

    2010 Second International Workshop on Education Technology and Computer Science

    978-0-7695-3987-4/10 $26.00 2010 IEEE

    DOI 10.1109/ETCS.2010.478

    339

  • 8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

    2/4

    Network Sensor

    Figure 1. The hybrid IDS based data mining algorithms

    The hybrid IDS consists of four parts: data warehouse,sensors, analysis engine and alarm system.

    A. Data warehouseData warehouse technology has the following functions: to

    manage decision-making process, subject-oriented, integrated,and time-related data collection, to support multi-process andmulti-threading technology. Many commercial DBMS have thefunction. The project uses SQL Server 2005 data warehousetechnology, which includes Analysis Services. It can easily setup a data warehouse, achieve distributed computing, and

    provide OLE DB Controls and ADO (ActiveX Data Objects)technology, and has a flexible data model, etc. Obviously, thefeatures can improve the speeds of the data mining and theanalysis engine.

    Data warehouse technology is beneficial that the differentcomponents asynchronously handle the same piece of data

    stored in a database. Therefore, it is the heart of the data andmodels in the whole system.

    B. Sensors

    Sensors are closely related with the network operatingsystem, usually to discuss Windows system or UNIX/Linuxsystems. This paper sets out technical means of sensors as anexample of Windows system.

    1) Host Sensors

    They gather information in monitored hosts with a varietyof methods, such as application logs, security logs and eventlogs, running applications and registry changes.

    After set up audit features, Windows Server will monitorvarious states of the system, and write them to logs. With thehelp of Windows API functions, we develop programs tomonitor the system logs, running applications and registrychanges, and to send them to the host sensor manager of theanalysis engine to be analyzed.

    We use the hook function to intercept API calls.Hook is an important technology of Windows message

    processing mechanism. With installing a variety of hooks, theapplication can set the appropriate subroutines to monitor the

    system messaging. Before messages reach their destinations,subroutines intercept them and make some analysis accordingto the user requirement. Hook is divided into thread-specific

    hooks and global hooks. Thread-specific hooks monitor thespecified thread, and the global hooks monitor all the threads inthe system. For the global hooks, hook functions must beincluded in a separate dynamic-link library (DLL) so that theycan be called by a variety of associated applications.

    Hook function is a mechanism for application programs tomonitor message flows and to process some type of themessages that have not yet reached the purpose window in thesystem. For example:

    The process installs a hook WH_GETMESSAGE to checkeach window message in the system. It can install a hook bycalling SetWindowsHookEx function as following:

    HHOOK hHook = SetWindowsHookEx

    (WH_GETMESSAGE, GetMsgProc, hinstDLL, 0);

    Where parameters WH_GETMESSAGE indicates the typeof hook to be installed, GetMsgProc indicates the functionaddress of system call while the window deals with themessage, and hinstDLL indicates the specified DLL thatcontains GetMsgProc function.

    2) Network Sensors

    With the netstat tool of Windows system, network sensorscollect the network connection information established

    between computers. Netstat command can collect all the openport information on the computers. We may design a programto run netstat command at a regular interval, and to output theresults. But this way will add to the burden on the system. In a

    relatively busy system, the records of a day may go up to someGB in size.

    Therefore, we can optimize the program to capture easilythe network connection information. It first lists all open ports,monitors the port whether it is a new open and when it isclosed, records the port information only updated, and outputsrecords to Network Sensor Manager in Analysis Engine. Therecords include port services, port number, activation time, andtime stamp and so on.

    DataWarehouse

    Host Sensor

    Alarm System

    Alarm

    Manager

    Intruder Tracing

    System Protection

    Strategy

    Archive Information

    Alarm StrategyNetwork Sensor Manager

    Host Sensor Manager

    PatternMining

    Mining AlgorithmLibrary

    Misuse

    Detector

    Sensor-1 Sensor-2 Sensor-m

    Analysis Engine

    Alarm

    Message

    Alarm

    Message

    Anomaly

    Detector

    Sensor-1 Sensor-2 Sensor-n

    340

  • 8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

    3/4

    C. Analysis Engine

    Analysis Engine consists of three parts: Network/HostSensor Manager, Misuse and Anomaly Detector, MiningAlgorithm Library and Pattern Mining.

    1) Sensor Manager receives data from sensors, then

    analyse the data, translate them into the form of database

    records, and store them into the data warehouse.

    2) Misuse and Anomaly Detection detects intrusions based

    on the matching patterns stored in the data warehouse.

    Traditional IDS is divided into two separate types: misusedetection and anomaly detection. Anomaly detection is knownas behavior-based detection, which sets up the behavioralmodels for users under normal circumstances in the learning

    phase, then compares the current user behavior with theexisting behavioral models, and founds an intrusion if thedeviation is greater than the threshold of the credibility. The

    basic principle is that intrusion comes out if any behavior is notconsistent with the known behaviors.

    Misuse detection is also called knowledge-based intrusion

    detection, which sets up intrusion patterns for the knownintrusions, then matches the current user behaviors and systemstatus with the existing intrusion behavior patterns. The basic

    principle is that intrusion comes out if any behavior isconsistent with the known behaviors.

    We integrate these two models into the hybrid IDS, thusformat new basic principles of intrusion detection: any

    behavior is a normal behavior if it is consistent with normalbehavior model, any behavior is a intrusion behavior if it isconsistent with anomaly behavior model, and others are addedto the detection models in data warehouse by the PatternMining module based on Mining Algorithm Library to generatea new detection model. While comparing an unknown behaviorwith normal/anomaly behavior model, the detectors determine

    a normal/anomaly behavior by comparing support andconfidence level of calculated results with a given minimumsupport and confidence level.

    3) Mining Algorithm Library and Pattern Mining for

    mining unknown intrusions.

    Point of view from the data warehouse, data mining can beregarded as an advanced stage of online analytical processing(OLAP). We apply data mining technology to IDS, use itsalgorithms of association analysis and sequential patternanalysis to extract safety-related characteristic properties,generate classification models based on them, and identifyautomatically security incidents. The analytical methods ofdata mining can be divided into three parts:

    a) Association analysis

    Its purpose is to uncover hidden relationships among thedata. Based on correlation among a set of items, you can usethe association analysis to identify the correlation betweenintrusion behaviors.

    Here are the basic algorithms of association analysis:

    Set I=(i1, i2, ..., im) is a collection of binary words in whichthe elements are referred to as item. Assume D as a collection

    of transaction T, which is a collection of items, and TI.

    Assume X is a collection of items in I, if XT, thereforetransaction T contains X.

    An associational rule is an implication form like XY,

    where XI, YI, andXY=. The support of rule XY inthe transaction D is the ratio of the number of transactions

    contained X and Y in a transaction set to the number of alltransactions, denoted by Support (XY), that is:

    Support(XY)=|{T: XYT, TD}|D|

    The confidence level of rule XY in the transaction D isthe ratio of the number of transactions contained X and Y in atransaction set to the number of transactions contained X,

    denoted by Confidence (XY), that is:

    Confidence(XY)=|{T: XYT, TD}||{T: XT, TD}|

    Given a transaction set D, the tasks of association analysisare to create the associational rules that support and confidencelevel from mining data are respectively greater than theminimum support (minsupp) and the minimum confidence

    (minconf) given by the users.Agrawal and et al in 1993, designed a basic algorithm

    (Apriori). In recent years, the algorithm has been madeconsiderable progress. The project applied the latest algorithmsfor pattern mining.

    b) Sequence pattern analysis

    Similar to the association analysis, its purpose is to uncoverrelationships among the data. But its focus is on analysis ofcontext among the data. Many behaviors of hacker intrusionshave context, and some actions must occur after others. Forexample: a hacker generally scans the system port before attack.

    c) Classification analysis

    Assume record collection and a set of tags, where tag is agroup of categories with different characteristics. We give a tagfor each record, that is, to classify records by tags. Then wecheck the tagged records, and describe their characteristics. Forexample, the intrusions are divided into three categories basedon harmful levels of hacking: the fatal intrusion, the generalintrusion, and the weak intrusion. Classification analysischecks the previous hacking, classifies each risky level, andthen gives their descriptions according to classificationstandards.

    Bayesian classification algorithm is as following:

    Each connection record is described with an n-dimensionalfeature vector X=(x1, x2, ..., xn), where the n attributes,respectively, describe characteristics of n-connected records.

    Assume that there are m categories C1, C2, ..., Cm. Givenan unknown connection record X (or no tag), classification

    predicts that X is the highest category of posterior probability,namely, Bayesian classifier assigns unknown connection

    records d to the category Ci, if and only if P(Ci|X)P(Cj|X), 1 j m, j i. According to Bayesian,P(Ci|X)=P(X|Ci)P(C i)/P(X).

    For any category, P(X) is a constant, we can get the greatest

    341

  • 8/11/2019 Analysis and Design for Intrusion Detection System IDS Using Data Mining IEEE 2010

    4/4

    value of P(X|Ci)P(C i). The priori probability of category isP(Ci)=si/s, where s iis the number of connection records in thecat

    nd s iis the number of connection records in thecat

    , if and only if P(X|Ci)P(CP(X

    rithms in this project toimprove the performance of IDS.

    D.

    , archiving, intrusion tracingwhen necessary. Here omitted.

    III. THE EXPERIMENT FOR ASSOCIATION ANALYSIS

    A.

    s and the detecting phase of

    o be suitable for detection rule set while

    rule is that of the

    lidity has a direct impact onthe accuracy of detection results.

    B.

    from the file of network pacrec

    a text file, in which thvar

    efore, we have to

    which the right items does not

    which the left items does

    more ideal. The few associational rulesare as Table I showed.

    TABLE I. ASSOCIATIONAL

    rulesite s

    S t Co ce

    egory Ci, s is the total number of connection records.

    For calculation P(X|Ci), in order to reduce overhead, giventhe assumption condition of category independence, so thatP(X|Ci)=P(Xk|Ci), (k=1,,n), where P(Xk|Ci)=sik/si, s ikis thenumber of connection records that has the value of X

    k in the

    category Ci, aegory Ci.

    In order to classify the unknown connections, for eachcategory Ci, we calculate P(X|Ci)P(C i), to assign connectionrecords X to category Ci i)>

    not contain the IP and Ports.

    After the above steps of filtration, we get the finalassociational rules to be|Cj)P(Cj), 1jm, ji.

    Although the algorithms adapt to different scenes, wecomprehensively use these algo

    Alarm system

    The main functions of alarm system are to build the

    emergency measures based on alarm strategies, such as theappropriate system protection

    The design of associational rule detector

    Association analysis in data mining is divided into twoparts: the learning phase of the rulethe application of the rules learnt.

    1) In the learning phase: the Analysis Engine applys

    association analysis to connection records from Network

    Sensor Manager, to mine out the associations between the

    values of data items under the normal state of networks, and

    obtaine the associational rule set, which are filtered by some

    artificial rules so as t

    detecting intrusions.

    2) In the detecting phase: the Analysis Engine gets the

    connection records from Network Sensor Manager, and

    matches with detection rule set to determine whether intrusion

    takes place. The process matching detection

    association analysis in the detecting phase.The detection rule set made in the learning phase is the core

    of the Analysis Engine. Their va

    The experimental results for associational rules

    Our experiment data are ketsorded by TCPdump tool.

    We compile the network packets to the format of theconnection records, save them as e

    iables are separated by a space.

    Association analysis of data mining builds up the rule setsfrom the connection records, where the minimum support is set

    to 5%, and the minimum confidence is set to 100%. But thereare a large number of useless rules in the rule sets. They cannot be used simply to express the meaningful associations

    between the values of connection attributes. If we use them as astandard for monitoring the network intrusions, the decisions ofthe system would be misdetections. Therremove the useless rules, as the following:

    To filter out the rules incontain the categories;

    Then to filter out the rules in

    RULES

    upporThe left items of

    The right

    ms of rule (%)

    nfiden

    (%)

    192. 168. 7. 13 80 sf normal 7. 8 100. 0

    192. 168. 4. 16 25 passive exter normal 8. 3 100. 0

    192. 168. 2. 10 80 active normal 23. 2 100. 0

    192. 168. 4. 18 tcp 25 sf normal 8. 5 100. 0

    192. 168. 7. 23 80 a 19. 4 100. 0active norm l

    IV. CONCLUSIONS

    The hybrid IDS is efficient to detect known and unknownintrusions. The research on intrusion detections based on datamining is one of the hot study topics at home and abroad.There are still a series of theoretical and practical problems to

    be resolved, and a number of key technologies are required tomake further deep study. The experiment shows that the designand implementation of an efficient and accurate IDS based ondat

    representativeoriginal data and to filter precisely useless rules.

    enceFoundation of Zhejiang Province, China (No. Y1080343).

    the 7th USENIX Security Symposium, San

    9 IEEE

    from

    Computer Engineering, Beijing.2002, 28(6), pp9-10,169

    a mining is a large, complex project.

    In the application of the data mining algorithms to original

    connection records, how to effectively get the correspondingfrequent patterns is the key to study. In the future, we willfocus the study on how to select appropriate and

    ACKNOWLEDGMENT

    The work has been supported by the Natural Sci

    REFERENCES

    [1] W. Lee and S. J. Stolfo. Data mining approaches for intrusiondetection, In Proceedings ofAntonio, TX, January 1998.

    [2] W. Lee and S. J. Stolfo. A data mining framework for building

    intrusion detection models, In Proceedings of the 199Symposium on Security and Privacy, Oakland, CA, May 1999

    [3] http://www.sans.org/resources/idfaq/data_mining.php?printer=Y,2003.4

    [4] Chinese Academy of Sciences (CAS). Network IDS technology in CASreached the international advanced level, in Chinese. Retrievedhttp://www.cas.cn/jzd/jcx/jcxlc/200204/t20020403_1034832.shtml

    [5] Xu Jing, Liu Baoxu and Xu Rongsheng. Design and implementation ofdata mining-based IDS, in Chinese,

    342