A statistical approach to reduce malware inside an ...

22
Gustavo A Valencia-Zapata M.Sc. Candidate in Statistics, Juan C Salazar-Uribe, Ph.D. School of Statistics, Universidad Nacional de Colombia-Sede Medellín [email protected] * www.gustavovalencia.com, [email protected] 1 A statistical approach to reduce malware inside an Information System in Banking Sector

Transcript of A statistical approach to reduce malware inside an ...

Page 1: A statistical approach to reduce malware inside an ...

Gustavo A Valencia-Zapata M.Sc. Candidate in Statistics, Juan C Salazar-Uribe, Ph.D.

School of Statistics, Universidad Nacional de Colombia-Sede Medellín

[email protected] * www.gustavovalencia.com, [email protected]

1

A statistical approach to reduce malware inside

an Information System in Banking Sector

Page 2: A statistical approach to reduce malware inside an ...

WORLDCOMP´12 2

Paper: A statistical approach to reduce malware inside an Information System in Banking Sector

Paper: CART for Handling Missing Values in a CMBD. Application Malware inside an Information System in Banking Sector

Page 3: A statistical approach to reduce malware inside an ...

The research question

¿How malware incidence can be decreased in an Information System (IS)?

As in human epidemiologic context is necessary to apply treatments (medicine,

vaccines, therapies, etc.), on a computer environment would be the application

of antivirus scan.

¿How antivirus scans (medical tests) can be dosed,

in our population (computer network), for the

reduction of malware (diseases) incidence in

banking IS?

Currently the bank scans all the computers once a week. This research was

intended to change this policy. For example some computer will be scan once

a week, others twice a week or once a month.

2

Page 4: A statistical approach to reduce malware inside an ...

In this research the first stages to build the model are: information extraction

(IE), handling missing values, and statistics analysis. The main information

source is the bank antivirus software. Secondary information sources are: web

filtering, HCM (Human Capital/Resource Management ), and CMDB

(Configuration Management Database).

First stages 3

Page 5: A statistical approach to reduce malware inside an ...

CMDB 4

Page 6: A statistical approach to reduce malware inside an ...

TABLE I

CMDB PARAMETERS

Variable Meaning/value Type Unit

Class Laptop, Desktop or server Nominal NA

Brand Computer brand Nominal NA

Computer_Age Operating time Scale Week

Processor_Type Type of computer processor Nominal NA

Processor_Clock The speed of a computer processor Scale GHz

Processors Number of processors Integer Count

Memory (RAM) Memory size Scale GB

Operation_System Operation System (OS) Nominal NA

Service_Pack Updates to a OS Nominal NA

Hard_Disk Hard disk size Scale GB

CMDB

Around 18.22% of CMDB data (infected computers) are missing values.

Classification and Regression Trees (CART) are used for handling missing values

(imputation) to avoid losing valuable information.

5

Page 7: A statistical approach to reduce malware inside an ...

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

CART (The classification and regression

trees) method was suggested by Breiman[1].

The decision trees produced by CART are

strictly binary, containing exactly two

branches for each decision node.

CART recursively partitions the records with

similar values for the target attribute.

6

[1 ]Breiman Wデ ;ノく さCノ;ゲゲキaキI;デキラミ ;ミS RWェヴWゲゲキラミ TヴWWゲざが ヱΓΒヴ

1

Page 8: A statistical approach to reduce malware inside an ...

Table II shows the variables for the computer number 0022. We can identify

three out of ten variables with missing values

CART

TABLE II

COMPUTER 0022 – CMDB PARAMETERS

Variable Meaning/value Units

Class Missing NA

Brand Missing NA

Computer_Age Missing Week

Processor_Tipe P27 NA

Processor_Clock 2.19 GHz

Processors 2 Count

Memory (RAM) 2.14 GB

Operation_System SO_7 NA

Service_Pack SP_3 NA

Hard_Disk 80.02 GB

7

Page 9: A statistical approach to reduce malware inside an ...

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

Variable: Class

Node 0 indicates that CPU (desktop)

category has the higher probability (0.7) to

be selected if a random imputation is

conducted. On the other hand, Laptop

category has a smaller probability (0.28)

than the first one, and the Server category

has null probability (0.0)

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Class

8

Page 10: A statistical approach to reduce malware inside an ...

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

Variable: Class

Node 1 indicates that Laptop category has

The higher probability (0.83) to be selected

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

Variable: Class

9

Page 11: A statistical approach to reduce malware inside an ...

CART

0.00 0

CPU 70.74 2229

Laptop 28.94 912

Server 0.32 10

Total 100.00 3151

Node 0

Nominal % n

Processor_Clock

Improvement=0.31

0.00 0

CPU 16.41 178

Laptop 83.50 906

Server 0.09 1

Total 34.43 1085

Node 1

Nominal % n

<=2,56

0.00 0

CPU 99.27 2051

Laptop 0.29 6

Server 0.44 9

Total 65.57 2066

Node 2

Nominal % n

>2,56

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

0.00 0

CPU 71.91 9

Laptop 27.66 65

Server 0.43 1

Total 7.46 235

Node 4

Nominal % n

P08, P10, P23, P29, P37

Memory

Improvement=0.03

0.00 0

CPU 98.83 169

Laptop 0.58 1

Server 0.58 1

Total 5.43 171

Node 9

Nominal % n

<=3.36

0.00 0

CPU 0.00 0

Laptop 100.00 64

Server 0.00 0

Total 2.03 64

Node 10

Nominal % n

>3.36

Class

Node 3 indicates that Laptop category has

The higher probability (0.989) to be selected

Processor_Type

Improvement=0.06

0.00 0

CPU 1.06 9

Laptop 98.94 841

Server 0.00 0

Total 26.98 850

Node 3

Nominal % n

P01, P02, P03, P04, P05, P06, P07, P09, P11, P12, P13,

P14, P15, P16, P17, P18, P19, P20, P21, P22, P24, P25,

P26, P27, P28, P30, P31, P32, P33, P34, P35, P36

10

Variable: Class

As a consequence, for computer number

0022 the Class variable will be imputed as

being Laptop.

Page 12: A statistical approach to reduce malware inside an ...

Evaluating model prediction

The formulated hypotheses for McNemar test (2-sided) were[2]:

TABLA III

CHI-SQUARE TEST

Value Exact Sig.

(Two-sided)

McNemar Test 1.058 0.392

Nº Valid Cases 7049

Use binomial distribution

According to this analysis we cannot reject the null hypothesis, that is, CART

SラWゲミげデ change Class values after imputation (p-value=0.396)

11

ぷヲへ Cラミラ┗Wヴが さPヴ;IデキI;ノ Nラミヮ;ヴ;マWデヴキI “デ;デキゲデキIゲざが ヱΓΓΓ

Page 13: A statistical approach to reduce malware inside an ...

Evaluating model prediction

In this case E_Class is the imputed value and Class is the real value. For

instance, 5013 (99.6%) computers with Class equal to CPU (Desktop) were

Classified correctly by CART, and 2002 (99.3%) computers with Class equal to

Laptop were classified correctly by the same CART.

TABLA IV

CONTINGENCY TABLE CLASS

E_Class

Total CPU Laptop

Class CPU 5013 20 5033

Laptop 14 2002 2016

Total 5027 2022 7049

12

Page 14: A statistical approach to reduce malware inside an ...

Antivirus Scanning Dosage Statistics Model

For example, according to Chi-Squared Test for Independence, Malware_Level

and USB are independent. However, for both situations (disable or enable USB

ports) the computers have the same levels of malware in our case.

Notwithstanding, as a recommendation to improve security, disabling USB ports

is an effective strategy for preventing information leakage.

The Kaplan-Meier method is used for estimating the survival function from life-

Time Data. To use this strategy we define the following outcome: Elapsed time

to first malware infection in a computer.

Survival curves show, for each time plotted on the X axis, the portion of all

computers surviving at that time.

13

Hosmer Jr, D.W. and Lemeshowく さAヮヮノキWS “┌ヴ┗キ┗;ノ Aミ;ノ┞ゲキゲぎ RWェヴWゲゲキラミ MラSWノキミェ ラa TキマW デラ E┗Wミデ D;デ;ざが ヱΓΓΓく

Page 15: A statistical approach to reduce malware inside an ...

Antivirus Scanning Dosage Statistics Model

Week

Survival Function Week に Kaplan-Meier Curves

Cu

mS

urv

iva

l

Group 0 = computers with USB disabled

Group 1 = computers with USB enabled

14

The log-Rank Test said that there

are not important differences

between those groups, it means

that for both situations (disable or

enable USB ports) the computer

have the same levels of malware

Page 16: A statistical approach to reduce malware inside an ...

Antivirus Scanning Dosage Statistics Model

Week

Survival Function Week に Kaplan-Meier Curves

Cu

mS

urv

iva

l

Group 0 = さYラ┌ミェざ computers (1 to 165 weeks)

Group 1 = さAS┌ノデざ computers (166 to 248 weeks)

Group 2 = さOノSざ computers (> 248 weeks)

Computer_Age

15

We can see that group 2

(Computers with more of 248

week of operating time) showed

statistical differences when they

were compared with the others

groups. That means the さOノSざ

computers were infected slower

than others groups.

Page 17: A statistical approach to reduce malware inside an ...

Conclusion and Future work

In this study we believe that malware level depends on variables such as:

Processors (number of processor in the computer)

Computer_Age (Operating time)

Browse_Time. (Web surfing time)

Class (Laptop, desktop or server)

Future directions of this work include performing additional statistics analysis

such as recurrence analysis and formulation of survival models through

Cox-Models. This also will allow identifying significant variables to optimize the

malware scanning policy in an IS as well as measure its effect size.

16

Page 18: A statistical approach to reduce malware inside an ...

Acknowledgment

The authors thank Juan Carlos Correa from School of Statistics of the

Universidad Nacional de Colombia at Medellín for helpful feedback that

contributed to improve this research. Also the authors thank the Security Team

of the Bank Company for their continuous encouragement and support

Many thanks to Universidad Nacional de Colombia-Sede Medellín for helping us

to achieve these goals.

17

Page 19: A statistical approach to reduce malware inside an ...

References

Weiguo Jが さAヮヮノ┞キミェ EヮキSWマキラノラェ┞ キミ Cラマヮ┌デWヴ Vキヴ┌ゲ PヴW┗Wミデキラミぎ PヴラゲヮWIデゲ ;ミS Lキマキデ;デキラミゲざが ヲヰヱヰく TエWゲキゲが Cラマヮ┌デWヴ “IキWミIWが Uミキ┗Wヴゲキデ┞ ラa A┌Iニノ;ミSく

B;キノW┞が NくJくTが さTエW M;デエWマ;デキI;ノ TエWラヴ┞ ラa IミaWIデキラ┌ゲ DキゲW;ゲWゲ ;ミS Iデゲ AヮヮノキI;デキラミゲざ 1975, New York: Oxford University Press.

Kephart Jが ;ミS WエキデW “が さDキヴWIデWS-Graph Epidemiological Models of Computer Viruses",

IEEE Computer Symposium on Research in Security and Privacy, Proceedings, pp. 343に359, May 1991.

Kephart Jが ;ミS WエキデW “が さMW;ゲ┌ヴキミェ ;ミS MラSWノキミェ Cラマヮ┌デWヴ Vキヴ┌ゲ PヴW┗;ノWミIWが RWゲW;ヴIエ キミ “WI┌ヴキデ┞ ;ミS Pヴキ┗;I┞ざが ヱΓΓンが PヴラIWWSキミェゲが ヱΓΓン IEEE Computer Society Symposium on,

pp. 2に15, May 1993.

Kephartが Jが さHラ┘ Tラヮラノラェ┞ AaaWIデゲ Pラヮ┌ノ;デキラミ D┞ミ;マキIゲざ キミ Langton, C.G. (ed.) Artificial

Life III. Reading, MA: Addison-Wesley, 1994.

17

Page 20: A statistical approach to reduce malware inside an ...

References

Pastor-Satorras, R. and Vespignaniが Aが さEヮキSWマキI D┞ミ;マキIゲ ;ミS EミSWマキI “デ;デWゲ キミ CラマヮノW┝ NWデ┘ラヴニゲざく B;ヴIWノラミ;が “ヮ;キミぎ Universitat Politecnica de Catalunya, 2001.

Rishikesh Pが さUゲキミェ Pノ;ミデ EヮキSWマキラノラェキI;ノ MWデエラSゲ Tラ Tヴ;Iニ Cラマヮ┌デWヴ NWデ┘ラヴニ Wラヴマゲざが 2004. Thesis, Computer Science, Virginia Polytechnic and State University.

D;ミキWノ Tく L;ヴラゲWが さDキゲIラ┗Wヴキミェ Kミラ┘ノWSェW キミ D;デ;く Aミ キミデヴラS┌Iデキラミ デラ S;デ; マキミキミェざ ヲヰヰヵく John Wiley & Sons, Inc

Leo Breiman, Jerome Friedman, Richard Olshenが ;ミS Cエ;ヴノWゲ “デラミWが さCノ;ゲゲキaキI;デキラミ ;ミS RWェヴWゲゲキラミ TヴWWゲざが ヱΓΒヴく Cエ;ヮマ;ミ わ H;ノノっCRC PヴWゲゲく

Vipin K┌マ;ヴが さTエW Tラヮ TWミ Aノェラヴキデエマゲ キミ D;デ; Mキミキミェざが ヲヰヰΓく Cエ;ヮマ;ミ わ H;ノノっCrc.

Q┌キミノ;ミが Rが さUミニミラ┘ミ ;デデヴキH┌デW ┗;ノ┌Wゲ キミ キミS┌Iデキラミざく In Proceedings of the Sixth

International Workshop on Machine Learning, 1989 pp. 164に168.

Cラミラ┗Wヴが さPヴ;IデキI;ノ Nラミヮ;ヴ;マWデヴキI “デ;デキゲデキIゲざが ヱΓΓΓく Jラエミ WキノW┞ わ “ラミゲが IミI

Hosmer Jr, D.W. and Lemeshowく さAヮヮノキWS “┌ヴ┗キ┗;ノ Aミ;ノ┞ゲキゲぎ RWェヴWゲゲキラミ MラSWノキミェ ラa TキマW デラ E┗Wミデ D;デ;ざが ヱΓΓΓく Jラエミ WキノW┞ Sons,

17

Page 21: A statistical approach to reduce malware inside an ...

References

Cラミラ┗Wヴが さPヴ;IデキI;ノ Nラミヮ;ヴ;マWデヴキI “デ;デキゲデキIゲざが ヱΓΓΓく Jラエミ WキノW┞ わ “ラミゲが IミI

Hosmer Jr, D.W. and Lemeshowく さAヮヮノキWS “┌ヴ┗キ┗;ノ Aミ;ノ┞ゲキゲぎ RWェヴWゲゲキラミ MラSWノキミェ ラa TキマW デラ E┗Wミデ D;デ;ざが ヱΓΓΓく Jラエミ WキノW┞ Sons,

17

Page 22: A statistical approach to reduce malware inside an ...

1

NKS THANKS THANKS THANKS