Integrating Cybersecurity Log Data Analysis in Hadoop Bryan Stearns, Susan Urban, Sindhuri Juturu...

1
Integrating Cybersecurity Log Data Analysis in Hadoop Bryan Stearns, Susan Urban, Sindhuri Juturu Texas Tech University 2014 NSF Research Experience for Undergraduates Site Program Abstract In cybersecurity, the growth of Advanced Persistent Threats (APTs) and other sophisticated attacks makes it increasingly important to analyze network and system activities from all event sources. If logs are recorded through different software packages, the resulting big data can be in a form known as dirty data that is difficult to merge and cross-analyze. Additionally, storing logs in a big data architecture like Apache Hadoop can make data joins time-consuming. Merging data as it is stored rather than at access can greatly simplify unified analysis, yet doing so requires knowledge of what kinds of merges will be required. Services wishing to holistically analyze dirty data are required to merge information externally each time data is pulled. This research is developing a file system called HVID (Hadoop Value-oriented Integration of Data) that will represent dirty log data as a single table while maintaining its raw form using a novel variant of column-oriented storage. This system utilizes the open-source big data system Hadoop HBase to enable fast access to unified views without the need for predetermined joins. This design will allow more natural and efficient holistic analysis of stored cybersecurity data both with external mining applications and with local MapReduce tools. Introduction o The amount of unstructured or semi-structured data recorded in cybersecurity endeavors is growing every day [1]. o Increasingly sophisticated cybersecurity threats require large-scale holistic pattern analysis for proper detection [1][2]. o Heterogeneous “dirty” data generated from multiple network sources needs customized unification to be useful for such purposes [3]. o MapReduce is desirable to analyze unstructured data, but existing unification methods structure data externally from unstructured storage platforms, requiring additional I/O to feed merged data back to storage [3]. o A better method is needed to unify and manipulate disparate information from across services in a network! Methods Equipment o 64-bit single-node virtual Linux machine o IBM BigInsights V3.0.0.0 o Station with 8-thread 1.87GHz processor and 8GB RAM Process o Design – The design proceeded with two fronts: abstract and physical. Abstract design focused on the data to be merged, while physical focused on the selection, implementation, and optimization of features available in Hadoop. o Implement – The designed data architecture was created along with basic access features. Java was used for system creation and interfacing. o Test – Basic speed tests were performed on working features. Generic system time properties at the start and completion of operations were used for tests. Objectives Design and prototype a file system that: o Runs in Hadoop o Supports structured, semi-structured, and unstructured data o Provides quick access to merged tables o Does not restrict what columns are used for merging o Supports MapReduce operations upon merged data Implications o Faster value-based retrieval of data o Eliminate need to individually merge tables containing shared features o Reduce I/O needed for holistic unstructured MapReduce analysis Conclusion: The HVID design: o Unifies data into a common format via a unique value-based structure o Allows heterogeneous datasets to be merged by any field o Supports external or internal data mining and manipulation o Supports internal MapReduce on merged data o Should allow value-based merge and join queries to be run in comparable time to plain select queries, o Requires less space than comparable row- oriented solutions* o Can utilize backup copies for improved data interconnectivity o Requires further research and development o Provides a means to unite dirty cybersecurity data in storage without the need to explicitly outline how information should me merged until it is needed. References: [1] A. A. Cárdenas, P. K. Manadhata, and S. P. Rajan, "Big Data Analytics for Security," IEEE Security & Privacy, vol. 11, pp. 74-76, 2013. [2] A. K. Sood and R. J. Enbody, "Targeted Cyberattacks: A Superset of Advanced Persistent Threats," IEEE Security & Privacy, vol. 11, p. 7, 2013. [3] T.-F. Yen, A. Oprea, K. Onarlioglu, T. Leetham, W. Robertson, A. Juels, et al., "Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks," presented at the Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, 2013. Design: o An inverted variant of column-oriented storage was developed in which values are the primary key and row IDs are the dependent value. o This physical association of rows with shared values allows dynamic views on relations without lengthy scans and comparisons. o HBase was chosen as the HVID platform for its flexibility, support for read-intensive applications, and its ability to integrate with Hadoop MapReduce. o Value-oriented data reside in row key byte arrays for quick scanning o Rows are sorted for fast collection of values from contiguous ranges. o Large unstructured data is stored in a separate row-oriented table. o This row-oriented form can be used to store backups of value-oriented data, while enhancing row ID resolution of value-based queries. Future Work: o Complete working implementation for data lookup o Compare functionality with and without backup row- oriented records o Modify BulkLoad to support multi-table output from a single MapReduce o Analyze row key structure for region balancing o Load implementation onto cluster o Benchmark various operations in various cluster configurations o Create Hive interface for increased functionality and ease-of-use o Create automatic upload system and interface o Create web-interface for database access o Explore support for varying data-types within a field via qualifiers o Explore inverted clustering techniques based on inverted data DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS- 1263183. Any opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense. Results: o Custom database generation and access tools were created for HBase using Java. Full functionality is not yet complete, but load and basic data retrieval have been implemented for the row-oriented segment of HVID. o Whether row-oriented versions of data should be kept alongside value-oriented tables remains unknown until a full prototype is built and merge speeds are tested under various configurations. o While results are inconclusive, selection of rows through value-oriented storage shows promise. Figure 1: Value-Based storage Preliminary Testing: o Collective storage consumption by value- based HVID tables was found to be 86% of that required for a classic HBase table. o Further compression can be realized when employing numeric data o Value-based access is not yet fully implemented for testing. o Results show little difference between HVID and classic HBase for basic row- oriented retrieval (when using rows as a backup form). o The system must be tested on a full Hadoop cluster before any speed tests can contain significant meaning. Figure 2: Preliminary Space and Time Comparisons Classic HBase HVID 0 50 100 150 Space taken for database storage* * Tests used 25MB .tsv text file as source data 116 Classic HBase HVID 0 5000 10000 15000 Time to select 360k rows by value** Time (ms) 11500 5500 ** Only Row IDs were selected. Pulling content remains to be implemented. 108 128 * When using non-duplicated value-oriented usage A B C a1 a2 a3 b1 b2 b3 c1 c2 c3 T1 row 1 row 2 row 3 B C D b1 b2 b4 c2 c4 c5 d1 d2 d3 row 1 row 2 row 3 T2 B b1:T1:ro w1 b1:T2:ro w1 b2:T1:ro w2 b2:T2:ro w2 b3:T1:ro w3 b4:T2:ro w3 A a1:T1:ro w1 a2:T1:ro w2 a3:T1:ro w3 C c1:T1:ro w1 c2:T1:ro w2 c2:T2:ro w1 c3:T1:ro w3 c4:T2:ro w2 c5:T2:ro w3 D d1:T2:ro w1 d2:T2:ro w2 d3:T2:ro w3 T2 row1 row2 row3 {(“B”, “b1”), (“C”, “c2”), (“D”, “d1”)} {(“B”, “b2”), (“C”, “c4”), (“D”, “d2”)} {(“B”, “b4”), (“C”, “c5”), (“D”, “d3”)} rows T1:row1 T1:row2 T1:row3 {(A, a1), (B, b1), (C, c1)} {(A, a2), (B, b2), (C, c2)} {(A, a3), (B, b3), (C, c3)} T2:row1 T2:row2 T2:row3 {(B, b1), (C, c2), (D, d1)} {(B, b2), (C, c4), (D, d2)} {(B, b4), (C, c5), (D, d3)} Source Data Classic HBase Row-Oriented storage HVID Storage vs T1 row1 row2 row3 {(“A”, “a1”), (“B”, “b1”), (“C”, “c1”)} {(“A”, “a2”), (“B”, “b2”), (“C”, “c2”)} {(“A”, “a3”), (“B”, “b3”), (“C”, “c3”)}

Transcript of Integrating Cybersecurity Log Data Analysis in Hadoop Bryan Stearns, Susan Urban, Sindhuri Juturu...

Page 1: Integrating Cybersecurity Log Data Analysis in Hadoop Bryan Stearns, Susan Urban, Sindhuri Juturu Texas Tech University 2014 NSF Research Experience for.

Integrating Cybersecurity Log Data Analysis in HadoopBryan Stearns, Susan Urban, Sindhuri Juturu

Texas Tech University 2014 NSF Research Experience for Undergraduates Site Program

AbstractIn cybersecurity, the growth of Advanced Persistent Threats (APTs) and other sophisticated attacks makes it increasingly important to analyze network and system activities from all event sources. If logs are recorded through different software packages, the resulting big data can be in a form known as dirty data that is difficult to merge and cross-analyze. Additionally, storing logs in a big data architecture like Apache Hadoop can make data joins time-consuming. Merging data as it is stored rather than at access can greatly simplify unified analysis, yet doing so requires knowledge of what kinds of merges will be required. Services wishing to holistically analyze dirty data are required to merge information externally each time data is pulled. This research is developing a file system called HVID (Hadoop Value-oriented Integration of Data) that will represent dirty log data as a single table while maintaining its raw form using a novel variant of column-oriented storage. This system utilizes the open-source big data system Hadoop HBase to enable fast access to unified views without the need for predetermined joins. This design will allow more natural and efficient holistic analysis of stored cybersecurity data both with external mining applications and with local MapReduce tools.

Introductiono The amount of unstructured or semi-structured data recorded in cybersecurity endeavors is growing every day [1].

o Increasingly sophisticated cybersecurity threats require large-scale holistic pattern analysis for proper detection [1][2].

o Heterogeneous “dirty” data generated from multiple network sources needs customized unification to be useful for such purposes [3].

o MapReduce is desirable to analyze unstructured data, but existing unification methods structure data externally from unstructured storage platforms, requiring additional I/O to feed merged data back to storage [3].

o A better method is needed to unify and manipulate disparate information from across services in a network!

MethodsEquipment

o 64-bit single-node virtual Linux machine

o IBM BigInsights V3.0.0.0

o Station with 8-thread 1.87GHz processor and 8GB RAM

Process

o Design – The design proceeded with two fronts: abstract and physical. Abstract design focused on the data to be merged, while physical focused on the selection, implementation, and optimization of features available in Hadoop.

o Implement – The designed data architecture was created along with basic access features. Java was used for system creation and interfacing.

o Test – Basic speed tests were performed on working features. Generic system time properties at the start and completion of operations were used for tests.

ObjectivesDesign and prototype a file system that:

o Runs in Hadoop

o Supports structured, semi-structured, and unstructured data

o Provides quick access to merged tables

o Does not restrict what columns are used for merging

o Supports MapReduce operations upon merged data

Implications

o Faster value-based retrieval of data

o Eliminate need to individually merge tables containing shared features

o Reduce I/O needed for holistic unstructured MapReduce analysis

Conclusion:

The HVID design:

o Unifies data into a common format via a unique value-based structure

o Allows heterogeneous datasets to be merged by any field

o Supports external or internal data mining and manipulation

o Supports internal MapReduce on merged data

o Should allow value-based merge and join queries to be run in comparable time to plain select queries,

o Requires less space than comparable row-oriented solutions*

o Can utilize backup copies for improved data interconnectivity

o Requires further research and development

o Provides a means to unite dirty cybersecurity data in storage without the need to explicitly outline how information should me merged until it is needed.

References:[1] A. A. Cárdenas, P. K. Manadhata, and S. P. Rajan, "Big Data Analytics for Security," IEEE Security & Privacy, vol. 11, pp. 74-76, 2013.

[2] A. K. Sood and R. J. Enbody, "Targeted Cyberattacks: A Superset of Advanced Persistent Threats," IEEE Security & Privacy, vol. 11, p. 7, 2013.

[3] T.-F. Yen, A. Oprea, K. Onarlioglu, T. Leetham, W. Robertson, A. Juels, et al., "Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks," presented at the Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, 2013.

Design:o An inverted variant of column-oriented storage was

developed in which values are the primary key and row IDs are the dependent value.

o This physical association of rows with shared values allows dynamic views on relations without lengthy scans and comparisons.

o HBase was chosen as the HVID platform for its flexibility, support for read-intensive applications, and its ability to integrate with Hadoop MapReduce.

o Value-oriented data reside in row key byte arrays for quick scanning

o Rows are sorted for fast collection of values from contiguous ranges.

o Large unstructured data is stored in a separate row-oriented table.

o This row-oriented form can be used to store backups of value-oriented data, while enhancing row ID resolution of value-based queries.

Future Work:

o Complete working implementation for data lookup

o Compare functionality with and without backup row-oriented records

o Modify BulkLoad to support multi-table output from a single MapReduce

o Analyze row key structure for region balancing

o Load implementation onto cluster

o Benchmark various operations in various cluster configurations

o Create Hive interface for increased functionality and ease-of-use

o Create automatic upload system and interface

o Create web-interface for database access

o Explore support for varying data-types within a field via qualifiers

o Explore inverted clustering techniques based on inverted data

DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. Any opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense.

Results:o Custom database generation and access tools were created for HBase using Java. Full functionality is not yet complete, but load and basic

data retrieval have been implemented for the row-oriented segment of HVID.

o Whether row-oriented versions of data should be kept alongside value-oriented tables remains unknown until a full prototype is built and merge speeds are tested under various configurations.

o While results are inconclusive, selection of rows through value-oriented storage shows promise.

Figure 1: Value-Based storage

Preliminary Testing:o Collective storage consumption by value-based HVID tables

was found to be 86% of that required for a classic HBase table.

o Further compression can be realized when employing numeric data

o Value-based access is not yet fully implemented for testing.

o Results show little difference between HVID and classic HBase for basic row-oriented retrieval (when using rows as a backup form).

o The system must be tested on a full Hadoop cluster before any speed tests can contain significant meaning.

Figure 2: Preliminary Space and Time Comparisons

Classic HBase HVID0

20406080

100120140

Space takenfor database storage*

Row Based (MB) Value Based (MB)

* Tests used 25MB .tsv text file as source data

116

Classic HBase HVID0

2000400060008000

100001200014000

Time to select360k rows by value**

Time (ms)

11500

5500

** Only Row IDs were selected.

Pulling content remains to be implemented.

108128

* When using non-duplicated value-oriented usage

A B C

a1

a2

a3

b1

b2

b3

c1

c2

c3

T1

row1

row2

row3

B C D

b1

b2

b4

c2

c4

c5

d1

d2

d3

row1

row2

row3

T2

Bb1:T1:row1

b1:T2:row1

b2:T1:row2

b2:T2:row2

b3:T1:row3

b4:T2:row3

Aa1:T1:row1

a2:T1:row2

a3:T1:row3

Cc1:T1:row1

c2:T1:row2

c2:T2:row1

c3:T1:row3

c4:T2:row2

c5:T2:row3

Dd1:T2:row1

d2:T2:row2

d3:T2:row3

T2row1

row2

row3

{(“B”, “b1”), (“C”, “c2”), (“D”, “d1”)}{(“B”, “b2”), (“C”, “c4”), (“D”, “d2”)}{(“B”, “b4”), (“C”, “c5”), (“D”, “d3”)}

rowsT1:row1

T1:row2

T1:row3

{(A, a1), (B, b1), (C, c1)}

{(A, a2), (B, b2), (C, c2)}

{(A, a3), (B, b3), (C, c3)}

T2:row1

T2:row2

T2:row3

{(B, b1), (C, c2), (D, d1)}

{(B, b2), (C, c4), (D, d2)}

{(B, b4), (C, c5), (D, d3)}

Source Data

Classic HBaseRow-Oriented storage

HVID Storage

vs

T1row1

row2

row3

{(“A”, “a1”), (“B”, “b1”), (“C”, “c1”)}{(“A”, “a2”), (“B”, “b2”), (“C”, “c2”)}{(“A”, “a3”), (“B”, “b3”), (“C”, “c3”)}