U.S. Army Intelligence and Security...

1

Transcript of U.S. Army Intelligence and Security...

INSCOM … Vigilance Always!

U.S. Army Intelligence and Security Command

OVERALL CLASSIFICATION OF THIS BRIEFING IS UNCLASSIFIED

(U) Big Data IntroductionINSCOM ORSA Cell

November 2016

UNCLASSIFIED

UNCLASSIFIED

Agenda

• Define Big Data

• History of Big Data

• Basics of Networking, HDFS, and MapReduce

• Uses and limitations of Hadoop

• Military Applications and Concerns

• Summary and Review of Definitions2

UNCLASSIFIED

UNCLASSIFIED

The Three V’s of Big Data

3

UNCLASSIFIED

UNCLASSIFIED

How big is “Big Data?”

• Doesn’t fit in (1,048,576 rows and 16,384 columns)?

• Doesn’t fit in memory (constraining factor for ) ?

• Doesn’t fit on a single machine (starts at ~1TB)?

• Requires (starting around 5-10TB)?

4

UNCLASSIFIED

UNCLASSIFIED

Why bother with Big Data?

– Know (what happened?)• Basic analytics + visualizations (descriptive statistics, histogram, time

series, bar-chart, box plot, etc.

• Interactive drill down

• Implemented with MapReduce or Queries

• Examples: forensics, assessments, historical data/reports/trends

– Explain (why)• Data mining, classifications, building models, clustering

• Correlation

• Examples: Find similar items, find hubs and authorities ina graph, find frequent item sets

• Possibly implemented with Apache Mahout

– To predict (what will happen)• Neural networks, decision models, unsupervised learning

• Examples: Translation, weather forecast, user profile, traffic models, economic models 5

UNCLASSIFIED

UNCLASSIFIED

Why is Big Data Hard

– Storage: At 1TB each, it takes 1000 computers to store 1 PB

– Movement: Assuming a 10Gb network, it takes 2 hours to copy 1TB, or 83 days to copy 1PB

– Searching: Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277 CPU days to process 1TB and 785 CPU years to process 1PB

– Processing: • How do we convert existing algorithms to work on large data

• How do we create new algorithms?

6

UNCLASSIFIED

UNCLASSIFIED

Understanding Traditional Data Storage

• Often requires big/expensive hardware

• Requires expensive Data Base Management System (Oracle, Terabase, etc)

• Not necessarily fault tolerant

• Back-up can be difficult and expensive

• Doesn’t scale horizontally (high marginal cost)

• SQL is unsuited for some analytics

– Complex analysis (like ranking Internet Pages)

– Unstructured data7

UNCLASSIFIED

UNCLASSIFIED

Google’s Problem

• In 1999 Google wanted to index the web. But even at that time it was hundreds of millions of pages

– Crawl all the pages

– Rank pages based on relevant metrics

– Build a search index of keywords to pages

– Do this in real-time!

8

UNCLASSIFIED

UNCLASSIFIED

Google’s Solution

• Google Designed their own storage and processing infrastructure

– Google File System and MapReduce

• Goals:

– Cheap

– Scalable

– Reliable

Image from http://infolab.stanford.edu

9

UNCLASSIFIED

UNCLASSIFIED

Google Product

• It worked!

• Powered Google Search for many Years

• General framework for large-scale batch computation tasks

• Still used internally at Google to this day

10

UNCLASSIFIED

UNCLASSIFIED

Google Share’s Ideas

2003: Google published paper on Google File System (GFS) Internet Link

2004: Google publishes paper on MapReduce Internet Link

At this point, these are already mature technologies.

….but it took 2-3 years for people to “get it”!11

UNCLASSIFIED

UNCLASSIFIED

The Elephant in the Room

• Doug Cutting and Mike Cafarella attempted to develop an Open-source search platform called Nutch

• Ran into same problem Google did

• Decided to “reverse engineer” GFS and MapReduce from the 2003 and 2004 papers

• 2006: spun their product out into Apache Hadoop

12

UNCLASSIFIED

UNCLASSIFIED

Hadoop Goes Mainstream

• Today Hadoop is used by every Fortune 500 company, majority of internet companies and social media, as well as an increasing number of government agencies.

Facebook has a 20PB/4000 node cluster

• Many big tech companies and betting on Hadoop

• Experts predict that within 5-10 years, the vast majority of servers will contain Hadoopclusters

13

UNCLASSIFIED

UNCLASSIFIED

Hadoop Primer

• Networking

• Hadoop File System (HDFS)

• Map Reduce

14

UNCLASSIFIED

UNCLASSIFIED

Networking Primer

**Graphic from Cloudera Training Material

Google Data Center in Council Bluffs, Iowa

Central Cooling plan in Google’s data center in Douglas County, Georgia

15

UNCLASSIFIED

UNCLASSIFIED

Hadoop File Systems

• Same concepts as the file system on your personal computer

– Directory Tree

– Create, read, write, and delete files

• Filesystems store metadata and data

– Metadata: filename, size, permissions, location

– Data: contents of the file

16

UNCLASSIFIED

UNCLASSIFIED

Understanding HDFS

HDFS Design assumptions

• Failures are common

– Massive scale means more failures

– Disks, network, node

• Files are append-only

• Files are large (GBs to TBs)

– Works better with few large files than many small files

• Accesses are large and sequential17

UNCLASSIFIED

UNCLASSIFIED

N

HDFS Block Replication

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Name Node:Metadata information about files and blocks

Very Large Data File

3-Fold Replication is baked into the

process

18

UNCLASSIFIED

UNCLASSIFIED

N

Map Reduce

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Name Node:Metadata information about files and blocks

Very Large Data File

Map produces intermediate values

Reduce combines intermediate values into one or more final values

19

UNCLASSIFIED

UNCLASSIFIED

Word Count Example

The cat sat on the matThe aardvark sat on the sofa

The, 1cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1

aardvark, 1

cat, 1

mat, 1

on, [1,1]

sat, [1,1]

sofa, 1

the, [1,1,1,1]

aardvark, 1

cat, 1

mat, 1

on, [1,1]

sat, [1,1]

sofa, 1

the, [1,1,1,1]

aardvark, 1cat, 1mat, 1on, 2sat, 2sofa, 1the, 4

Mapper Input

Mapping

Shuffling Reducing

Final Result

~100 lines of Java Code to accomplish in Hadoop20

UNCLASSIFIED

UNCLASSIFIED

Reasons to avoid Hadoop

• Use cases that may not be best in Hadoop

– Analysis cannot be adapted to parallel processing environment

– Real time analytics or fast access (i.e. 30 milliseconds to look up information in a database that has 300 million people)

– When your intermediate processes need to talk to each other

– When processing requires significant data to be shuffled over the network

21

UNCLASSIFIED

UNCLASSIFIED

22

Big Data Ecosystem

UNCLASSIFIED

UNCLASSIFIED

Big Data Landscape 2014

23

UNCLASSIFIED

UNCLASSIFIED

Big Data Landscape 2016

24

UNCLASSIFIED

UNCLASSIFIED

Linux

Hadoop

You don’t need to download and compile it yourself!

Cloudera/Hortonworks

25

UNCLASSIFIED

UNCLASSIFIED

Hadoop Ecosystem (1 of 2)

• Hive: Relational database abstraction using a SQL like dialect (but executed as MapReduce Jobs). Developed by Facebook

SELECT s.word, s.freq, k.freq FROM shakepeare JOIN ON (s.word=k.word) WHERE s.freq >=5;

• Pig: High level scripting language for executing one or more Map Reduce Jobs. Developed by Yahoo

Emps=LOAD ‘people.txt’ AS (id,name,salary);

Rich=FILTER emps BY salary > 200000;

Sorted_rich=ORDER rich BY salary DESC;

STORE sorted_rich INTO ‘rich_people.txt’

26

UNCLASSIFIED

UNCLASSIFIED

Hadoop Ecosystem (2 of 2)

• Sqoop: Performs bidirectional data transfers between Hadoopand almost any SQL database with a JDBC driver

• Flume: A streaming data collection and aggregation system for massive volumes of data

• Hbase: HBase is Hadoop's NoSQL database. Patterned after Google BigTable, HBase is designed to provide fast, tabular access to the high-scale data stored on HDFS.

• Accumullo: Similar to Hbase but developed by the National Security Agency with cell-based access control (added a new element to the key called Column Visibility) (https://accumulo.apache.org/ )

27

UNCLASSIFIED

UNCLASSIFIED

Two Big Data Tools to Watch Closely

• Spark (fast general engine for large scale data processing)

– Runs 100x faster than map reduce in memory and 10x faster on disk

– Easy to use: write applications in Java, Scala, R or Python

– Generality: Combine SQL, streaming, and complex analytics

• Apache Drill (Schema-free SQL Query Engine for Hadoop, NoSQL, and Cloud)

– Query any non-relational data store with SQL (can point to directory of JSON files on your laptop or S3)

– With Drill’s ODBC Drivers, you can connect to any existing BI Tool (Excel, R, SAS, Tableau, etc) 28

UNCLASSIFIED

UNCLASSIFIED

29

Military “Big Data” Applications

UNCLASSIFIED

UNCLASSIFIED

Hadoop Use Cases

• Hadoop is the platform of choice for:

– Clickstream data

– Sentiment data (Twitter and social media)

– Telematics, such as vehicle tracking data

– Sensor and Machine-generated data

– Geo tracking and location data

– Server and network logs

– Document and text repositories

– Digitized images, voice, video and other media.

PossibleMilitary

Application

30

UNCLASSIFIED

UNCLASSIFIED

Sampling of Army “Big Data” Projects

• Gabriel Nimbus: ARCYBER instance of the Big Data Platform used to aggregate and enrich cyber data as well as provide a platform to develop rapid analytics for defensive cyber operations (DCO)

• Tactical Cloud Reference Implementation (TCRI): TCRI intends to deliver a joint warfighting tactical/deployed data and analytic platform that enables all-source analysis, rapid decision making, and optimization of force employment

• Person-Event Data Environment (PDE): The Person-Event Data Environment (PDE) business intelligence platform is a cloud-based virtual data repository for housing personnel digitized information. Functionally, the PDE serves two central purposes: (1) acquire, integrate, and securely store data for Army-approved research projects, and (2) provide a secure, virtual workspace where approved researchers can access ‘‘sensitive’’ although unclassified Army military service, performance, manpower, and health data. PDE is hosted by the Army Analytics Group.

31

UNCLASSIFIED

UNCLASSIFIED

Military Application

Today:

• Intelligence

• Cyber

Tomorrow:

• Mobile Technology

• Sensorssustainablesecurity.org

www.enocean.com

www.techweekeurope.co.uk

www.geek.com

32

UNCLASSIFIED

UNCLASSIFIED

33

Final Thoughts

UNCLASSIFIED

UNCLASSIFIED

“Big” vs. “Bigger” Data

– “Bigger” Data• Size: 1GB up to 1TB fits on 1 machine

• Often doesn’t fit into memory

• Doesn’t require Hadoop (unless you required a production application with lighting fast query/analytics)

– Some recommendations for “Bigger” Data:• Tools that I’ve found help with “Bigger Data”:

– SQL-Lite

– Elastic-search, Logstash, Kibana (ELK) server

– Apache Drill

– Unix Terminal (query many csv files with grep)

• Batch process (store and aggregate interim statistics)

• “Poor man’s parallelization” (multiple instances running simultaneously)

• Understand how your code uses RAM (i.e. difference between data-frames and lists in R)

34

UNCLASSIFIED

UNCLASSIFIED

Big Data Storage vs. Big Data Analytics

– Big Data Storage distributed storage

Data is duplicated and stored across many different nodes (computers)

– Big Data Analytics distributed analytics

Analytics is conducted across multiple nodes; a master node collects and aggregates interim solutions

35

36

What is Data Science?

Math and Statistics:• Machine learning• Statistical modeling• Bayesian inference• Optimization• Simulation• Network science• Model Development

Programming & Database• Computer science

fundamentals• Scripting language (Python)• Statistical Computing package

(e.g. R)• Databases SQL and NoSQL• Parallel databases and parallel

query processing• MapReduce concepts• Hadoop and Hive/Pig• Experience with xaaS like AWS• Basic Tools Development• Understands Sources of Data

Ops/Intel Expertise & Leader Skills:• Background in intel/ops/cyber• Curious about data• Influence with leaders • Problem solver• Creates narratives with data• Visual design and communication• Strategic, proactive, creative,

innovative and collaborative

Data science =

the ability to extract knowledge and insights from large and complex data sets

--DJ Patil, the U.S. Government’s first Chief Data Scientist

Operationalize Data for Decision Makers

UNCLASSIFIED

UNCLASSIFIED

The Gartner Hype Cycle

Graph from “Big Data for Defence and Security” by Neil Couch and Bill Robins

37

UNCLASSIFIED

UNCLASSIFIED

Final “Big Data” Considerations

Big Data Ethics Big Data Security

Image from www.tbitsglobal.com

38

UNCLASSIFIED

UNCLASSIFIED

“A human must turn information into intelligence or knowledge. We've tended to forget that no computer will ever ask a new

question.”--Grace Hopper

http://en.wikiquote.org/wiki/Grace_Hoppe r

39

UNCLASSIFIED

UNCLASSIFIED

Questions

• For more information on “Big Data”, visit my research site at http://data-analytics.net

• For data science collaboration, visit https://dscoe.army.mil/

• For insomnia, visit https://dmbeskow.github.io/

40

Contact Info:

Major David Beskow

Office: 703.706.1255

NIPR: [email protected]

SIPR: [email protected]

JWICS: [email protected]

UNCLASSIFIED

UNCLASSIFIED

References

Executive Office of the President, “Big Data: Seizing Opportunities, Preserving Values.” May 2014

Couch, Neil and Robins, Bill. “Big Data for Defence and Security,” Royal United Services Institute, September 2013

Olson, Mike, “HADOOP, Scalable, Flexible Data Storage and Analysis”, IQT Quarterly, Vol 1, No. 3, p 14-18.

Jacobs, Bill and Dinsmore, Thomas. “Delivering Value from Big Data with Revolution R Enterprise and Hadoop”, Revolution Analytics Executive White Paper, October 2014

Jacobs, Bill. “Maximizing the Value of Big Data.” Revolution Analytics White Paper, April 2014

Cloudera training materials were a primary resource in creating this presentation.

41

Back-up Slides

42

UNCLASSIFIED

UNCLASSIFIED

AWS Primer

• Elastic Cloud Compute (EC2):

– Virtual Machines (Computers) that Reside in the Cloud (just like a real computer, you choose RAM and Storage size)

– Choose Linux of Microsoft image

• Simple Storage Solution (S3)

– “Buckets” that can store files

– Think of this as an infinitely expandable Dropbox in which you only pay for the storage used

43

UNCLASSIFIED

UNCLASSIFIED

The Four V’s of Big Data

Volume

• Scale of Data

• 40 Zettabytes of data will be created by 2020

Velocity

• Analysis of Streaming Data

• The NY Stock Exchange captures 1TB of trade information during each trading session

Variety

• Different forms of data

• By the end of 2014, it’s anticipated there will be 420 million wearable wireless health monitors

Veracity

• Uncertainty of Data

• Poor data quality costs the US economy $3.1 Trillion a year

Data from IBM Graphic Visualization

44

UNCLASSIFIED

UNCLASSIFIED

HDFS Fault-tolerance

• Many different failure modes

– Disk corruption, node failure, switch failure

• Primary concern: Data is safe!!

• Secondary concerns

– Keep accepting reads and writes

– Do it transparently to clients/users

45

UNCLASSIFIED

UNCLASSIFIED

Hadoop Cost Considerations

• Traditional Storage:

Terabase is ~ $20K per TB per year

• Hadoop Storage:

$1K-2K per TB per year

Note: Hadoop Storage costs assume you have the technical expertise in-house. Hiring/contracting Hadoop programmers increases costs significantly.

***Cloudera training, NYC, 201446

UNCLASSIFIED

UNCLASSIFIED

File Systems (cont)

• Disk does a seek for every I/O operation

• Seeks are expensive (~10ms)

• Throughput tradeoff—Input/OutputOperations per second (IOPS)

– 100 MB/s and 10 IOPS

– 10MB/s and 100 IOPS

• Big I/Os mean better throughput

47

UNCLASSIFIED

UNCLASSIFIED

Summary

• GFS and MR co-design

– Cheap, simple, effective at scale

• Fault-tolerance baked in

– Replicate data 3x

– Incrementally re-execute computation

– Avoid single points of failure

48

UNCLASSIFIED

UNCLASSIFIED

Networking Primer

Namenode

Host 1

Namenode

Host 2

Data Node

Host 3

Data Node

Host 5

Data Node

Host 4

Data Node

Host 6

49

UNCLASSIFIED

UNCLASSIFIED

HDFS Block Replication

**Graphic from Cloudera Training Material50

UNCLASSIFIED

UNCLASSIFIED

MapReduce—Map

MapReduce—Map

• Records from the data source (lines out of files, rows out of a database, etc.) and feeds them into the map function as key*value pairs: e.g., (filename, line)

• Map() produces one or more intermediate values along with an output key from the input

Txt

Txt

MapTask

{key 1, values}

Shuffle Phase

[key 1, int. values}

Reduce Task

Final {key,

values}

{key 1, values}

{key 1, values}

[key 1, int. values}

[key 1, int. values} 51

52

Hierarchy of Data Scientists

Tool-maker: generates algorithms from scratch: full understanding of when

algorithm will break.

High-end tool user: uses products that require deeper understanding of the question and tools. Ex: executing and

debug a few lines of code

Tool user: uses products generated by other analysts to generate answers to

well-known questions

UNCLASSIFIED

UNCLASSIFIED

MapReduce—Reduce

MapReduce—Reduce

• After the map phase is over, all the intermediate values for a given output key are combined together into a list

• Reduce() combines those intermediate values into one or more final values for that same output key

Txt

Txt

MapTask

{key 1, values}

Shuffle Phase

[key 1, int. values}

Reduce Task

Final {key,

values}

{key 1, values}

{key 1, values}

[key 1, int. values}

[key 1, int. values} 53

UNCLASSIFIED

UNCLASSIFIED

Where does data science fit in?

Data-Driven Decision Making(across the organization)

Automated DDD

Data Science

Data Engineering and Processing[including “Big Data” technologies]

Other positive effects of data processing

54

UNCLASSIFIED

UNCLASSIFIED

The Data Scientist

“The Hacker”

Computer Science

“The Nerd”

Statistics & Math

Modeling

“The Expert”

SME on fieldof interest

“The Data Scientist”

Organizations can achieve data

science through a team approach

55

UNCLASSIFIED

UNCLASSIFIED

How Big Data Fits into a Data Science Minor at USMA

UC Berkley:

1. Research Design and Application for Data and Analysis

2. Exploring and Analyzing Data

3. Storing and Retrieving Data

4. Applied Machine Learning

5. Visualizing and Communicating Data

Proposed WP Curriculum:

1. Engineering Statistics

2. Data Bases & Big Data

3. Network Analysis

4. Machine Learning and Data Mining

5. Visualizing and Communicating Data

56

UNCLASSIFIED

UNCLASSIFIED

Near-term Future of Big Data

• HDFS remains industry standard

• Spark replaces MapReduce

Spark Code:

file=spark.textFile(“hdfs://…”)

errors=file.filter(lambda line: “ERROR” in line)

# Count errors mentioning MySQL

errors.count()

• Keep your eye on Apache Drill

57