Big data

1

Big Data

Guy Lion

April 5, 2013

2

Table of Content

1) Big Data trends.

2) How Big is your Data?

3) Big Data Potential.

4) Big technologies. New databases.

5) Big quantitative methods. New stats.

6) Big Data temperaments.

7) Is Big always better?

3

1) Big Data Trends

4

Cost of Data storage has dropped

5 Social networks are creating a huge live Unstructured Data.

Social Media (Facebook & Twitter) has grown

exponentially

Twitter started in

March 2006.

Has 500 million

users.

Facebook started

in Feb 2004. Has

1 billion active

users.

Facebook vs Twitter # Active Users in 000

exponential growth

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Jan-

08

Apr

-08

Jul-0

8

Oct-0

8

Jan-

09

Apr

-09

Jul-0

9

Oct-0

9

Jan-

10

Apr

-10

Jul-1

0

Oct-1

0

Jan-

11

Apr

-11

Jul-1

1

Oct-1

1

Jan-

12

Apr

-12

Jul-1

2

Oct-1

2

Jan-

13

Facebook Twitter

6

Unstructured Data is taking over…

7

• How Tall is it? How large is your sample (rows)?

• How Wide is it? How many variables (columns)?

• What is its Velocity? How frequently is it updated?

• Does it include unstructured data (documents, emails, Social Media)?

2) How Big is your Data?

8

3) Big Data Potential

9

4) Big Technologies.

New Databases

11

Database: Structured vs Unstructured

Data Type

Unstructured.

Social Media,

Text documents,

Web services

Structured.

Customers,

transactions,

numbers in rows.

Database

language

SQL

structured

query language

NoSQL

not only SQL

Database

type

Relational

database

Non-relational

database

Database

structure

Data Warehouse

Data Marts

Hadoop

Hadoop

Connectors

Reporting

Business

Intelligence

Oracle Essbase

& IBM Cognos

Reporting

tool

12

5) Big quantitative methods.

New Stats

13

New Stats Map

Predictive

Analytics

Statistics &

Regression

Data Mining &

Machine Learning

(formerly Artificial

Intelligence)

A/B Testing

(hypothesis testing)

Regression

Time Series

Analysis

Spatial Analysis

Signal Processing

Association

Rule Learning

Cluster Analysis

Classification

Neural Networks

Natural Language

ProcessingSentiment Analysis

Optimization Genetic Algorithms

Pattern Recognition

14

Definitions. Part I

Association Rule Learning: method to uncover interesting relationships

by generating and testing possible rules. One application is “market

basket analysis”, where a retailer figures out what products are

frequently bought together. A cited example is that shoppers who buy

diapers often buy beer.

Classification: identifies the categories in which new data belongs,

based on an existing data set grouped in predefined categories. It

differs from Cluster Analysis that starts without predefined categories.

Genetic algorithms: an optimization method inspired by the “survival of

the fittest” process. Potential solutions are encoded as “chromosomes”

that can combine and mutate. The chromosomes are selected for

survival within a modeled “environment.” Examples: optimizing the

performance of an investment portfolio.

15

Definitions. Part II

Natural language processing (NLP): it uses algorithms to analyze text data.

Sentiment Analysis is a common application. It measures customers’

reaction to a product campaign by analyzing social media.

Neural networks: models inspired by the workings of neurons and synapses

within the brain. Used for finding nonlinear patterns. They can be used for

Pattern recognition and Optimization. Examples of neural network

applications include identifying customers that may leave and identifying

fraudulent insurance claims.

Signal processing: an electrical engineering method to analyze signals

(radio, etc…) and discern between signal and noise. It is used to extract

the signal from the noise from a set of less precise data [Signal Detection

Theory].

16

Definitions. Part III

Spatial Analysis: it analyzes geographic location encoded within

the data. The information comes from GPS. Applications include

spatial regression to figure a consumer willingness to purchase a

product given his location.

17

6) Big Data Temperaments

Source: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horne

and Jaime Capella.

18

7) Is Big always better?

19

No! says Nate Silver

•He refers to John P. Ioannidis 2005

paper: “Why Most Published

Research Findings are False.”

2/3ds of scientific papers’ results

can’t be replicated!

“… numbers have no way of speaking for

themselves. We speak for them.”

•“I came to realize that prediction in the era of Big Data was

not going very well.”

•“If the quantity of information is increasing [exponentially]…

Most of it is just noise.”

20

Nate’s targets

• Political pundits. Their “intuitive” election predictions have been disastrous. Granted, it was not because of Big Data but instead No Data. He showed them how to do it using Small Data (polls with samples < 1,000);

• Economists forecasters. They have used Big Data with poor results. The majority of them can’t forecast a recession already underway. ECRI predicted with certainty a double dip recession in 2011 using tens of variables they did not understand. Instead, the economy improved;

• Stock market & financial market forecasters. Similar performance as economists forecasters;

• Earthquake forecasting. The field is not well understood.

“… Statistical inferences are much stronger when backed

up by theory… about their root causes.”

21

No! says Vincent Granville

• Big Data is huge, but information is very sparse;

• Storing and processing the entire data is very inefficient;

• You can do better by smartly sampling only 5% of the data;

You don’t need Big Data, you need Smart Data.

22

Yes! Says Chris Anderson

• He quotes Peter Norvig, Google’s research director: “All models are

wrong, and increasingly you can succeed without them.”

• “… with massive data, [the scientific method] is becoming

obsolete.”

• “We can throw the numbers into the biggest computing clusters …

and let statistical algorithms find patterns where science cannot.” He

mentions examples such as J.Craig Venter gene sequencing,

Google Search, and Google Translator, among other successes.

“Correlation supersedes causation, and science can advance without

coherent models, unified theories, or … any … explanation at all.”

“With enough data, the numbers speak for themselves.”

23

Big Data Effectiveness Map

Theory not well

understood

Theory well

understood

Tall data

More data more

Noise

Oversampling

More data more

Signal

Oversampling

Wide data

More variables more

false positives

Multicollinearity

Model overfitting

More variables more

explanation

Multicollinearity

Model overfitting

Examples

Economics,

Financial markets,

Earthquake

forecasting

Weather forecasting,

Customer behavior

Games & Sports

[Chess, Baseball,

etc…], Politics

Google Search,

Google Translator,

Google Flu-trends,

Customer behavior

Field not needing

causal

understanding

More data better

model performance

Field needing causal understanding

Rule Based

More data better

model performance

Big data

Education

Transcript of Big data