Big Data Analytics - Introduction

download Big Data Analytics - Introduction

If you can't read please download the document

Transcript of Big Data Analytics - Introduction

Big Data Analytics

What Is Big Data Analytics?

Big DataBuzz word

Two definitions:Data sets too large for modern relational databases

Semi-structured/Unstructured data sets

AnalyticsThe science of measuring and discovering patterns and trends with data

Source: http://www.socialtalent.co/blog/big-data-whats-the-big-deal

Data, Data, Everywhere...

In 2004:Internet traffic: 1 Exabyte (that's 134,217,728 8GB flash drives)

A lot of other media:Newspapers/books/magazines

DVDs

Data, Data, Everywhere...

Today:Internet traffic: 1.3 Zettabytes (that's 178,670,639,360 8 GB sticks)110.3 exabytes per month

Even more media:Mobile devices (phones/tablets/mp3 players/etc)

The Internet of Things

Streaming Media

The Internet of Things

How many of you have...Fitness trackers?

E-readers?

Ipods?

Tie them to social sites (i.e. Facebook)?

The Internet of Things

You're being tracked!

So what?Marketing

Medical

Government

Building fuller picture of what's tracked.

Social Network Integration

Six Degrees of Separation

Source: http://www.83toinfinity.com

Source: http://www.math.cornell.edu/~numb3rs/blanco/social_net.jpg

Data Storage

Data Storage

Relational DatabasesStructured data

Can scale to huge volumes of data

HadoopSemi-structured/unstructured data

Massively parallel storage and processing

Relational Database

Source: http://www.ntu.edu.sg/home/ehchua/programming/sql/images/ManyToOne.png

Unstructured Data

Source: http://storagegaga.com/2011/12/

Semi-structured

Source: http://www.stylusstudio.com/images/figures/sql_xml_xml_fragment.gif

What Solution to Pick?

Data Volume and SpeedRelational Databases Will Cap out

Big Data Stores Scale (For Now)Hadoop

Spark

Lucene

Alternative Modeling TechniquesHyper Normalized (6-8NF)Inmon's Textual Disambiguation

Anchor Modeling

Data Vault

Hadoop

Version 1Giant data store

File distribution

File parsing tools

Generic security

Version 2Giant data store

Replaced foundation work

Unified security -LDAP/Kerberos support

Tools

Oozie

Hive

NoSQL DatabasesHbase

MongoDB

JSON

{"employees": [{ "firstName":"John" , "lastName":"Doe" },{ "firstName":"Anna" , "lastName":"Smith" },{ "firstName":"Peter" , "lastName":"Jones" }]}

Source: http://www.w3schools.com/json/json_syntax.asp

How to Analyze?

Performance

Timeliness

Accuracy

Feedback

Big Data Solutions

Search the entire data set

Great performance

Highly accurate

Integrates into Analytics toolsOnly some of the tools are able to support Hadoop, etc.

Statistics

Designed for all sizes of data sets

Decreases time to results

As accurate as needed

Analytics tools fully support

Most Big Data tools support

Analytics Tools

Can access data of most sizesMost can handle Hadoop and some NoSQL databases

Built for Predictive Modeling

Starting to handle social/network modeling

How to Get Started

Grab some tools!RapidMiner (http://rapidminer.com/)

R (http://www.r-project.org/)

Weka (http://www.cs.waikato.ac.nz/ml/weka/)

Grab some data!http://www.kdnuggets.com/datasets/index.html

http://aws.amazon.com/publicdatasets/

http://www.reddit.com/r/datasets

Prizes/Challenges

Kaggle - https://www.kaggle.com/

MIT - http://bigdata.csail.mit.edu/challenge

Heritage Health Prize - http://www.heritagehealthprize.com/c/hhp

Twitter - @OpenDataAlex

LinkedIn alexmeadows

Github - dbaAlex

Questions? Comments?