Big Data

62
Introduction to Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University. [email protected]

description

เรื่องน่ารู้เกี่ยวกับ Big Data ครับไปบรรยายาที่สถาบันจุฬาภรณ์ 23 เมษา 2556

Transcript of Big Data

Page 1: Big Data

Introduction to Big Data

Dr. Putchong Uthayopas

Department of Computer Engineering, Faculty of Engineering, Kasetsart University.

[email protected]

Page 2: Big Data

Agenda

• Introduction and Motivation

• Big Data Characteristics

• Big Data Technology

• Using Big Data

• Trends

Page 3: Big Data

Introduction and Motivation

Page 4: Big Data

We are living in the world of Data

Geophysical Exploration

Medical Imaging

Video Surveillance

Mobile Sensors

Gene Sequencing

Smart Grids

Social Media

Page 5: Big Data

Some Data sizes ~40 109 Web pages at ~300 kilobytes each = 10 Petabytes

Youtube 48 hours video uploaded per minute;

in 2 months in 2010, uploaded more than total NBC ABC CBS

~2.5 petabytes per year uploaded?

LHC 15 petabytes per year

Radiology 69 petabytes per year

Square Kilometer Array Telescope will be 100 terabits/second

Earth Observation becoming ~4 petabytes per year

Earthquake Science – few terabytes total today

PolarGrid – 100’s terabytes/year

Exascale simulation data dumps – terabytes/second

5

Page 6: Big Data

http://www.touchagency.com/free-twitter-infographic/

Page 7: Big Data
Page 8: Big Data

Information as an Asset

• Cloud will enable larger and larger data to be easily collected and used

• People will deposit information into the cloud – Bank, personal ware house

• New technology will emerge – Larger and scalable storage technology

– Innovative and complex data analysis/visualization for multimedia data

– Security technology to ensure privacy

• Cloud will be mankind intelligent and memory!

Page 9: Big Data

“Data is the new oil.”

Andreas Weigend, Stanford (ex Amazon)

Data is more like soup – its

messy and you don’t know

what’s in it….

Page 10: Big Data

The Coming of Data Deluge

• In the past, most scientific disciplines could be described as small data, or evendata poor. Most experiments or studies had to contend with just a few hundred or a few thousand data points.

• Now, thanks to massively complex new instruments and simulators, many disciplines are generating correspondingly massive data sets that are described as big data, or data rich. – Consider the Large Hadron Collider, which will eventually generate

about 15 petabytes of data per year. A petabyte is about a million gigabytes, so that qualifies as a full-fledged data deluge.

The Coming Data Deluge: As science becomes more data intensive, so does our language

BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011

Page 11: Big Data

Particle

physics

data

“Herculean” and

“Heroic”

Page 12: Big Data

Scale: an explosion of data

http://www.phgfoundation.org/reports/10364/

“A single sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project”

Page 13: Big Data

Creating a connectome

• neuroscientists have set the goal of creating a connectome, a complete map of the brain's neural circuitry. – an image of a cubic millimeter chunk of the brain would comprise

about 1 petabyte of data (at a 5-nanometer resolution).

– There are about a million cubic millimeters of neural matter to map, making a total of about a thousand exabytes (an exabyte is about a thousand petabytes)

– qualifies as what Jim Gray once called an exaflood of data.

The Coming Data Deluge: As science becomes more data intensive, so does our language

BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011

Page 14: Big Data

The new model is for the data to be captured by

instruments or generated by simulations before

being processed by software and for the resulting

information or knowledge to be stored in computers.

Scientists only get to look at their data fairly late in

this pipeline. The techniques and technologies for

such data-intensive science are so different that it is

worth distinguishing data-intensive science from

computational science as a new, fourth paradigm for

scientific exploration.

—Jim Gray, computer scientist

Page 15: Big Data
Page 16: Big Data
Page 17: Big Data
Page 18: Big Data

• The White House today announced a $200 million big-data initiative to create tools to improve scientific research by making sense of the huge amounts of data now available..

• Grants and research programs are geared at improving the core technologies around managing and processing big data sets, speeding up scientific research with big data, and encouraging universities to train more data scientists and engineers.

• The emergent field of data science is changing the direction and speed of scientific research by letting people fine-tune their inquiries by tapping into giant data sets.

• Medical research, for example, is moving from broad-based treatments to highly targeted pharmaceutical testing for a segment of the population or people with specific genetic markers.

Page 19: Big Data
Page 20: Big Data

So, what is big data?

Page 21: Big Data

Big Data

“Big data is data that exceeds the processing

capacity of conventional database systems. The

data is too big, moves too fast, or doesn’t fit the

strictures of your database architectures. To gain

value from this data, you must choose an

alternative way to process it.”

Reference: “What is big data? An introduction to the big data

landscape.”, Edd Dumbill, http://radar.oreilly.com/2012/01/what-is-big-

data.html

Page 22: Big Data

Amazon View of Big Data

'Big data' refers to a collection of tools, techniques and technologies which make it easy to work with data at any scale. These distributed, scalable tools provide flexible programming models to navigate

and explore data of any shape and size, from a variety of sources.

Page 23: Big Data

The Value of Big Data

• Analytical use – Big data analytics can reveal insights hidden

previously by data too costly to process. • peer influence among customers, revealed by analyzing

shoppers’ transactions, social and geographical data.

– Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data.

• Enabling new products. – Facebook has been able to craft a highly personalized

user experience and create a new kind of advertising business

Page 24: Big Data

Big Data Characteristics

Page 25: Big Data

3 Characteristics of Big Data

• Volumes of data are larger than those conventional relational database infrastructures can cope with Volume

• Rate at which data flows in is much faster.

• Mobile event and interaction by users.

• Video, image , audio from users Velocity

• the source data is diverse, and doesn’t fall into neat relational structures eg. text from social networks, image data, a raw feed directly from a sensor source.

Variety

Page 26: Big Data

Big Data Challenge

Volume

• How to process data so big that can not be move, or store.

Velocity

• A lot of data coming very fast so it can not be stored such as Web usage log , Internet, mobile messages. Stream processing is needed to filter unused data or extract some knowledge real-time.

Variety

• So many type of unstructured data format making conventional database useless.

Page 27: Big Data

Big Data Technology

Page 28: Big Data

From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University

Page 29: Big Data

What is needed for big data

• Your data

• Storage infrastructure

• Computing infrastructure

• Middleware to handle BIG Data

• Data Analysis

– Statistical analysis

– Data Mining

• People

Page 30: Big Data

How to deal with big data

• Integration of – Storage – Processing – Analysis Algorithm – Visualization

Massive Data

Stream

Stream processing

Processing

Processing

Processing

Visualize

Analysis Storage

Page 31: Big Data

How can we store and process massive data

• Beyond capability of a single server

• Basic Infrastructure

– Cluster of servers

– High speed interconnected

– High speed storage cluster

• Incoming data will be spread across the server farm

• Processing is quickly distributed to the farm

• Result is collected and send back

Page 32: Big Data

NoSQL (Not Only SQL)

• Next Generation Databases mostly addressing some of the points:

– being non-relational, distributed, open-source and horizontally scalable.

– Used to handle a huge amount of data

– The original intention has been modern web-scale databases.

Reference: http://nosql-database.org/

Page 33: Big Data

MongoDB

• MongoDB is a general purpose, open-source database.

• MongoDB features: – Document data model with dynamic

schemas

– Full, flexible index support and rich queries

– Auto-Sharding for horizontal scalability

– Built-in replication for high availability

– Text search

– Advanced security

– Aggregation Framework and MapReduce

– Large media storage with GridFS

Page 34: Big Data

What is Hadoop?

- Hadoop or Apache Hadoop

- open-source software framework

- supports data-intensive distributed applications.

- develop by the Apache

- derived from Google's MapReduce and Google File

System (GFS) papers.

- Implement with Java

Page 35: Big Data

Overview

master

worker

Page 36: Big Data

HDFS

Page 37: Big Data

Google Cloud Platform

• App engines – mobile and web app

• Cloud SQL – MySQL on the cloud

• Cloud Storage – Data storage

• Big Query – Data analysis

• Google Compute Engine – Processing of large data

Page 38: Big Data

Amazon

• Amazon EC2

– Computation Service using VM

• Amazon DynamoDB

– Large scalable NoSQL databased

– Fully distributed shared nothing architecture

• Amazon Elastic MapReduce (Amazon EMR)

– Hadoop based analysis engine

– Can be used to analyse data from DynamoDB

Page 39: Big Data

Issues

• I/O capability of a single computer is limited , how to handle massive data

• Big Data can not be moved

– Careful planning must be done to handle big data

– Processing capability must be there from the start

Page 40: Big Data

Using Big Data

Page 41: Big Data

WHAT FACEBOOK KNOWS

http://www.facebook.com/data

Cameron Marlow calls himself Facebook's "in-

house sociologist." He and his team can

analyze essentially all the information the site

gathers.

Page 42: Big Data

Study of Human Society

• Facebook, in collaboration with the University of Milan, conducted experiment that involved

– the entire social network as of May 2011

– more than 10 percent of the world's population.

• Analyzing the 69 billion friend connections among those 721 million people showed that

– four intermediary friends are usually enough to introduce anyone to a random stranger.

Page 43: Big Data

The links of Love

• Often young women specify that they are “in a relationship” with their “best friend forever”.

– Roughly 20% of all relationships for the 15-and-under crowd are between girls.

– This number dips to 15% for 18-year-olds and is just 7% for 25-year-olds.

• Anonymous US users who were over 18 at the start of the relationship

– the average of the shortest number of steps to get from any one U.S. user to any other individual is 16.7.

– This is much higher than the 4.74 steps you’d need to go from any Facebook user to another through friendship, as opposed to romantic, ties.

http://www.facebook.com/notes/facebook-data-team/the-links-of-

love/10150572088343859

Graph shown the relationship of anonymous US users who were

over 18 at the start of the relationship.

Page 44: Big Data

Why?

• Facebook can improve users experience – make useful predictions about users' behavior

– make better guesses about which ads you might be more or less open to at any given time

• Right before Valentine's Day this year a blog post from the Data Science Team listed the songs most popular with people who had recently signaled on Facebook that they had entered or left a relationship

Page 45: Big Data

How facebook handle Big Data?

• Facebook built its data storage system using open-source software called Hadoop. – Hadoop spreading them across many machines inside a

data center. – Use Hive, open-source that acts as a translation service,

making it possible to query vast Hadoop data stores using relatively simple code.

• Much of Facebook's data resides in one Hadoop store more than 100 petabytes (a million gigabytes) in size, says Sameet Agarwal, a director of engineering at Facebook who works on data infrastructure, and the quantity is growing exponentially. "Over the last few years we have more than doubled in size every year,”

Page 46: Big Data

Google Flu

• pattern emerges when all the flu-related search queries are added together.

• We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening.

• By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world.

http://www.google.org/flutrends/abo

ut/how.html

Page 47: Big Data

From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University

Page 48: Big Data

From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University

Page 49: Big Data
Page 50: Big Data
Page 51: Big Data
Page 52: Big Data
Page 53: Big Data
Page 54: Big Data
Page 55: Big Data
Page 56: Big Data

Preparing for BigData • Understanding and preparing your data

– To effectively analyse and, more importantly, cross-analyse your data sets – this is often where the most insightful results come from – you need to have a rigorous knowledge of what data you have.

• Getting staff up to scratch – Finding people with data analysis experience

• Defining the business objectives – Once the end goal has been decided then a strategy can be created for implementing big data analytics

to support the delivery of this goal

• Sourcing the right suppliers and technology – But in terms of storage, hardware, and data warehousing, you will need to make a range of decisions to

make sure you have all the capabilities and functionality required to meet your big data needs.

http://www.thebigdatainsightgroup.com/site/article/preparing-big-data-revolution

Page 57: Big Data

Trends

Page 58: Big Data

Trends

• A move toward large and scalable Virtual Infrastructure – Providing computing service

– Providing basic storage service

– Providing Scalable large database • NOSQL

– Providing Analysis Service

• All these services has to come together – Big data can not moved!

Page 59: Big Data

Issues

• Security – Will you let an important data being accumulate outside your

organization? • If it is not an important data, why analyze them ?

– Who own the data? If you discontinue the service, is the data being destroy properly.

– Protection in multi-tenant environment

• Big data can not be moved easily – Processing have to be near. Just can not ship data around

• So you finally have to select the same cloud for your processing. Is it available, easy, fast?

• New learning, development cost – Need new programming, porting? – Tools is mature enough?

Page 60: Big Data

When to use Big data on the Cloud

• When data is already on the cloud – Virtual organization – Cloud based SaaS Service

• For startup – CAPEX to OPEX – No need to maintain large infra – Focus on scalability and pay as you go – Data is on the cloud anyway

• For experimental project – Pilot for new services

Page 61: Big Data

Summary

• Big data is coming.

– Changing the way we do science

– Big data are being accumulate anyway

– Knowledge is power.

• Better understand your customer so you can offer better service

• Tools and Technology is available

– Still being developed fast

Page 62: Big Data

Thank you