Big data presentation

36
Big Data Trịnh Phong Nhã Võ Hoàng Trôvi Võ Đình Chinh GVGD: TS. Nguyễn Đức Thái

Transcript of Big data presentation

Big Data

Trịnh Phong NhãVõ Hoàng TrôviVõ Đình Chinh

GVGD: TS. Nguyễn Đức Thái

Memory storage…

Computer Memory: 640K Ought to be Enough for Anyone

How much data?

7 billion peopleGoogle processes 100 PB/day; 3 million serversFacebook has 300 PB + 500 TB/day; 35% of

world’s photosYouTube 1000 PB video storage; 4 billion

views/dayTwitter processes 124 billion tweets/yearSMS messages – 6.1T per yearUS Cell Calls – 2.2T minutes per yearUS Credit cards - 1.4B Cards; 20B

transactions/year3

Contents

4. Big Data Security

3. SQL vs NoSQL

2. Big Data Technology Today

1. Big Data Overview

5. Big data trends

6. Demo with MongoDB & Ref docs

1. Big Data Overview (tt)

“Big data is not a single technology but a combination of old and new tech-nologies that helps companies gain actionable insight”. (“Big Data For DummiesPublished by John Wiley & Sons,

Inc. ” book reference)

1. Big Data Overview (tt)

Characteristics of Big Data

Sources of Big Data

ERP

RFID

Website

Network Switches

Social Media

Examining Big Data Types

Structured Data

Structured Data(…)

Computer- or machine-generated: Machine-generated data generally

refers to data that is created by a machine without human intervention.(Sensor data, Web log data, Point-of-sale data, Financial data…)

Human-generated: This is data that humans, in interaction with computers, supply (Input data, Click-stream data, Gaming-related data…)

Examining Big Data Types

Unstructured Data

Unstructured Data(…)

Unstructured data is everywhereMachine-generated unstructured data: Satellite images, Scientific data, Photographs and video, Radar or sonar data…

Human-generated unstructured data:Text internal to your company, Social media data, Mobile data…

Managing different data types

Managing different data types

Integrating data types into a big data environment need:

Connectors: enable you to pull data in from various big data sources

Metadata is the definitions, mappings, and other characteristics used to describe how to find, access, and use a company’s data (and software) components

Analysis• Querying• Statistic• Modeling• Data Mining• Text

analytics

Analysis & Processing

Processing• Data storage • Data transfer• Data

monitoring

What will we do with Big Data?

Quiz….?

How to store and handle Big Data?

2. Big Data Technology Today

Storage…NoSQL Database

2.Big Data Technology Today(tt)

Processing

2.Big Data Technology Today(tt)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

2.Big Data Technology Today(tt)

Instead of treating memory as a cache, why not treat it as a primary data store? Facebook keeps 80% of its

data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk - 5 -10ms • RAM – x0.001msec

20

EventsFACEBOOK

FACEBOOK

FACEBOOK

Memory Grid

Data Grid

Data Grid

Data Grid

2.Big Data Technology Today(tt)

Transfer data:

2.Big Data Technology Today(tt)

Open-source software framework from Apache Hadoop Google MapReduce GFS (Google File System)

HDFS Map/Reduce

3. SQL vs NoSQL

Data storage

File

SQL DBMS

NoSQL

3. SQL vs NoSQL (…)

A relational database is a set of tables containing data fitted into predefined categories.

Each table contains one or more data categories in columns.

Each row contains a unique instance of data for the categories defined by the columns.

3. SQL vs NoSQL (…)

Key-value stores. As the name implies, a key-value store is a system that stores values indexed for retrieval by keys.

Some of the market leaders:

RiakAmazon DynamoVoldermort

3. SQL vs NoSQL (…)

Column-oriented databases. column-oriented databases contain one extendable column of closely related data

Some of the market leaders:

HBaseCassandra

3. SQL vs NoSQL (…)

Document-based stores. These databases store and organize data as collections of documents, rather than as structured tables with uniform sized fields for each record

Some of the market leaders:

MongoDBCouchDBSimpleDB

3. SQL vs NoSQL (…)

SQL 2008 Data storage capacity

3. SQL vs NoSQL (…)

GridFS stores files in two collections: chunks stores the binary chunks. For

details, see The chunks Collection. files stores the file’s metadata. For

details, see The files Collection.

3. SQL vs NoSQL (…)

BSON Types

The chunks Collection

The files Collection

3. SQL vs NoSQL (…)

4. Big Data Security

• Secure computations in distributed programming frameworks

• Security best practices for non-relational data stores

• Secure data storage and transactions logs• Cryptographically enforced access control

and secure communication• Granular access control• Real-time security/compliance monitoring

4. Big Data Security (…)

Technical Recommendations for sercurity• Use Kerberos for node authentication• Use file layer encryption• Data anonymization• Use key management• Deployment validation• Use secure communication• Tokenization• Cloud database controls

5. Big data trends

• Big data – of the people, by the people, for the people

• Big data and social computing• Cloud computing• In memmory computing• Mobile Applications and HTML5• Internet and big data

6. Demo with MongoDB & Ref docs

Ref docs: Judith Hurwitz, Alan Nugent, Dr. Fern Halper,

and Marcia Kaufman: Big Data For Dummies. John Wiley & Sons, Inc. 2013.

“Technology Trends for 2013” prepared by Kaushal Amin, Chief Technology Officer, KMS Technology – Atlanta, GA, USA

Website: http://hadoop.apache.org/Demo with MongoDB