David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box...

BIG DATAOVERVIEW OF STORAGE AND PROCESSING

David Gibbs and Govardhan Tanniru

Georgia State UniversityDepartment of Computer Science

P.O. Box 3965 Atlanta, GA 30302-3965.

Big Data

Big Data does not only relate to the size of data Complexity: missing information, dummy data,

organization Processing: Software, processing power, parallel

and distributed computing Data Transfer: Limitations of current systems,

CPU intensive Storage: Data sets beyond relational database,

clusters, data centers, distributed data User Interaction: Non-programmers need to

perform complex information, real time GUI interfaces, visualization of data

Where the Field is

Primary sources of big data Meteorology Complex physics simulations Biology Business

Web searching Social networking Telecommunications

Many programs for storage and processing Most Popular: HDFS, GFS, Hadoop, and MapReduce No standard for processing/storing data

No common “off the shelf” software Increases the difficulty in mining data within a field or

industry

Difficulties

Storage Developing a system in which very large amounts of

data can be stored securely and accessed quickly Transfer

Transfer from the storage site to the processing site Moving large amounts of data over TCP is costly

Processing How powerful of a system is needed? “There is a lot of data but no information” Processing the data in an efficient manner and

obtaining the correct information

The Direction of Data Storage

NoSQL Allows storage of massive data sets without the need

for overwhelming tables and indexing Each cluster stores part of the data and replicates it

on other clusters Master/Slave architecture

HDFS (Hadoop Distributed File System) P2P architecture

Cassandra ColumnFamily data model

Increased difficulty for data mining No Join operations Pulling in more data than needed

Increased transfer times, processing power

More About NoSQL(ACID).. The key advantage of schema-free design is that it

enables applications to quickly upgrade the structure of data without table rewrites.

The data validity and integrity aspect is enforced at the data management layer.

NoSQL typically does not maintain complete consistency across distributed servers because of the burden this places on databases,

particularly in distributed systems. The Consistency, Availability, Partition (CAP) Theorem

states that with consistency, availability, and partitioning tolerance, only two can be optimized at any time.

Traditional relational databases enforce strict transactional semantics to preserve consistency, but many NoSQL databases have more scalable architectures that relax the consistency requirement.

Some NoSQL databases put objects into a conflict state when this occurs. However, it is inevitably the responsibility of the application to deal with these conflicts.

Important Papers by Google

Google File System Map Reduce Big Table

Google File System

Google has reexamined traditional choices /Assumptions and explored radically different points in the design space.

First, component failures are the norm rather than the exception.

->The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis

Second, files are huge by traditional standards. Multi-GB files are common.

Third, most files are mutated by appending new data rather than overwriting existing data.

Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility .

Consistancy Model.

Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially.

A variety of data share these characteristics. Appending becomes the focus of performance optimization

and atomicity guarantees, while caching data blocks in the client loses its appeal.

Google has introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them.

Snapshot :creates a copy of a file or a directory treeat low cost.

Record :append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. (Without Additional Locking).

GFS Architecture

Master servers keep metadata on the various data files. Chunk servers store the actual data on disk. Each chunk is

replicates across three different chunk servers to create redundancy in case of server crashes.

Once directed by a master server, a client application retrieves files directly from chunk servers.

Map Reduce Operation

MapReduce is a programming model and an associated implementation for processing and generating large data sets.

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs.

A Reduce function that merges all intermediate values associated with the same intermediate key.

The MapReduce system has three different types of servers.The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files.

The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.

The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.- In MapReduce a map maps one view of data to another, producing a key value pair,

Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Map and Reduce. (Contd..)

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Big Table

BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.

BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.

It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.

Machines can be added and deleted while the system is running and the whole system just works.

Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.

· BigTable has three different types of servers: ( Master, Tablet ,Lock Servers)

Hardware strategy

Use ultra cheap commodity hardware and built software on top to handle their death.

A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.

Mixed Architectures

Many Papers focus on the integration of Traditional and Big Data Architectures.

We need architectures to handle both the types of Data. Below is the diagram from Oracle white Paper.

Other Areas of Focus Knowledge Discovery in Databases. Bringing the big data and big compute

communities together is an active area of research. Hybrid Way of Storing Un Structuted Data(File

Systems and DBMS). Efficient Data Transfer Protocols for Big Data(high-

performance network data movement ) Use of cloud computing for Big Data. Compression aspects: I/O Performance Analysis for

Big Data Clustering. Privacy Implications on Social Networking sites.

(Friends tagging another person). Faults with HADOOP might help our research.

Questions

David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box...

Documents

Transcript of David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box...