John Lenhart. Data stores are growing by 50% each year, and that rate of increase is accelerating...

17
Big data: Issues, challenges, tools and Good practices John Lenhart

Transcript of John Lenhart. Data stores are growing by 50% each year, and that rate of increase is accelerating...

Page 1: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big data: Issues, challenges, tools and Good practices

John Lenhart

Page 2: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Motivation

Data stores are growing by 50% each year, and that rate of increase is accelerating[1]

In 2010, we crossed the barrier of the zettabyte (ZB) across all online data. This year, we will produce 4 ZB of data worldwide[1]

The type of data is also changing.  Over 80% of it will be unstructured data which does not work well with relational databases[1]

Page 3: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data Defined

“Big data is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value from it…”

Big data is sort of a misnomer, as it only points out the size of the data not giving too much of attention to its other existing properties

Page 4: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data’s Properties

Variety - the stored data is not all of the same type or category Structured data - data that is organized in a

structure so that it is identifiable e.g. SQL data

Semi-structured data - a form of structured data that has a self-describing structure yet does not conform with the formal structure of a relational database e.g. XML

Unstructured data - data with no identifiable structure e.g. image

Page 5: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data’s Properties…

Volume - The “Big” in Big data and represents the large volume or size of the data  At present the data existing is in

petabytes and is supposed to increase to zettabytes in the near future

For example big social networking sites are producing data in order of terabytes everyday and this amount of data is difficult to handle using traditional systems

Page 6: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data’s Properties…

Velocity -  represents not only the speed at which the data is incoming, but also the speed at which the data is outgoing Traditional systems are not capable of performing

analytics on data that is constantly in motion Variability - represents the inconsistency of the

data flow The flow of data can be highly inconsistent, leading to

periodic peaks and lows Daily, seasonal and event-triggered peak data loads can be

challenging to manage, especially for unstructured data[2] For example a large natural disaster would spike page

visits for cnn.com

Page 7: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data’s Properties…

Complexity Represents the difficulty of linking,

matching, cleansing, and transforming data from multiple sources

Value Systems must not only be designed to

handle Big data efficiently and effectively, but also be able to filter the most important data from all of the data collected

This filtered data is what helps add value to a business

Page 8: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data in The Real World Log Storage in IT Industries

IT industries store large amounts of data as logs to deal with problems which occur rarely in order to solve them

Big data analytics is used on the data to pinpoint the point of failures

Traditional Systems are not able to handle these logs because of their volume, raw and semi structured nature, and high rate of change

Sensor Data Massive amount of sensor data is also a big challenge for Big

data Example▪ The Large Hadron Collider (LHC) is the world’s largest and highest-

energy particle accelerator. The data flow in its experiments consists of 25 to 200 petabytes of data which needs to be processed and stored

Page 9: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data in The Real World… Risk Analysis

It’s important for financial institutions to model data in order to calculate the risk so that it falls under their acceptable thresholds

A lot of potential data is underutilized because of its volume and should be integrated within the model to determine the risk patterns more accurately

Social Media The largest use of Big data is for social media and customer

sentiments Keeping an eye on what the customers are saying about

their products helps business organizations to get a kind of customer feedback

The customer feedback can then be used to make decisions and add value to the business

Page 10: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data Challenges and Issues

Privacy and Security The most important issue with Big data which

includes conceptual, technical as well as legal significance

The personal information of a person when combined with external large data sets leads to the inference of new private facts about that person

Big data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences without the ability to fight back or even having knowledge that they are being discriminated against

Page 11: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Big Data Challenges and Issues… Data Access and Sharing of Information

If data is to be used to make accurate decisions in time it becomes necessary that it should be available in accurate, complete and timely manner

Storage and Processing Issues Many companies are struggling to store the large

amount of data they are producing▪ Outsourcing storage to the cloud may seem like an option

but long upload times and constant updates to the data preclude this option

Processing a large amount of data also takes a lot of time

Page 12: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Tools and Techniques available

Hadoop - is an open source project hosted by Apache Software Foundation for managing Big data

Hadoop consists of two main components The Hadoop File System (HDFS) which is a

distributed file-system that stores the data on multiple separate servers (each of which having its own processor(s))

MapReduce the framework that understands and assigns work to the nodes in a cluster[3]

Page 13: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Tools and Techniques available…

Hadoop provides the following advantages[3]

Data read/write performance is increased by distributing the data across the cluster allowing each processor to do work in a parallel fashion

It’s scalable, new nodes can be added as needed without making changes to the existing system

It’s cost effective because it brings parallel computing to commodity servers

It’s flexible, it can absorb any type of data, structured or not from any number of sources

It’s fault tolerant, it handles failures intrinsically by always storing multiple copies of the data and automatically loading a copy when a fault is detected

Page 14: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Tools and Techniques available… How do you use Hadoop?

The developer writes a program that conforms to the MapReduce programming model

The developer specifies the format of the data to be processed in their program

How does MapReduce work?[4]

Each Hadoop program performs two tasks: ▪ Map - Breaks all of the data down into key/value

pairs▪ Reduce - Takes the output from the map step as

input and combines those data key/value pairs into a smaller set of key/value pairs

Page 15: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Tools and Techniques available…

MapReduce example[4]:  Assume you have five files, and each file contains two columns that represent a city and the corresponding temperature recorded in that city for the various measurement days Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome, 33 ,New York, 18

We want to find the maximum temperature for each city across all of the data files

Then we create five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city Which results in: (Toronto, 20) (New York, 22) (Rome, 33)

Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York, 33) (Rome, 38)

(Toronto, 22) (New York, 20) (Rome, 31)(Toronto, 31) (New York, 19) (Rome, 30) All five of these output streams would be fed into the reduce tasks, which

combines the input results and outputs a single value for each city, producing a final result set as follows: (Toronto, 32) (New York, 33) (Rome, 38)

Page 16: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

Questions?

Page 17: John Lenhart.  Data stores are growing by 50% each year, and that rate of increase is accelerating [1]  In 2010, we crossed the barrier of the zettabyte.

References

Big data: Issues, challenges, tools and Good practices http://ieeexplore.ieee.org.ezp.scranton.edu/xpls/icp.jsp?arnu

mber=6612229&tag=1#references Why Every Database Must Be Broken Soon

1. https://blogs.vmware.com/vfabric/2013/03/why-every-database-must-be-broken-soon.html

Big Data: What it is and why it matters 2.

http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

What is Hadoop? 3. http://www-01.ibm.com/software/data/infosphere/hadoop/

What is MapReduce? 4.

http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/