Big data issues challenges tools n good practices
-
Upload
ravi-ganghas -
Category
Engineering
-
view
246 -
download
3
Transcript of Big data issues challenges tools n good practices
Avita Katal , Mohammad Wazid, RH Goudar
Dept of C S E , G raphi c s Era Uni vers i tyDehradun , I nd ia
Publ i s hed i n : C O N T E M P O R A R Y C O M P U T I N G ( I C 3 ) , 2 0 1 3 S I X T H I N T E R N A T I O N A L C O N F E R E N C E O N
Big Data : Issues, Challenges, Tools and Good Practices
1
Ravi
Motivation
Data stores are growing by 50% each year, and that rate of increase is accelerating[8]
The type of data is also changing. Over 80% of it will be unstructured data which does not work well with relational databases[8]
The main difficulty is because the volume is increasing rapidly in comparison to computing resources
2
Ravi
Defining Big Data
It is defined as large amount of data whichrequires new technologies and architecturesso that it becomes possible to extract valueform it by capturing and analysis process.
It is a recent upcoming technology that canbring huge benefits to the businessorganizations.
3
Ravi
Properties of Big Data
Variety : Data being produced is not onlytraditional but also semi structured fromvarious sources.
Volume : Data is supposed to increase inzettabytes in near future
Velocity : Speed of data coming fromvarious sources
Big data can be defined with following properties associated with it.
4
Ravi
Properties of Big Data...5
Ravi
Properties of Big Data...
Variability : It considers the inconsistencies of data flow.
Complexity : It is difficult to link, match cleanse, and transform data across systems coming from various sources.
Value : Queries can be run against the data stored to deduct important results.
6
Ravi
Related Work
Collaborative research on methodologies for big data analysis and design.[1]
Databases required for big data [2]
Architectural considerations for big data [3]
Concept of big data with market solutions [4]
Scientific Data Infrastructure (SDI) generic architectural model [5]
7
Ravi
Related Work...
How big data analytics is different from traditional analytics [6]
Analysis of social media sites like facebook,flickr,google+ [7]
8
Ravi
Importance of Big Data
Log Storage in IT Industries IT industries store large amounts of data as logs
to deal with problems which occur rarely.
Big data analytics is used on the data to pinpoint the point of failures
Traditional Systems are not able to handle these logs.
Sensor DataMassive amount of sensor data is also a big
challenge for Big data
9
Ravi
Importance of Big Data...
Risk Analysis It’s important for financial institutions to model data in
order to calculate the risk. A lot of potential data is underutilized because of its
volume and should be integrated to determine the risk patterns more accurately
Social Media The largest use of Big data is for social media and
customer sentiments Keeping an eye on what the customers are saying is like
getting a feedback. The customer feedback can then be used to make
decisions and add value to the business
10
Ravi
Big Data Challenges and Issues
Privacy and Security
The most important issue with Big data which includes conceptual, technical as well as legal significance
The personal information of a person when combined with external large data sets leads to the inference of new private facts about that person
Big data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences .
11
Ravi
Big Data Challenges and Issues...
Data Access and Sharing of Information If data is to be used to make accurate decisions in
time it becomes necessary that it should be available in accurate, complete and timely manner
Storage and Processing Issues Many companies are struggling to store the large
amount of data they are producingOutsourcing storage to the cloud may seem like an
option but long upload times and constant updates to the data preclude this option
Processing a large amount of data also takes a lot of time
12
Ravi
Big Data Challenges and Issues...
Analytical Challenges What if data volume gets so large that we don’t know
how to deal with it
Does all data need to be stored ?
Does all data need to be analyzed?
Which data points are really important ?
How can data be used to best advantages
Skill Requirement : Being a new and emerging technology, it needs to attract organization and youth with diverse new skill sets.
13
Ravi
Big Data Challenges and Issues...
Technical Challenges
Fault Tolerance
Scalability
Quality of Data
Heterogeneous Data
14
Ravi
Tools and Techniques Available
Hadoop - is an open source project hosted by Apache Software Foundation for managing Big data
Hadoop consists of two main componentsThe Hadoop File System (HDFS) which is a
distributed file-system that stores the data on multiple separate servers (each of which having its own processor(s))
MapReduce the framework that understands and assigns work to the nodes in a cluster[9]
15
Ravi
Advantages of Hadoop
Hadoop provides the following advantages[9]
Data read/write performance is increased by distributing the data across the cluster allowing each processor to do work in a parallel fashion
It’s scalable, new nodes can be added as needed without making changes to the existing system
It’s cost effective because it brings parallel computing to commodity servers
16
Ravi
Advantages of Hadoop…
It’s flexible, it can absorb any type of data, structured or not from any number of sources
It’s fault tolerant, it handles failures intrinsically by always storing multiple copies of the data and automatically loading a copy when a fault is detected
17
Ravi
Hadoop
How do you use Hadoop?
The developer writes a program that conforms to the MapReduce programming model
The developer specifies the format of the data to be processed in their program
18
Ravi
Hadoop
How does MapReduce work?[10]
Each Hadoop program performs two tasks:
Map - Breaks all of the data down into key/value pairs
Reduce - Takes the output from the map step as input and combines those data key/value pairs into a smaller set of key/value pairs
19
Ravi
Map Reduce - Example
MapReduce example[10]: Assume you have five files, and each file contains two columns that represent a city and the corresponding temperature recorded in that city for the various measurement days Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome,
33 ,New York, 18
We want to find the maximum temperature for each city across all of the data files
Then we create five map tasks, where each mapperworks on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city Which results in: (Toronto, 20) (New York, 22) (Rome, 33)
20
Ravi
Map Reduce – Example…
Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York, 33)
(Rome, 38)(Toronto, 22) (New York, 20) (Rome, 31)(Toronto, 31) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combines the input results and outputs a single value for each city, producing a final result set as follows: (Toronto, 32) (New York, 33) (Rome, 38)
21
Ravi
Big Data – Good Practices
Creating dimensions of all the data being stored is good practice.
All the dimensions should have durable surrogate keys that can’t be changed and are unique.
Expect to integrate structured and unstructured data
Generality of technology is needed. Building it around key value pairs work.
22
Ravi
Big Data – Good Practices…
As value of big data becomes more apparent, privacy concerns grow.
Data quality needs to be better.
Limit on scalability of records.
Business and IT leaders should work together to create more value from data.
Investment in data quality and metadata reduces processing time.
23
Ravi
Conclusions
New concept of big data, its importance and existing projects.
Many challenges and issues exist which need to be brought up.
Big data will help business grow.
Hadoop Tool
24
Ravi
References
[1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money,“Big Data: Issues and Challenges Moving Forward”, IEEE, 46th Hawaii International Conference on System Sciences, 2013.
[2] Sam Madden, “ From Databases to Big Data”, IEEE, Internet Computing, May-June 2012.
[3] Kapil Bakshi, “Considerations for Big Data: Architecture and Approach”,IEEE , Aerospace Conference, 2012.
[4] Sachchidanand Singh, Nirmala Singh, “Big Data Analytics”, IEEE,International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, 2012.
[5] Yuri Demchenko, Zhiming Zhao, Paola Grosso, AdiantoWibisono, Cees de Laat, “Addressing Big Data Challenges for Scientific Data Infrastructure”, IEEE , 4th International Conference on Cloud Computing Technology and Science, 2012.
25
Ravi
References...
[6] Martin Courtney, “The Larging-up of Big Data”, IEEE, Engineering & Technology, September 2012.
[7] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von Voigt, “Big Data Privacy Issues in Public Social Media”, IEEE, 6th International Conference on Digital Ecosystems Technologies (DEST), 18-20 June 2012.
[8] Why Every Database Must Be Broken Soon https://blogs.vmware.com/vfabric/2013/03/why-every-database-must-be-broken-soon.html
[9] What is Hadoop? . http://www-01.ibm.com/software/data/infosphere/hadoop/
[10] What is MapReduce? http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce
26
Ravi
Thank You.27
Ravi