Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson...
-
Upload
rio-info -
Category
Technology
-
view
41 -
download
0
description
Transcript of Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson...
Processamento Intensivo de Dados Intensive Data Processing (Big Data)
Nelson F. F. Ebecken
NTT/COPPE/UFRJ
Your Big Data Is Worthless if You Don’t Bring It Into the Real Worldhttp://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the real--world/
Big Data
Big Data refers to data that is too big to fit on a single server, too unstructured to fit into a
row-and-column database, or too continuously flowing to fit into a static data
warehouse (Thomas H. Davenport)
Big Data and traditional analytics
Type of data
Volume of Data
Big Data
Unstructured formats
100 terabytes to petabytes
Traditional analytics
Formated in rows and columns Tens of terabytes or less
Flow of Data
Analysis methods
Constant flow of data
Machine Learning
Static pool of data
Hypothesis-based
Primary purpose Data-based products Internal decision support and services
A menu of big data possibilities
Style of data Source of data Industry affected Function affected
Large volume Online Financial services Marketing
Unstructured Video Health care Supply chain
Continuous flow Sensor Manufacturing Human resources
Multiple formats Genomic Travel/transport Finance
Terminology for using and analyzing data
Term Time frame Decision support 1970-1985
Executive support 1980-1990
Online analytical processing OLAP
1990-2000
Business intelligence 1989-2005
Analytics 2005-2010
Big Data 2010-present
Specific meaning
Use of data analysis to support decision making
Focus on data analysis for decisions by senior executives
Software for analysing multidimensional data tables Tools to support data-
driven decisions, with emphasis on reporting
Focus on ststistical and mathematical analysis for decisions Focus on very large, unstructured, fast moving data
How important is Big Data to You and Your Organization ?
·Has your management team considered some of the new types of data that may affect your business and industry, both now and in the next several years ? ·Have you discussed the term big data and wether it’s a good description of what your organization is doing with data and analytics ? ·Are you beggining to change your decision-making processes toward a more continuos approach driven by the continuos availability of data ? ·Has your organization adopted faster and more agile approaches to analyzing and acting on important data and analysis ? ·Are you beggining to focus more on external information about business and makets enviroments ? ·Have you made a big bet on big data ?
Big data is going to reshape a lot of different businesses and industries
·Every industry that moves things ·Every industry that sells to consumers ·Every industry that emplys machinery
·Every industry that sells or uses content ·Every industry that provides service ·Every industry that has physical facilities ·Every industry that involves money
Responsability locus for big data projects
Cost savings
Faster decisions
Better decisions
Product/service innovation
Discovery
IT innovation group
Business unit or function
analytics group Business unit or function
analytics group R&D or product development group
Production
IT architecture and operations Business unit or function executive Business unit or function executive Product development or product management
Overview of technologies for big data
Technology
Hadoop
Definition
Open source software for processing big data across multiple parallel servers
MapReduce
Scripting languages
Machine learning
Visual analytics
Natural language processing NLP
In-memory analytics
The architectural framework on which
Hadoop is based Programming languages that work well with big data (Python, Pig, Hive...)
Algorithms for rapidly finding the model that best fits a data set
Display of analytical results in visual or graphic formats
Algorithms for analyzing text, frequencies, meanings,...
Processing big data in computer memory for greater speed
MapReduce
MapReduce is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers.
·It was originally developed by Google ·In 2003, Google's distributed file system, called GFS
In 2004, Google published the paper that introduced MapReduce
MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (an Apache project).
Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions:
Processes input key/value pair
Produces set of intermediate pairs
'map (in_key, in_value) -> list(out_key,
intermediate_value)I • Produces a set of merged output values (usually just one)
'reduce (out_key, list(intermediate_value)) -> list(out_value)I
Map-Reduce
. Parallel programming for large masses of data
Map/Combine/Partition Shuffle Sort/Reduce
key/val key/val
key/val key/val
key/val key/val
Reduce output
Reduce output
Reduce output
input Map
input Map
input Map
14
Why learn models in MapReduce?
·High data throughput Stream about 100 Tb per hour using 500 mappers ·Framework provides fault tolerance
Monitors mappers and reducers and re-starts tasks on other machines should one of the machines fail
·Excels in counting patterns over data records·Built on relatively cheap, commodity hardware No special purpose computing hardware ·Large volumes of data are being increasingly stored on Grid clusters running MapReduce Especially in the internet domain
Why learn models in MapReduce?
• Learning can become limited by computation time and not data volume With large enough data and number of machines
Reduces the need to down-sample data
More accurate parameter estimates compared to learning on a single machine for the same amount of time
Learning models in MapReduce ·A primer for learning models in MapReduce (MR) Illustrate techniques for distributing the learning algorithm in a
MapReduce framework Focus on the mapper and reducer computations ·Data parallel algorithms are most appropriate for MapReduce implementations ·Not necessarily the most optimal implementation for a specific algorithm
Other specialized non-MapReduce implementations exist for some algorithms, which may be better
·MR may not be the appropriate framework for exact solutions of non data parallel/sequential algorithms Approximate solutions using MapReduce may be good enough
Types of learning in MapReduce
• Three common types of learning models using MapReduce framework
1.Parallel training of multiple models – Train either in mappers or reducers
2. Ensemble training methods – Train multiple models and combine them
3. Distributed learning algorithms – Learn using both mappers and reducers
Use the Grid as a large cluster
of independent machines (with fault
tolerance)
Parallel training of multiple models
·Train multiple models simultaneously using a learning algorithm that can be learnt in memory ·Useful when individual models are trained using a subset, filtered or modification of raw data ·Can train 1000 `s of models simultaneously ·Essentially, treat Grid as a large cluster of machines – Leverage fault tolerance of Hadoop ·Train 1 model in each reducer – Map:
·Input: All data ·Filters subset of data relevant for each model training ·Output: <model_index, subset of data for training this model>
– Reduce ·Train model on data corresponding to that model_index
Apache Mahout
Scalable to large data sets. Our core algorithms for clustering, classification and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well.Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.
25 April 2014 - Goodbye MapReduceThe Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them.We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.Apache Spark™ is a fast and general engine for large-scale data processing.H2O is the open source in memory solution from 0xdata for predictive analytics on big data.
Matrix Methods with Hadoop Slides bit.ly/10SIe1A
Code github.com/dgleich/matrix-hadoop-tutorial
DAVID F. GLEICH
ASSISTANT PROFESSOR COMPUTER SCIENCE PURDUE UNIVERSITY
David Gleich á Purdue bit.ly/10SIe1A 1
20
ACM KDD 201424-27/08
New environments: Microsoft Azure ML Studio, Google Prediction API,…
2 Research Sessions + Industry & GovernmentStatistical Techniques for Big DataScaling-up Methods for Big Data
Topic Modeling
Big data & machine learning
This is a huge field, growing very fast
Many algorithms and techniques:can be seen as a giant toolbox with wide-ranging applications
Ranging from the very simple to the extremely sophisticated
Difficult to see the big picture
Huge range of applications
Math skills are crucial