Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson...

Processamento Intensivo de Dados Intensive Data Processing (Big Data)

Nelson F. F. Ebecken

NTT/COPPE/UFRJ

Your Big Data Is Worthless if You Don’t Bring It Into the Real Worldhttp://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the real--world/

http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the



Big Data

Big Data refers to data that is too big to fit on a single server, too unstructured to fit into a

row-and-column database, or too continuously flowing to fit into a static data

warehouse (Thomas H. Davenport)

Big Data and traditional analytics

Type of data

Volume of Data

Big Data

Unstructured formats

100 terabytes to petabytes

Traditional analytics

Formated in rows and columns Tens of terabytes or less

Flow of Data

Analysis methods

Constant flow of data

Machine Learning

Static pool of data

Hypothesis-based

Primary purpose Data-based products Internal decision support and services

A menu of big data possibilities

Style of data Source of data Industry affected Function affected

Large volume Online Financial services Marketing

Unstructured Video Health care Supply chain

Continuous flow Sensor Manufacturing Human resources

Multiple formats Genomic Travel/transport Finance

Terminology for using and analyzing data

Term Time frame Decision support 1970-1985

Executive support 1980-1990

Online analytical processing OLAP

1990-2000

Business intelligence 1989-2005

Analytics 2005-2010

Big Data 2010-present

Specific meaning

Use of data analysis to support decision making

Focus on data analysis for decisions by senior executives

Software for analysing multidimensional data tables Tools to support data-

driven decisions, with emphasis on reporting

Focus on ststistical and mathematical analysis for decisions Focus on very large, unstructured, fast moving data

How important is Big Data to You and Your Organization ?

·Has your management team considered some of the new types of data that may affect your business and industry, both now and in the next several years ? ·Have you discussed the term big data and wether it’s a good description of what your organization is doing with data and analytics ? ·Are you beggining to change your decision-making processes toward a more continuos approach driven by the continuos availability of data ? ·Has your organization adopted faster and more agile approaches to analyzing and acting on important data and analysis ? ·Are you beggining to focus more on external information about business and makets enviroments ? ·Have you made a big bet on big data ?

Big data is going to reshape a lot of different businesses and industries

·Every industry that moves things ·Every industry that sells to consumers ·Every industry that emplys machinery

·Every industry that sells or uses content ·Every industry that provides service ·Every industry that has physical facilities ·Every industry that involves money

Responsability locus for big data projects

Cost savings

Faster decisions

Better decisions

Product/service innovation

Discovery

IT innovation group

Business unit or function

analytics group Business unit or function

analytics group R&D or product development group

Production

IT architecture and operations Business unit or function executive Business unit or function executive Product development or product management

Overview of technologies for big data

Technology

Hadoop

Definition

Open source software for processing big data across multiple parallel servers

MapReduce

Scripting languages

Machine learning

Visual analytics

Natural language processing NLP

In-memory analytics

The architectural framework on which

Hadoop is based Programming languages that work well with big data (Python, Pig, Hive...)

Algorithms for rapidly finding the model that best fits a data set

Display of analytical results in visual or graphic formats

Algorithms for analyzing text, frequencies, meanings,...

Processing big data in computer memory for greater speed

MapReduce

MapReduce is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers.

·It was originally developed by Google ·In 2003, Google's distributed file system, called GFS

In 2004, Google published the paper that introduced MapReduce

MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (an Apache project).

Programming Model

Input & Output: each a set of key/value pairs Programmer specifies two functions:

Processes input key/value pair

Produces set of intermediate pairs

'map (in_key, in_value) -> list(out_key,

intermediate_value)I • Produces a set of merged output values (usually just one)

'reduce (out_key, list(intermediate_value)) -> list(out_value)I

Map-Reduce

. Parallel programming for large masses of data

Map/Combine/Partition Shuffle Sort/Reduce

key/val key/val

key/val key/val

key/val key/val

Reduce output

Reduce output

Reduce output

input Map

input Map

input Map

14

Why learn models in MapReduce?

·High data throughput Stream about 100 Tb per hour using 500 mappers ·Framework provides fault tolerance

Monitors mappers and reducers and re-starts tasks on other machines should one of the machines fail

·Excels in counting patterns over data records·Built on relatively cheap, commodity hardware No special purpose computing hardware ·Large volumes of data are being increasingly stored on Grid clusters running MapReduce Especially in the internet domain

Why learn models in MapReduce?

• Learning can become limited by computation time and not data volume With large enough data and number of machines

Reduces the need to down-sample data

More accurate parameter estimates compared to learning on a single machine for the same amount of time

Learning models in MapReduce ·A primer for learning models in MapReduce (MR) Illustrate techniques for distributing the learning algorithm in a

MapReduce framework Focus on the mapper and reducer computations ·Data parallel algorithms are most appropriate for MapReduce implementations ·Not necessarily the most optimal implementation for a specific algorithm

Other specialized non-MapReduce implementations exist for some algorithms, which may be better

·MR may not be the appropriate framework for exact solutions of non data parallel/sequential algorithms Approximate solutions using MapReduce may be good enough

Types of learning in MapReduce

• Three common types of learning models using MapReduce framework

1.Parallel training of multiple models – Train either in mappers or reducers

2. Ensemble training methods – Train multiple models and combine them

3. Distributed learning algorithms – Learn using both mappers and reducers

Use the Grid as a large cluster

of independent machines (with fault

tolerance)

Parallel training of multiple models

·Train multiple models simultaneously using a learning algorithm that can be learnt in memory ·Useful when individual models are trained using a subset, filtered or modification of raw data ·Can train 1000 `s of models simultaneously ·Essentially, treat Grid as a large cluster of machines – Leverage fault tolerance of Hadoop ·Train 1 model in each reducer – Map:

·Input: All data ·Filters subset of data relevant for each model training ·Output: <model_index, subset of data for training this model>

– Reduce ·Train model on data corresponding to that model_index

Apache Mahout

Scalable to large data sets. Our core algorithms for clustering, classification and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well.Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

25 April 2014 - Goodbye MapReduceThe Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them.We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.Apache Spark™ is a fast and general engine for large-scale data processing.H2O is the open source in memory solution from 0xdata for predictive analytics on big data.

https://mahout.apache.org/users/sparkbindings/home.html

http://spark.apache.org/

https://issues.apache.org/jira/browse/MAHOUT-1500

Matrix Methods with Hadoop Slides bit.ly/10SIe1A

Code github.com/dgleich/matrix-hadoop-tutorial

DAVID F. GLEICH

ASSISTANT PROFESSOR COMPUTER SCIENCE PURDUE UNIVERSITY

David Gleich á Purdue bit.ly/10SIe1A 1

ACM KDD 201424-27/08

New environments: Microsoft Azure ML Studio, Google Prediction API,…

2 Research Sessions + Industry & GovernmentStatistical Techniques for Big DataScaling-up Methods for Big Data

Topic Modeling

Big data & machine learning

This is a huge field, growing very fast

Many algorithms and techniques:can be seen as a giant toolbox with wide-ranging applications

Ranging from the very simple to the extremely sophisticated

Difficult to see the big picture

Huge range of applications

Math skills are crucial

Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson...

Technology

Transcript of Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson...