Introducción a hadoop

34
Introducción a Hadoop El bazuca de los datos Iván de Prado Alonso // @ivanprado // @datasalt

description

Introducción a Ha

Transcript of Introducción a hadoop

Page 1: Introducción a hadoop

Introducción a Hadoop

El bazuca de los datos

Iván de Prado Alonso // @ivanprado // @datasalt

Page 2: Introducción a hadoop

2 / 34

Datasalt

Foco en el Big Data– Contribución al Open Source

– Consultoría & Desarrollo

– Formación

Page 3: Introducción a hadoop

3 / 34

BIG“MAC”DATA

Page 4: Introducción a hadoop

4 / 34

Fisonomía de un proyecto Big Data

Servicio

Procesamiento

Adquisición

Page 5: Introducción a hadoop

5 / 34

Tipos de sistemas Big Data

● Offline– La latencia no es un problema

● Online– La inmediatez de los datos es importante

● Mixto– Lo más común

Offline Online

MapReduceHadoopDistributed RDBMS

Bases de datos NoSQLMotores de búsqueda

Page 6: Introducción a hadoop

6 / 34

“Swiss army knife of the 21st century”

Media Guardian Innovation Awards

http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop

Page 7: Introducción a hadoop

7 / 34

Historia

● 2004-2006– Google publica los papers de GFS y MapReduce

– Doug Cutting implementa una versión Open Source en Nutch

● 2006-2008– Hadoop se separa de Nutch

– Se alcanza la escala web en 2008

● 2008-Hasta ahora– Hadoop se populariza y se comienza a explotar

comercialmente.

Fuente: Hadoop: a brief history. Doug Cutting

Page 8: Introducción a hadoop

8 / 34

Hadoop

“The Apache Hadoop software library is a

framework that allows for the distributed

processing of large data sets across clusters of

computers using a simple programming

model”De la página de Hadoop

Page 9: Introducción a hadoop

9 / 34

Sistema de Ficheros Distribuido

● Sistema de ficheros distribuido (HDFS)– Bloques grandes: 64 Mb

● Almacenados en el sistema de ficheros del SO

– Tolerante a Fallos (replicación)

– Formatos habituales:● Ficheros en formato texto (CSV)● SequenceFiles

– Ristras de pares [clave, valor]

Page 10: Introducción a hadoop

10 / 34

MapReduce

● Dos funciones (Map y Reduce)– Map(k, v) : [z,w]*

– Reduce(k, v*) : [z, w]*

● Ejemplo: contar palabras– Map([documento, null]) -> [palabra, 1]*

– Reduce(palabra, 1*) -> [palabra, total]

● MapReduce y SQL– SELECT palabra, count(*) GROUP BY palabra

● Ejecución distribuida en un cluster con escalabilidad horizontal

Page 11: Introducción a hadoop

11 / 34

El típico Word Count

Esto es una lineaEsto también

map(“Esto es una linea”) =esto, 1es, 1una, 1linea, 1

map(“Esto también”) = esto, 1también, 1

reduce(es, {1}) = es, 1

reduce(esto, {1, 1}) = esto, 2

reduce(linea, {1}) = linea, 1

reduce(también, {1}) = también, 1

reduce(una, {1}) = una, 1

Map Reduce

Resultado:

es, 1esto, 2linea, 1también, 1una, 1

Page 12: Introducción a hadoop

12 / 34

Word Count en Hadooppublic class WordCountHadoop extends Configured implements Tool {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while(itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}}

}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0;for(IntWritable val : values) {

sum += val.get();}result.set(sum);context.write(key, result);

}}

@Override public int run(String[] args) throws Exception {

if(args.length != 2) {System.err.println("Usage: wordcount-hadoop <in> <out>");System.exit(2);

}

Path output = new Path(args[1]);HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output);

Job job = new Job(getConf(), "word count hadoop");job.setJarByClass(WordCountHadoop.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);

return 0; }

public static void main(String[] args) throws Exception {ToolRunner.run(new SortJobHadoop(), args);

}}

¡Mejor vamos por partes!

Page 13: Introducción a hadoop

13 / 34

Word Count en Hadoop - Mapper

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());while(itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}}

}

Page 14: Introducción a hadoop

14 / 34

Word Count en Hadoop - Reducer

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,

InterruptedException {int sum = 0;for(IntWritable val : values) {

sum += val.get();}result.set(sum);context.write(key, result);

}}

Page 15: Introducción a hadoop

15 / 34

Word Count en Hadoop – Configuración y ejecución

if(args.length != 2) {System.err.println("Usage: wordcount-hadoop <in> <out>");System.exit(2);

}

Path output = new Path(args[1]);HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf),

output);

Job job = new Job(getConf(), "word count hadoop");job.setJarByClass(WordCountHadoop.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);

Page 16: Introducción a hadoop

16 / 34

Ejecución de un Job MapReduce

Nodo 1

Nodo 1

Nodo 2

Nodo 2

Bloques del fichero de entrada

Mappers

Reducers

DatosIntermedios

Resultado

Page 17: Introducción a hadoop

17 / 34

Serialización

● Writables• Serialización nativa de Hadoop

• De muy bajo nivel

• Tipos básicos: IntWritable, Text, etc.● Otras

• Thrift, Avro, Protostuff

• Compatibilidad hacia atrás.

Page 18: Introducción a hadoop

18 / 34

La curva de aprendizaje de Hadoop

es alta

Page 19: Introducción a hadoop

19 / 34

Tuple MapReduce

● Un MapReduce más simple– Tuplas en lugar de key/value

– A nivel de job se define● Los campos por los que agrupar● Los campos por los que ordenar

– Tuple MapReduce-join

Page 20: Introducción a hadoop

20 / 34

Pangool

● Implementación de TupleMap reduce– Desarrollado por Datasalt

– OpenSource

– Eficiencia equiparable a Hadoop

● Objetivo: reemplazar la API de Hadoop

● Si quieres aprender Hadoop, empieza por Pangool

Page 21: Introducción a hadoop

21 / 34

Eficiencia de Pangool

● Equiparable a Hadoop

Ver http://pangool.net/benchmark.html

Page 22: Introducción a hadoop

22 / 34

Pangool – URL resolution

● Ejemplo de Join– Muy difícil en Hadoop. Fácil en Pangool.

● Problema:– Existen muchos acortadores de URLs y redirecciones

– Para analizar datos, suele ser útil reemplazar las URLs por su URL canónica

– Supongamos que tenemos ambos datasets● Un mapa con entradas URL → URL canónica● Un dataset con URLs (que queremos resolver) y otros campos.

– El siguiente job Pangool soluciona el problema de manera escalable.

Page 23: Introducción a hadoop

23 / 34

URL Resolution – Definiendo Schemas

static Schema getURLRegisterSchema() {List<Field> urlRegisterFields = new ArrayList<Field>();urlRegisterFields.add(Field.create("url",Type.STRING));urlRegisterFields.add(Field.create("timestamp",Type.LONG));urlRegisterFields.add(Field.create("ip",Type.STRING));return new Schema("urlRegister", urlRegisterFields);

}

static Schema getURLMapSchema() {List<Field> urlMapFields = new ArrayList<Field>();urlMapFields.add(Field.create("url",Type.STRING));urlMapFields.add(Field.create("canonicalUrl",Type.STRING));return new Schema("urlMap", urlMapFields);

}

Page 24: Introducción a hadoop

24 / 34

URL Resolution – Cargando el fichero a resolver

public static class UrlProcessor extends TupleMapper<LongWritable, Text> {

private Tuple tuple = new Tuple(getURLRegisterSchema());

@Overridepublic void map(LongWritable key, Text value, TupleMRContext

context, Collector collector) throws IOException, InterruptedException {

String[] fields = value.toString().split("\t");tuple.set("url", fields[0]);tuple.set("timestamp", Long.parseLong(fields[1]));tuple.set("ip", fields[2]);collector.write(tuple);

}}

Page 25: Introducción a hadoop

25 / 34

URL Resolution – Cargando el mapa de URLs

public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> {

private Tuple tuple = new Tuple(getURLMapSchema());

@Overridepublic void map(LongWritable key, Text value, TupleMRContext

context, Collector collector) throws IOException, InterruptedException {

String[] fields = value.toString().split("\t");tuple.set("url", fields[0]);tuple.set("canonicalUrl", fields[1]);collector.write(tuple);

}}

Page 26: Introducción a hadoop

26 / 34

URL Resolution – Resolución en el reducer

public static class Handler extends TupleReducer<Text, NullWritable> {

private Text result;

@Overridepublic void reduce(ITuple group, Iterable<ITuple> tuples,

TupleMRContext context, Collector collector) throws IOException, InterruptedException, TupleMRException {

if (result == null) {result = new Text();

}String cannonicalUrl = null;for(ITuple tuple : tuples) {

if("urlMap".equals(tuple.getSchema().getName())) {cannonicalUrl = tuple.get("canonicalUrl").toString();

} else {result.set(cannonicalUrl + "\t" +

tuple.get("timestamp") + "\t" + tuple.get("ip"));collector.write(result, NullWritable.get());

}}

}}

Page 27: Introducción a hadoop

27 / 34

URL Resolution – Configurando y Lanzando el job

String input1 = args[0];String input2 = args[1];String output = args[2];

deleteOutput(output);

TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");mr.addIntermediateSchema(getURLMapSchema());mr.addIntermediateSchema(getURLRegisterSchema());mr.setGroupByFields("url");mr.setOrderBy(

new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC));mr.setTupleReducer(new Handler());mr.setOutput(new Path(output),

new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);

mr.addInput(new Path(input1), new HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor());

mr.addInput(new Path(input2), new HadoopInputFormat(TextInputFormat.class),new UrlProcessor());

mr.createJob().waitForCompletion(true);

Page 28: Introducción a hadoop

28 / 34

Hadoop vs Pangool

/** * Copyright [2012] [Datasalt Systems S.L.] * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.datasalt.pangool.examples.urlresolution;

import java.io.IOException;import java.util.ArrayList;import java.util.List;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.ToolRunner;

import com.datasalt.pangool.examples.BaseExampleJob;import com.datasalt.pangool.io.ITuple;import com.datasalt.pangool.io.Schema;import com.datasalt.pangool.io.Schema.Field;import com.datasalt.pangool.io.Schema.Field.Type;import com.datasalt.pangool.io.Tuple;import com.datasalt.pangool.tuplemr.TupleMRBuilder;import com.datasalt.pangool.tuplemr.TupleMRException;import com.datasalt.pangool.tuplemr.TupleMapper;import com.datasalt.pangool.tuplemr.TupleReducer;import com.datasalt.pangool.tuplemr.mapred.lib.input.HadoopInputFormat;import com.datasalt.pangool.tuplemr.mapred.lib.output.HadoopOutputFormat;

/** * This example shows how to perform reduce-side joins using Pangool. We have one file with URL Registers: ["url", "timestamp", "ip"] and another file with * canonical URL mapping: ["url", "canonicalUrl"]. We want to obtain the URL Registers file with the url substituted with the * canonical one according to the mapping file: ["canonicalUrl", "timestamp", "ip"]. */public class UrlResolution extends BaseExampleJob {

static Schema getURLRegisterSchema() {List<Field> urlRegisterFields = new ArrayList<Field>();urlRegisterFields.add(Field.create("url",Type.STRING));urlRegisterFields.add(Field.create("timestamp",Type.LONG));urlRegisterFields.add(Field.create("ip",Type.STRING));return new Schema("urlRegister", urlRegisterFields);

}

static Schema getURLMapSchema() {List<Field> urlMapFields = new ArrayList<Field>();urlMapFields.add(Field.create("url",Type.STRING));urlMapFields.add(Field.create("canonicalUrl",Type.STRING));return new Schema("urlMap", urlMapFields);

}

@SuppressWarnings("serial")public static class UrlProcessor extends TupleMapper<LongWritable, Text> {

private Tuple tuple = new Tuple(getURLRegisterSchema());

@Overridepublic void map(LongWritable key, Text value, TupleMRContext context,

Collector collector) throws IOException, InterruptedException {

String[] fields = value.toString().split("\t");

tuple.set("url", fields[0]);tuple.set("timestamp",

Long.parseLong(fields[1]));tuple.set("ip", fields[2]);collector.write(tuple);

}}

@SuppressWarnings("serial")public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> {

private Tuple tuple = new Tuple(getURLMapSchema());

@Overridepublic void map(LongWritable key, Text value, TupleMRContext context,

Collector collector) throws IOException, InterruptedException {

String[] fields = value.toString().split("\t");

tuple.set("url", fields[0]);tuple.set("canonicalUrl", fields[1]);collector.write(tuple);

}}

@SuppressWarnings("serial")public static class Handler extends TupleReducer<Text, NullWritable> {

private Text result;

@Overridepublic void reduce(ITuple group, Iterable<ITuple> tuples, TupleMRContext

context, Collector collector) throws IOException, InterruptedException, TupleMRException {

if (result == null) {result = new

Text();}String cannonicalUrl = null;for(ITuple tuple : tuples) {

if("urlMap".equals(tuple.getSchema().getName())) {

cannonicalUrl = tuple.get("canonicalUrl").toString();} else {

result.set(cannonicalUrl + "\t" + tuple.get("timestamp") + "\t" + tuple.get("ip"));

collector.write(result, NullWritable.get());}

}}

}

public UrlResolution() {super("UrlResolution: [input_url_mapping] [input_url_regs] [output]");

}

@Overridepublic int run(String[] args) throws Exception {

if(args.length != 3) {failArguments("Wrong number of arguments");return -1;

}String input1 = args[0];String input2 = args[1];String output = args[2];

deleteOutput(output);

TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");mr.addIntermediateSchema(getURLMapSchema());mr.addIntermediateSchema(getURLRegisterSchema());mr.setGroupByFields("url");mr.setTupleReducer(new Handler());mr.setOutput(new Path(output), new

HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);mr.addInput(new Path(input1), new

HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor());mr.addInput(new Path(input2), new

HadoopInputFormat(TextInputFormat.class), new UrlProcessor());mr.createJob().waitForCompletion(true);

return 1;}

public static void main(String[] args) throws Exception {ToolRunner.run(new UrlResolution(), args);

}}

/** * Copyright [2012] [Datasalt Systems S.L.] * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.datasalt.pangool.benchmark.urlresolution;

import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.RawComparator;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.io.WritableComparator;import org.apache.hadoop.io.WritableUtils;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Partitioner;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.lib.MultipleInputs;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.util.GenericOptionsParser;

/** * Code for solving the URL Resolution CoGroup Problem in Hadoop Java Map/Red API. * <p> * The URL Resolution CoGroup Problem is: We have one file with URL Registers: {url timestamp ip} and another file with * canonical URL mapping: {url cannonicalUrl}. We want to obtain the URL Registers file with the url substituted with * the canonical one according to the mapping file: {canonicalUrl timestamp ip}. */public class HadoopUrlResolution {

public final static int SOURCE_URL_MAP = 0;public final static int SOURCE_URL_REGISTER = 1;

public static class UrlRegJoinUrlMap implements WritableComparable<UrlRegJoinUrlMap> {

// --- Common fields --- //private Text groupUrl = new Text();private int sourceId;

// --- Url register --- //private Text ip = new Text();private long timestamp;

// --- Url map --- //private Text cannonicalUrl = new Text();

public void setUrlRegister(String groupUrl, String ip, long timestamp) {this.groupUrl.set(groupUrl);sourceId = SOURCE_URL_REGISTER;this.ip.set(ip);this.timestamp = timestamp;

}

public void setUrlMap(String groupUrl, String cannonicalUrl) {this.groupUrl.set(groupUrl);sourceId = SOURCE_URL_MAP;this.cannonicalUrl.set(cannonicalUrl);

URL Resolutionen Pangool

return groupUrl.toString() + "," + ip.toString() + "," + timestamp;

}}

}

public static class KeyPartitioner implements Partitioner<UrlRegJoinUrlMap, NullWritable> {@Overridepublic int getPartition(UrlRegJoinUrlMap key, NullWritable value, int

numPartitions) {return Math.abs(key.groupUrl.hashCode()) %

numPartitions;}

@Overridepublic void configure(JobConf arg0) {}

}

public static class GroupingComparator implements RawComparator<UrlRegJoinUrlMap> {

@Overridepublic int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)

{try {

int offset1 = s1;

int offset2 = s2;

int strSize1 = WritableComparator.readVInt(b1, offset1);

int strSize2 = WritableComparator.readVInt(b2, offset2);

offset1 += WritableUtils.decodeVIntSize(b1[offset1]);

offset2 += WritableUtils.decodeVIntSize(b2[offset2]);

return WritableComparator.compareBytes(b1, offset1, strSize1, b2, offset2, strSize2);

} catch(IOException e) {throw new

RuntimeException(e);}

}

@Overridepublic int compare(UrlRegJoinUrlMap o1, UrlRegJoinUrlMap o2) {

return o1.groupUrl.compareTo(o2.groupUrl);}

}

/** * */public static class UrlMapClass implements Mapper<LongWritable, Text, UrlRegJoinUrlMap, NullWritable>

{

private final UrlRegJoinUrlMap key = new UrlRegJoinUrlMap();

@Overridepublic void configure(JobConf arg0) {}

@Overridepublic void close() throws IOException {}

@Overridepublic void map(LongWritable ignore, Text inValue,

OutputCollector<UrlRegJoinUrlMap, NullWritable> context, Reporter arg3) throws IOException {

String[] fields = inValue.toString().split("\t");

key.setUrlMap(fields[0], fields[1]);context.collect(key, NullWritable.get());

}}

/** * */public static class UrlRegisterMapClass implements Mapper<LongWritable, Text, UrlRegJoinUrlMap,

NullWritable> {

private final UrlRegJoinUrlMap key = new UrlRegJoinUrlMap();

@Overridepublic void configure(JobConf arg0) {}

@Overridepublic void close() throws IOException {}

@Overridepublic void map(LongWritable ignore, Text inValue,

OutputCollector<UrlRegJoinUrlMap, NullWritable> context, Reporter arg3) throws IOException {

String[] fields = inValue.toString().split("\t");

key.setUrlRegister(fields[0], fields[2], Long.parseLong(fields[1]));

context.collect(key, NullWritable.get());}

}

public static class Reduce extends Reducer<UrlRegJoinUrlMap, NullWritable, Text, NullWritable> {

Text result = new Text();

protected void reduce(UrlRegJoinUrlMap key, Iterable<NullWritable> values, Context ctx) throws IOException,

InterruptedException {String cannonicalUrl = null;for(@SuppressWarnings("unused")NullWritable value : values) {

if(key.sourceId == SOURCE_URL_MAP) {

cannonicalUrl = key.cannonicalUrl.toString();} else {

result.set(cannonicalUrl + "\t" + key.timestamp + "\t" + key.ip);

ctx.write(result, NullWritable.get());}

}};

}

public final static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf,

args).getRemainingArgs();if(otherArgs.length != 3) {

System.err.println("Usage: urlresolution <url-map> <url-register> <out>");

System.exit(2);}JobConf job = new JobConf(conf);FileSystem fS = FileSystem.get(conf);fS.delete(new Path(otherArgs[2]), true);

MultipleInputs.addInputPath(job, new Path(otherArgs[0]), TextInputFormat.class, UrlMapClass.class);

MultipleInputs.addInputPath(job, new Path(otherArgs[1]), TextInputFormat.class, UrlRegisterMapClass.class);

job.setJarByClass(HadoopUrlResolution.class);

job.setPartitionerClass(KeyPartitioner.class);job.setOutputValueGroupingComparator(GroupingComparator.class);

job.setMapOutputKeyClass(UrlRegJoinUrlMap.class);job.setMapOutputValueClass(NullWritable.class);

job.setOutputKeyClass(Text.class);job.setOutputValueClass(NullWritable.class);

FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));

Job j = new Job(job);j.setReducerClass(Reduce.class);j.waitForCompletion(true);

}}

}

@Overridepublic void readFields(DataInput in) throws IOException {

groupUrl.readFields(in);sourceId = (int) in.readByte();if(sourceId == SOURCE_URL_MAP) {

cannonicalUrl.readFields(in);} else {

ip.readFields(in);timestamp =

in.readLong();}

}

@Overridepublic void write(DataOutput out) throws IOException {

groupUrl.write(out);out.writeByte((byte) sourceId);if(sourceId == SOURCE_URL_MAP) {

cannonicalUrl.write(out);} else {

ip.write(out);

out.writeLong(timestamp);}

}

@Overridepublic int hashCode() {

if(sourceId == SOURCE_URL_MAP) {return

((groupUrl.hashCode() * 31) + cannonicalUrl.hashCode()) * 31;} else {

return ((((groupUrl.hashCode() * 31) + ip.hashCode()) * 31) + (int) timestamp) * 31;

}}

@Overridepublic boolean equals(Object right) {

if(right instanceof UrlRegJoinUrlMap) {

UrlRegJoinUrlMap r = (UrlRegJoinUrlMap) right;if(sourceId ==

r.sourceId) {

if(groupUrl.equals(r.groupUrl)) {

if(sourceId == SOURCE_URL_MAP) {

return cannonicalUrl.equals(r.cannonicalUrl);

} else {

return ip.equals(r.ip) && timestamp == r.timestamp;

}

}}

}return false;

}

public static class Comparator extends WritableComparator {public Comparator() {

super(UrlRegJoinUrlMap.class);}

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

try {

int offset1 = s1;

int offset2 = s2;

// Group url

int strSize1 = WritableComparator.readVInt(b1, offset1);

int strSize2 = WritableComparator.readVInt(b2, offset2);

offset1 += WritableUtils.decodeVIntSize(b1[offset1]);

offset2 += WritableUtils.decodeVIntSize(b2[offset2]);

int cmp = WritableComparator.compareBytes(b1, offset1, strSize1, b2, offset2, strSize2);

if(cmp != 0) {

return cmp;

}

offset1 += strSize1;

offset2 += strSize2;

// Source id

int sourceId1 = (int) b1[offset1];

int sourceId2 = (int) b2[offset2];

if(sourceId1 != sourceId2) {

return sourceId1 > sourceId2 ? 1 : -1;

}

return 0;}

catch(IOException e) {

throw new RuntimeException(e);}

}}

static { // register this comparator

WritableComparator.define(UrlRegJoinUrlMap.class, new Comparator());}

@Overridepublic int compareTo(UrlRegJoinUrlMap o) {

int cmp = groupUrl.compareTo(o.groupUrl);if(cmp != 0) {

return cmp;}if(sourceId != o.sourceId) {

return sourceId > o.sourceId ? 1 : -1;

}return 0;

}

public String toString() {if(sourceId == SOURCE_URL_MAP) {

return groupUrl.toString() + "," + cannonicalUrl.toString();

} else {

URL Resolutionen Hadoop

Page 29: Introducción a hadoop

29 / 34

Hadoop en el mundo real

● Captura de datos– Usando Hadoop

– Usando Sqoop para BDs SQL

– Con servicios que escriben en el HDFS

– Flume

– Storm

● Aplicación de patrones básicos– Filtrado

– Ordenación

– Ejecución distribuida

– Joins● Reduce-side, Map-side, en memoria

– Producto cruzado

– Reconciliado

Page 30: Introducción a hadoop

30 / 34

Hadoop en el mundo real (II)

● Encadenamiento de Jobs– Se encadena la salida de un job con la entrada de la de

otro, formando un flujo.

– Sistemas para flow:● Oozie

● Actualización de bases de datos– Generación/actualización

● de índices Solr● de BD clave/valor● de BD SQL

Page 31: Introducción a hadoop

31 / 34

Aclaremos dudas habituales

● Hadoop no es una DB● Hadoop “aparentemente” sólo

procesa datos● Hadoop no permite “lookups”

Hadoop supone un cambio de paradigma que cuesta asimilar

Page 32: Introducción a hadoop

32 / 34

Cambio en la Filosofía

● Reprocesarlo todo siempre. ¡TODO!● ¿Por qué?

• Más tolerante a fallos• Más flexible• Más eficiente. Ej:

Con un HD de 7200 RPM– Random IOPS – 100 – Lectura secuencial – 40 MB/s– Tamaño de registro: 5 Kb

… con que un 1,25% de los registros cambien, es más rápido reescribirlo todo que hacer accesos aleatorios de actualización.

– 100 MB, 20.000 registros

» Lectura secuencial: 2,5 sg» Lectura aleatoria: 200 sg

Page 33: Introducción a hadoop

33 / 34

Herramientas de alto nivel

● Usar Hadoop o Pangool puede no ser apropiado para todos los proyectos

● Hay herramientas de mayor nivel que pueden ser muy útiles y más fáciles:– Hive: Análisis de datos con SQL

– Pig: Análisis de datos con un lenguaje parecido a SQL

– Cascading: framework de programación en Hadoop, que incluye tuplas y gestión automática de flows.

– Hadoop streaming: procesamiento de datos sobre Hadoop con cualquier comando/lenguaje de scripting.

Page 34: Introducción a hadoop

Gracias

Iván de Prado [email protected]@ivanprado