Introducción a hadoop
-
Upload
datasalt -
Category
Technology
-
view
107 -
download
2
description
Transcript of Introducción a hadoop
Introducción a Hadoop
El bazuca de los datos
Iván de Prado Alonso // @ivanprado // @datasalt
2 / 34
Datasalt
Foco en el Big Data– Contribución al Open Source
– Consultoría & Desarrollo
– Formación
3 / 34
BIG“MAC”DATA
4 / 34
Fisonomía de un proyecto Big Data
Servicio
Procesamiento
Adquisición
5 / 34
Tipos de sistemas Big Data
● Offline– La latencia no es un problema
● Online– La inmediatez de los datos es importante
● Mixto– Lo más común
Offline Online
MapReduceHadoopDistributed RDBMS
Bases de datos NoSQLMotores de búsqueda
6 / 34
“Swiss army knife of the 21st century”
Media Guardian Innovation Awards
http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
7 / 34
Historia
● 2004-2006– Google publica los papers de GFS y MapReduce
– Doug Cutting implementa una versión Open Source en Nutch
● 2006-2008– Hadoop se separa de Nutch
– Se alcanza la escala web en 2008
● 2008-Hasta ahora– Hadoop se populariza y se comienza a explotar
comercialmente.
Fuente: Hadoop: a brief history. Doug Cutting
8 / 34
Hadoop
“The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters of
computers using a simple programming
model”De la página de Hadoop
9 / 34
Sistema de Ficheros Distribuido
● Sistema de ficheros distribuido (HDFS)– Bloques grandes: 64 Mb
● Almacenados en el sistema de ficheros del SO
– Tolerante a Fallos (replicación)
– Formatos habituales:● Ficheros en formato texto (CSV)● SequenceFiles
– Ristras de pares [clave, valor]
10 / 34
MapReduce
● Dos funciones (Map y Reduce)– Map(k, v) : [z,w]*
– Reduce(k, v*) : [z, w]*
● Ejemplo: contar palabras– Map([documento, null]) -> [palabra, 1]*
– Reduce(palabra, 1*) -> [palabra, total]
● MapReduce y SQL– SELECT palabra, count(*) GROUP BY palabra
● Ejecución distribuida en un cluster con escalabilidad horizontal
11 / 34
El típico Word Count
Esto es una lineaEsto también
map(“Esto es una linea”) =esto, 1es, 1una, 1linea, 1
map(“Esto también”) = esto, 1también, 1
reduce(es, {1}) = es, 1
reduce(esto, {1, 1}) = esto, 2
reduce(linea, {1}) = linea, 1
reduce(también, {1}) = también, 1
reduce(una, {1}) = una, 1
Map Reduce
Resultado:
es, 1esto, 2linea, 1también, 1una, 1
12 / 34
Word Count en Hadooppublic class WordCountHadoop extends Configured implements Tool {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while(itr.hasMoreTokens()) {
word.set(itr.nextToken());context.write(word, one);
}}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;for(IntWritable val : values) {
sum += val.get();}result.set(sum);context.write(key, result);
}}
@Override public int run(String[] args) throws Exception {
if(args.length != 2) {System.err.println("Usage: wordcount-hadoop <in> <out>");System.exit(2);
}
Path output = new Path(args[1]);HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output);
Job job = new Job(getConf(), "word count hadoop");job.setJarByClass(WordCountHadoop.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);
return 0; }
public static void main(String[] args) throws Exception {ToolRunner.run(new SortJobHadoop(), args);
}}
¡Mejor vamos por partes!
13 / 34
Word Count en Hadoop - Mapper
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());while(itr.hasMoreTokens()) {
word.set(itr.nextToken());context.write(word, one);
}}
}
14 / 34
Word Count en Hadoop - Reducer
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {int sum = 0;for(IntWritable val : values) {
sum += val.get();}result.set(sum);context.write(key, result);
}}
15 / 34
Word Count en Hadoop – Configuración y ejecución
if(args.length != 2) {System.err.println("Usage: wordcount-hadoop <in> <out>");System.exit(2);
}
Path output = new Path(args[1]);HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf),
output);
Job job = new Job(getConf(), "word count hadoop");job.setJarByClass(WordCountHadoop.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);
16 / 34
Ejecución de un Job MapReduce
Nodo 1
Nodo 1
Nodo 2
Nodo 2
Bloques del fichero de entrada
Mappers
Reducers
DatosIntermedios
Resultado
17 / 34
Serialización
● Writables• Serialización nativa de Hadoop
• De muy bajo nivel
• Tipos básicos: IntWritable, Text, etc.● Otras
• Thrift, Avro, Protostuff
• Compatibilidad hacia atrás.
18 / 34
La curva de aprendizaje de Hadoop
es alta
19 / 34
Tuple MapReduce
● Un MapReduce más simple– Tuplas en lugar de key/value
– A nivel de job se define● Los campos por los que agrupar● Los campos por los que ordenar
– Tuple MapReduce-join
20 / 34
Pangool
● Implementación de TupleMap reduce– Desarrollado por Datasalt
– OpenSource
– Eficiencia equiparable a Hadoop
● Objetivo: reemplazar la API de Hadoop
● Si quieres aprender Hadoop, empieza por Pangool
21 / 34
Eficiencia de Pangool
● Equiparable a Hadoop
Ver http://pangool.net/benchmark.html
22 / 34
Pangool – URL resolution
● Ejemplo de Join– Muy difícil en Hadoop. Fácil en Pangool.
● Problema:– Existen muchos acortadores de URLs y redirecciones
– Para analizar datos, suele ser útil reemplazar las URLs por su URL canónica
– Supongamos que tenemos ambos datasets● Un mapa con entradas URL → URL canónica● Un dataset con URLs (que queremos resolver) y otros campos.
– El siguiente job Pangool soluciona el problema de manera escalable.
23 / 34
URL Resolution – Definiendo Schemas
static Schema getURLRegisterSchema() {List<Field> urlRegisterFields = new ArrayList<Field>();urlRegisterFields.add(Field.create("url",Type.STRING));urlRegisterFields.add(Field.create("timestamp",Type.LONG));urlRegisterFields.add(Field.create("ip",Type.STRING));return new Schema("urlRegister", urlRegisterFields);
}
static Schema getURLMapSchema() {List<Field> urlMapFields = new ArrayList<Field>();urlMapFields.add(Field.create("url",Type.STRING));urlMapFields.add(Field.create("canonicalUrl",Type.STRING));return new Schema("urlMap", urlMapFields);
}
24 / 34
URL Resolution – Cargando el fichero a resolver
public static class UrlProcessor extends TupleMapper<LongWritable, Text> {
private Tuple tuple = new Tuple(getURLRegisterSchema());
@Overridepublic void map(LongWritable key, Text value, TupleMRContext
context, Collector collector) throws IOException, InterruptedException {
String[] fields = value.toString().split("\t");tuple.set("url", fields[0]);tuple.set("timestamp", Long.parseLong(fields[1]));tuple.set("ip", fields[2]);collector.write(tuple);
}}
25 / 34
URL Resolution – Cargando el mapa de URLs
public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> {
private Tuple tuple = new Tuple(getURLMapSchema());
@Overridepublic void map(LongWritable key, Text value, TupleMRContext
context, Collector collector) throws IOException, InterruptedException {
String[] fields = value.toString().split("\t");tuple.set("url", fields[0]);tuple.set("canonicalUrl", fields[1]);collector.write(tuple);
}}
26 / 34
URL Resolution – Resolución en el reducer
public static class Handler extends TupleReducer<Text, NullWritable> {
private Text result;
@Overridepublic void reduce(ITuple group, Iterable<ITuple> tuples,
TupleMRContext context, Collector collector) throws IOException, InterruptedException, TupleMRException {
if (result == null) {result = new Text();
}String cannonicalUrl = null;for(ITuple tuple : tuples) {
if("urlMap".equals(tuple.getSchema().getName())) {cannonicalUrl = tuple.get("canonicalUrl").toString();
} else {result.set(cannonicalUrl + "\t" +
tuple.get("timestamp") + "\t" + tuple.get("ip"));collector.write(result, NullWritable.get());
}}
}}
27 / 34
URL Resolution – Configurando y Lanzando el job
String input1 = args[0];String input2 = args[1];String output = args[2];
deleteOutput(output);
TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");mr.addIntermediateSchema(getURLMapSchema());mr.addIntermediateSchema(getURLRegisterSchema());mr.setGroupByFields("url");mr.setOrderBy(
new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC));mr.setTupleReducer(new Handler());mr.setOutput(new Path(output),
new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);
mr.addInput(new Path(input1), new HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor());
mr.addInput(new Path(input2), new HadoopInputFormat(TextInputFormat.class),new UrlProcessor());
mr.createJob().waitForCompletion(true);
28 / 34
Hadoop vs Pangool
/** * Copyright [2012] [Datasalt Systems S.L.] * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.datasalt.pangool.examples.urlresolution;
import java.io.IOException;import java.util.ArrayList;import java.util.List;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.ToolRunner;
import com.datasalt.pangool.examples.BaseExampleJob;import com.datasalt.pangool.io.ITuple;import com.datasalt.pangool.io.Schema;import com.datasalt.pangool.io.Schema.Field;import com.datasalt.pangool.io.Schema.Field.Type;import com.datasalt.pangool.io.Tuple;import com.datasalt.pangool.tuplemr.TupleMRBuilder;import com.datasalt.pangool.tuplemr.TupleMRException;import com.datasalt.pangool.tuplemr.TupleMapper;import com.datasalt.pangool.tuplemr.TupleReducer;import com.datasalt.pangool.tuplemr.mapred.lib.input.HadoopInputFormat;import com.datasalt.pangool.tuplemr.mapred.lib.output.HadoopOutputFormat;
/** * This example shows how to perform reduce-side joins using Pangool. We have one file with URL Registers: ["url", "timestamp", "ip"] and another file with * canonical URL mapping: ["url", "canonicalUrl"]. We want to obtain the URL Registers file with the url substituted with the * canonical one according to the mapping file: ["canonicalUrl", "timestamp", "ip"]. */public class UrlResolution extends BaseExampleJob {
static Schema getURLRegisterSchema() {List<Field> urlRegisterFields = new ArrayList<Field>();urlRegisterFields.add(Field.create("url",Type.STRING));urlRegisterFields.add(Field.create("timestamp",Type.LONG));urlRegisterFields.add(Field.create("ip",Type.STRING));return new Schema("urlRegister", urlRegisterFields);
}
static Schema getURLMapSchema() {List<Field> urlMapFields = new ArrayList<Field>();urlMapFields.add(Field.create("url",Type.STRING));urlMapFields.add(Field.create("canonicalUrl",Type.STRING));return new Schema("urlMap", urlMapFields);
}
@SuppressWarnings("serial")public static class UrlProcessor extends TupleMapper<LongWritable, Text> {
private Tuple tuple = new Tuple(getURLRegisterSchema());
@Overridepublic void map(LongWritable key, Text value, TupleMRContext context,
Collector collector) throws IOException, InterruptedException {
String[] fields = value.toString().split("\t");
tuple.set("url", fields[0]);tuple.set("timestamp",
Long.parseLong(fields[1]));tuple.set("ip", fields[2]);collector.write(tuple);
}}
@SuppressWarnings("serial")public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> {
private Tuple tuple = new Tuple(getURLMapSchema());
@Overridepublic void map(LongWritable key, Text value, TupleMRContext context,
Collector collector) throws IOException, InterruptedException {
String[] fields = value.toString().split("\t");
tuple.set("url", fields[0]);tuple.set("canonicalUrl", fields[1]);collector.write(tuple);
}}
@SuppressWarnings("serial")public static class Handler extends TupleReducer<Text, NullWritable> {
private Text result;
@Overridepublic void reduce(ITuple group, Iterable<ITuple> tuples, TupleMRContext
context, Collector collector) throws IOException, InterruptedException, TupleMRException {
if (result == null) {result = new
Text();}String cannonicalUrl = null;for(ITuple tuple : tuples) {
if("urlMap".equals(tuple.getSchema().getName())) {
cannonicalUrl = tuple.get("canonicalUrl").toString();} else {
result.set(cannonicalUrl + "\t" + tuple.get("timestamp") + "\t" + tuple.get("ip"));
collector.write(result, NullWritable.get());}
}}
}
public UrlResolution() {super("UrlResolution: [input_url_mapping] [input_url_regs] [output]");
}
@Overridepublic int run(String[] args) throws Exception {
if(args.length != 3) {failArguments("Wrong number of arguments");return -1;
}String input1 = args[0];String input2 = args[1];String output = args[2];
deleteOutput(output);
TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");mr.addIntermediateSchema(getURLMapSchema());mr.addIntermediateSchema(getURLRegisterSchema());mr.setGroupByFields("url");mr.setTupleReducer(new Handler());mr.setOutput(new Path(output), new
HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class);mr.addInput(new Path(input1), new
HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor());mr.addInput(new Path(input2), new
HadoopInputFormat(TextInputFormat.class), new UrlProcessor());mr.createJob().waitForCompletion(true);
return 1;}
public static void main(String[] args) throws Exception {ToolRunner.run(new UrlResolution(), args);
}}
/** * Copyright [2012] [Datasalt Systems S.L.] * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.datasalt.pangool.benchmark.urlresolution;
import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.RawComparator;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.io.WritableComparator;import org.apache.hadoop.io.WritableUtils;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Partitioner;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.lib.MultipleInputs;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.util.GenericOptionsParser;
/** * Code for solving the URL Resolution CoGroup Problem in Hadoop Java Map/Red API. * <p> * The URL Resolution CoGroup Problem is: We have one file with URL Registers: {url timestamp ip} and another file with * canonical URL mapping: {url cannonicalUrl}. We want to obtain the URL Registers file with the url substituted with * the canonical one according to the mapping file: {canonicalUrl timestamp ip}. */public class HadoopUrlResolution {
public final static int SOURCE_URL_MAP = 0;public final static int SOURCE_URL_REGISTER = 1;
public static class UrlRegJoinUrlMap implements WritableComparable<UrlRegJoinUrlMap> {
// --- Common fields --- //private Text groupUrl = new Text();private int sourceId;
// --- Url register --- //private Text ip = new Text();private long timestamp;
// --- Url map --- //private Text cannonicalUrl = new Text();
public void setUrlRegister(String groupUrl, String ip, long timestamp) {this.groupUrl.set(groupUrl);sourceId = SOURCE_URL_REGISTER;this.ip.set(ip);this.timestamp = timestamp;
}
public void setUrlMap(String groupUrl, String cannonicalUrl) {this.groupUrl.set(groupUrl);sourceId = SOURCE_URL_MAP;this.cannonicalUrl.set(cannonicalUrl);
URL Resolutionen Pangool
return groupUrl.toString() + "," + ip.toString() + "," + timestamp;
}}
}
public static class KeyPartitioner implements Partitioner<UrlRegJoinUrlMap, NullWritable> {@Overridepublic int getPartition(UrlRegJoinUrlMap key, NullWritable value, int
numPartitions) {return Math.abs(key.groupUrl.hashCode()) %
numPartitions;}
@Overridepublic void configure(JobConf arg0) {}
}
public static class GroupingComparator implements RawComparator<UrlRegJoinUrlMap> {
@Overridepublic int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)
{try {
int offset1 = s1;
int offset2 = s2;
int strSize1 = WritableComparator.readVInt(b1, offset1);
int strSize2 = WritableComparator.readVInt(b2, offset2);
offset1 += WritableUtils.decodeVIntSize(b1[offset1]);
offset2 += WritableUtils.decodeVIntSize(b2[offset2]);
return WritableComparator.compareBytes(b1, offset1, strSize1, b2, offset2, strSize2);
} catch(IOException e) {throw new
RuntimeException(e);}
}
@Overridepublic int compare(UrlRegJoinUrlMap o1, UrlRegJoinUrlMap o2) {
return o1.groupUrl.compareTo(o2.groupUrl);}
}
/** * */public static class UrlMapClass implements Mapper<LongWritable, Text, UrlRegJoinUrlMap, NullWritable>
{
private final UrlRegJoinUrlMap key = new UrlRegJoinUrlMap();
@Overridepublic void configure(JobConf arg0) {}
@Overridepublic void close() throws IOException {}
@Overridepublic void map(LongWritable ignore, Text inValue,
OutputCollector<UrlRegJoinUrlMap, NullWritable> context, Reporter arg3) throws IOException {
String[] fields = inValue.toString().split("\t");
key.setUrlMap(fields[0], fields[1]);context.collect(key, NullWritable.get());
}}
/** * */public static class UrlRegisterMapClass implements Mapper<LongWritable, Text, UrlRegJoinUrlMap,
NullWritable> {
private final UrlRegJoinUrlMap key = new UrlRegJoinUrlMap();
@Overridepublic void configure(JobConf arg0) {}
@Overridepublic void close() throws IOException {}
@Overridepublic void map(LongWritable ignore, Text inValue,
OutputCollector<UrlRegJoinUrlMap, NullWritable> context, Reporter arg3) throws IOException {
String[] fields = inValue.toString().split("\t");
key.setUrlRegister(fields[0], fields[2], Long.parseLong(fields[1]));
context.collect(key, NullWritable.get());}
}
public static class Reduce extends Reducer<UrlRegJoinUrlMap, NullWritable, Text, NullWritable> {
Text result = new Text();
protected void reduce(UrlRegJoinUrlMap key, Iterable<NullWritable> values, Context ctx) throws IOException,
InterruptedException {String cannonicalUrl = null;for(@SuppressWarnings("unused")NullWritable value : values) {
if(key.sourceId == SOURCE_URL_MAP) {
cannonicalUrl = key.cannonicalUrl.toString();} else {
result.set(cannonicalUrl + "\t" + key.timestamp + "\t" + key.ip);
ctx.write(result, NullWritable.get());}
}};
}
public final static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();if(otherArgs.length != 3) {
System.err.println("Usage: urlresolution <url-map> <url-register> <out>");
System.exit(2);}JobConf job = new JobConf(conf);FileSystem fS = FileSystem.get(conf);fS.delete(new Path(otherArgs[2]), true);
MultipleInputs.addInputPath(job, new Path(otherArgs[0]), TextInputFormat.class, UrlMapClass.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[1]), TextInputFormat.class, UrlRegisterMapClass.class);
job.setJarByClass(HadoopUrlResolution.class);
job.setPartitionerClass(KeyPartitioner.class);job.setOutputValueGroupingComparator(GroupingComparator.class);
job.setMapOutputKeyClass(UrlRegJoinUrlMap.class);job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);job.setOutputValueClass(NullWritable.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
Job j = new Job(job);j.setReducerClass(Reduce.class);j.waitForCompletion(true);
}}
}
@Overridepublic void readFields(DataInput in) throws IOException {
groupUrl.readFields(in);sourceId = (int) in.readByte();if(sourceId == SOURCE_URL_MAP) {
cannonicalUrl.readFields(in);} else {
ip.readFields(in);timestamp =
in.readLong();}
}
@Overridepublic void write(DataOutput out) throws IOException {
groupUrl.write(out);out.writeByte((byte) sourceId);if(sourceId == SOURCE_URL_MAP) {
cannonicalUrl.write(out);} else {
ip.write(out);
out.writeLong(timestamp);}
}
@Overridepublic int hashCode() {
if(sourceId == SOURCE_URL_MAP) {return
((groupUrl.hashCode() * 31) + cannonicalUrl.hashCode()) * 31;} else {
return ((((groupUrl.hashCode() * 31) + ip.hashCode()) * 31) + (int) timestamp) * 31;
}}
@Overridepublic boolean equals(Object right) {
if(right instanceof UrlRegJoinUrlMap) {
UrlRegJoinUrlMap r = (UrlRegJoinUrlMap) right;if(sourceId ==
r.sourceId) {
if(groupUrl.equals(r.groupUrl)) {
if(sourceId == SOURCE_URL_MAP) {
return cannonicalUrl.equals(r.cannonicalUrl);
} else {
return ip.equals(r.ip) && timestamp == r.timestamp;
}
}}
}return false;
}
public static class Comparator extends WritableComparator {public Comparator() {
super(UrlRegJoinUrlMap.class);}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
try {
int offset1 = s1;
int offset2 = s2;
// Group url
int strSize1 = WritableComparator.readVInt(b1, offset1);
int strSize2 = WritableComparator.readVInt(b2, offset2);
offset1 += WritableUtils.decodeVIntSize(b1[offset1]);
offset2 += WritableUtils.decodeVIntSize(b2[offset2]);
int cmp = WritableComparator.compareBytes(b1, offset1, strSize1, b2, offset2, strSize2);
if(cmp != 0) {
return cmp;
}
offset1 += strSize1;
offset2 += strSize2;
// Source id
int sourceId1 = (int) b1[offset1];
int sourceId2 = (int) b2[offset2];
if(sourceId1 != sourceId2) {
return sourceId1 > sourceId2 ? 1 : -1;
}
return 0;}
catch(IOException e) {
throw new RuntimeException(e);}
}}
static { // register this comparator
WritableComparator.define(UrlRegJoinUrlMap.class, new Comparator());}
@Overridepublic int compareTo(UrlRegJoinUrlMap o) {
int cmp = groupUrl.compareTo(o.groupUrl);if(cmp != 0) {
return cmp;}if(sourceId != o.sourceId) {
return sourceId > o.sourceId ? 1 : -1;
}return 0;
}
public String toString() {if(sourceId == SOURCE_URL_MAP) {
return groupUrl.toString() + "," + cannonicalUrl.toString();
} else {
URL Resolutionen Hadoop
29 / 34
Hadoop en el mundo real
● Captura de datos– Usando Hadoop
– Usando Sqoop para BDs SQL
– Con servicios que escriben en el HDFS
– Flume
– Storm
● Aplicación de patrones básicos– Filtrado
– Ordenación
– Ejecución distribuida
– Joins● Reduce-side, Map-side, en memoria
– Producto cruzado
– Reconciliado
30 / 34
Hadoop en el mundo real (II)
● Encadenamiento de Jobs– Se encadena la salida de un job con la entrada de la de
otro, formando un flujo.
– Sistemas para flow:● Oozie
● Actualización de bases de datos– Generación/actualización
● de índices Solr● de BD clave/valor● de BD SQL
31 / 34
Aclaremos dudas habituales
● Hadoop no es una DB● Hadoop “aparentemente” sólo
procesa datos● Hadoop no permite “lookups”
Hadoop supone un cambio de paradigma que cuesta asimilar
32 / 34
Cambio en la Filosofía
● Reprocesarlo todo siempre. ¡TODO!● ¿Por qué?
• Más tolerante a fallos• Más flexible• Más eficiente. Ej:
Con un HD de 7200 RPM– Random IOPS – 100 – Lectura secuencial – 40 MB/s– Tamaño de registro: 5 Kb
… con que un 1,25% de los registros cambien, es más rápido reescribirlo todo que hacer accesos aleatorios de actualización.
– 100 MB, 20.000 registros
» Lectura secuencial: 2,5 sg» Lectura aleatoria: 200 sg
33 / 34
Herramientas de alto nivel
● Usar Hadoop o Pangool puede no ser apropiado para todos los proyectos
● Hay herramientas de mayor nivel que pueden ser muy útiles y más fáciles:– Hive: Análisis de datos con SQL
– Pig: Análisis de datos con un lenguaje parecido a SQL
– Cascading: framework de programación en Hadoop, que incluye tuplas y gestión automática de flows.
– Hadoop streaming: procesamiento de datos sobre Hadoop con cualquier comando/lenguaje de scripting.