Introduction
description
Transcript of Introduction
Introduction
Introduction to Avro and Integration with Hadoop
What is Avro?
• Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data.
• Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas
Creating your first Avro schema
Schema description: { "name": "User", "type": "record", "fields": [ {"name": "FirstName", "type": "string", "doc": "First Name"}, {"name": "LastName", "type": "string"}, {"name": "isActive", "type": "boolean", "default": true}, {"name": "Account", "type": "int", "default": 0} ]}
Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)2. Records
{ "type": "record", "name": "LongList",[ {"name": "value", "type": "long"}, {"name": ”description", "type”:”string”}]
}3. Others (Enums, Arrays, Maps, Unions, Fixed)
Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)2. Records
{ "type": "record", "name": "LongList",[ {"name": "value", "type": "long"}, {"name": ”description", "type”:”string”}]
}3. Others (Enums, Arrays, Maps, Unions, Fixed)
How to create Avro record?
String schemaDescription = " { \n" + " \"name\": \"User\", \n" + " \"type\": \"record\",\n" + " \"fields\": [\n" + " {\"name\": \"FirstName\", \"type\": \"string\", \"doc\": \"First Name\"},\n" + " {\"name\": \"LastName\", \"type\": \"string\"},\n" + " {\"name\": \"isActive\", \"type\": \"boolean\", \"default\": true},\n" + " {\"name\": \"Account\", \"type\": \"int\", \"default\": 0} ]\n" + "}";
Schema.Parser parser = new Schema.Parser();
Schema s = parser.parse(schemaDescription);
GenericRecordBuilder builder = new GenericRecordBuilder(s);
How to create Avro record? (cont. 2)
1. The first step to create Avro record is to create JSON-based schema
2. Avro provides parser that will take a Avro schema string and return schema object.
3. Once the schema object is created, we have created a builder that will allow us to create
records with default values
How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe
How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe
How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode("")));
}
s.setFields(lstFields);
How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode("")));
}
s.setFields(lstFields);
How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{ "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{ "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
How to write Avro records in a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) { dataFileWriter.append(rec); }
dataFileWriter.close();
How to reading Avro records from a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) { dataFileWriter.append(rec); }
dataFileWriter.close();
How to read Avro records from a file?
File file = new File(“<file-name>");
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file,
reader);
while (dataFileReader.hasNext()) {
Record r = (Record) dataFileReader.next(); System.out.println(r.toString());
}
Running MapReduce Jobs on Avro Data
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data (cont. 2)
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data - Mapper
public static class MapImpl extends AvroMapper<GenericRecord, Pair<String, GenericRecord>> {
public void map( GenericRecord datum,
AvroCollector<Pair<String, GenericRecord>> collector, Reporter reporter)
throws IOException {
….
} }
Running MapReduce Jobs on Avro Data - Reducer
public static class ReduceImpl extends AvroReducer<Utf8, GenericRecord, GenericRecord> {
public void reduce(Utf8 key, Iterable<GenericRecord> values, AvroCollector< GenericRecord> collector, Reporter reporter) throws IOException {
collector.collect(values.iterator().next());
return;
} }
Running Avro MapReduce Jobs on Data with Different schema
List<Schema> schemas= new ArrayList<Schema>(); schemas.add(schema1); schemas.add(schema2); Schema schema3=Schema.createUnion(schemas);
This will allow to read data from different sources and process both of them in the same mapper
Summary
• Avro is a great tool to use for semi-structured and structured data
• Simplifies MapReduce development
• Provides good compression mechanism
• Great tool for conversion from existing SQL code
• Questions?