Parquet and impala overview external

16
1 Parquet data format & Impala overview

description

Evaluation of Hadoop formats along with Parquet. Brief intro and overview of Impala.

Transcript of Parquet and impala overview external

Page 1: Parquet and impala overview external

1

Parquet data format & Impala overview

Page 2: Parquet and impala overview external

2

Agenda

• Objective

• Various data formats

• Use case

• Parquet

• Impala

Page 3: Parquet and impala overview external

3

Objective

• 2 fold:

• Quest for a more performant data format than Avro for nested data

• Understand and test new data formats in general

Page 4: Parquet and impala overview external

4

Hadoop data formats• Sequence file. It stores key-value pairs

of data in a flat binary file. Rows stored as values.

• ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter.

• Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.

Page 5: Parquet and impala overview external

5

Parquet• Columnar storage

• Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression.

• Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run.

• Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form

Page 6: Parquet and impala overview external

6

Parquet• Parquet can handle multiple schemas.

Support schema evolution.

• LogType A : organizationId, userId, timestamp, recordId, cpuTime

• LogType V : userId, organizationId, timestamp, foo, bar

• Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.

Page 7: Parquet and impala overview external

7

Parquet• SELECT vs INSERT.

• Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks.

• Inserting into a Parquet table is a more memory-intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.

Page 8: Parquet and impala overview external

8

Parquet•Memory issues (Heap space error) resolved by:

•Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB.

•The total memory allocated was around 1 GB.

•Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) .

•So writing data using parquet will always have a high memory requirement .

•Hive’s Distribute by: was workaround to memory issues!

Page 9: Parquet and impala overview external

9

Parquet vs other formats

Performance test with 100G data over multiple queries

Parquet wins

Page 10: Parquet and impala overview external

10

Impala overview• MPP implementation of a query engine

• Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch.

• Not using M/R – but uses HDFS

• Not CEP – closer to a RDBMS.

• Impala uses the same metadata store as Hive to record information about table structure and properties

Page 11: Parquet and impala overview external

11

Impala overview• Can create a table in Hive, and use it in

Impala

• E.g. Impala doesn’t support Avro, but Hive does

• Language is mix between SQL & HiveQL

• Requires a lot of memory (128 G min./node)

• Initial load of data via Refresh; can take a lot of time

• loads the block location data for newly added data files

Page 12: Parquet and impala overview external

12

Impala overview

• Shortcomings

• Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala-compatible data types – it cannot contain nested types such as array, map, or struct.

• Impala currently does not "spill to disk"

• if intermediate results being processed on a node exceed the memory reserved for Impala on that node.

• No Custom Serializer/Deserializer classes (SerDes)

• Impala cancels a running query if any host on which that query is executing fails

Page 13: Parquet and impala overview external

13

Impala overview• Example. For create a PARQUET table in IMPALA

there are 3 ways:

• -> PARQUET table created in HIVE (with no nested data types).

• -> Create and load with data a normal text table in IMPALA:

• IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’;

• Create Parquet format table and then insert into parquet table using normal text table.

• IMPALA> insert overwrite table parquet_table_name select * from text_table_name;

Page 14: Parquet and impala overview external

14

Use Case•Can't query Avro table in Impala because

having nested columns.

•Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types.

•(cannot contain nested types such as array, map, orstruct).

Page 15: Parquet and impala overview external

15

Use Case• How to deal with nested XML data in Hadoop?

• There is no direct mapping from xml to avro. Process goes:

• Parse XML and Convert to Avro : Parse XML using XMLStreamReader and

• Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro:

• Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis.

• The problem is the Schema provided has some arrays which is union of type string and null both.

• Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception.

• Tested: Impala 1.2.3 on CDH 4.5

• Impala doesn’t support nested types at this point

Page 16: Parquet and impala overview external

16

Thank you