My life as a beekeeper

23
My life as a beekeeper @89clouds

Transcript of My life as a beekeeper

Page 1: My life as a beekeeper

My life as a beekeeper

@89clouds

Page 2: My life as a beekeeper

Who am I?

Pedro Figueiredo ([email protected])

Hadoop et al

SocialFacebook games, media (TV, publishing)

Elastic MapReduce, Cloudera

NoSQL, as in “Not a SQL guy”

Page 3: My life as a beekeeper

The problem with Hive

It looks like SQL

Page 4: My life as a beekeeper

No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);

Page 5: My life as a beekeeper

“It’s just like Oracle!”

Analysts will be very happy

At least until they join with that 30 billion-record table

Pro tip: explain MapReduce and then MAPJOIN

set hive.mapjoin.smalltable.filesize=xxx;

Page 6: My life as a beekeeper

Your first interview question

“Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”

Page 7: My life as a beekeeper

Dynamic partitions

Partitions are the poor person’s indexes

Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000;

Plan your partitions ahead

Page 8: My life as a beekeeper

Multi-vitamins

You can minimise input scans by using multi-table INSERTs:

FROM inputINSERT INTO TABLE output1 SELECT fooINSERT INTO TABLE output2 SELECT bar;

Page 9: My life as a beekeeper

Persistence, do you speak it?

External Hive metastore

Avoid the pain of cluster set up

Use an RDS metastore if on AWS, RDBMS otherwise.

10GB will get you a long way, this thing is tiny

Page 10: My life as a beekeeper

Now you have 2 problems

Regular expressions are great, if you’re using a real programming language.

WHERE foo RLIKE ‘(a|b|c)’ will hurt

WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’

Generate these statements, if needs be, it will pay off.

Page 11: My life as a beekeeper

Avro

Serialisation framework (think Thrift/Protocol Buffers).

Avro container files are SequenceFile-like, splittable.

Support for snappy built-in.

If using the LinkedIn SerDe, the table creation syntax changes.

Page 12: My life as a beekeeper

AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable';

Page 13: My life as a beekeeper

MAKE! MONEY! FAST!

Use spot instances in EMR

Usually stick around until America wakes up

Brilliant for worker nodes

Page 14: My life as a beekeeper

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;

Page 15: My life as a beekeeper

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;

Page 16: My life as a beekeeper

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;

Page 17: My life as a beekeeper

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;

Page 18: My life as a beekeeper

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;

Page 19: My life as a beekeeper

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;

Page 20: My life as a beekeeper

To be or not to be“Consider a traditional RDBMS”

At what size should we do this?

Hive is not an end, it’s the means

Data on HDFS/S3 is simply available, not “available to Hive”

Hive isn’t suitable for near real time

Page 21: My life as a beekeeper

Hive != MapReduce

Don’t use Hive instead of Native/Streaming

“I know, I’ll just stream this bit through a shell script!”

Imo, Hive excels at analysis and aggregation, so use it for that

Page 22: My life as a beekeeper

Thank you

Fred Easey (@poppa_f)

Peter Hanlon

Page 23: My life as a beekeeper

Questions?

[email protected]

@pfig / @89clouds

http://89clouds.com/