Smart421 SyncNorwich Big Data on AWS by Robin Meehan

17
April 2013 Robin Meehan CTO SyncNorwich – Big Data on AWS

description

 

Transcript of Smart421 SyncNorwich Big Data on AWS by Robin Meehan

April 2013

Robin MeehanCTO

SyncNorwich – Big Data on AWS

http://flickr.com/photos/brunogirin/68341710/

What the hell is it?

http://commons.wikimedia.org/wiki/File:Loud_environment_headphones.jpg

http://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.jpg

http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg

http://www.flickr.com/photos/krishaamer/2836262962/

http://flickr.com/photos/42033648@N00/61053542

Storm

Spark

Dremel/Drill

Impala

AWS Redshift

etc etc

Big data exploitation – in practice

11

• Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…)

• The raw aggregator quote data is of a scale to present a ‘Big Data’ problem – there is great potential for gaining additional insights from this data

So…• Define some candidate business questions• Test them against significant volumes of data• Measure cluster size/£/time performance

Introduction…….

12

The example Aviva Use Case…

Driving AWS EMR…

13

AWS Elastic Map Reduce…configuring a Hadoop Cluster...

Some pig…

Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.

register 's3n://ashaw-1/jars/myudfs.jar';register 's3n://ashaw-1/jars/dom4j-1.6.1.jar';A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage();Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage();A1 = limit A 5000000;Arac1 = limit Arac 5000000;B = foreach A1 generate myudfs.Flatten((chararray)$5);Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5);C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21);D = filter C by $1.$0 == 1 OR $0.$0 == 1;STORE D INTO ‘s3n://ashaw-1/myoutputfolder/’;

XML Flattening results:

• 10 Million quotes:

Costs per run…

Cluster size: Time to execute: Approx. cost:

10 x Small nodes 64 minutes. 11 compute hours - $1.155 per hour (approx. £0.72)

19 x Small nodes 31 minutes. 20 compute hours - $2.10 per hour (approx. £1.30)

8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34)

But we could have used spot instances…

Visualisation

• It will be a similar adoption pattern to cloud:− Those organisations that make it work and gain

additional business insights will• market more accurately• sell more• have less customer churn• have better paying customers

• Market forces will eventually force adoption or failure of their competitors – all other things being equal. It’s Darwinian evolutionary forces at work in the marketplace.

• Interestingly, the costs to exploit big data (well – at least to find out if there is some value that you are missing out on) are now very low due to vendors such as AWS, so it’s a market advantage that is relatively cheap to attain− I.e. we’re talking about a few enabled savvy staff

and some “pay as you go” compute resources

Wrapping up…