Smart421 SyncNorwich Big Data on AWS by Robin Meehan

April 2013

Robin MeehanCTO

SyncNorwich – Big Data on AWS

http://flickr.com/photos/brunogirin/68341710/

What the hell is it?

http://commons.wikimedia.org/wiki/File:Loud_environment_headphones.jpg

http://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.jpg

http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg

http://www.flickr.com/photos/krishaamer/2836262962/

http://flickr.com/photos/42033648@N00/61053542

Storm

Spark

Dremel/Drill

Impala

AWS Redshift

etc etc

Big data exploitation – in practice

11

• Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…)

• The raw aggregator quote data is of a scale to present a ‘Big Data’ problem – there is great potential for gaining additional insights from this data

So…• Define some candidate business questions• Test them against significant volumes of data• Measure cluster size/£/time performance

Introduction…….

12

The example Aviva Use Case…

Driving AWS EMR…

13

AWS Elastic Map Reduce…configuring a Hadoop Cluster...

Some pig…

Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.

register 's3n://ashaw-1/jars/myudfs.jar';register 's3n://ashaw-1/jars/dom4j-1.6.1.jar';A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage();Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage();A1 = limit A 5000000;Arac1 = limit Arac 5000000;B = foreach A1 generate myudfs.Flatten((chararray)$5);Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5);C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21);D = filter C by $1.$0 == 1 OR $0.$0 == 1;STORE D INTO ‘s3n://ashaw-1/myoutputfolder/’;

XML Flattening results:

• 10 Million quotes:

Costs per run…

Cluster size: Time to execute: Approx. cost:

10 x Small nodes 64 minutes. 11 compute hours - $1.155 per hour (approx. £0.72)

19 x Small nodes 31 minutes. 20 compute hours - $2.10 per hour (approx. £1.30)

8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34)

But we could have used spot instances…

Visualisation

• It will be a similar adoption pattern to cloud:− Those organisations that make it work and gain

additional business insights will• market more accurately• sell more• have less customer churn• have better paying customers

• Market forces will eventually force adoption or failure of their competitors – all other things being equal. It’s Darwinian evolutionary forces at work in the marketplace.

• Interestingly, the costs to exploit big data (well – at least to find out if there is some value that you are missing out on) are now very low due to vendors such as AWS, so it’s a market advantage that is relatively cheap to attain− I.e. we’re talking about a few enabled savvy staff

and some “pay as you go” compute resources

Wrapping up…

Smart421 SyncNorwich Big Data on AWS by Robin Meehan

Technology

Transcript of Smart421 SyncNorwich Big Data on AWS by Robin Meehan