smart421 star testing event big data, london 26 sept 2013 r meehan v1.0
DESCRIPTION
Contains the slides by Robin Meehan, CTO at Smart421 Ltd presented at Star Testing Event - Big Data event in London on Thursday 26 Sept 2013. This deck comprises 21 slides in total. More details on the case study for Aviva / Quotemehappy.com can be found on webpage http://www.smart421.com/customers/casestudies/quotemehappy.aspTRANSCRIPT
10 April 2023
Star Testing Event – Big DataRobin Meehan, CTO, Smart421
http://commons.wikimedia.org/wiki/File:Encyclopedia_Britannica_series.JPG
http://commons.wikimedia.org/wiki/File:Rail_disruption_at_Buckley.jpg
Cloud…
http://commons.wikimedia.org/wiki/File:Felicidade_A_very_happy_boy.jpg
http://commons.wikimedia.org/wiki/File:Eyjafjallaj%C3%B6kull_%284530326521%29.jpg
…Big Data!
http://commons.wikimedia.org/wiki/File:Woolworths_Beccles_-_geograph.org.uk_-_1077646.jpg
So what’s changed?
Big data exploitation – in practice
8
Case study
Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…)
The raw aggregator quote data is of a scale to present a ‘Big Data’ problem – there is great potential for gaining additional insights from this data
9
Define some candidate business questions
Test them against
significant volumes of data
Measure cluster size/£/time
performance
Driving AWS EMR…
10
AWS Elastic Map Reduce…configuring a Hadoop Cluster...
Some pig…
register 's3n://ashaw-1/jars/myudfs.jar';register 's3n://ashaw-1/jars/dom4j-1.6.1.jar';A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage();Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage();A1 = limit A 5000000;Arac1 = limit Arac 5000000;B = foreach A1 generate myudfs.Flatten((chararray)$5);Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5);C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21);D = filter C by $1.$0 == 1 OR $0.$0 == 1;STORE D INTO ‘s3n://ashaw-1/myoutputfolder/’;
Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.
Visualisation
Costs per run…
Cluster size: Time to execute: Approx. cost:10 x Small nodes 64 minutes 11 compute hours - $1.155 per hour (approx. £0.72)19 x Small nodes 31 minutes 20 compute hours - $2.10 per hour (approx. £1.30) 8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34)
But we could have used spot instances…
http://commons.wikimedia.org/wiki/File:Binoculars_25x100.jpg
Challenges…
http://flickr.com/photos/42033648@N00/61053542
Storm
Spark
Dremel/Drill
Impala
AWS Redshift
Accumulo
etc etc
Trends…
Blog: http://smart421.wordpress.com/tag/big-data/
Thank you!
Robin MeehanCTO, Smart421
Spare slides…
10 April 2023
http://commons.wikimedia.org/wiki/File:Loud_environment_headphones.jpg
10 April 2023
http://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.jpg
10 April 2023
http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg
http://www.flickr.com/photos/krishaamer/2836262962/