cloudcamp chicago - big data & cloud may 2015 - all slides

63
CloudCamp Chicago “Big Data and Cloud” #cloudcamp @CloudCamp_CHI Sponsored by Hosted by

Upload: cloudcamp-chicago

Post on 27-Jul-2015

248 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

CloudCamp Chicago

“Big Data and Cloud”

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 2: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Emcee

Margaret WalkerCohesive Networks

Tweet: @CloudCamp_Chi #cloudcamp

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 3: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

… sponsored by you!

William Knowles - Evident.ioAdam Kallish - IBMCraig Hancock - HealthEngineBrandon Pittman - VMwareChuck Mackie - Maven Wave PartnersBrad Foster - Maven Wave PartnersKim Neuwirth - Narrative SciencePiaOpulencia - Narrative ScienceJimStiller - Cloud Technology Partners NetworksBrian Lickenbrock - EY

Page 4: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

6:00 pm Introductions6:05 pm: Lightning Talks

"Big Data without Big Infrastructure" - Dan Chuparkoff, VP of Product at Civis Analytics @Chuparkoff "Simplicity, Storytelling and Big Data" - Craig Booth, Data Engineer at Narrative Science @craigmbooth "Spark: A Quick Ignition" - Matthew Kemp, Team Lead & Engineer of Things at Signal @mattkemp"Building warehousing systems on Redshift" - Tristan Crockett, Software Engineer at Edgeflip @thcrock

7:00 pm: Unpanel 7:45 pm: Unconference / Networking, drinks and pizza

Agenda

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 5: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

"Big Data without Big Infrastructure"

Dan ChuparkoffVP of Product at Civis Analytics

Tweet: @Chuparkoff#cloudcamp

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 6: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff

BIG Data without

BIG Infrastructure

Dan Chuparkoff

VP of Product

Civis Analytics

Page 7: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Civis is an easy-to-use, incredibly extensible data science platform in the cloud for teams who want to make great data-driven decisions to drive their organizations forward.

I work at Civis

Page 8: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Big Data without Big Infrastructure@chuparkoff

“The ability to use the data that you’ve built up in the past

to inform & improve what you’re going to do in the future.”

Big Data at Civis Analytics

Page 9: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Data science is too damn hard

have a report every day that says what happened yesterday?

apply predictive modeling to improve my customer retention?

to use data from my past to improve acquisition in the future?

Why can’t I…

?

?

?

Page 10: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Everyone’s story •  Aggregate

•  Unify•  Explore•  Optimize•  Share•  Automate

Page 11: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Big Data without Big Infrastructure@chuparkoff

Where should we start?

Cloud OnPrem

vs.  

Page 12: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Civis Analytics uses AWS

Page 13: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

•  No hardware costs and infinitely scalable

•  Safety and security of AWS

•  Automatic backups to multiple data centers

•  Access from any computer with an internet connection

Page 14: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Redshift   S3  EC2   DynamoDB   RDS   EMR  

Page 15: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Page 16: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Page 17: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Page 18: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Civis data streams aggregate data from virtually any source.

Get all pf your data together in one place.

Aggregate

From data to activation

Page 19: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Next, Civis’ intelligent matching algorithmslink data in disparate data stores. No matter where your data starts, Civis helps you build a unified data repository.

Unify

From data to activation

Page 20: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Explore and transform the data in a fast analytics database.

Explore

From data to activation

Page 21: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Build powerful predictive models and easily score results with the Civis platform’s advanced modeling engine. This is the heart of data-driven decision making!

Optimize

From data to activation

Page 22: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Create, automate, & share reports across your team.Empower your entire organization to move forward with precision.

Share

From data to activation

Page 23: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

When tomorrow comesthere’s no need to reinvent the wheel. Civis let’s you automate and schedule from start to finish, so you can get back to pushing boundaries.

Automate

From data to activation

Page 24: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

@chuparkoff Big Data without Big Infrastructure

Big Data + the Cloud + AWS helps Civis Analytics turn

an analyst into a data scientist & a data scientist

into a team of data scientists.

Thanks!

Page 25: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

"Simplicity, Storytelling and Big Data"

Craig BoothData Engineer at Narrative Science

Tweet: @craigmbooth #cloudcamp

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 26: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Simplicity, Storytelling & Big Data

Craig Booth

What I Wish I Knew About Big Data On Day One.

Page 27: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

My Backgrounddata driven science

30+ journal articles; complex analytics on 10s of TB of data

data powered storytelling

Page 28: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
Page 29: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

lumière léger

Page 30: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
Page 31: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides
Page 32: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Credit: Josh Bloom Henrik Brink of wise.io

Page 33: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

“…more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize”

Xavier Amatriain and Justin Basilico, Netflix

Page 34: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

“We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”

Xavier Amatriain and Justin Basilico, Netflix

Expla

inabil

ity

Imple

mentab

ility

Accur

acy

Can I c

ommun

icate

resu

lts?

How lo

ng w

ill it

take m

e to

build

?Can

I tole

rate

some e

rrors?

Page 35: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

"Spark: A Quick Ignition"

Matthew KempTeam Lead & Engineer of Things at Signal

Tweet: @mattkemp #cloudcamp

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 36: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Spark: A Quick IgnitionMatthew Kemp

Page 37: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Provides distributed processing

Main unit of abstraction is the RDD

Can be used with frameworks like Mesos or Yarn

Supports Java, Python and Scala

https://spark.apache.org/

What is Spark?

Page 38: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables

Transformations Lazily create a new RDD from an existing one

Actions Usually return a value, force computation of RDD

Resilient Distributed Dataset

Page 39: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Some examples: filter map flatMap distinct union intersection join reduceByKey

Transformations

Page 40: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Some examples: reduce collect take count foreach saveAsTextFile

Actions

Page 42: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Example: Word Count

flatMap()inputreduceBy

Key() map() outputmap()

Page 43: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Word Count

Page 44: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Alternate Word Count

Page 45: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')

Running the Example

Page 46: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1

The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...

Page 47: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c

A (Bad) Shell Version

Page 48: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an

The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...

Page 49: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Our Use Case

distinct()3rd party

3rd partydistinct()

join()

join()

union() distinct() foreach()1st party

Page 50: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Questions?

Page 51: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Contact [email protected]

@mattkemp

/in/matthewkemp

Page 52: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

"Building warehousing systems on Redshift"

Tristan CrockettSoftware Engineer at Edgeflip

Tweet: @thcrock #cloudcamp

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 53: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Redshift: Lessons Learned

Tristan Crockett – Software Engineer, Edgeflip

Page 54: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Basics

● Analytical database● PostgreSQL with column storage engine● Automatic Data compression● No traditional indexes; specify a sort key (how

are records in the table sorted?) and distribution key (which node will house a record?)

Page 55: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

My Work with Redshift

● Data warehouse for Facebook user feeds and related app data

● Inputs– RDS (MySQL)

– DynamoDB

– Facebook

● Stats– ~2TB of compressed data

– Two main tables, ~5bil and ~25bil rows respectively

Page 56: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Advantages / Disadvantages

● Fast at copying data in from S3● Fast at computing aggregate/analytical

functions over an entire table● Decent at intra-db operations (create table as

select, insert into select)● Most everything else is slow● Without traditional indexes, table design isn't as

flexible

Page 57: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Lessons/Tips

● Optimize load size (1 MB to 1 GB per file)● Compress input● Upsert when needed, and always vacuum● Don't populate tables with 'CREATE TABLE AS'

if you like compression (which you do)● To avoid complicated joins, consider computing

single-table aggregates and join on the results

Page 58: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Upsert

Page 59: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Keep an Eye on Compression

Page 60: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Single-Table Aggregates

Page 61: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Thanks for Listening!

[email protected]

@thcrock

Page 62: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Un-panel Discussion

volunteer to join the panel & ask questions from the floor!

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Page 63: CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Unconference

Small groups & discussions, network

Pizza’s almost here!

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by