cloudcamp chicago - big data & cloud may 2015 - all slides

CloudCamp Chicago

“Big Data and Cloud”

#cloudcamp@CloudCamp_CHI

Sponsored by

Hosted by

Emcee

Margaret WalkerCohesive Networks

Tweet: @CloudCamp_Chi #cloudcamp


Sponsored by

Hosted by

… sponsored by you!

William Knowles - Evident.ioAdam Kallish - IBMCraig Hancock - HealthEngineBrandon Pittman - VMwareChuck Mackie - Maven Wave PartnersBrad Foster - Maven Wave PartnersKim Neuwirth - Narrative SciencePiaOpulencia - Narrative ScienceJimStiller - Cloud Technology Partners NetworksBrian Lickenbrock - EY

6:00 pm Introductions6:05 pm: Lightning Talks

"Big Data without Big Infrastructure" - Dan Chuparkoff, VP of Product at Civis Analytics @Chuparkoff "Simplicity, Storytelling and Big Data" - Craig Booth, Data Engineer at Narrative Science @craigmbooth "Spark: A Quick Ignition" - Matthew Kemp, Team Lead & Engineer of Things at Signal @mattkemp"Building warehousing systems on Redshift" - Tristan Crockett, Software Engineer at Edgeflip @thcrock

7:00 pm: Unpanel 7:45 pm: Unconference / Networking, drinks and pizza

Agenda


Sponsored by

Hosted by

"Big Data without Big Infrastructure"

Dan ChuparkoffVP of Product at Civis Analytics

Tweet: @Chuparkoff#cloudcamp


Sponsored by

Hosted by

@chuparkoff

BIG Data without

BIG Infrastructure

Dan Chuparkoff

VP of Product

Civis Analytics

@chuparkoff Big Data without Big Infrastructure

Civis is an easy-to-use, incredibly extensible data science platform in the cloud for teams who want to make great data-driven decisions to drive their organizations forward.

I work at Civis

Big Data without Big Infrastructure@chuparkoff

“The ability to use the data that you’ve built up in the past

to inform & improve what you’re going to do in the future.”

Big Data at Civis Analytics


Data science is too damn hard

have a report every day that says what happened yesterday?

apply predictive modeling to improve my customer retention?

to use data from my past to improve acquisition in the future?

Why can’t I…

?

?

?


Everyone’s story •  Aggregate

•  Unify•  Explore•  Optimize•  Share•  Automate

Big Data without Big Infrastructure@chuparkoff

Where should we start?

Cloud OnPrem

vs.


Civis Analytics uses AWS


•  No hardware costs and infinitely scalable

•  Safety and security of AWS

•  Automatic backups to multiple data centers

•  Access from any computer with an internet connection


Redshift S3 EC2 DynamoDB RDS EMR


Civis data streams aggregate data from virtually any source.

Get all pf your data together in one place.

Aggregate

From data to activation


Next, Civis’ intelligent matching algorithmslink data in disparate data stores. No matter where your data starts, Civis helps you build a unified data repository.

Unify



Explore and transform the data in a fast analytics database.

Explore



Build powerful predictive models and easily score results with the Civis platform’s advanced modeling engine. This is the heart of data-driven decision making!

Optimize



Create, automate, & share reports across your team.Empower your entire organization to move forward with precision.

Share



When tomorrow comesthere’s no need to reinvent the wheel. Civis let’s you automate and schedule from start to finish, so you can get back to pushing boundaries.

Automate



Big Data + the Cloud + AWS helps Civis Analytics turn

an analyst into a data scientist & a data scientist

into a team of data scientists.

Thanks!

"Simplicity, Storytelling and Big Data"

Craig BoothData Engineer at Narrative Science

Tweet: @craigmbooth #cloudcamp


Sponsored by

Hosted by

Simplicity, Storytelling & Big Data

Craig Booth

What I Wish I Knew About Big Data On Day One.

My Backgrounddata driven science

30+ journal articles; complex analytics on 10s of TB of data

data powered storytelling

lumière léger

Credit: Josh Bloom Henrik Brink of wise.io

“…more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize”

Xavier Amatriain and Justin Basilico, Netflix

“We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”

Xavier Amatriain and Justin Basilico, Netflix

Expla

inabil

ity

Imple

mentab

ility

Accur

acy

Can I c

ommun

icate

resu

lts?

How lo

ng w

ill it

take m

e to

build

?Can

I tole

rate

some e

rrors?

"Spark: A Quick Ignition"

Matthew KempTeam Lead & Engineer of Things at Signal

Tweet: @mattkemp #cloudcamp


Sponsored by

Hosted by

Spark: A Quick IgnitionMatthew Kemp

Provides distributed processing

Main unit of abstraction is the RDD

Can be used with frameworks like Mesos or Yarn

Supports Java, Python and Scala

https://spark.apache.org/

What is Spark?

https://spark.apache.org/

Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables

Transformations Lazily create a new RDD from an existing one

Actions Usually return a value, force computation of RDD

Resilient Distributed Dataset

Some examples: filter map flatMap distinct union intersection join reduceByKey

Transformations

Some examples: reduce collect take count foreach saveAsTextFile

Actions

Sample Text

Spark Example

Spark Shell

Shell Example

Gists

https://gist.github.com/mkemp/005058fdece30ef6ed56

https://gist.github.com/mkemp/005058fdece30ef6ed56

https://gist.github.com/mkemp/3dc2b3a665aa96a2ca69

https://gist.github.com/mkemp/3dc2b3a665aa96a2ca69

https://gist.github.com/mkemp/6d3e758fdce75f6bd035

https://gist.github.com/mkemp/6d3e758fdce75f6bd035

https://gist.github.com/mkemp/b6b07f983d1c86d79344

https://gist.github.com/mkemp/b6b07f983d1c86d79344

Example: Word Count

flatMap()inputreduceBy

Key() map() outputmap()

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Word Count

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Alternate Word Count

$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')

Running the Example

a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1

The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...

#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c

A (Bad) Shell Version

23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an

The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...

Our Use Case

distinct()3rd party

3rd partydistinct()

join()

join()

union() distinct() foreach()1st party

Questions?

Contact [email protected]

@mattkemp

/in/matthewkemp

"Building warehousing systems on Redshift"

Tristan CrockettSoftware Engineer at Edgeflip

Tweet: @thcrock #cloudcamp


Sponsored by

Hosted by

Redshift: Lessons Learned

Tristan Crockett – Software Engineer, Edgeflip

Basics

● Analytical database● PostgreSQL with column storage engine● Automatic Data compression● No traditional indexes; specify a sort key (how

are records in the table sorted?) and distribution key (which node will house a record?)

My Work with Redshift

● Data warehouse for Facebook user feeds and related app data

● Inputs– RDS (MySQL)

– DynamoDB

– Facebook

● Stats– ~2TB of compressed data

– Two main tables, ~5bil and ~25bil rows respectively

Advantages / Disadvantages

● Fast at copying data in from S3● Fast at computing aggregate/analytical

functions over an entire table● Decent at intra-db operations (create table as

select, insert into select)● Most everything else is slow● Without traditional indexes, table design isn't as

flexible

Lessons/Tips

● Optimize load size (1 MB to 1 GB per file)● Compress input● Upsert when needed, and always vacuum● Don't populate tables with 'CREATE TABLE AS'

if you like compression (which you do)● To avoid complicated joins, consider computing

single-table aggregates and join on the results

Upsert

Keep an Eye on Compression

Single-Table Aggregates

Thanks for Listening!

[email protected]

@thcrock

Un-panel Discussion

volunteer to join the panel & ask questions from the floor!


Sponsored by

Hosted by

Unconference

Small groups & discussions, network

Pizza’s almost here!


Sponsored by

Hosted by

cloudcamp chicago - big data & cloud may 2015 - all slides

Technology

chuparkoff big data

cloudcamp chicago big

data engineer

big data craig booth

lightning talks big

disparate data stores

chuparkoff simplicity

multiple data centers