cloudcamp chicago - big data & cloud may 2015 - all slides
TRANSCRIPT
CloudCamp Chicago
“Big Data and Cloud”
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
Emcee
Margaret WalkerCohesive Networks
Tweet: @CloudCamp_Chi #cloudcamp
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
… sponsored by you!
William Knowles - Evident.ioAdam Kallish - IBMCraig Hancock - HealthEngineBrandon Pittman - VMwareChuck Mackie - Maven Wave PartnersBrad Foster - Maven Wave PartnersKim Neuwirth - Narrative SciencePiaOpulencia - Narrative ScienceJimStiller - Cloud Technology Partners NetworksBrian Lickenbrock - EY
6:00 pm Introductions6:05 pm: Lightning Talks
"Big Data without Big Infrastructure" - Dan Chuparkoff, VP of Product at Civis Analytics @Chuparkoff "Simplicity, Storytelling and Big Data" - Craig Booth, Data Engineer at Narrative Science @craigmbooth "Spark: A Quick Ignition" - Matthew Kemp, Team Lead & Engineer of Things at Signal @mattkemp"Building warehousing systems on Redshift" - Tristan Crockett, Software Engineer at Edgeflip @thcrock
7:00 pm: Unpanel 7:45 pm: Unconference / Networking, drinks and pizza
Agenda
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
"Big Data without Big Infrastructure"
Dan ChuparkoffVP of Product at Civis Analytics
Tweet: @Chuparkoff#cloudcamp
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
@chuparkoff
BIG Data without
BIG Infrastructure
Dan Chuparkoff
VP of Product
Civis Analytics
@chuparkoff Big Data without Big Infrastructure
Civis is an easy-to-use, incredibly extensible data science platform in the cloud for teams who want to make great data-driven decisions to drive their organizations forward.
I work at Civis
Big Data without Big Infrastructure@chuparkoff
“The ability to use the data that you’ve built up in the past
to inform & improve what you’re going to do in the future.”
Big Data at Civis Analytics
@chuparkoff Big Data without Big Infrastructure
Data science is too damn hard
have a report every day that says what happened yesterday?
apply predictive modeling to improve my customer retention?
to use data from my past to improve acquisition in the future?
Why can’t I…
?
?
?
@chuparkoff Big Data without Big Infrastructure
Everyone’s story • Aggregate
• Unify• Explore• Optimize• Share• Automate
Big Data without Big Infrastructure@chuparkoff
Where should we start?
Cloud OnPrem
vs.
@chuparkoff Big Data without Big Infrastructure
Civis Analytics uses AWS
@chuparkoff Big Data without Big Infrastructure
• No hardware costs and infinitely scalable
• Safety and security of AWS
• Automatic backups to multiple data centers
• Access from any computer with an internet connection
@chuparkoff Big Data without Big Infrastructure
Redshift S3 EC2 DynamoDB RDS EMR
@chuparkoff Big Data without Big Infrastructure
@chuparkoff Big Data without Big Infrastructure
@chuparkoff Big Data without Big Infrastructure
@chuparkoff Big Data without Big Infrastructure
Civis data streams aggregate data from virtually any source.
Get all pf your data together in one place.
Aggregate
From data to activation
@chuparkoff Big Data without Big Infrastructure
Next, Civis’ intelligent matching algorithmslink data in disparate data stores. No matter where your data starts, Civis helps you build a unified data repository.
Unify
From data to activation
@chuparkoff Big Data without Big Infrastructure
Explore and transform the data in a fast analytics database.
Explore
From data to activation
@chuparkoff Big Data without Big Infrastructure
Build powerful predictive models and easily score results with the Civis platform’s advanced modeling engine. This is the heart of data-driven decision making!
Optimize
From data to activation
@chuparkoff Big Data without Big Infrastructure
Create, automate, & share reports across your team.Empower your entire organization to move forward with precision.
Share
From data to activation
@chuparkoff Big Data without Big Infrastructure
When tomorrow comesthere’s no need to reinvent the wheel. Civis let’s you automate and schedule from start to finish, so you can get back to pushing boundaries.
Automate
From data to activation
@chuparkoff Big Data without Big Infrastructure
Big Data + the Cloud + AWS helps Civis Analytics turn
an analyst into a data scientist & a data scientist
into a team of data scientists.
Thanks!
"Simplicity, Storytelling and Big Data"
Craig BoothData Engineer at Narrative Science
Tweet: @craigmbooth #cloudcamp
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
Simplicity, Storytelling & Big Data
Craig Booth
What I Wish I Knew About Big Data On Day One.
My Backgrounddata driven science
30+ journal articles; complex analytics on 10s of TB of data
data powered storytelling
lumière léger
Credit: Josh Bloom Henrik Brink of wise.io
“…more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize”
Xavier Amatriain and Justin Basilico, Netflix
“We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”
Xavier Amatriain and Justin Basilico, Netflix
Expla
inabil
ity
Imple
mentab
ility
Accur
acy
Can I c
ommun
icate
resu
lts?
How lo
ng w
ill it
take m
e to
build
?Can
I tole
rate
some e
rrors?
"Spark: A Quick Ignition"
Matthew KempTeam Lead & Engineer of Things at Signal
Tweet: @mattkemp #cloudcamp
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
Spark: A Quick IgnitionMatthew Kemp
Provides distributed processing
Main unit of abstraction is the RDD
Can be used with frameworks like Mesos or Yarn
Supports Java, Python and Scala
https://spark.apache.org/
What is Spark?
Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables
Transformations Lazily create a new RDD from an existing one
Actions Usually return a value, force computation of RDD
Resilient Distributed Dataset
Some examples: filter map flatMap distinct union intersection join reduceByKey
Transformations
Some examples: reduce collect take count foreach saveAsTextFile
Actions
Sample Text
Spark Example
Spark Shell
Shell Example
Gists
Example: Word Count
flatMap()inputreduceBy
Key() map() outputmap()
#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)
Example: Word Count
#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)
Example: Alternate Word Count
$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')
Running the Example
a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1
The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...
#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c
A (Bad) Shell Version
23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an
The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...
Our Use Case
distinct()3rd party
3rd partydistinct()
join()
join()
union() distinct() foreach()1st party
Questions?
"Building warehousing systems on Redshift"
Tristan CrockettSoftware Engineer at Edgeflip
Tweet: @thcrock #cloudcamp
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
Redshift: Lessons Learned
Tristan Crockett – Software Engineer, Edgeflip
Basics
● Analytical database● PostgreSQL with column storage engine● Automatic Data compression● No traditional indexes; specify a sort key (how
are records in the table sorted?) and distribution key (which node will house a record?)
My Work with Redshift
● Data warehouse for Facebook user feeds and related app data
● Inputs– RDS (MySQL)
– DynamoDB
● Stats– ~2TB of compressed data
– Two main tables, ~5bil and ~25bil rows respectively
Advantages / Disadvantages
● Fast at copying data in from S3● Fast at computing aggregate/analytical
functions over an entire table● Decent at intra-db operations (create table as
select, insert into select)● Most everything else is slow● Without traditional indexes, table design isn't as
flexible
Lessons/Tips
● Optimize load size (1 MB to 1 GB per file)● Compress input● Upsert when needed, and always vacuum● Don't populate tables with 'CREATE TABLE AS'
if you like compression (which you do)● To avoid complicated joins, consider computing
single-table aggregates and join on the results
Upsert
Keep an Eye on Compression
Single-Table Aggregates
Un-panel Discussion
volunteer to join the panel & ask questions from the floor!
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by
Unconference
Small groups & discussions, network
Pizza’s almost here!
#cloudcamp@CloudCamp_CHI
Sponsored by
Hosted by