ironsource atom - redshift - lessons learned

25
All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

Upload: idan-tohami

Post on 16-Apr-2017

69 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: IronSource Atom -  Redshift - Lessons Learned

All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

Page 2: IronSource Atom -  Redshift - Lessons Learned

All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

Big Data Month 2016 – Up Next…

15.11

22.11

22.11

28.11 30.11

14.11

Page 3: IronSource Atom -  Redshift - Lessons Learned

All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.

13:00 – 13:20 Intro to Amazon Redshift by IronSource13:20 – 15:00 LAB I – Using Amazon RedShift15:00 – 15:15 Break15:15 – 17:25 LAB II – Table Layout and Schema Design with Amazon Redshift17:25 – 17:30 Your next steps on AWS by CloudZone

Master AWS Redshift - Agenda

Page 4: IronSource Atom -  Redshift - Lessons Learned

Shimon Tolts General Manager, Data Solutions

AtomData Pipeline Processing 200B events

with Node.js And Docker On AWS

Page 5: IronSource Atom -  Redshift - Lessons Learned

About ironSource: Hypergrowth

People Reached Each Month

4200Apps Installed Every Minutewith the ironSource Platform

Registered & Analyzed Data EventsEvery Month

200B

800M

50B

0

100B

150B

200B

Jun 201

5

Jul 201

5

Aug 201

5

Sep 201

5

Oct 201

5

Nov 201

5

Dec 201

5

Jan 201

6

Feb 201

6

Mar 201

6

Apr 201

6

May 201

6

Page 6: IronSource Atom -  Redshift - Lessons Learned

We needed a way to manage this data:

Our Business Challenge

ProcessCollect Store

Page 7: IronSource Atom -  Redshift - Lessons Learned
Page 8: IronSource Atom -  Redshift - Lessons Learned

Collection

● Multi region layer - Latency based

routing

● Low latency from client to Atom servers

● High Availability - AWS regions does

fail!

● Storing raw data + headers upon

receiving

Page 9: IronSource Atom -  Redshift - Lessons Learned

Data Enrichment● Enrich data before storing in your Data

Lake and/or Warehouse○ IP to Country○ Currency conversion ○ Decrypt data○ User Agent parsing - OS, Browser, Device...

● Any custom logic you would like! - fully extendible

Page 10: IronSource Atom -  Redshift - Lessons Learned

Data Targets● Near real-time data insertion - 1

minute!● Stream data to Google Storage and/or

AWS S3● Smart insertion of data into AWS

Redshift○ Set the amount of parallel copys○ Configure priority on tables

● BigQuery - Streaming data using batch files import (saves 20% cost)

Page 11: IronSource Atom -  Redshift - Lessons Learned
Page 12: IronSource Atom -  Redshift - Lessons Learned

Micro-Services Architecture● Everything is a service● Decoupling● Distributed systems

Separate lifecycle● Communication using RESTful /

Queue / Streams

Page 13: IronSource Atom -  Redshift - Lessons Learned

Docker● Linux Container● Save provisioning time● Infrastructure as code● Dev-Test-Production - identical

container● Ship easily

Page 14: IronSource Atom -  Redshift - Lessons Learned

Cloud infrastructure● Pay as you go - (grow)● SaaS services ● Auto-scaling-groups● DynamoDB● RDS *SQL● Redshift data warehouse

Page 15: IronSource Atom -  Redshift - Lessons Learned

Continuous Integration● From commit to production● Jenkins commit hook● Git branching model● AWS dynamic slaves● Unit tests● Docker builds● Updating live environment

Page 16: IronSource Atom -  Redshift - Lessons Learned

Diagram

Page 18: IronSource Atom -  Redshift - Lessons Learned

● Xplenty - hadoop service - ~40min query● One big cluster - 96 xlarge nodes● No WLM configuration● CSV copy● No reserved nodes● different ETL process implemented by every department.

STARTING POINT

Page 21: IronSource Atom -  Redshift - Lessons Learned

● using 8xlnodes if needed● Redshift cluster per department● “hot and cold” clusters - SSD: fast and furios, HDD: slow but cheap● WLM configuration● Reserved Nodes● JSON copy● One pipeline to rule them all - ironBeast - currently supporting over 50B events per month. inserting data to more than 10 Redshift clusters.

SOLUTION:

Page 22: IronSource Atom -  Redshift - Lessons Learned

WORK LOAD MANAGEMENT

Page 23: IronSource Atom -  Redshift - Lessons Learned

THINGS WE LEARNED ALONG THE WAY● https://github.com/awslabs/amazon-redshift-utils (AdminViews)

● users permissions does not apply on new tables created in a schema

● Vacuum Vacuum Vacuum

● Avoid parallel inserts (especially in 8xl nodes) - if you copy to multiple tables, it is better to

implement a COPY queue

● STL_LOAD_ERRORS - money on the floor

● Columnar datastore does not mean you can use as much columns as you want - it is better to

split to multiple tables.

● Encode your columns - ‘analyze compression’

● instances that query Redshift should use MTU 1500 - link

Page 24: IronSource Atom -  Redshift - Lessons Learned

Redshift use cases

Page 25: IronSource Atom -  Redshift - Lessons Learned

10 MillionFree Monthly Events

Thank you!

ironsrc.com/atom

[email protected] @shimontolts