(gam303) riot games: migrating mountains of data to aws

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Sean Maloney, Riot Games Data Engineer

@SEAN_SEANNERY

October 2015

GAM303

Riot Games:Migrating Mountains

of Data to AWS

SEAN

MALONEYBIG DATA ENGINEER

WHO IS THIS GUY?

Lead developer on Riot’s ETL tools

FAVORITE ACTIVITY:

Attempting to grow facial hair but

failing miserably

MOVING MOUNTAINS OF DATA

INTRODUCTION1.

WHY WE NEEDED TO MOVE2.

TRY, TRY, TRY AGAIN3.

WHAT WE CAN DO NOW4.

HOW IT IMPACTS OUR USERS5.

INTRODUCTION

WHAT IS LEAGUE OF LEGENDS?

2009LAUNCH

ONLINEMULTIPLAYER

WINDOWS / OSX

40-50 MIN GAMES

THE

TEAM

YOUR CHAMP

THE

BATTLE

GROUND

WHY MOVE?

CHAT

STORE AUDIT

Load Balancers and Firewalls

-30 HADOOP NODES (CDH)

-250 TB (FULL)

-PARTITIONS: 1.4MILLION

-HDFS REPL FACTOR: 2 :(

SQOOP + OOZIE

Data center was filling upOur game was growing!

We own our infrastructureMore game servers > More analytics servers

WHY

MOVE?

RESOURCE CONTENTIONHive .08 pre YARN, immature resource scheduling

WHY

MOVE?

TRANSACTIONAL DATA

a b c d e f

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

SERVER TELEMETRY

TRANSACTIONAL DATA

a b c d e f

map[‘a’=>1, ‘b’=>2,’c’=> ...]ts

1 2 3 4 5 6

1 2 3 4 5 6

1 2 3 4 5 6

map[‘a’=>1, ‘b’=>2,’c’=> ...]ts

map[‘a’=>1, ‘b’=>2,’c’=> ...]ts

map[‘a’=>1, ‘b’=>2,’c’=> ...]ts

Can’t join the dataWHY

MOVE?

Slower performance

HIVE MAP

TYPECaptures upstream schema changes

We have a lot of upstream schema

changes!

AMAZON SIMPLE

STORAGE

SERVICE

(AMAZON S3)

TRY, TRY, TRY AGAIN

FIRST ATTEMPT

PROPOSED AMAZON EC2 / AMAZON EMR STRUCTURE

METASTORE (RDS)

AMAZON S3

TELEMETRY EMR ETL EMR USER EMR

HDFS

hdfs://user/hive/warehouse/

schema1.db/

table1/

realm/

dt/

time/

table2/

table3/

schema2.db/

table1/

…

schema3.db/

schema4.db/

S3

s3n://datawarehouse/

schema1/

table1/

env/

dt/

time/

table2/

table3/

schema2/

s3n://telemetrydata/

application1/

table1/

env/

dt/

table2/

application2/

PROPOSED AMAZON S3 STRUCTURE

HIVE

‣ schema1

table1

env

dt

time

table2

table3

‣ schema2

table1

...

‣ schema3

‣ schema4

HOW LONG?

< 6 months 6 mo < t <1 yr > 1yr

DO IT IN 6 WEEKSWe had one tool that was already storing data in the cloud

PROJECT

PLANNING

1. DISTCP Copy -> S3 prod

location

3. Insert overwrite temp table from

prod table with map conversion

4. Copy files from staging to prod

location

PLAN

A

2. Create temp table in Hive on

staging location

5. Choose cut-over date and repoint

incoming data ETLS to S3

~$ hadoop distcp

‘hdfs://riothive:54310/user/hive/warehouse/lol_prod.db/store’

‘s3n://warehouse/prod/store’ &> output.log

hive> CREATE EXTERNAL TABLE copy_stage.store_tmp LIKE prod.store

LOCATION ‘s3n://warehouse/temp/store_tmp/’

hive> INSERT OVERWRITE copy_stage.store_tmp PARTITION (env, dt, h)

SELECT MAP( ‘id’, CAST(id as string),

‘type’, CAST(type as string),

‘date_created’, CAST(dt_created as string)

),

dt, h, CASE realm_id WHEN ‘1’ THEN ‘NA1’ WHEN ‘2’ THEN ‘KR1’ …

FROM lol_prod.store

WHERE dt = $dt AND h = $h AND realm_id = $realm_id

~$ aws s3 cp s3n://warehouse/temp/store_temp

s3n://warehouse/prod/store

PROPOSED AMAZON S3 STRUCTURE

Over 70 tables x 15 regions to move

Python script to generate sqlPLAN

A Ran SQL scripts in parallel for each table

DONE! Tell our customers! Celebrate

MISSING PARTITIONS

CORRUPTED PARTITIONS

PLAN A

IS THE

WORSTPOOR QUERY PERFORMANCE

DON’T

LEARN FROM OUR MISTAKESDO

➔ Use DISTCP tool to move

files➔ Don’t use Hive .08 to

migrate

➔ Audit every file that gets

copied

➔ Allocate time for tuning

AWS infrastructure

➔ Don’t deliver until everything

is working;

lost trust is hard to regain

➔ Don’t underestimate simple

problems in big data

WHAT WOULD YOU DO?

Fix the holes

with good data?

Wipe out

everything, start

from scratch

Give up? Move

to the woods,

become a

lumberjack

WHAT WE DID

Fix the holes

with good data?

Wipe out

everything, start

from scratch

Give up? Move

to the woods,

become a

lumberjack

SECOND ATTEMPT

Leverage our ETL tools to repair

Compare rowcounts of iron hive vs

cloud hive for each partition

If rowcount bad, run script to re-import

the data

PLAN

B

ROW COUNTS

ROW COUNTS

duplicated data: 2540

missing partitions: 27777

partial partitions: 10528

total bad partitions: 40844 (>=2013)

10 seconds to fix dupes

10 minutes to fix missing / partial backfill

PLAN

B

We didn’t have statistics enabled on

the cloud hive

Finding bad partitions is expensive

PLAN B

IS THE

WORST

Row counts in Hive .08 means map

reduce jobs

Fix all tables 2013-01-01 onwards, all regions:

266 days

Fix all tables, all of time, all regions:

787 days

PLAN B

IS THE

WORST

DON’T


➔ Estimate how long the move

will take using extrapolation➔ Don’t assume repairing is

faster than starting fresh

➔ Turn on rowcount statistics

in hive

➔ Get an auditing solution for

DW accuracy

➔ Don’t assume your source

data warehouse is 100%

accurate

THIRD ATTEMPT

Start over from scratch

Modify Hadoop DISTCP tool to be data

driven

MAPRED TOOL TO COPY FILES

FROM HDFS->S3

3RD TIME’S

THE

CHARM

Recursively list all files needed to

move

Write that list to a DB table for tracking

and auditing

3RD TIME’S

THE

CHARM

appl_job_id hdfs_source s3_target hdfs_size s3_size hdfs_chksum s3_chksum copy_status chksum_status

job_xx_112 hdfs://mytbl1/file1 s3://mybkt1/my

tbl1/file1

132594 mlk567lkm5 not_run not_run


tbl1/file2

292694 87gf879sdf9 not_run not_run


tbl1/file3

3259 h43jhak4h5s not_run not_run


tbl1/file4

62484 fd767a7e7f6 not_run not_run

DATA DRIVEN COPY TOOL

Query DB for failed files and retry /

debug.

Compare file sizes / checksums after

copy completes

Store success / fail status for each

copy job

3RD TIME’S

THE

CHARM

appl_job_id hdfs_source s3_target hdfs_size s3_size hdfs_chksum s3_chksum copy_status chksum_status


tbl1/file1

132594 132594 mlk567lkm5 mlk567lkm5 success success


tbl1/file2

292694 87gf879sdf9 failed not_run


tbl1/file3

3259 3259 h43jhak4h5s fg53hj65un success failed


tbl1/file4

62484 62484 fd767a7e7f6 fd767a7e7f6 success success

DATA DRIVEN COPY TOOL

DON’T


➔ Make your migration tool

repeatable➔ Don’t wait too long to migrate

or else DISTCP might have

issues➔ Create S3 permissions and

naming standards early

➔ Upgrade your hive version

to more stable releases

➔ Hire people smarter than

yourself

➔ Don’t forget to clean up temp

S3 files

➔ Don’t stop. Believing.

hive> SHOW SCHEMAS;

OK

copy_stage

test_warehouse

prod_warehouse

DELETE_ME_1

DELETE_ME_2

DELETE_ME_3

DELETE_ME_4

DELETE_ME_5

insights_tech

data_science

sand_box

Time taken: 0.457 seconds, Fetched: 11 row(s)

NOTE TO SELF:

Even if a database schema is named ‘DELETE_ME_1’

Check where Hive managed tables are pointed before running

CASCADE DELETE

Also, turn on S3 versioning

OOPS

WHAT CAN WE DO NOW?

POST-MOVE STRUCTURE

METASTORE (RDS)

AMAZON S3

TELEMETRY EMR ETL EMR USER EMR

Amazon RDS

AWS INFRASTRUCTURE TODAY

EMR EC2 Storage

Data Science Analytics /

Hue

ETL Telemetry

PlatforaAmazon

DynamoDB

Loading

Auditing ETL

Telemetry

collectors

Data

dictionary

Rocana

(real time

dashboard)

Solr (real

time)

Point Data

Service

Metastore

Data Science Fraud

DynamoDB

ETL App DB

Point Data Store

S3

Source of “Truth”

Networking

VP

CAWS Direct

Connect

AWS Direct

Connect

AWS Direct

Connect

AWS Direct

Connect

Create Azd-hoc EMR clusters

NEW AND

IMPROVED

Track billing for teams using our

resources

Amazon CloudWatch Monitoring

Easy Metastore Scaling

NEW AND

IMPROVED

Don’t have to manage HDFS name

nodes

No more debugging hardware issues

(just spin up a new instance)

FOR THE USERS

Custom rewards for mastering different

champions

Intensive query that spans every game

that every player has played

Improves player engagement

CHAMPION

MASTERY

Full copy of our data warehouse in

DynamoDB

Hive->DynamoDB Dynamic Partition

Support can answer questions faster

than ever

PLAYER

SUPPORT

Data science team queries all chat

messages in game

Sentiment analysis and classification

Identifies negative, offensive players and

mutes them automatically

OFFENSIVE

CHAT

DETECTION

FINAL THOUGHTS...

QUESTIONS?

[email protected]

@SEAN_SEANNERYengineering.riotgames.com

ENGINEERING

BLOG

EAT. DRINK.

PLAY re:Invent After

Party

TONIGHT! 6pm-10pm @ Palazzo Tower

3rd Floor - Palazzo Parlor

(gam303) riot games: migrating mountains of data to aws

Technology