(gam303) riot games: migrating mountains of data to aws
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sean Maloney, Riot Games Data Engineer
@SEAN_SEANNERY
October 2015
GAM303
Riot Games:Migrating Mountains
of Data to AWS
SEAN
MALONEYBIG DATA ENGINEER
WHO IS THIS GUY?
Lead developer on Riot’s ETL tools
FAVORITE ACTIVITY:
Attempting to grow facial hair but
failing miserably
MOVING MOUNTAINS OF DATA
INTRODUCTION1.
WHY WE NEEDED TO MOVE2.
TRY, TRY, TRY AGAIN3.
WHAT WE CAN DO NOW4.
HOW IT IMPACTS OUR USERS5.
INTRODUCTION
WHAT IS LEAGUE OF LEGENDS?
2009LAUNCH
ONLINEMULTIPLAYER
WINDOWS / OSX
40-50 MIN GAMES
THE
TEAM
YOUR CHAMP
THE
BATTLE
GROUND
WHY MOVE?
CHAT
STORE AUDIT
Load Balancers and Firewalls
-30 HADOOP NODES (CDH)
-250 TB (FULL)
-PARTITIONS: 1.4MILLION
-HDFS REPL FACTOR: 2 :(
SQOOP + OOZIE
Data center was filling upOur game was growing!
We own our infrastructureMore game servers > More analytics servers
WHY
MOVE?
RESOURCE CONTENTIONHive .08 pre YARN, immature resource scheduling
WHY
MOVE?
TRANSACTIONAL DATA
a b c d e f
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
SERVER TELEMETRY
TRANSACTIONAL DATA
a b c d e f
map[‘a’=>1, ‘b’=>2,’c’=> ...]ts
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
map[‘a’=>1, ‘b’=>2,’c’=> ...]ts
map[‘a’=>1, ‘b’=>2,’c’=> ...]ts
map[‘a’=>1, ‘b’=>2,’c’=> ...]ts
Can’t join the dataWHY
MOVE?
Slower performance
HIVE MAP
TYPECaptures upstream schema changes
We have a lot of upstream schema
changes!
AMAZON SIMPLE
STORAGE
SERVICE
(AMAZON S3)
TRY, TRY, TRY AGAIN
FIRST ATTEMPT
PROPOSED AMAZON EC2 / AMAZON EMR STRUCTURE
METASTORE (RDS)
AMAZON S3
TELEMETRY EMR ETL EMR USER EMR
HDFS
hdfs://user/hive/warehouse/
schema1.db/
table1/
realm/
dt/
time/
table2/
table3/
schema2.db/
table1/
…
schema3.db/
schema4.db/
S3
s3n://datawarehouse/
schema1/
table1/
env/
dt/
time/
table2/
table3/
schema2/
s3n://telemetrydata/
application1/
table1/
env/
dt/
table2/
application2/
PROPOSED AMAZON S3 STRUCTURE
HIVE
‣ schema1
table1
env
dt
time
table2
table3
‣ schema2
table1
...
‣ schema3
‣ schema4
HOW LONG?
< 6 months 6 mo < t <1 yr > 1yr
HOW LONG?
< 6 months 6 mo < t <1 yr > 1yr
HOW LONG?
< 6 months 6 mo < t <1 yr > 1yr
DO IT IN 6 WEEKSWe had one tool that was already storing data in the cloud
PROJECT
PLANNING
1. DISTCP Copy -> S3 prod
location
3. Insert overwrite temp table from
prod table with map conversion
4. Copy files from staging to prod
location
PLAN
A
2. Create temp table in Hive on
staging location
5. Choose cut-over date and repoint
incoming data ETLS to S3
~$ hadoop distcp
‘hdfs://riothive:54310/user/hive/warehouse/lol_prod.db/store’
‘s3n://warehouse/prod/store’ &> output.log
hive> CREATE EXTERNAL TABLE copy_stage.store_tmp LIKE prod.store
LOCATION ‘s3n://warehouse/temp/store_tmp/’
hive> INSERT OVERWRITE copy_stage.store_tmp PARTITION (env, dt, h)
SELECT MAP( ‘id’, CAST(id as string),
‘type’, CAST(type as string),
‘date_created’, CAST(dt_created as string)
),
dt, h, CASE realm_id WHEN ‘1’ THEN ‘NA1’ WHEN ‘2’ THEN ‘KR1’ …
FROM lol_prod.store
WHERE dt = $dt AND h = $h AND realm_id = $realm_id
~$ aws s3 cp s3n://warehouse/temp/store_temp
s3n://warehouse/prod/store
PROPOSED AMAZON S3 STRUCTURE
Over 70 tables x 15 regions to move
Python script to generate sqlPLAN
A Ran SQL scripts in parallel for each table
DONE! Tell our customers! Celebrate
MISSING PARTITIONS
CORRUPTED PARTITIONS
PLAN A
IS THE
WORSTPOOR QUERY PERFORMANCE
DON’T
LEARN FROM OUR MISTAKESDO
➔ Use DISTCP tool to move
files➔ Don’t use Hive .08 to
migrate
➔ Audit every file that gets
copied
➔ Allocate time for tuning
AWS infrastructure
➔ Don’t deliver until everything
is working;
lost trust is hard to regain
➔ Don’t underestimate simple
problems in big data
WHAT WOULD YOU DO?
Fix the holes
with good data?
Wipe out
everything, start
from scratch
Give up? Move
to the woods,
become a
lumberjack
WHAT WE DID
Fix the holes
with good data?
Wipe out
everything, start
from scratch
Give up? Move
to the woods,
become a
lumberjack
SECOND ATTEMPT
Leverage our ETL tools to repair
Compare rowcounts of iron hive vs
cloud hive for each partition
If rowcount bad, run script to re-import
the data
PLAN
B
ROW COUNTS
ROW COUNTS
duplicated data: 2540
missing partitions: 27777
partial partitions: 10528
total bad partitions: 40844 (>=2013)
10 seconds to fix dupes
10 minutes to fix missing / partial backfill
PLAN
B
We didn’t have statistics enabled on
the cloud hive
Finding bad partitions is expensive
PLAN B
IS THE
WORST
Row counts in Hive .08 means map
reduce jobs
Fix all tables 2013-01-01 onwards, all regions:
266 days
Fix all tables, all of time, all regions:
787 days
PLAN B
IS THE
WORST
DON’T
LEARN FROM OUR MISTAKESDO
➔ Estimate how long the move
will take using extrapolation➔ Don’t assume repairing is
faster than starting fresh
➔ Turn on rowcount statistics
in hive
➔ Get an auditing solution for
DW accuracy
➔ Don’t assume your source
data warehouse is 100%
accurate
THIRD ATTEMPT
Start over from scratch
Modify Hadoop DISTCP tool to be data
driven
MAPRED TOOL TO COPY FILES
FROM HDFS->S3
3RD TIME’S
THE
CHARM
Recursively list all files needed to
move
Write that list to a DB table for tracking
and auditing
3RD TIME’S
THE
CHARM
appl_job_id hdfs_source s3_target hdfs_size s3_size hdfs_chksum s3_chksum copy_status chksum_status
job_xx_112 hdfs://mytbl1/file1 s3://mybkt1/my
tbl1/file1
132594 mlk567lkm5 not_run not_run
job_xx_113 hdfs://mytbl1/file2 s3://mybkt1/my
tbl1/file2
292694 87gf879sdf9 not_run not_run
job_xx_124 hdfs://mytbl1/file3 s3://mybkt1/my
tbl1/file3
3259 h43jhak4h5s not_run not_run
job_xx_129 hdfs://mytbl1/file4 s3://mybkt1/my
tbl1/file4
62484 fd767a7e7f6 not_run not_run
DATA DRIVEN COPY TOOL
Query DB for failed files and retry /
debug.
Compare file sizes / checksums after
copy completes
Store success / fail status for each
copy job
3RD TIME’S
THE
CHARM
appl_job_id hdfs_source s3_target hdfs_size s3_size hdfs_chksum s3_chksum copy_status chksum_status
job_xx_112 hdfs://mytbl1/file1 s3://mybkt1/my
tbl1/file1
132594 132594 mlk567lkm5 mlk567lkm5 success success
job_xx_113 hdfs://mytbl1/file2 s3://mybkt1/my
tbl1/file2
292694 87gf879sdf9 failed not_run
job_xx_124 hdfs://mytbl1/file3 s3://mybkt1/my
tbl1/file3
3259 3259 h43jhak4h5s fg53hj65un success failed
job_xx_129 hdfs://mytbl1/file4 s3://mybkt1/my
tbl1/file4
62484 62484 fd767a7e7f6 fd767a7e7f6 success success
DATA DRIVEN COPY TOOL
DON’T
LEARN FROM OUR MISTAKESDO
➔ Make your migration tool
repeatable➔ Don’t wait too long to migrate
or else DISTCP might have
issues➔ Create S3 permissions and
naming standards early
➔ Upgrade your hive version
to more stable releases
➔ Hire people smarter than
yourself
➔ Don’t forget to clean up temp
S3 files
➔ Don’t stop. Believing.
hive> SHOW SCHEMAS;
OK
copy_stage
test_warehouse
prod_warehouse
DELETE_ME_1
DELETE_ME_2
DELETE_ME_3
DELETE_ME_4
DELETE_ME_5
insights_tech
data_science
sand_box
Time taken: 0.457 seconds, Fetched: 11 row(s)
NOTE TO SELF:
Even if a database schema is named ‘DELETE_ME_1’
Check where Hive managed tables are pointed before running
CASCADE DELETE
Also, turn on S3 versioning
OOPS
WHAT CAN WE DO NOW?
POST-MOVE STRUCTURE
METASTORE (RDS)
AMAZON S3
TELEMETRY EMR ETL EMR USER EMR
Amazon RDS
AWS INFRASTRUCTURE TODAY
EMR EC2 Storage
Data Science Analytics /
Hue
ETL Telemetry
PlatforaAmazon
DynamoDB
Loading
Auditing ETL
Telemetry
collectors
Data
dictionary
Rocana
(real time
dashboard)
Solr (real
time)
Point Data
Service
Metastore
Data Science Fraud
DynamoDB
ETL App DB
Point Data Store
S3
Source of “Truth”
Networking
VP
CAWS Direct
Connect
AWS Direct
Connect
AWS Direct
Connect
AWS Direct
Connect
Create Azd-hoc EMR clusters
NEW AND
IMPROVED
Track billing for teams using our
resources
Amazon CloudWatch Monitoring
Easy Metastore Scaling
NEW AND
IMPROVED
Don’t have to manage HDFS name
nodes
No more debugging hardware issues
(just spin up a new instance)
FOR THE USERS
Custom rewards for mastering different
champions
Intensive query that spans every game
that every player has played
Improves player engagement
CHAMPION
MASTERY
Full copy of our data warehouse in
DynamoDB
Hive->DynamoDB Dynamic Partition
Support can answer questions faster
than ever
PLAYER
SUPPORT
Data science team queries all chat
messages in game
Sentiment analysis and classification
Identifies negative, offensive players and
mutes them automatically
OFFENSIVE
CHAT
DETECTION
FINAL THOUGHTS...
QUESTIONS?
@SEAN_SEANNERYengineering.riotgames.com
ENGINEERING
BLOG
EAT. DRINK.
PLAY re:Invent After
Party
TONIGHT! 6pm-10pm @ Palazzo Tower
3rd Floor - Palazzo Parlor