(gam404) hunting monsters in a low-latency multiplayer game on ec2

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Wes Macdonald, Technical Director, Turtle Rock Studios

October 2015

Hunting Monsters in a Low-latency

Multiplayer Game on Amazon EC2

GAM404

What to expect from the session

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

1. What is Evolve?




5. Going global

6. Auto Scaling


What is Evolve?

• 5-player action game

• 4-player coop hunter team

• 1 player monster

• Hunters chase the monster

• Monster is hunting wildlife

Heavy resource requirements

• Built on Crytek’s CRYENGINE®

• 5 human players but also 30+ AI wildlife

• CPU and memory requirements on par with 40+ player

dedicated servers

What we’re working with

• Build process: Executables, assets, packaging

• Clients: Windows PC, Xbox One, PlayStation 4

• Server: Linux, stripped assets

1. What is Evolve?




5. Going global

6. Auto Scaling


Real-time game servers in the cloud?

• Traditionally games are collocated

• Needed to purchase physical hardware

• Manually install and maintain that hardware

Real-time game servers in the cloud?

• Using a cloud service simplifies those issues

• Testing proved without a doubt it’s possible

• Resource allocation is strict

• CPU, memory, I/O are all predictable

• We shipped it

Hardware requirements

• Must be memory bound not CPU bound

• Automated testing: All bot rounds

• C3 instances gave us the best CPU to memory ratio

• RAID-0 stripe ephemeral disks

• Space for executables, assets, logs and core dumps

• We have swap space for emergencies

Network requirements

• Enhanced networking a benefit of C3

• Latency to all instance types were good

• The Internet is the real latency variable

• Low bandwidth UDP protocol

• Optimized for P2P

• Process manager

• Multiple game servers

Instance configuration

Launcher

1. What is Evolve?




5. Going global

6. Auto Scaling


C

Matchmaking and lobbies

Matchmaking

External ELB

• XBL, PSN, Steam

• Lobby is always P2P

• Fallback or cost reduction

• Client-host

H C C C

Region

Server reservation request

• Servers checking in

• Poll interval decreases with

reservation

• Database is only for

transactional data

requirement

• No persistence needed

Game Instances

Game Instance

External ELB App Servers

Server is reserved

• IP/port/server to host

• Host forwards details

• P2P host migrationExternal ELB

Region

CH C C C

1. What is Evolve?




5. Going global

6. Auto Scaling


People are error prone

• Automate everything

• Anything done manually is a liability

Server automation system

• Unified application for operations and monitoring

• This gives us blanket authentication, authorization,

accounting

• Using AWS SDK

• Very few people have access to AWS

• Every action is predefined from starting a server to

provisioning an entire region

Server automation system

HTTPS ELB

Operations

API

Web App

Proxy

Grafana Graphite Logstash

User Database

Task Database

Processor

AWS SDK

Auto Scaling

Build distribution

• Every build uploaded into Amazon S3

• 20 GB per file in Amazon CloudFront is real

• Use baked AMIs instead

• Baked before entering production

Build distribution

Game Instance

EBS RAID0 Ephemeral

• Every build into Amazon S3

• Dev from Amazon S3

• Bake AMI

• Production from AMI

We have dependencies

• MySQL servers

• ELB instances

• Salt configuration servers

• Many other services we haven’t covered

• An instance needs to discover dependencies easily

Instance metadata (169.254.169.254)

• Gives us everything we need for auto discovery

• No SDK, no AWS Identity and Access Management

(IAM) rule required, no complexity

• Use subnet as our container

• Build a DNS address out of it in Amazon Route 53

• Region-AZ-subnet-service .Domain.net

Amazon

Route 53

Single subnet

1. us-west-1a-ABCDEF-ELB

2. us-west-1a-ABCDEF-RDS

Subnet - ABCDEF

App Servers

Game Servers Game Servers

1

2

Whole region

us-west-1-ELB .Domain.net

Subnet - ABCDEF

App Servers

Game Servers Game ServersSubnet - ABABAB

App Servers


us-west-1a us-west-1cus-west-1a

Instance configuration

• We use SaltStack for server configuration

• Download configuration at startup

• Including the baked AMIs

• Using user data for startup scripting

• Allows quick changes to config without a rebake

• Tagging instances with our own data

1. What is Evolve?




5. Going global

6. Auto Scaling


Make life easier, VPN

• We VPN to all our VPCs from our office and OPS VPCs

• OPS VPCs are for operations services

• OPS in us-west-1, eu-west-1

• Use different IP subnets per VPC

• Direct private SSH to any instance is great

• Simplifies security group management

Region discovery

• UDP ping service for every region

• Measures QoS: Latency and packet loss

• Also verifies build availability at the same time

• Written in C, extremely efficient, doubles as a relay

Region discovery

us-west-1-PING .Domain.net

Subnet - ABCDEF

App Servers

Game Servers Game ServersSubnet - ABABAB

App Servers


us-west-1a us-west-1c

Ping Servers Ping Servers

Region failover

• Every client discovers and ranks every region locally

• Second best is always known

• Seamless failover to other regions at any time

Region failover

1. What is Evolve?




5. Going global

6. Auto Scaling


Difficult problem

• We can’t use AWS Auto Scaling

• Scaling metric is the ratio of active to available servers

• Scaling is done per individual subnet

• Metrics sourced directly from RDS databases

• We don’t need to worry about fragmentation

• Ready for some complex math?

Easy solution

• If we’re over 80% utilization scale up 10%

• If we’re under 60% utilization scale down 10%

• Sample often, act at set interval

• Interval must be longer than scale up time

• Track scale downs

IT WORKS!

IT WORKS?

Easy fix

• Track highest peak over last week

• Take largest of 10% of peak or 10% of current

1. What is Evolve?

2. Real-time Game Simulation in AWS

3. REST Based Server Reservation Service

4. Automation, Key for Success

5. Going Global

6. Auto Scaling


Track everything

• We use Graphite for our data collection

• Grafana for our visualization

• Tracking everything of interest from our applications

• StatsD on all instances to aggregate early

Scalable graphite

• Aggregation periods are important

• I/O requirements are pretty big

• We’re using I2 instances with their large ephemerals

• Striped and using LVM for snapshotting

• Graphite is single threaded

• Need to scale using multiple processes

Scalable Graphite

HAProxy

Aggregator

Relay Relay Relay Relay


Cache Cache Cache Cache

Ephemeral RAID-0 Web APP

Varnish

Regions and failover

HAProxy

Relay Relay Relay RelayInstances Instances Instances Instances

HAProxy

Aggregator



Graphite

Mirror

Scaling Logstash

• Fairly straightforward to scale

• Be sure you’re filtering unnecessary data early

• Used for system and application logs for all servers

Crash collection

• We upload raw core dumps into Amazon S3

• Process them early on the machine with GDB

• Send processed data into aggregation app

• Web front end to view crash aggregation data

• Links to raw dumps in Amazon S3

1. What is Evolve?




5. Going global

6. Auto Scaling


Summary

• Game servers in AWS works

• Automate everything

• Subnets, metadata, Route 53 for auto configuration

• Region failover through ping service

• Auto scale 10% over/under 80/60 works

• Global scale of monitoring and operations

We’re hiring

• This is what we were working on last year

• What we’re doing now is bigger and better

• TurtleRockStudios.com

Thank you!

Remember to complete

your evaluations!

Related Sessions

(gam404) hunting monsters in a low-latency multiplayer game on ec2

Technology