(gam404) hunting monsters in a low-latency multiplayer game on ec2

54
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Wes Macdonald, Technical Director, Turtle Rock Studios October 2015 Hunting Monsters in a Low-latency Multiplayer Game on Amazon EC2 GAM404

Upload: amazon-web-services

Post on 19-Feb-2017

1.362 views

Category:

Technology


2 download

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Wes Macdonald, Technical Director, Turtle Rock Studios

October 2015

Hunting Monsters in a Low-latency

Multiplayer Game on Amazon EC2

GAM404

What to expect from the session

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

What is Evolve?

• 5-player action game

• 4-player coop hunter team

• 1 player monster

• Hunters chase the monster

• Monster is hunting wildlife

Heavy resource requirements

• Built on Crytek’s CRYENGINE®

• 5 human players but also 30+ AI wildlife

• CPU and memory requirements on par with 40+ player

dedicated servers

What we’re working with

• Build process: Executables, assets, packaging

• Clients: Windows PC, Xbox One, PlayStation 4

• Server: Linux, stripped assets

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

Real-time game servers in the cloud?

• Traditionally games are collocated

• Needed to purchase physical hardware

• Manually install and maintain that hardware

Real-time game servers in the cloud?

• Using a cloud service simplifies those issues

• Testing proved without a doubt it’s possible

• Resource allocation is strict

• CPU, memory, I/O are all predictable

• We shipped it

Hardware requirements

• Must be memory bound not CPU bound

• Automated testing: All bot rounds

• C3 instances gave us the best CPU to memory ratio

• RAID-0 stripe ephemeral disks

• Space for executables, assets, logs and core dumps

• We have swap space for emergencies

Network requirements

• Enhanced networking a benefit of C3

• Latency to all instance types were good

• The Internet is the real latency variable

• Low bandwidth UDP protocol

• Optimized for P2P

• Process manager

• Multiple game servers

Instance configuration

Launcher

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

C

Matchmaking and lobbies

Matchmaking

External ELB

• XBL, PSN, Steam

• Lobby is always P2P

• Fallback or cost reduction

• Client-host

H C C C

Region

Server reservation request

• Servers checking in

• Poll interval decreases with

reservation

• Database is only for

transactional data

requirement

• No persistence needed

Game Instances

Game Instance

External ELB App Servers

Server is reserved

• IP/port/server to host

• Host forwards details

• P2P host migrationExternal ELB

Region

CH C C C

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

People are error prone

• Automate everything

• Anything done manually is a liability

Server automation system

• Unified application for operations and monitoring

• This gives us blanket authentication, authorization,

accounting

• Using AWS SDK

• Very few people have access to AWS

• Every action is predefined from starting a server to

provisioning an entire region

Server automation system

HTTPS ELB

Operations

API

Web App

Proxy

Grafana Graphite Logstash

User Database

Task Database

Processor

AWS SDK

Auto Scaling

Build distribution

• Every build uploaded into Amazon S3

• 20 GB per file in Amazon CloudFront is real

• Use baked AMIs instead

• Baked before entering production

Build distribution

Game Instance

EBS RAID0 Ephemeral

• Every build into Amazon S3

• Dev from Amazon S3

• Bake AMI

• Production from AMI

We have dependencies

• MySQL servers

• ELB instances

• Salt configuration servers

• Many other services we haven’t covered

• An instance needs to discover dependencies easily

Instance metadata (169.254.169.254)

• Gives us everything we need for auto discovery

• No SDK, no AWS Identity and Access Management

(IAM) rule required, no complexity

• Use subnet as our container

• Build a DNS address out of it in Amazon Route 53

• Region-AZ-subnet-service .Domain.net

Amazon

Route 53

Single subnet

1. us-west-1a-ABCDEF-ELB

2. us-west-1a-ABCDEF-RDS

Subnet - ABCDEF

App Servers

Game Servers Game Servers

1

2

Whole region

us-west-1-ELB .Domain.net

Subnet - ABCDEF

App Servers

Game Servers Game ServersSubnet - ABABAB

App Servers

Game Servers Game Servers

us-west-1a us-west-1cus-west-1a

Instance configuration

• We use SaltStack for server configuration

• Download configuration at startup

• Including the baked AMIs

• Using user data for startup scripting

• Allows quick changes to config without a rebake

• Tagging instances with our own data

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

Make life easier, VPN

• We VPN to all our VPCs from our office and OPS VPCs

• OPS VPCs are for operations services

• OPS in us-west-1, eu-west-1

• Use different IP subnets per VPC

• Direct private SSH to any instance is great

• Simplifies security group management

Region discovery

• UDP ping service for every region

• Measures QoS: Latency and packet loss

• Also verifies build availability at the same time

• Written in C, extremely efficient, doubles as a relay

Region discovery

us-west-1-PING .Domain.net

Subnet - ABCDEF

App Servers

Game Servers Game ServersSubnet - ABABAB

App Servers

Game Servers Game Servers

us-west-1a us-west-1c

Ping Servers Ping Servers

Region failover

• Every client discovers and ranks every region locally

• Second best is always known

• Seamless failover to other regions at any time

Region failover

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

Difficult problem

• We can’t use AWS Auto Scaling

• Scaling metric is the ratio of active to available servers

• Scaling is done per individual subnet

• Metrics sourced directly from RDS databases

• We don’t need to worry about fragmentation

• Ready for some complex math?

Easy solution

• If we’re over 80% utilization scale up 10%

• If we’re under 60% utilization scale down 10%

• Sample often, act at set interval

• Interval must be longer than scale up time

• Track scale downs

IT WORKS!

IT WORKS!

IT WORKS?

Easy fix

• Track highest peak over last week

• Take largest of 10% of peak or 10% of current

1. What is Evolve?

2. Real-time Game Simulation in AWS

3. REST Based Server Reservation Service

4. Automation, Key for Success

5. Going Global

6. Auto Scaling

7. Monitoring and metrics

Track everything

• We use Graphite for our data collection

• Grafana for our visualization

• Tracking everything of interest from our applications

• StatsD on all instances to aggregate early

Scalable graphite

• Aggregation periods are important

• I/O requirements are pretty big

• We’re using I2 instances with their large ephemerals

• Striped and using LVM for snapshotting

• Graphite is single threaded

• Need to scale using multiple processes

Scalable Graphite

HAProxy

Aggregator

Relay Relay Relay Relay

Relay Relay Relay Relay

Cache Cache Cache Cache

Ephemeral RAID-0 Web APP

Varnish

Regions and failover

HAProxy

Relay Relay Relay RelayInstances Instances Instances Instances

HAProxy

Aggregator

Relay Relay Relay Relay

Relay Relay Relay Relay

Graphite

Mirror

Scaling Logstash

• Fairly straightforward to scale

• Be sure you’re filtering unnecessary data early

• Used for system and application logs for all servers

Crash collection

• We upload raw core dumps into Amazon S3

• Process them early on the machine with GDB

• Send processed data into aggregation app

• Web front end to view crash aggregation data

• Links to raw dumps in Amazon S3

1. What is Evolve?

2. Real-time game simulation in AWS

3. REST-based server reservation service

4. Automation, key for success

5. Going global

6. Auto Scaling

7. Monitoring and metrics

Summary

• Game servers in AWS works

• Automate everything

• Subnets, metadata, Route 53 for auto configuration

• Region failover through ping service

• Auto scale 10% over/under 80/60 works

• Global scale of monitoring and operations

We’re hiring

• This is what we were working on last year

• What we’re doing now is bigger and better

• TurtleRockStudios.com

Thank you!

Remember to complete

your evaluations!

Related Sessions