(gam404) hunting monsters in a low-latency multiplayer game on ec2
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Wes Macdonald, Technical Director, Turtle Rock Studios
October 2015
Hunting Monsters in a Low-latency
Multiplayer Game on Amazon EC2
GAM404
What to expect from the session
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
What is Evolve?
• 5-player action game
• 4-player coop hunter team
• 1 player monster
• Hunters chase the monster
• Monster is hunting wildlife
Heavy resource requirements
• Built on Crytek’s CRYENGINE®
• 5 human players but also 30+ AI wildlife
• CPU and memory requirements on par with 40+ player
dedicated servers
What we’re working with
• Build process: Executables, assets, packaging
• Clients: Windows PC, Xbox One, PlayStation 4
• Server: Linux, stripped assets
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
Real-time game servers in the cloud?
• Traditionally games are collocated
• Needed to purchase physical hardware
• Manually install and maintain that hardware
Real-time game servers in the cloud?
• Using a cloud service simplifies those issues
• Testing proved without a doubt it’s possible
• Resource allocation is strict
• CPU, memory, I/O are all predictable
• We shipped it
Hardware requirements
• Must be memory bound not CPU bound
• Automated testing: All bot rounds
• C3 instances gave us the best CPU to memory ratio
• RAID-0 stripe ephemeral disks
• Space for executables, assets, logs and core dumps
• We have swap space for emergencies
Network requirements
• Enhanced networking a benefit of C3
• Latency to all instance types were good
• The Internet is the real latency variable
• Low bandwidth UDP protocol
• Optimized for P2P
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
C
Matchmaking and lobbies
Matchmaking
External ELB
• XBL, PSN, Steam
• Lobby is always P2P
• Fallback or cost reduction
• Client-host
H C C C
Region
Server reservation request
• Servers checking in
• Poll interval decreases with
reservation
• Database is only for
transactional data
requirement
• No persistence needed
Game Instances
Game Instance
External ELB App Servers
Server is reserved
• IP/port/server to host
• Host forwards details
• P2P host migrationExternal ELB
Region
CH C C C
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
Server automation system
• Unified application for operations and monitoring
• This gives us blanket authentication, authorization,
accounting
• Using AWS SDK
• Very few people have access to AWS
• Every action is predefined from starting a server to
provisioning an entire region
Server automation system
HTTPS ELB
Operations
API
Web App
Proxy
Grafana Graphite Logstash
User Database
Task Database
Processor
AWS SDK
Auto Scaling
Build distribution
• Every build uploaded into Amazon S3
• 20 GB per file in Amazon CloudFront is real
• Use baked AMIs instead
• Baked before entering production
Build distribution
Game Instance
EBS RAID0 Ephemeral
• Every build into Amazon S3
• Dev from Amazon S3
• Bake AMI
• Production from AMI
We have dependencies
• MySQL servers
• ELB instances
• Salt configuration servers
• Many other services we haven’t covered
• An instance needs to discover dependencies easily
Instance metadata (169.254.169.254)
• Gives us everything we need for auto discovery
• No SDK, no AWS Identity and Access Management
(IAM) rule required, no complexity
• Use subnet as our container
• Build a DNS address out of it in Amazon Route 53
• Region-AZ-subnet-service .Domain.net
Amazon
Route 53
Single subnet
1. us-west-1a-ABCDEF-ELB
2. us-west-1a-ABCDEF-RDS
Subnet - ABCDEF
App Servers
Game Servers Game Servers
1
2
Whole region
us-west-1-ELB .Domain.net
Subnet - ABCDEF
App Servers
Game Servers Game ServersSubnet - ABABAB
App Servers
Game Servers Game Servers
us-west-1a us-west-1cus-west-1a
Instance configuration
• We use SaltStack for server configuration
• Download configuration at startup
• Including the baked AMIs
• Using user data for startup scripting
• Allows quick changes to config without a rebake
• Tagging instances with our own data
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
Make life easier, VPN
• We VPN to all our VPCs from our office and OPS VPCs
• OPS VPCs are for operations services
• OPS in us-west-1, eu-west-1
• Use different IP subnets per VPC
• Direct private SSH to any instance is great
• Simplifies security group management
Region discovery
• UDP ping service for every region
• Measures QoS: Latency and packet loss
• Also verifies build availability at the same time
• Written in C, extremely efficient, doubles as a relay
Region discovery
us-west-1-PING .Domain.net
Subnet - ABCDEF
App Servers
Game Servers Game ServersSubnet - ABABAB
App Servers
Game Servers Game Servers
us-west-1a us-west-1c
Ping Servers Ping Servers
Region failover
• Every client discovers and ranks every region locally
• Second best is always known
• Seamless failover to other regions at any time
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
Difficult problem
• We can’t use AWS Auto Scaling
• Scaling metric is the ratio of active to available servers
• Scaling is done per individual subnet
• Metrics sourced directly from RDS databases
• We don’t need to worry about fragmentation
• Ready for some complex math?
Easy solution
• If we’re over 80% utilization scale up 10%
• If we’re under 60% utilization scale down 10%
• Sample often, act at set interval
• Interval must be longer than scale up time
• Track scale downs
1. What is Evolve?
2. Real-time Game Simulation in AWS
3. REST Based Server Reservation Service
4. Automation, Key for Success
5. Going Global
6. Auto Scaling
7. Monitoring and metrics
Track everything
• We use Graphite for our data collection
• Grafana for our visualization
• Tracking everything of interest from our applications
• StatsD on all instances to aggregate early
Scalable graphite
• Aggregation periods are important
• I/O requirements are pretty big
• We’re using I2 instances with their large ephemerals
• Striped and using LVM for snapshotting
• Graphite is single threaded
• Need to scale using multiple processes
Scalable Graphite
HAProxy
Aggregator
Relay Relay Relay Relay
Relay Relay Relay Relay
Cache Cache Cache Cache
Ephemeral RAID-0 Web APP
Varnish
Regions and failover
HAProxy
Relay Relay Relay RelayInstances Instances Instances Instances
HAProxy
Aggregator
Relay Relay Relay Relay
Relay Relay Relay Relay
Graphite
Mirror
Scaling Logstash
• Fairly straightforward to scale
• Be sure you’re filtering unnecessary data early
• Used for system and application logs for all servers
Crash collection
• We upload raw core dumps into Amazon S3
• Process them early on the machine with GDB
• Send processed data into aggregation app
• Web front end to view crash aggregation data
• Links to raw dumps in Amazon S3
1. What is Evolve?
2. Real-time game simulation in AWS
3. REST-based server reservation service
4. Automation, key for success
5. Going global
6. Auto Scaling
7. Monitoring and metrics
Summary
• Game servers in AWS works
• Automate everything
• Subnets, metadata, Route 53 for auto configuration
• Region failover through ping service
• Auto scale 10% over/under 80/60 works
• Global scale of monitoring and operations
We’re hiring
• This is what we were working on last year
• What we’re doing now is bigger and better
• TurtleRockStudios.com