state and the event log: how to manage state in …...• input data is stateless deltas •...

56
State and the Event Log: How to Manage State in Apache Kafka Ben Abramson

Upload: others

Post on 24-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State and the Event Log: How to Manage State in Apache Kafka Ben Abramson

Page 2: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Introduction

In this talk, we will explore good and bad practice around managing state with Apache Kafka. We will illustrate the dangers of assuming that state can be completely managed through an Event Log and show how non-functional requirements are key to developing a reliable design in a Kafka based system

2

Page 3: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State & the Event Log

• The World has a state and events occurring.

• The state dictates what events occur

• The events change the state

• They are not interchangeable, a persistent event log is not a

substitute for a state store anymore than a state store is a

substitute for a log of events

3

Page 4: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Who are we and what do we do?

• William Hill one of the oldest and most well-established companies in the gaming industry

• The Trading department of William Hill “trade” what happens in a Sports event.

• We deal with managing odds for c200k sports events a year. We publish odds for the company and result the markets once events have concluded.

• Cater for both traditional pre-match markets and in-play markets • We have been building applications based on messaging

technology for a long time as it suits our event-based use-cases

4

Page 5: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

What Do We Do? (In the simplest terms…)

5

Page 6: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

What we were asked to build

6

Page 7: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Features of the System

• A combination of long living data (fixtures) and short living data (in-play)

• A combination of stateless and stateful services • Pricing models require the full state of an event (i.e. everything that has

happened) • Input data is stateless deltas

• Microservices wired together with Kafka • Kafka used as both a data transport mechanism and data storage mechanism

• Viewed as a stateless pipeline of delta messages, it is very simple • Introduce state management, and it gets more complex

7

Page 8: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Attempt # 1 The State is in the Event Log

Page 9: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

The Design

• Publish delta messages to a topic • Develop downstream apps that have no persisted state • When they start, they read from the beginning (or an

earlier offset) and build any state • This sounds good in theory, but once you start thinking

about durability and start up time, it can cause headaches • Once you ramp up to lots of topics, the state becomes

distributed

9

Page 10: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Sporting Events: Game State & Game Events

• We have to process game events - e.g. a goal, a substitution • The game event affects the game state and the prices (odds) we

offer • But we weren’t storing the game state • This is ok for ‘dumb pass through’ • However, edge cases and performance soon gave us headaches • If you have multiple topics, which one contains the state?

10

Page 11: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

The Sound of Inevitability

If you do not have a ‘golden copy’ of state (i.e. a

single source of the truth) Global State Mutation

becomes an inevitability.

11

Page 12: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Data Storage versus Data Transportation

• Kafka can be used for data storage and data transportation • Exercise caution when mixing these use cases • Using it for transportation implies lots of topics to wire up lots of

services • Using Kafka for data storage as a means of ensuring durability is

compatible with using it for transportation • Using it for state storage and transportation can cause problems:

Where is your Golden Copy?

12

Page 13: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

The Abramson Law of Event Log Systems

13

“In a stateful system, it is essential to have a Golden Copy of the Truth of your state,

available in real time” - Me, right now

Page 14: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Sporting Events: Game State & Game Events

• There were a lot of cases of ‘in order to process event X, we need to know Y about the event’

• Some parts of the system could work fine on deltas, but some parts required full state

• Initially, we solved this through either caching, or adding state to messages in the form of ‘metadata’

• This quickly got out of hand

14

Page 15: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

The Classic Kafka Anti-pattern

“To get the full state, you just read the event log from the beginning, it’s easy, and you don’t need a

state store.” - People who don’t understand Kafka

• Reading up to millions or even billions of messages can be pretty

slow

• The event log is not a replacement for the state store 15

Page 16: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Attempt # 2 State Where Necessary: In metadata or cached

Page 17: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Caching

• In the cases of ‘in order to process event X, we need to know Y about the event’, we started to cache data

• For example corners: • Corner markets are based on the number of corners • A message from a feed says that the home team has a corner • Punters want bets paid out as quickly as possible • Where do you keep the running totals? You can cache it….

17

Page 18: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Caching

• This works until your service restarts… • Caching without an underlying layer of persistence is transient, it makes

failure recovery a pain

18

Page 19: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Metadata

• We started adding ‘metadata’ to messages that required some of the state

• 1 or 2 fields you can work with, but we ended up with a lot of data • We found that metadata is harder to type, so a lot of it became

unstructured • Map<String, Map<String, List<Object>>>

• Messages became very large and unwieldy

19

Page 20: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Failure Recovery

The benefits of caching and metadata proved to be limited. • While caching & metadata solved some problems initially, they introduced

others • With a non-persistent cache, failure means re-reading from the start • With large messages, re-reading from the start becomes a big job • Some of our services were taking up to 40 minutes to start

20

Page 21: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Attempt # 3 Full State Messages

Page 22: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State in services or state in messages?

Without an explicit state store, your state needs to be either in your services or in your messages

22

Page 23: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Why pass full state?

• Failure recovery difficulties meant that some of the caching we did came with drawbacks – full state fixes that

• Full state allowed many of the services to be stateless – everything they needed to process was on the message

• State was maintained in the feed handlers ingesting 3rd party deltas

• Deltas from 3rd party feeds were converted into full state messages and passed on

23

Page 24: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Why pass full state?

• Very durable & performant • Messages can be processed

asynchronously and then just get the current offset

24

Page 25: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Full State Messages

• We are passed stateless deltas by third party feeds • The applications reading the feeds would build full state

using a state store • The full state message is passed downstream • Not really an event log now

Page 26: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Problems with full state

Full state brought some powerful durability guarantees, but there were drawbacks • Messages could get very unwieldy in size – some were up to 9Mb • Some services became inefficient – having to parse massive messages

to look at one or two pieces of data • Understanding the full message requires a lot of knowledge • Inefficient, essentially an event log where every entry is an event log in

itself

26

Page 27: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

FYI…

Passing full state is not an absolute no-no, it might be appropriate in some use cases (but probably not

this one)

27

Page 28: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Abandoned Attempt Kafka Streams

Page 29: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Kafka Streams

• Kafka Streams offered some solutions to some of our problems • Fast data access • Exactly once processing semantics • Continually updated dataset • Powerful data processing operations supported

29

Page 30: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Kafka Streams

While KStreams is very powerful, there were some issues. • Kafka Streams was very immature at the time (2017) • Failure recovery scenarios again proved to be the Achilles heel • Failover involved hard-killing the app and restarting it • The time taken to rebuild datasets meant it could not be supported inside

SLAs • We abandoned Kafka Streams, but it is much more robust now

30

Page 31: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Attempt # 4 State Store

Page 32: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Storage Options

32

• The three data storage options • Limitations with caches and Kafka as a

store meant revisiting state stores

Page 33: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Return of the State Store

• Re-introducing the state store brought all the benefits: • Golden copy of the truth • Fully persistent • Readily available

• Immediately solved many of the performance and persistence problems

• Allowed us to rationalise & simplify our estate

33

Page 34: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Store: Simplification

34

Page 35: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Store: Simplification

• Enabled rationalisation of the estate, in many places two services joined by a Kafka topic could be changed into one service with a Cassandra table

• In some places, four services were replaced by one • Many problems around start up time and durability were solved – simply

read the database table and start reading from the latest or last committed offset

• The database introduced more latency, but this was somewhat offset by the reduction in Kafka hops

• Querying the state of something could be done in real time

35

Page 36: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Store: Considerations

Think carefully about how and where to store and access data. • In a Microservice architecture, each service should have its own store,

completely isolated from everything else. • This works quite well in the context of data services, but becomes a bit

more complex with a data pipeline

36

Page 37: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Store: Classic Data Service Microservices

37

Page 38: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Store: Data Enrichment Pipeline

38

Page 39: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Stores and Services

When applying state stores to event log services, there are a number of options • Closed state store: table(s) that are solely used by one service and totally

internal to that service • Open state store: table(s) that are owned by one service, but contain data that

is useful to other services. The owning service is the writer and manager of that state, but many services can read

• There are exceptions, e.g. the sub-system state store – information coming from different sources that need to be amalgamated

39

Page 40: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

State Store: Sub System Data Store

40

Page 41: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Never Attempted Change Data Capture

Page 42: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Change Data Capture

What is Change Data Capture? • Creating state from an event log is problematic, but creating an event log from

state changes over time is pretty easy • This is something the state store already does – the log file • Change data capture is publishing changes to state in real time

42

Page 43: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Change Data Capture

43

• Data store provides Golden Copy • Build the state first and feed deltas

downstream • The state is available to anyone downstream

who needs it

Page 44: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Change Data Capture

In hindsight, this probably would have been the best approach • Feed handlers ingesting 3rd party data could feed data into data stores • CDC creates a single synchronous event log for all changes • Downstream apps can subscribe and act on messages they care about

44

Page 45: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Change Data Capture

45

• Changes from multiple feeds update state in state store

• This publishes to the change log via CDC • Acts a triggering mechanism for pricing • Full state is obtained by going back up to

the state store

Page 46: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Limitations Hardware Limitation driven Architecture

Page 47: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Hardware Limitation 1

• In order to save money, our Kafka nodes only had 60Gb disks • The entire cluster had < 1Tb of disk • This meant we had to be very aggressive with compaction &

deletion • This compounded our problems with both full state (big) messages

and our requirement to re-read from the start of a topic to deduce state

• This created more anti-patterns

47

Page 48: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Anti-patterns Driven from Hardware Limitations

• Limitations on disk meant that the primary topic had a short retention time – 4 hours

• We would write apps that would consume off those topics, filter certain messages and then write them to intermediate topics. This was common for pre-match messages

• The intermediate topics then had longer retentions • Bigger disks would have been a far more optimal and

cost-effective solution • It also further dissipates state

Page 49: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Hardware Limitation 2

• William Hill has its own self-built internal cloud • One of the limitations is no durability or transferability in disk storage (i.e.

you cannot reliably write to local disk) • This limited options for storing data on a service basis

49

Page 50: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

The Failover Russian Doll

• Too many components trying to manage durability & failover

• They could conflict, causing recovery problems

50

Page 51: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Cultural Lessons

• It’s really important that your ops people and hardware suppliers understand the requirement for lots of disk

• If they are not used to Big Data systems, they will probably push back on hardware requirements

• However, more disk is significantly cheaper than the cost of coding around the issues

• Also, you don’t need expensive disks – commodity hardware will do, Kafka is designed to handle failure

• Make failover mechanisms simple and not overly layered

51

Page 52: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Summary

• The event log is not a substitute for the state store. It may contain the state, but in most use cases, it is not easily accessible

• There are some use cases for building state from the start – error replaying, reporting, but building state in real time is not one of them

• Use caution when mixing the Kafka use cases of data transport and data storage

• Think carefully about how to maintain a ‘Golden Copy of the Truth’ • Change Data Capture is a powerful pattern

Page 53: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Summary (cont.)

• If you have requirements for availability and persistence, you will need a fast-access, persistent place for your data, in certain circumstances, Kafka will struggle to do both.

• Kafka has changed how we think about message-oriented & event log architecture, good practice is still emerging, don’t be afraid to try different things, but be prepared to be wrong.

• Ensure that everyone understands the implications, not just the developers

Page 54: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Resources

• Consult with Confluent

• Kafka: The Definitive Guide • https://www.confluent.io/resources/kafka-the-definitive-guide/

• GitHub Examples

• https://github.com/confluentinc/examples • https://github.com/confluentinc/kafka-streams-examples

• Confluent Enterprise Reference Architecture

• https://www.confluent.io/whitepaper/confluent-enterprise-reference-architecture/

Page 55: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Questions

Page 56: State and the Event Log: How to Manage State in …...• Input data is stateless deltas • Microservices wired together with Kafka • Kafka used as both a data transport mechanism

Thank You