postgresql + kafka: the delight of change data capture

Post on 21-Jan-2018

412 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PostgreSQL + Kafka The Delight of Change Data CaptureJeff Klukas - Data Engineer at Simple

1

2

Overview

Commit logs: what are they?

Write-ahead logging (WAL)

Commit logs as a data store

Demo: change data capture

Use cases

3

https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/

Commit Logs

4

Ordered Immutable Durable

Commit Logs

5

Commit Logs

Ordered Immutable Durable

In practice, old logs can be deleted or archived

6

Write-Ahead Logging (WAL)

7

– https://www.postgresql.org/docs/current/static/wal-intro.html

“WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage”

8

– https://www.postgresql.org/docs/9.4/static/logicaldecoding-explanation.html

“Logical decoding is the process of extracting all persistent changes to a database's tables into a coherent, easy to understand format which can be interpreted without detailed knowledge of the database's internal state.”

9

10

Topic Partitions

11

Topics

12

Compacted Topics

13

https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

14

INSERT INTO transactions VALUES (56789, 20.00);

{ "transaction_id": {"int": 56789}, "amount": {"double": 20.00} }

Bottled Water - Message Key

{ "transaction_id": { "int": 56789 } }

Bottled Water - Message Value

15

UPDATE transactions SET amount = 25.00 WHERE transaction_id = 56789;

{ "transaction_id": {"int": 56789}, "amount": {"double": 25.00} }

Bottled Water - Message Key

{ "transaction_id": { "int": 56789 } }

Bottled Water - Message Value

16

DELETE FROM transactions WHERE transaction_id = 56789;

null

Bottled Water - Message Key

{ "transaction_id": { "int": 56789 } }

Bottled Water - Message Value

17

tx-service

tx-postgres

Use Cases

18

tx-service

tx-postgres

tx-pgkafka

Kafka topic: tx-pgkafka

19

tx-service

tx-postgres

tx-pgkafka

demux-service

Kafka topic: tx-pgkafka

20

tx-service

tx-postgres

tx-pgkafka

demux-service

Kafka topic: tx-pgkafka

Kafka topic: customers-table

Kafka topic: transactions-table

21

tx-service

tx-postgres

tx-pgkafka

demux-service

activity-service

activity-postgres

activity-pgkafka

Kafka topic: tx-pgkafka

Kafka topic: customers-table

Kafka topic: transactions-table

Kafka topic: activity-pgkafka

22

tx-service

tx-postgres

tx-pgkafka

demux-service

activity-service

activity-postgres

activity-pgkafka

Amazon Redshift (Data Warehouse)

Amazon S3 (Data Lake)

analytics-service

Kafka topic: tx-pgkafka

Kafka topic: customers-table

Kafka topic: transactions-table

Kafka topic: activity-pgkafka

23

tx-service

tx-postgres

tx-pgkafka

demux-service

activity-service

activity-postgres

activity-pgkafka

Amazon Redshift (Data Warehouse)

Amazon S3 (Data Lake)

analytics-service

Kafka topic: tx-pgkafka

Kafka topic: customers-table

Kafka topic: transactions-table

Kafka topic: activity-pgkafka

Change Data Capture

24

tx-service

tx-postgres

tx-pgkafka

demux-service

activity-service

activity-postgres

activity-pgkafka

Amazon Redshift (Data Warehouse)

Amazon S3 (Data Lake)

analytics-service

Kafka topic: tx-pgkafka

Kafka topic: customers-table

Kafka topic: transactions-table

Kafka topic: activity-pgkafka

Messaging

25

tx-service

tx-postgres

tx-pgkafka

demux-service

activity-service

activity-postgres

activity-pgkafka

Amazon Redshift (Data Warehouse)

Amazon S3 (Data Lake)

analytics-service

Kafka topic: tx-pgkafka

Kafka topic: customers-table

Kafka topic: transactions-table

Kafka topic: activity-pgkafka

Analytics

26

Recap

Commit logs: what are they?

Write-ahead logging (WAL)

Commit logs as a data store

Demo: change data capture

Use cases

27

• Blog post on Simple’s CDC pipeline

• https://www.simple.com/engineering

• Bottled Water: https://github.com/confluentinc/bottledwater-pg

• Debezium (CDC to Kafka from Postgres, MySQL, or MongoDB)

• http://debezium.io/

• https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka

• https://www.confluent.io/kafka-summit-sf17/

• Martin Kleppmann, Making Sense of Stream Processing eBook

Also See…

Thank You

28

Extras

29

30

The Dual Write Problem

https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

31

Redshift Architecture Amazon Redshift

Replicating to Redshift

32

33

Table Schema

CREATE TABLE pgkafka_txservice_transactions ( pg_lsn NUMERIC(20,0) ENCODE raw, pg_txn_id BIGINT ENCODE lzo, pg_operation CHAR(6) ENCODE bytedict, pg_txn_timestamp TIMESTAMP ENCODE lzo, ingestion_timestamp TIMESTAMP ENCODE lzo, transaction_id INT ENCODE lzo, amount NUMERIC(18,2) ENCODE lzo ) DISTKEY transaction_id SORTKEY (transaction_id, pg_lsn, pg_operation);

Amazon Redshift

34

Deduplication

CREATE TABLE deduped LIKE pgkafka_txservice_transactions;

INSERT INTO deduped SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY pg_lsn ORDER BY ingestion_timestamp DESC) FROM pgkafka_txservice_transactions ) WHERE row_number = 1;

DROP TABLE pgkafka_txservice_transactions;

ALTER TABLE deduped RENAME TO pgkafka_txservice_transactions;

Amazon Redshift

35

View of Current StateCREATE VIEW current_txservice_transactions AS SELECT transaction_id, amount, FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY pg_lsn, pg_operation) AS n, COUNT(*) OVER (PARTITION BY transaction_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c FROM pgkafka_txservice_transactions) WHERE n = c AND pg_operation <> 'delete';

Amazon Redshift

top related