apache distributed log @ q con 2017
TRANSCRIPT
![Page 1: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/1.jpg)
Building reliable real-time serviceswith Apache DistributedLog
@sijieg
![Page 2: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/2.jpg)
Logs are Everywhere● DB Storage Engines - WAL
● DB Replication - Binlog, Log shipping
● Distributed Consensus - Replicated log
● Messaging/Pub-Sub - Kafka
![Page 3: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/3.jpg)
Apache DistributedLog
![Page 4: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/4.jpg)
Log StreamAn endless, totally ordered,
sequence of immutable records
![Page 5: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/5.jpg)
Log Stream
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
![Page 6: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/6.jpg)
Sequence Numbers - DLSN
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
DLSN - System Sequence Number
![Page 7: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/7.jpg)
Sequence Numbers - Transaction ID
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
DLSN - System Sequence Number
Transaction ID - Application Sequence Number
E.g. Offset or Timestamp
![Page 8: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/8.jpg)
Sequence Numbers - Sequence ID
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
DLSN - System Sequence Number
Transaction ID - Application Sequence Number
E.g. Offset or Timestamp
Sequence ID
![Page 9: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/9.jpg)
Writer & Readers
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
New records added here
Tailing Reads(close to head of stream)
Catching-up Reads(rewind to any positions)
![Page 10: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/10.jpg)
Read Parallelism
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Read from multiple positions in parallel
![Page 11: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/11.jpg)
Log Segments
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Log SegmentX
Log SegmentX+1
Log SegmentX+2
![Page 12: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/12.jpg)
Log Segment Store
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Log SegmentX
Log SegmentX+1
Log SegmentX+2
Apache BookKeeper
![Page 13: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/13.jpg)
Log Stream Metadata
1 2 3 4 5 6 7 11
12
13
14
15
16
17
Oldest Newest
Writer Reader
Reader
Reader
- List of segments- Transaction Id Index- Truncation point- ...
Stream Metadata
Updates Notifications
![Page 14: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/14.jpg)
Namespace
1 2 3 4 5 6 7 11
12
13
14
15
16
17
Oldest Newest
Writer Reader
Reader
Reader
/manhattan/stream-x.../ads/stream_xxx/ads/stream_yyy
Namespace
Lookup
![Page 15: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/15.jpg)
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
- Segments
![Page 16: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/16.jpg)
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
![Page 17: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/17.jpg)
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
WriteProxy
ReadProxy
- Ownership Tracking- Batching, Compression
Record Cache -Rate Limiting, Quota -
- Serving
![Page 18: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/18.jpg)
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
ColdStorage(HDFS)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
WriteProxy
ReadProxy
- Ownership Tracking- Batching, Compression
Record Cache -Rate Limiting, Quota -
- Serving
![Page 19: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/19.jpg)
Data Flow
WriteClient
WriteProxy Bookie
Bookie
Bookie
ReadProxy
ReadClient
ReadClient
ReadClient
1. write records
4. acknowledge
2. transmit buffer
3. Flush -Write a batched entry to bookies
5. Commit -Write Control
Record6. Long poll read
7. Speculative Read
8. Cache Records
9. Long poll read
![Page 20: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/20.jpg)
Consensus
![Page 21: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/21.jpg)
Consensus - Primary Leader Approach
![Page 22: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/22.jpg)
Consensus - Log Replication
![Page 23: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/23.jpg)
Consensus - Safety Ensurance● Election Safety - CAS operation on metadata store
○ Log Segment Sequence Number monotonically increase○ A log segment sequence number is guaranteed to only hand over to a
writer once● Log Segment Append-Only
○ A writer can only append entries to the log segment that is allocated to it
● Fencing - Termination mechanism of a log segment○ No entries can be appended to a log segment if it is fenced
![Page 24: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/24.jpg)
User Cases
![Page 25: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/25.jpg)
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
ColdStorage(HDFS)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
WriteProxy
ReadProxy
- Ownership Tracking- Batching, Compression
Record Cache -Rate Limiting, Quota -
- Serving
- Applications
- Different
Consumer
models
DBs - e.g.,Twitter’s
Manhattan
DeferredRPC
(queuing)
Self-servePub/Sub
StreamComputing
Cross DCReplication
![Page 26: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/26.jpg)
DatabaseStronger Consistency
![Page 27: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/27.jpg)
Stronger Consistency in Manhattan
MHCoordinator
MHCoordinator
MHCoordinator
1 2 3 4 5 6 7 11
12
13
14
15
16
17
Oldest Newest
MHReplica
MHReplica
MHReplica
1
2
3
![Page 28: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/28.jpg)
Self-Serve Pub/SubMessage Delivery
![Page 29: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/29.jpg)
Topic
Partitioned Pub/Sub
1 2 3 4 5 6 7 11
12
13
14
15
16
17
18
19
20
21
22
1 2 3 4 5 6 7 11
12
13
14
15
16
17
18
19
20
21
22
1 2 3 4 5 6 7 11
12
13
14
15
16
17
18
19
20
21
22
New messages appended here
Reads from anyposition- last position stored in offset store
- rewind to any positions-rewind by time (e.g 15 mins ago)
![Page 30: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/30.jpg)
Deferred RPCReliable Queuing
![Page 31: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/31.jpg)
Reliable RPC System
E D E A E D A A E D E A E D A E D E A E D A
WebServer
RPCQueue
RPCWorker
RPCWorker
Service A
Service B
Service C1
2
3
4
![Page 32: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/32.jpg)
Scale at Twitter
![Page 33: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/33.jpg)
Performance - Basic (GCP)● Disk & Network Bound● 1 Journal Disk + 5 Ledger Diks● Each Disk can write/read at ~220MB/second● 6 log streams, 1 write proxy + 3 bookies● 1 writer + 1 tailing reader => 2 million records/second● 3 catch-up raders => 7.5 million records/second● End-to-End Latency : within 30ms when network is
around 30% untilized
![Page 34: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/34.jpg)
Performance - Effect of Record Size
![Page 35: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/35.jpg)
Applications at Twitter● Manhattan Key/Value Store - Stronger Consistency● Durable Deferred RPC - Journal● Real-time search indexing - Change propagation● Self-serve Pub/Sub - Message Delivery, Ads Pipeline● Stream Computing
○ Source & Sink○ Stateful Processing in Heron (coming soon)
● Reliable cross datacenter replication● ...
![Page 36: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/36.jpg)
Scale at Twitter● O(1) trillion records per day, O(10) petabytes per day
● O(10) thousands streams, O(1) million live log segments
● O(10^2) bookies, O(10^3) proxies
● Record size from 100 bytes to 20KB to even more
● Data is kept from hours to days, even up to a year
![Page 37: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/37.jpg)
Future
![Page 38: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/38.jpg)
Not Just Messaging● Stream - Events between services
○ Persistent
○ Rewindable
○ Replayable
○ Time independent
● Unification of Messaging and Storage
![Page 39: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/39.jpg)
Apache DistributedLog (incubating)● Open sourced on 05/09/2016.● Landed at Apache Incubator on 06/25/2016.● Website
○ http://distributedlog.io/○ http://incubator.apache.org/projects/distributedlog.ht
ml● Code -
https://github.com/apache/incubator-distributedlog
![Page 40: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/40.jpg)
Apache DistributedLog (incubating)● Mail List -
[email protected]● Jira - https://issues.apache.org/jira/browse/DL● Project Ideas -
https://cwiki.apache.org/confluence/display/DL/Project+Ideas
● Paper: “DistributedLog: A high performance replicated log service” (ICDE 2017)
![Page 42: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/42.jpg)
Appendix● Kafka vs DistributedLog
![Page 43: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/43.jpg)
Kafka vs DL - Overall
![Page 44: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/44.jpg)
Kafka vs DL - Data Segmentation
![Page 45: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/45.jpg)
Kafka vs DL - Data Retention● Kafka
○ Time based Retention○ Log compaction by keys
● DL○ Time based Retention (messaging)○ Explicit truncation (database, replicated state machines)
![Page 46: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/46.jpg)
Kafka vs DL - Cluster Expand● Kafka - Partition Rebalance
○ Adding new brokers○ Partitions outgrow of brokers’ capacity○ Adding new partitions
● DL○ New log segments will automatically allocated to new
storage nodes○ Scaling proxies (cpu, memory) independent of scaling
storage
![Page 47: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/47.jpg)
Kafka vs DL - Writer● Kafka
○ Multiple-Writers Semantic via Brokers● DL
○ Multiple-Writers Semantic via Write Proxies (messaging)
○ Single-Writer Semantic using Core Library (database, replicated state machines)■ Fencing, Exclusive Writer
![Page 48: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/48.jpg)
Kafka vs DL - Reader● Kafka
○ Both writes and reads are served by the leader brokers○ Polling
● DL○ Reads from any storage replicas○ Long poll + Speculative Reads
![Page 49: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/49.jpg)
Kafka vs DL - Replication Scheme● Kafka
○ ISR Replication○ Follower brokers catchup with Leader broker
● DL○ Quorum-Vote Replication○ Ack Quorum is adjustable○ Replication Repair
![Page 50: Apache distributed log @ q con 2017](https://reader033.vdocuments.net/reader033/viewer/2022052302/58e4a0741a28abf5428b6075/html5/thumbnails/50.jpg)
Kafka vs DL - Storage/Durability● Kafka
○ File (set of files) per partition○ Only write to filesystem page cache
● DL (BookKeeper)○ Interleaved Storage○ All writes are persisted to disk via explicit fsync before
acknowledges○ Physical I/O Isolation