siphon - near real time databus using kafka, eric boyd, nitin kumar

Post on 16-Apr-2017

1.871 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Thursday, April 14, 2016

Siphon – Near Real Time Databus Using KafkaEric Boyd – CVP Engineering – Microsoft

Nitin Kumar – Principal Eng Manager - Microsoft

Linux is a

cancer

Thursday, April 14, 2016

Ads Oslo Schedule

Ads Oslo Feature List

Bing Ads Execution

• Shipped once every 6 months

• Averaged 3 marketplace experiments per month

• Big bets on marketplace features that didn’t work.

• Focused teams on 6 tracks with

independent metrics.

• Pushed teams to ship as quickly as they

could, focusing only on moving their

metric.

• Built/borrowed infrastructure to enable

much more rapid experimentation.

• Over 3 years got to a rate of >1000

experiments a month

Profitability!!

Eric joinsMSFT

What drove the turnaround?

• Focus on small teams with clear metrics each team was driving.

• Pushing each team to experiment and iterate as fast as possible. Data alone determines what gets shipped.

• Iterated on key metrics until we found the ones with the most impact.

• Commitment that we would get 1.5-2% better each month, and ship a package of experimentally tested improvements each month.

Relationship with Open Source

• From “Linux is a cancer…”

• To contributing to open source • Storm with C# - SCP.NET (http://www.nuget.org/packages/Microsoft.SCP.Net.SDK/)

• Spark with C# - Mobius (https://github.com/Microsoft/Mobius)

• Kafka with C# - C# Client for Kafka (https://github.com/Microsoft/Kafkanet)

• BOND (https://github.com/Microsoft/bond)

• Across MSFT• C#• VSCode• Hyper-V drivers for Linux• https://github.com/Microsoft/ with 18 pages of repositories!

Microsoft Big Data History

• Massive batch oriented systems• Hundreds of thousands of machines• Exabytes of storage• SQL-like language with C# extensions

Moving to streaming

Data Bus

Devices Services

Streaming Processing

BatchProcessing

Applications

Scalable pub/sub for NRT data streams

Interactive analytics

Vision

• A Databus for all Near Real Time (NRT) data in an organization.

• Quick and Easy Publication, Discovery and Subscription of NRT dataset.

• Compatibility with various Stream Processing systems like Storm, Spark, Splunk.

Siphon Adoption

15 months since launch

Excel Word Outlook

Windows 10

Usage

Bing Ads Campaign perf

Bing Live site telemetryCortana

Office 365

0

10

20

30

40

50

60

70

80

Thro

ugh

pu

t (i

n G

Bp

s)

Siphon Data Volume (Ingress and Egress)

Volume published (GBps) Volume subscribed (GBps) Total Volume (GBps)

0

2

4

6

8

10

12

14

16

18

Thro

ugh

pu

t (e

ven

ts p

er s

ec)

Mill

ion

s

Siphon Events per second (Ingress and Egress)

EPS In Eps Out Total EPS

1.3 millionEVENTS PER SECOND INGRESS AT PEAK

~1 trillionEVENTS PER DAY PROCESSED AT PEAK

3.5 petabytesPROCESSED PER DAY

100 thousandUNIQUE DEVICES AND MACHINES

1,300PRODUCTION KAFKA BROKERS

Scale: Kafka at Microsoft (Ads, Bing, Office)

Kafka Brokers 1300+ across 5 Datacenters

Operating System Windows Server 2012 R2

Hardware Spec 12 Cores, 32 GB RAM, 4x2 TB HDD (JBOD), 10 GB Network

Incoming Events 1.3 million per sec, (112 Billion per day, 500 TB per day)

Outgoing Events 5 million per sec, (~1 Trillion per day, 3.5 PB per day)

Kafka Topics/Partitions 50+/5000+

Kafka version 0.8.1.1 (3 way replication)

Siphon Architecture

Asia DC

Zookeeper Canary

Kafka

Collector

Agent

Services Data Pull (Agent)

Services Data Push

Device Proxy Services

Consumer API (Push/

Pull)

Europe DC

Zookeeper Canary

Kafka

US DC

Zookeeper Canary

Kafka

Streaming

Batch

Audit Trail

Open Source

Microsoft Internal

Siphon

Multiple sources and schemas

Siphon Bond

Schema

Part

A Main Header

MessageId

AuditId

TimeStamp

Part

B Extended HeaderKey-Value[]

Part

C Payload

CSV

XML

JSONJSON

XML

CSVSiphon Bond

Schema

Bond (https://github.com/Microsoft/bond) Cross platform framework for working with schematized data. Cross language (de) serialization. Similar to Protobuf, Thrift and AVRO.

Collector – Data Ingestion (Producer)

• Http(s) Server • Restful API with SSL support.• Abstraction from Kafka

internals (Partition, Kafka version)• Throttling, QPS Monitoring• PII scrubbing• Load balancing/failover to multiple DCs• Supported for both Windows and Linux

servers.

Device Proxy Services

Collector

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

Collector

Collector

Load

Bal

ance

r

Services Data Push

Agent

Services Data Pull (Agent)

Open Source

Microsoft Internal

Siphon

URL : http://localhost/produce/<version>?topic=<toipic>Method : POST

Pull & Push Consumers

Virtual Network A

HLC

Pull

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

P1Collector

Collector

RES

T A

PI

Virtual Network B

Pull• RESTful API with SSL support• Works for out of network consumers• Supports metadata and data operation• Implement Simple consumer APIs• Spark streaming receiver for Kafka REST

Push• Configurable push to destinations like HDFS,

Cosmos, Kafka.• Utilizes KafkaNet - .NET High Level Consumer

(https://github.com/Microsoft/Kafkanet)

High Level Consumer

Monitoring using Canary

Device Proxy Services

Collector

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

Collector

Collector

Load

Bal

ance

rServices Data Push

Agent

Services Data Pull (Agent)

Synthetic message

Audit Trail

Canary - https://github.com/Microsoft/Availability-Monitor-for-Kafka

High Level Consumer

Device Proxy Services

Collector

Kafka Brokers

Broker

Broker

Broker

Broker

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

Collector

Collector

Load

Bal

ance

rServices Data Push

Agent

Services Data Pull (Agent)

Audit Trail

Sampled vs Full Auditing support

Data completeness – Audit Trail

Production Experience – Telemetry Charts

• Monitoring using ELK• E2E Latency

• Data Completeness

• Processing Lag

• EPS breakdown by data center.

Key Takeaways

• Scale out with Kafka (50K -> 1M -> multi-million Events Per sec)

• Ability to build tunable Auditing/Monitoring

• Producer/Consumer Restful API provides a nice abstraction

• Config driven Pub/Sub system

top related