airstream: spark streaming at airbnb

Download Airstream: Spark Streaming At Airbnb

Post on 06-Jan-2017

1.940 views

Category:

Data & Analytics

11 download

Embed Size (px)

TRANSCRIPT

  • AirStreamLIYIN TANG & JINGWEI LU

  • Data Infrastructure at Airbnb

  • Event Logs

    MySQL Dumps

    Gold Cluster

    HDFS

    Hive

    Kafka

    Sqoop

    Silver Cluster Spark Cluster

    SparkReAir

    Airflow Scheduling

    S3

    Presto Cluster

    AirPal

    Caravel

    Tableau

    Batch Infrastructure

    Yarn HDFS

    Hive

    Yarn

    Liyin Tang and Jingwei Lu3

  • Streaming at Airbnb

    Event Logging

    MySQL BINLOG

    Cluster

    HDFS

    HiveSpinal tap

    Presto Cluster

    Yarn

    Kafka

    HBase

    Spark Streaming

    Datadog

    Druid

    Kafka

    Liyin Tang and Jingwei Lu4

  • Growing Pain

  • Stateless

    Liyin Tang and Jingwei Lu

    Computation SinkSource

    DStream DF DF

  • Stateful

    Liyin Tang and Jingwei Lu

    ComputationSource

    DStream DF DFSink1

    Sink2

    Sink N

    State Storage

    RDD

  • Multiple Streams

    Liyin Tang and Jingwei Lu

    DataFrameSink1

    Process A

    Sink2

    Sink3

    SinkN

    DataFrameSink1

    Process N

    Sink2

    Sink3

    SinkN

    Source

    DStream

    Align by Time

    DataFrame

    DataFrame

    State

    Source

    DStream

  • Streaming + Batch

    Liyin Tang and Jingwei Lu

    DataFrameSink1

    Process A

    Sink2

    Sink3

    SinkN

    DataFrame

    State

    DStream

    Align by Time

    DataFrameSink1

    Process A

    Sink2

    Sink3

    SinkN

  • Simplify and Unify

  • AirStream Architecture

    Liyin Tang and Jingwei Lu

    Sources

    Stream #1 Stream #NHive Tables HBase Tables

    Virtual Table Views for Computation

    Sinks

    Customized ComputationSpark SQL

    Simple Config

    HBase Services Streaming SourcesDruid

  • AirStream Architecture

    Liyin Tang and Jingwei Lu

    Sources

    Stream #1 Stream #NHive Tables HBase Tables

    Virtual Table Views for Computation

    Sinks

    Customized ComputationSpark SQL

    HBase Services Streaming SourcesDruid

    Same Computation for Batch processing

  • Stateful

  • Liyin Tang and Jingwei Lu

    State Store Merge changes

    Provide fast lookup

    Fast persistent storage across streaming and batch jobs

    14

  • Why HBase

    Liyin Tang and Jingwei Lu

    Rich Functionalities

    Rich Integration with Hadoop EcoSystem

    Easy Management

    Strong Community

    Reliable and Scalable

  • HBase State StoreOperators in Airstream

    Liyin Tang and Jingwei Lu16

    Full Table Scan

    Simple Aggregation

    Bulk Upload

    Key/Prefix Lookup

    Update

  • Liyin Tang and Jingwei Lu

    Computation DAG

    17

    Input Data

    Left Outer Join Result

    Key Lookup

  • Liyin Tang and Jingwei Lu

    Key Space Design

    Hash partition key space for load balance

    Composite key for K -> V

    Support full key lookup

    Prefix lookup supported for all keys used in hash function

    Hash key1 key2 key3

    Hash based on key prefix

    Hash key1 key2

    Lookup based on key prefix

    key1 = value1 and key2 = value2

    18

  • Partition based on key before write

    Use bulk upload for large volume update

    Write Performance

    Liyin Tang and Jingwei Lu19

  • Case Study

    Liyin Tang and Jingwei Lu

    Experiment realtime feedback

    20

    Update

    Experiment

    Assignment Event

    LookupHBase

    with TTL

    Booking Event

    Druid Datadog

    one airstream

    configjob 2 job 1

  • Realtime Data Ingestion

  • Realtime Ingestion on HBase

    Data Infrastructure

    MySQL

    Analytical Events

    KafkaSpark

    Streamin HBase

    HDFS Presto/Hive/Spark

    Source

    Ingest

    Realtime Query

    Snapshot

    Batch Query

    Liyin Tang and Jingwei Lu22

  • Access Data in HBase

    Liyin Tang and Jingwei Lu

    HBase

    Hive PrestoSpark SQL

    Spark Streaming

    Batch Jobs Interactive Query Streaming

    HDFSSnapshot

    Table Mapping/Unifed View on realtime data

    23

  • Snapshot & Reseed

    Liyin Tang and Jingwei Lu

    HBase HDFS

    Snapshot HFile Links)

    Bulk Upload

    24

  • Case Study 1: Events Ingestion

    Liyin Tang and Jingwei Lu

    Kafka

    topic

    topic

    topic

    Spark

    Executor1

    Executor

    Executor

    HBase

    DeD

    up

    HDFS Daily

    Realtime

    Hive

    Presto

    Events

    Part

    ition

    25

  • Case Study 2: Streaming DB Export

    Kafka RDS

    Table1

    Spinaltap.

    Table2

    TableN

    Spinaltap.

    Table2

    Spinaltap.

    TableN

    Spark

    Executor1

    Executor2

    Executor K

    HBase

    Region1

    Region2

    Region M

    HDFS

    Daily Snapshot

    Realtime Query

    Liyin Tang and Jingwei Lu26

  • Case Study: Streaming DB Export

    Rows CF: Colums Version Value

    id Fri May 19 00:33:19 2016 101

    city Fri May 19 00:33:19 2016 San Francisco

    city Fri May 10 00:34:19 2016 New York

    id Fri May 19 00:33:19 2016 1

    Liyin Tang and Jingwei Lu27

  • Case Study: Streaming DB Export

    TXN 1

    Commit_TS: 101

    TXN 2

    Commit_TS: 102

    TXN 3

    Commit_TS: 103

    TXN N

    Commit_TS: N

    Binlog Order

    Liyin Tang and Jingwei Lu28

  • Case Study: Streaming DB Export

    TXN 1

    Commit_TS: 101

    TXN 2

    Commit_TS: 103

    TXN 3

    Commit_TS: 102

    TXN N

    Commit_TS: N

    NTP

    Binlog Order

    Liyin Tang and Jingwei Lu29

  • Case Study: Streaming DB Export

    TXN 1

    Commit_TS: 101

    Binlog Order

    TXN 2

    Commit_TS: 103

    TXN 3

    Commit_TS: 102

    TXN N

    Commit_TS: N

    Point-in-Time Restore on TS 102Liyin Tang and Jingwei Lu

    30

  • Case Study: Streaming DB Export

    Rows CF: Colums Version Value

    id bin100 101

    city bin101 San Francisco

    city bin102 New York

    id bin100 1

    Liyin Tang and Jingwei Lu31

  • Case Study: Streaming DB Export

    Rows Version (Logical Offset) Value

    100 mysql-bin.00000:100

    101 mysql-bin.00000:101

    103 mysql-bin.00000:103

    102 mysql-bin.00000:102

    Liyin Tang and Jingwei Lu32

  • Case Study: Streaming DB Export

    Rows Version (Logical Offset) Value

    100 mysql-bin.00000:100

    101 mysql-bin.00000:101

    103 mysql-bin.00000:103

    102 mysql-bin.00000:102

    Liyin Tang and Jingwei Lu33

  • Operation

  • Job Management: Scaling up

    Config Driver Streaming Job

    Yarn

    Spark Jobs

    Liyin Tang & Jingwei Lu

    Config Driver Streaming Job

    Spark Jobs

    Config Driver Streaming Job Spark Jobs

  • Spark Job 1

    Spark Job2

    Spark Job N

    Concurrent

    Liyin Tang & Jingwei Lu

    Config Driver Streaming Job

    Yarn

    Job Management: Scaling up

  • Job Management: Fault Tolerant

    Driver

    Spark Job 1

    Spark Job2

    Spark Job N

    Streaming Job

    Concurrent

    Yarn

    Liyin Tang & Jingwei Lu

    OffsetManagement

    Mesos

    Driver

    Driver

    Config

    Config

    Config

    Checkpoint Rewind

  • Job Management: Monitoring & Alerting

    Driver

    Spark Job 1

    Spark Job2

    Spark Job NStreaming Job

    Concurrent

    Yarn

    AirStreamListener

    Liyin Tang & Jingwei Lu

  • Summary

    Liyin Tang and Jingwei Lu

    Simplify and Unify Stream Batch Pipeline

    Rich Stateful Computation

    Rich Integration with Hadoop EcoSystem

    Easy Operation

  • 40