traveloka's data journey — traveloka data meetup #2

Traveloka’s

Data

Journey

Stories and lessons learned on building a scalable data

pipeline at Traveloka.

Very Early

Days...

Stories and lessons learned on building a scalable data pipeline at Traveloka.

Very Early days

Applications & Services

Summarizer

Internal Dashboard

Report Scripts + Crontab

- Raw Activity- Key Value- Time Series

Full... Split & Shard!

Raw, KV, and Time Series DB

Applications & Services Internal

Dashboard

Report Scripts + Crontab

Raw Activity (Sharded)

Time Series SummarySummarizer

Lesson Learned1. UNIX principle: “Do One Thing and Do It Well”2. Split use cases based on SLA & query pattern3. Scalable tech based on growth estimation

Key Value DB (Sharded)

Throughput?

Kafka comes into rescue

Applications & Services


Lesson Learned1. Use something that can handle higher throughput for cases with high write volume like tracking2. Decouple publish and consume

Kafka as Datahub

Raw data consumer

Key Value (Sharded)

insert

update

We need Data Warehouse

and BI Tool, and we need it fast!


Other sources

Python ETL(temporary solution)

Star Schema DW on

PostgresPeriscope BI

Tool

Lesson Learned1. Think DW since the beginning of data pipeline2. BI Tools: Do not reinvent the wheel

“Have” to

adopt big data


Postgres couldn’t handle the load!


Other sources

Python ETL(temporary solution)

Star Schema DW on

RedshiftPeriscope BI

Tool

Lesson Learned1. Choose specific tech that best fit the use case

Scaling out in MongoDB

every so often is not manageable...

Lesson Learned1. MongoDB Shard: Scalability need to be tested!

Kafka as Datahub

Gobblin as Consumer

Raw Activity on S3

“Have” to adopt big data

Lesson Learned1. Processing have to be easily scaled2. Scale processing separately for: day to day job, backfill job

Kafka as Datahub

Gobblin as Consumer

Raw Activity on S3

Processing on Spark

Star Schema DW on

Redshift

Near Real Time on Big Data

is challenging

Lesson Learned1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration

Kafka as Datahub

MemSQL for Near Real Time DB

No OPS!!!


Open your mind for

any combination of tech!

Lesson Learned1. Combination of cloud provider is possible, but be careful of latency concern2. During a research project, always prepare plan B & C plus proper buffer on timeline3. Autoscale!

PubSub as Datahub

DataFlow for Stream

Processing

Key Value on DynamoDB

More autoscale!

Lesson Learned1. Autoscale = cost monitoring

CaveatAutoscale != everything solvede.g. PubSub default quota 200MB/s (could be increased, but manually request)

PubSub as Datahub

BigQuery for Near Real Time DB

More autoscale!

Lesson Learned1. Scalable as granular as possible, in this case separate compute and storage scalability2. Separate BI with well defined SLA and exploration use case

Kafka as Datahub

Gobblin as Consumer

Raw Activity on S3

Processing on Spark

Hive & Presto on Qubole as Query

Engine

BI & Exploration Tools

WRAP UP


Consumer of Data

Streaming

Batch

TravelokaApp

Kafka

ETL

Data Warehouse

S3 Data Lake

Batch Ingest

Android, iOS

DOMO Analytics UI

NoSQL DBTraveloka Services

Ingest

CloudPub/Sub

Storage

CloudStorage

Pipelines Cloud

Dataflow

Analytics

BigQuery

Monitoring

Logging

Hive, Presto Query

Key Lessons Learned

● Scalability in mind -- esp disk full.. :)● Scalable as granular as possible -- compute, storage● Scalability need to be tested (of course!)● Do one thing, and do it well, dig your requirement

-- SLA, query pattern

● Decouple publish and consume

-- publisher availability is very important!

● Choose tech that is specific to the use case● Careful of Gotchas! There's no silver bullet...

THE FUTURE


Future Roadmap

● In the past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline.

● It works well.● But after some time, we need to maintain a lot of different

components.● Multiple clusters:

○ Kafka○ Spark○ Hive/Presto○ Redshift○ etc

● Multiple data entry points for analyst:○ BigQuery○ Hive/Presto○ Redshift

Our Goal

● Simplifying our data architecture.● Single data entry point for data analysts/scientists,

both streaming and batch data.● Without compromising what we can do now.● Reliability, speed, and scale.● Less or no ops.● We also want to make migration as simple/easy as

possible.

How will we achieve this?

● There are few options that we are considering right now.

● Some of them introducing new technologies/components.

● Some of them is making use of our existing technology to its maximum potential.

● We are trying exciting new (relatively) technologies:○ Google BigQuery○ Google Dataprep on Dataflow○ AWS Athena○ AWS Redshift Spectrum○ etc

Plan to simplify

Cloud Pub/Sub

Cloud Dataflow

BigQuery Cloud Storage

Kubernetes Cluster Collector

Managed services

BI & Analytics UI

BigTable

REST API

ML Models

Plan to simplify

● Seems promising, but…● Need to be tested.● Cover all use cases that we need ?● Query migration ?● Costs ?● Maintainability ?● Potential problems ?

See You On

Next Event!

Thank You

traveloka's data journey — traveloka data meetup #2

Data & Analytics