traveloka's data journey — traveloka data meetup #2
TRANSCRIPT
Traveloka’s
Data
Journey
Stories and lessons learned on building a scalable data
pipeline at Traveloka.
Very Early
Days...
Stories and lessons learned on building a scalable data pipeline at Traveloka.
Very Early days
Applications & Services
Summarizer
Internal Dashboard
Report Scripts + Crontab
- Raw Activity- Key Value- Time Series
Full... Split & Shard!
Raw, KV, and Time Series DB
Applications & Services Internal
Dashboard
Report Scripts + Crontab
Raw Activity (Sharded)
Time Series SummarySummarizer
Lesson Learned1. UNIX principle: “Do One Thing and Do It Well”2. Split use cases based on SLA & query pattern3. Scalable tech based on growth estimation
Key Value DB (Sharded)
Throughput?
Kafka comes into rescue
Applications & Services
Raw Activity (Sharded)
Lesson Learned1. Use something that can handle higher throughput for cases with high write volume like tracking2. Decouple publish and consume
Kafka as Datahub
Raw data consumer
Key Value (Sharded)
insert
update
We need Data Warehouse
and BI Tool, and we need it fast!
Raw Activity (Sharded)
Other sources
Python ETL(temporary solution)
Star Schema DW on
PostgresPeriscope BI
Tool
Lesson Learned1. Think DW since the beginning of data pipeline2. BI Tools: Do not reinvent the wheel
“Have” to
adopt big data
Stories and lessons learned on building a scalable data pipeline at Traveloka.
Postgres couldn’t handle the load!
Raw Activity (Sharded)
Other sources
Python ETL(temporary solution)
Star Schema DW on
RedshiftPeriscope BI
Tool
Lesson Learned1. Choose specific tech that best fit the use case
Scaling out in MongoDB
every so often is not manageable...
Lesson Learned1. MongoDB Shard: Scalability need to be tested!
Kafka as Datahub
Gobblin as Consumer
Raw Activity on S3
“Have” to adopt big data
Lesson Learned1. Processing have to be easily scaled2. Scale processing separately for: day to day job, backfill job
Kafka as Datahub
Gobblin as Consumer
Raw Activity on S3
Processing on Spark
Star Schema DW on
Redshift
Near Real Time on Big Data
is challenging
Lesson Learned1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration
Kafka as Datahub
MemSQL for Near Real Time DB
No OPS!!!
Stories and lessons learned on building a scalable data pipeline at Traveloka.
Open your mind for
any combination of tech!
Lesson Learned1. Combination of cloud provider is possible, but be careful of latency concern2. During a research project, always prepare plan B & C plus proper buffer on timeline3. Autoscale!
PubSub as Datahub
DataFlow for Stream
Processing
Key Value on DynamoDB
More autoscale!
Lesson Learned1. Autoscale = cost monitoring
CaveatAutoscale != everything solvede.g. PubSub default quota 200MB/s (could be increased, but manually request)
PubSub as Datahub
BigQuery for Near Real Time DB
More autoscale!
Lesson Learned1. Scalable as granular as possible, in this case separate compute and storage scalability2. Separate BI with well defined SLA and exploration use case
Kafka as Datahub
Gobblin as Consumer
Raw Activity on S3
Processing on Spark
Hive & Presto on Qubole as Query
Engine
BI & Exploration Tools
WRAP UP
Stories and lessons learned on building a scalable data pipeline at Traveloka.
Consumer of Data
Streaming
Batch
TravelokaApp
Kafka
ETL
Data Warehouse
S3 Data Lake
Batch Ingest
Android, iOS
DOMO Analytics UI
NoSQL DBTraveloka Services
Ingest
CloudPub/Sub
Storage
CloudStorage
Pipelines Cloud
Dataflow
Analytics
BigQuery
Monitoring
Logging
Hive, Presto Query
Key Lessons Learned
● Scalability in mind -- esp disk full.. :)● Scalable as granular as possible -- compute, storage● Scalability need to be tested (of course!)● Do one thing, and do it well, dig your requirement
-- SLA, query pattern
● Decouple publish and consume
-- publisher availability is very important!
● Choose tech that is specific to the use case● Careful of Gotchas! There's no silver bullet...
THE FUTURE
Stories and lessons learned on building a scalable data pipeline at Traveloka.
Future Roadmap
● In the past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline.
● It works well.● But after some time, we need to maintain a lot of different
components.● Multiple clusters:
○ Kafka○ Spark○ Hive/Presto○ Redshift○ etc
● Multiple data entry points for analyst:○ BigQuery○ Hive/Presto○ Redshift
Our Goal
● Simplifying our data architecture.● Single data entry point for data analysts/scientists,
both streaming and batch data.● Without compromising what we can do now.● Reliability, speed, and scale.● Less or no ops.● We also want to make migration as simple/easy as
possible.
How will we achieve this?
● There are few options that we are considering right now.
● Some of them introducing new technologies/components.
● Some of them is making use of our existing technology to its maximum potential.
● We are trying exciting new (relatively) technologies:○ Google BigQuery○ Google Dataprep on Dataflow○ AWS Athena○ AWS Redshift Spectrum○ etc
Plan to simplify
Cloud Pub/Sub
Cloud Dataflow
BigQuery Cloud Storage
Kubernetes Cluster Collector
Managed services
BI & Analytics UI
BigTable
REST API
ML Models
Plan to simplify
● Seems promising, but…● Need to be tested.● Cover all use cases that we need ?● Query migration ?● Costs ?● Maintainability ?● Potential problems ?
See You On
Next Event!
Thank You