stream upload and asynchronous job processing in large scale systems
DESCRIPTION
Presentation at Barcamp Saigon 2013 - RMIT 7th July Presenter: Lê Bá Minh (VNG)TRANSCRIPT
Stream Upload And Asynchronous Job Processing System
Lê Bá Minh – [email protected] Manager – Zalo Team - VNG
Agenda
• 1/ Why we need an Asynchronous Job Processing System?• 2/ How it works ?• 3/ Application• 4/ Q &A
Parallel Stream Upload
• Data is separated in chunks
Facts
• Zalo Stream Upload• Background continuous Voice Upload• Background Image upload• …
• Facts (now)• 1M voices /day • 800K images /day• Peak: 500 Chunks/second
• Expect:• Scalable (more than 5000 chunks/second)• High performance
What we need• Asynchronous Job processing System
Collect Data
Processing Data
Response
Collect Data
Processing DataResponse
Workers
What we need
• Asynchronous Job processing System• Batch Job• Big data job• High Reliable: No job missed• Distributed job processing workers • High performance• Persistent• Load balancing, Failed over, Recoverable
Open-source solutions
• Share-memory workers• All workers in one physical server• No fail-over• Un-scalable
• Gearman• Good but not completely fit our requirement• No Batch Job support• Not full reliable (lost job)• Not full load-balance• Un-stable if more than 2000 jobs/second
Zalo Asyn Job Processing System
Client
Client
Worker 1
Worker 2
Worker 3
Z Database
Short Connection
Long Connection
TCP
TCP
Worker Manager
Job Caching
Job Manager
Persistent Manager
Job Clean-Up
Job Server
TCP
TCP
TCP
Implementation
• C/C++ for Job Server• C/C++, Java for client and workers • Binary Protocol• Z-Database
Job State
Queuing
Processing
Failed Time Out
Finished
Deliver to Worker
Worker ACK Failed
Worker ACK Finished
No ACK
Started
Job Type
• Single Job• Simple task • Immediately deliver
• Batch Job• Multiple tasks• Deliver when received all tasks
Deployment
Job Server 1
Job Server 2
Synchronized
Business Server
Worker 1
Worker 2
Worker 3
Applications
• Using for all Asynchronous job processing in Zalo: voice upload, image upload, feed processing…• Benchmark (single server)
• 50K images/seconds (640x480)• 50k voices/seconds (30s)
• Advantages• Batch Jobs• Never lost job• Worker can restart or stop any time• Fail-over, Load Balancing, Quick recover in failure
• Issue• Job duplication (handled by worker)
Q&A