Download - Spark, spark streaming & tachyon
Who am I? What do we do?
• Name: Johan Hong [email protected]
• Software Architect work for Pearson Higher Education
• Deliver personalized and connected learning at scale
• Build assessment platform with micro-services to serve internal and public services and applications
Definitions
Apache Spark™ is a fast and general engine for large-scale data processing.
Apache Spark is a cluster computing platform designed to be fast and general-purpose.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks
RDD is an Interface
Advanced Spark Internals and Tuning
Tachyon System Architecture
Tachyon: Memory Throughput I/O for Cluster Computing Frameworks
Fault-Tolerant in Spark Streaming
Could data be lost if the receiving node crashes before it replicates incoming data to other data node(s)?
It happens. Ooyala loses 1% of their data but it is considered as acceptable.
What can we do to prevent data loss?
We could persist events before they reach Spark Streaming Receiver, replay the events/messages after receiver crashes and recovers.