realtime reporting using spark streaming
TRANSCRIPT
Breaking ETL barrier with Real-time reporting
using Kafka, Spark Streaming
About us
Concur (now part of SAP) provides travel and expense management services to businesses.
Data Insights
A team that is building solutions to provide customer access to data, visualization and reporting.ExpenseTravelInvoice
About me
Santosh SahooPrincipal Architect III, Data Insights
Stack so far..
OLAP ReportETL
OLTP
App
Numbers
7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs2B row changes300M rows (Compacted)Only ~20 failure a night
Traditional ETL challenges
Scheduled (High latency)Hard to scale.Failover and recovery.Monolithic-nessSpaghetti (Logic +SQL)
Moving forward
Streaming, real timeScalableHighly availableReduce maintenance overheadEventual Consistency
Streaming Data Pipeline
SourceFlow ManagementProcessorStorage
Querying
Data Source
Event bus for business eventsLog ScrappingTransaction log scraping
(Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog)
Change Data CaptureApplication messaging/JMSMicro batching
(High watermarked, change tracking)
Kafka - Flow Management
No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark
Columnar Storage
Optimized for analytic query performance. Vertical partitioning Column ProjectionCompressionLoosely coupled schema.
HBaseAWS RedshiftParquetORCPostgres (Citrus)SAP HANA
Hadoop/HDFS
Pro - ScaleCon- Latency
Spark Streaming
What? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.
Spark Streaming Architecture
Worker
Worker
Worker
Receiver
Driver Master
Executor
Executor
Executor
Source
D1 D2
D3 D4
WAL
D1 D2
Replication
DataStore
TASK
DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets
Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
Howval kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")
val topics = Set("sometopic", "anothertopic")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
Architecture
App
OLTP
Kafka SparkStreaming OLAP
ReportingApp
High level view
OLTP
Reporting
CognosTableau ?
ArchiveFlume
Camus
StreamProcessorSparkSamza,Storm,Flink
HDFSImport
FTP
HTTP
SMTP
C
Tachyon
P
Standby
ProtobufJson
Broker
Kafka
Hive/Spark SQL
HANA
Load balanceFailover
HANA
HANAHANA
Replication
Service bus
SqoopSnapshot
Pig/Hive/MR - Normalization
ExtractCompensate
Data {Quality, Correction, Analytics}Migrate method
API/SQL
ExpenseTravel
TTXAPI
Complete Architecture
Can Spark Streaming survive Chaos Monkey?
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Lambda Architecture
Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
Demo….
QnA
concur.com/en-us/careers
We are hiring
Thank you!