fluentd + mongodb + spark = awesome sauce
TRANSCRIPT
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited
Bhavani Ananth, Tech Manager, Wipro Limited
Your
company logo here
Wipro – Open Source Practice: Vision & Mission
“Wipro will be the world leader in solving customer problems through the use of innovative and practical open source solutions. We will be a steward of every open source community in which we engage, and always act with sensitivity and integrity.”
Vision
“Wipro’s Open Source mission is to be the guide and partner to companies seeking to leverage the strategic, financial, organizational and technological benefits of open source software and methods. Wipro will anticipate and solve customers’ needs through a commitment to research, and by taking a balanced approach to legacy and innovative technologies. Wipro’s comprehensive suite of strategic and technology services will be delivered with passion and precision.”
Mission
Wipro – Open Source Practice Offerings
Advisory
Enterprise-wide adoption strategies
Best fit analysis & recommendation
Business Case Advisory
Governance
Technical Consulting
Support
Application and Infrastructure
Dev Ops Architecture,
Development Open Source
Community
Productized Services
Legacy Migration Services
Greenfield Development
Open Source Stack Setup
Open App
Cross Industry Solutions and Process Stacks
Connected Warehouse Platform
OMS
Sales Orders [Real-Time] Almost Real-Time
ERP/HOST
Purchase Orders Master Data [Scheduled]
TMS WMS
Direct to Customer
LMS IOT WCS
Route Plan / Carrier Tracking Associate Performance PUT/PICK Status
Connected Warehouse Platform Webservices Publisher Queues Subscriber Queues Integration Mapping
FTP (Flat file/Xml)
Facility Inventory & Orders Alerts & Notification Equipment Monitor
Automation Enabler
Performance Tracker Operations Dashboards
Warehouse KPI’s
Master Data Transaction Data
CSC SCP Warehouse Mobility & Dashboards Carrier Vendor
Warehouses Equipment Retailer Supplier
ANALYTICS & PREDICTION
The Awesome Sauce
Clickstream Analytics
User Behavior Analysis
Product Affinity
Website Resource Allocation
Prediction & recommendation
PREDICTION & RECOMMENDATION
Prediction Using Machine Learning
Content Recommendation
Conversion Prediction
Visitor Segmentation
Demand Forecasting
LOGS Sauce Raw Material
Logs, Logs Everywhere!
SysLog
Application
Server Logs
Social Media Feeds
Packet Data
Clickstream Data
Sensor Data
CDR
Custom App (C,
Ruby,Python)
Payment Data
Device Logs
Web Access
Logs
Database Logs
What can be done with logs?
Real time monitoring
Root cause analysis
Anomaly Detection and Predictive Monitoring
Debugging
Troubleshooting/Support
Challenges with Log Analytics
No standard log formats
Multiple logging frameworks
Logs highly decentralized
Limited real time visualization capability
Scalability Issues
Normalizing and correlating logs from disparate sources
What can be done with logs – Business PoV?
Input Data Analytics
User Interactions /Behavior
End user Experience/Improvements
Awesome Foursome– The Ingredients
FLUENTD The Ingredients
Why Fluentd
Unified Logging
Simple and Flexible
Proven
Minimal Resources
Reliable
Open Source
Community
Input Filter
Fluentd Plugin Architecture
Output
Filter (grep,enrich, delete.mask)
Parser (regexp,apache2)
Buffer
Format
Output out_mongo
Input
(udp,tcp,http,tail)
HA Fluentd topology
• “At Most once” and “At Least once” transfers
Log Aggregators
Fluentd (Active)
Log Forwarders
Fluentd (Backup)
Fluentd
Fluentd
Fluentd
Destination
MongoDB
Amazon S3
PUSH PUSH Log File
Log File
Log File
Node1
Node2
Node3
Fluentd – Failure Scenarios
Forwarder goes down
Aggregator goes down
KAFKA The Ingredients
Kafka – distributed streaming platform
DBs
Apps App App
Kafka Cluster
DBs
App
App
Apps App App
Stream Processor Connectors
Producers
Consumers
Publish-Subscribe streams of records
Store streams of records in fault tolerant way
Process streams of records
Kafka –Terms
Producer
Consumer
Consumer Group
Topic
Partition
Producer
Topics
0 1 0 1 2 0 1
Partition-1 Partition-2 Partition-3
Brokers
p1 p2 p3
R1 R2 R3
Consumer Groups
C1 C2 C2
Why Kafka
Ideal unified platform to handle real time data feeds
Has high throughput to support high volume event streams such as log aggregation
Deals well with high volume data loads from offline systems
Fault tolerance and Scalable
Able to handle the low latency associated with traditional messaging systems
Kafka – decouples data pipelines
Producers
Broker
Consumers
Producers Producers Producers
Kafka
Consumer Consumer Consumer
Kafka – Guarantees
Messages sent to the topic and partition are appended in the same order
A consumer instance gets the message in the same order as they are produced
A topic with replication factor N can tolerate n-1 failures
Kafka –Replication
Logs Logs Logs Logs
Topic1-part1
Topic1-part1
Topic1-part2
Topic1-part1
Topic1-part2
Topic1-part2
Broker1 Broker2 Broker3 Broker4
Leader
Follower
Follower
Leader
Follower
Follower
Producer Producer
Zookeeper
• Zookeeper enables highly reliable distributed coordination
• Kafka bundles single node ZooKeeper instance
• Metadata includes – broker addresses, message offsets
Producers Consumers
Kafka Cluster
Zookeeper metadata
metadata
metadata
messages messages
Kafka Persistence - File System
http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg
Sequential File I/O very fast
Uses OS page cache for data storage
Batching of messages speeds up disk operations, network transfers and in memory iterations.
Batch Processing
One of the big drivers for efficiency
Producers accumulate data in memory and send larger batches in a single request
Fix the number of messages in a batch - batch.size
Wait no longer than a fixed latency bound - linger.ms
Trade off small amount of latency for better throughput
Log Compaction
Per-record retention, rather than the coarser-grained time-based retention
Fluentd Kafka Integration • Kafka –Fluentd Consumer
• Fluentd kafka plugin
Kafka Ecosystem
Log Forwarders
Fluentd
Fluentd
Fluentd
Destination
MongoDB
Amazon S3
Kafka Clusters PUSH
Consumers
Fluentd
Fluentd
Fluentd
PULL PUSH
Advantage - Fluentd-Kafka
Backpressure - Pull versus Push
Reliable , Flexible data pipeline
Data Center – 1 - Active Data Center – 2 - Active
Kafka Broker -1 Topic – 1, Partition –
0..n
Kafka Broker –2 Topic – 1, Partition –
n+1, n+n
Fluentd-Kafka Plugin
ZK – 1 Leader ZK – 2 Follower Zookeeper Ensemble
Connected Warehouse – Kafka Cluster Architecture
Kafka Cluster
MONGODB The Ingredients
Why MongoDB
Cross platform document-oriented NOSQL database
Simple and Flexible Data Model
Field Level Indexing
Built In Query Capabilities
High Performance
System Architecture With Shards
Data Sources
Config Server
mongos mongos mongos
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
MongoDB For Analytics
Denormalization with support of Embedded Documents
Text Search Queries
Connector for almost all kind of data source
Aggregation Framework
Range Queries, Key value queries
SPARK The Ingredients
Spark – Logical Architecture
Apache Spark
Spark SQL Spark
Streaming MLlib GraphX
Scala, Java, Python, R
Spark – MongoDB Connector
Putting It All Together – Click Stream + Inventory Mgmt
Collection
Processing
Ingestion
Data Sync
Micro-Service
QUESTIONS &
ANSWERS
Thank you