fluentd + mongodb + spark = awesome sauce

Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited

Bhavani Ananth, Tech Manager, Wipro Limited

Your

company logo here

Wipro – Open Source Practice: Vision & Mission

“Wipro will be the world leader in solving customer problems through the use of innovative and practical open source solutions. We will be a steward of every open source community in which we engage, and always act with sensitivity and integrity.”

Vision

“Wipro’s Open Source mission is to be the guide and partner to companies seeking to leverage the strategic, financial, organizational and technological benefits of open source software and methods. Wipro will anticipate and solve customers’ needs through a commitment to research, and by taking a balanced approach to legacy and innovative technologies. Wipro’s comprehensive suite of strategic and technology services will be delivered with passion and precision.”

Mission

Wipro – Open Source Practice Offerings

Advisory

Enterprise-wide adoption strategies

Best fit analysis & recommendation

Business Case Advisory

Governance

Technical Consulting

Support

Application and Infrastructure

Dev Ops Architecture,

Development Open Source

Community

Productized Services

Legacy Migration Services

Greenfield Development

Open Source Stack Setup

Open App

Cross Industry Solutions and Process Stacks

Connected Warehouse Platform

OMS

Sales Orders [Real-Time] Almost Real-Time

ERP/HOST

Purchase Orders Master Data [Scheduled]

TMS WMS

Direct to Customer

LMS IOT WCS

Route Plan / Carrier Tracking Associate Performance PUT/PICK Status

Connected Warehouse Platform Webservices Publisher Queues Subscriber Queues Integration Mapping

FTP (Flat file/Xml)

Facility Inventory & Orders Alerts & Notification Equipment Monitor

Automation Enabler

Performance Tracker Operations Dashboards

Warehouse KPI’s

Master Data Transaction Data

CSC SCP Warehouse Mobility & Dashboards Carrier Vendor

Warehouses Equipment Retailer Supplier

ANALYTICS & PREDICTION

The Awesome Sauce

Clickstream Analytics

User Behavior Analysis

Product Affinity

Website Resource Allocation

Prediction & recommendation

PREDICTION & RECOMMENDATION

Prediction Using Machine Learning

Content Recommendation

Conversion Prediction

Visitor Segmentation

Demand Forecasting

LOGS Sauce Raw Material

Logs, Logs Everywhere!

SysLog

Application

Server Logs

Social Media Feeds

Packet Data

Clickstream Data

Sensor Data

CDR

Custom App (C,

Ruby,Python)

Payment Data

Device Logs

Web Access

Logs

Database Logs

What can be done with logs?

Real time monitoring

Root cause analysis

Anomaly Detection and Predictive Monitoring

Debugging

Troubleshooting/Support

Challenges with Log Analytics

No standard log formats

Multiple logging frameworks

Logs highly decentralized

Limited real time visualization capability

Scalability Issues

Normalizing and correlating logs from disparate sources

What can be done with logs – Business PoV?

Input Data Analytics

User Interactions /Behavior

End user Experience/Improvements

Awesome Foursome– The Ingredients

FLUENTD The Ingredients

Why Fluentd

Unified Logging

Simple and Flexible

Proven

Minimal Resources

Reliable

Open Source

Community

Input Filter

Fluentd Plugin Architecture

Output

Filter (grep,enrich, delete.mask)

Parser (regexp,apache2)

Buffer

Format

Output out_mongo

Input

(udp,tcp,http,tail)

HA Fluentd topology

• “At Most once” and “At Least once” transfers

Log Aggregators

Fluentd (Active)

Log Forwarders

Fluentd (Backup)

Fluentd

Fluentd

Fluentd

Destination

MongoDB

Amazon S3

PUSH PUSH Log File

Log File

Log File

Node1

Node2

Node3

Fluentd – Failure Scenarios

Forwarder goes down

Aggregator goes down

KAFKA The Ingredients

Kafka – distributed streaming platform

DBs

Apps App App

Kafka Cluster

DBs

App

App

Apps App App

Stream Processor Connectors

Producers

Consumers

Publish-Subscribe streams of records

Store streams of records in fault tolerant way

Process streams of records

Kafka –Terms

Producer

Consumer

Consumer Group

Topic

Partition

Producer

Topics

0 1 0 1 2 0 1

Partition-1 Partition-2 Partition-3

Brokers

p1 p2 p3

R1 R2 R3

Consumer Groups

C1 C2 C2

Why Kafka

Ideal unified platform to handle real time data feeds

Has high throughput to support high volume event streams such as log aggregation

Deals well with high volume data loads from offline systems

Fault tolerance and Scalable

Able to handle the low latency associated with traditional messaging systems

Kafka – decouples data pipelines

Producers

Broker

Consumers

Producers Producers Producers

Kafka

Consumer Consumer Consumer

Kafka – Guarantees

Messages sent to the topic and partition are appended in the same order

A consumer instance gets the message in the same order as they are produced

A topic with replication factor N can tolerate n-1 failures

Kafka –Replication

Logs Logs Logs Logs

Topic1-part1

Topic1-part1

Topic1-part2

Topic1-part1

Topic1-part2

Topic1-part2

Broker1 Broker2 Broker3 Broker4

Leader

Follower

Follower

Leader

Follower

Follower

Producer Producer

Zookeeper

• Zookeeper enables highly reliable distributed coordination

• Kafka bundles single node ZooKeeper instance

• Metadata includes – broker addresses, message offsets

Producers Consumers

Kafka Cluster

Zookeeper metadata

metadata

metadata

messages messages

Kafka Persistence - File System

http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg

Sequential File I/O very fast

Uses OS page cache for data storage

Batching of messages speeds up disk operations, network transfers and in memory iterations.

Batch Processing

One of the big drivers for efficiency

Producers accumulate data in memory and send larger batches in a single request

Fix the number of messages in a batch - batch.size

Wait no longer than a fixed latency bound - linger.ms

Trade off small amount of latency for better throughput

Log Compaction

Per-record retention, rather than the coarser-grained time-based retention

Fluentd Kafka Integration • Kafka –Fluentd Consumer

• Fluentd kafka plugin

Kafka Ecosystem

Log Forwarders

Fluentd

Fluentd

Fluentd

Destination

MongoDB

Amazon S3

Kafka Clusters PUSH

Consumers

Fluentd

Fluentd

Fluentd

PULL PUSH

Advantage - Fluentd-Kafka

Backpressure - Pull versus Push

Reliable , Flexible data pipeline

Data Center – 1 - Active Data Center – 2 - Active

Kafka Broker -1 Topic – 1, Partition –

0..n

Kafka Broker –2 Topic – 1, Partition –

n+1, n+n

Fluentd-Kafka Plugin

ZK – 1 Leader ZK – 2 Follower Zookeeper Ensemble

Connected Warehouse – Kafka Cluster Architecture

Kafka Cluster

MONGODB The Ingredients

Why MongoDB

Cross platform document-oriented NOSQL database

Simple and Flexible Data Model

Field Level Indexing

Built In Query Capabilities

High Performance

System Architecture With Shards

Data Sources

Config Server

mongos mongos mongos

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

MongoDB For Analytics

Denormalization with support of Embedded Documents

Text Search Queries

Connector for almost all kind of data source

Aggregation Framework

Range Queries, Key value queries

SPARK The Ingredients

Spark – Logical Architecture

Apache Spark

Spark SQL Spark

Streaming MLlib GraphX

Scala, Java, Python, R

Spark – MongoDB Connector

Putting It All Together – Click Stream + Inventory Mgmt

Collection

Processing

Ingestion

Data Sync

Micro-Service

QUESTIONS &

ANSWERS

Thank you

www.modsummit.com

www.developersummit.com

http://www.modsummit.com/

http://www.developersummit.com/