apache ignite as a data processing hub

APACHE IGNITE AS A DATA PROCESSING HUB ROMAN SHTYKH

CYBERAGENT, INC.

INTRODUCTION

ABOUT ME

Roman Shtykh

¡  R&D Engineer at CyberAgent, Inc.

¡  Areas of focus

¡  Data streaming and NLP

¡  Committer on the Apache Ignite and MyBatis projects

¡  Judoka

¡  @rshtykh

CYBERAGENT, INC.

¡  Internet ads

¡  Games

¡  Media

¡  Investing

25%

13%

52%

3% 7%

Games

Media

Internet ads

Investing

Other

* As of Sep 2015

AMEBA SERVICES

・ Monthly visitors (DUB total):

6 billion* ・ Number of member users :

about 39 million*

CyberAgent, Inc.

Ameba Services

* As of Dec 2014

•  Games •  Community services •  Content curation •  Other

AMEBA SERVICES

Ameba Pigg

CONTENTS

¡  Apache Ignite

¡  Feed your data

¡  Log Aggregation with Apache Flume

¡  Integration with Apache Ignite

¡  Streaming Data with Apache Kafka

¡  Data Pipeline with Kafka and Ignite: Example

APACHE IGNITE

¡  “High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”

¡  High performance, unlimited scalability and resiliency

¡  High-performance transactions and fast analytics

¡  Hadoop Acceleration, Apache Spark

¡  Apache project

https://ignite.apache.org/

MAKING APACHE IGNITE A DATA PROCESSING HUB

¡  Question: How to feed data?

¡  A simple solution: Create a client node

MAKING APACHE IGNITE A DATA PROCESSING HUB

¡  Question: How to feed data?

¡  A simple solution: Create a client node

¡  Is it reliable?

¡  Does it scale?

¡  Ignite-only solution?

¡  Does it keep your operational costs low?

LOG AGGREGATION WITH APACHE FLUME

LOG AGGREGATION WITH APACHE FLUME

¡  Flume

¡  “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.”

¡  Scalable

¡  Flexible

¡  Robust and fault tolerant

¡  Declarative configuration

¡  Apache project

DATA FLOW IN FLUME

Source Sink

Agent

Channel Incoming data

to another Agent or Destination

DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)

Source Sink

Agent


Sink Channel Channel Selector

DATA FLOW IN FLUME (RELIABILITY)

¡  No data is lost (configurable)

Source Sink

Agent


Source tx Sink tx

LOG TRANSFER AT AMEBA

Ameba Service � Aggregator

Aggregator

Aggregator

Monitoring Recommender

System

Elastic Search

Hadoop Batch processing

HBase

Stream Processing (Onix)

Stream Processing (HBaseSink)

Ameba Service �

Ameba Service �

LOG TRANSFER AT AMEBA

¡  Web Hosts

¡  More than 1600

¡  Size

¡  5.0 TB/day (raw)

¡  Traffic at peak

¡  160Mbps (compressed)

IGNITE SINK

¡  Reads Flume events from a channel

¡  With a user-implemented pluggable transformer converts them into cacheable entries

¡  Adding it requires no modification to the existing architecture

FLUME ⇒ IGNITE (1)

Source Ignite Sink

Agent

Channel Incoming data new connection


Source Ignite Sink

Agent


Sink tx

start tx


Source Ignite Sink

Agent


Sink tx

take event send events

ENABLING FLUME SINK

¡  Steps

1.  Implement EventTransformer

¡  convert Flume events into cacheable entries (java.util.Map<K, V>)

2.  Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib

3.  Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext

4.  Set up a Flume agent

¡  Sink setup

a1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSink a1.sinks.k1.igniteCfg = /some-path/ignite.xml a1.sinks.k1.cacheName = testCache a1.sinks.k1.eventTransformer = my.company.MyEventTransformer a1.sinks.k1.batchSize = 100

FLUME SINKS

¡  HDFS

¡  THRIFT

¡  AVRO

¡  HBASE

¡  ElasticSearch

¡  IRC

¡  IGNITE

APACHE FLUME & APACHE IGNITE

¡  If you do data aggregation with Flume

¡  Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent

¡  If you store your data (and do computations) in Ignite

¡  Improving data injection becomes easy with Flume sink

¡  Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing) ¡  Scalable

¡  Reliable

¡  Highly-Performant

STREAMING DATA WITH APACHE KAFKA

APACHE KAFKA

“Publish-subscribe messaging rethought as a distributed commit log”

¡  Low latency

¡  High Throughput

¡  Partitioned and Replicated

¡  Kafka is an essential component of any data pipeline today

http://kafka.apache.org/

APACHE KAFKA

¡  Messages are grouped in topics

¡  Each partition is a log

¡  Each partition is managed by a broker (when replicated, one broker is the partition leader)

¡  Producers & consumers (consumer groups)

¡  Used for

¡  Log aggregation

¡  Activity tracking

¡  Monitoring

¡  Stream processing

http://kafka.apache.org/documentation.html

KAFKA CONNECT

¡  Designed for large scale stream data integration using Kafka

¡  Provides an abstraction from communication with your Kafka cluster

¡  Offset management

¡  Delivery semantics

¡  Fault tolerance

¡  Monitoring, etc.

¡  Worker (scalability & fault tolerance)

¡  Connector (task config)

¡  Task (thread)

¡  Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released

INGESTING DATA STREAMS

¡  Two ways

¡  Kafka Streamer

¡  Sink Connector

SQL queries Distributed closures Transactions

Con

nect

ETL

STREAMING VIA SINK CONNECTOR

¡  Configure your connector

¡  Configure Kafka Connect worker

¡  Start your connector

# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml

$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties

STREAMING VIA SINK CONNECTOR

¡  Easy data pipeline

¡  Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer

¡  At-least-once delivery guarantee

¡  As of 1.6, start a new connector to write to a different cache

a b c d e

0 1 2 … Kafka offsets

a.key, a.val b.key, b.val …

a2 b2 c2 d2 e2

INGESTING DATA STREAMS

¡  Bi-directional streaming

SQL queries Distributed closures Transactions

Con

nect

Events Continuous queries

Con

nect

Si

nk

Sour

ce

STREAMING BACK TO KAFKA

¡  Listening to cache events

¡  PUT

¡  READ

¡  REMOVED

¡  EXPIRED, etc.

¡  Remote filtering can be enabled

¡  Kafka Connect offsets are ignored

¡  Currently, no delivery guarantees

evt1

evt2

evt3 as records

ENABLING SOURCE CONNECTOR

¡  Configure your connector

¡  Define a remote filter if needed cacheFilterCls=MyCacheEventFilter

¡  Make sure that event listening is enabled on the server nodes

¡  Configure Kafka Connect worker

¡  Start your connector

#connector name=ignite-src-connector connector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnector tasks.max=2 #topics, events topicNames=test cacheEvts=put,removed #cache cacheName=myCache igniteCfg=myconfig/ignite.xml

key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter

APACHE KAFKA & APACHE IGNITE

¡  If you do data streaming with Kafka

¡  Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need it for source)

¡  If you store your data (and do computations) in Ignite

¡  Improving data injection and listening for events on data becomes easy with Kafka Connectors

¡  Combining Apache Kafka and Ignite makes/keeps your data pipeline

¡  Scalable

¡  Reliable

¡  Highly-Performant

¡  Covers a wide range of ETL contexts

DATA PIPELINE WITH KAFKA AND IGNITE EXAMPLE

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Requirements

¡  instant processing and analysis

¡  scalable and resilient to failures

¡  low latency

¡  high throughput

¡  flexibility


¡  Filter and aggregate events

data Flume

filter/transform

data

data

slow down on heavy loads

more channels/layers


data

filter transform

etc.

•  Parsimonious resource use •  Replay enabled •  More operations on streams •  Flexibility

Other sources


¡  Filter and aggregate events

¡  Store events

¡  Notify about updates on aggregates

data

filter transform

etc.

Connectors


¡  Improving ads delivery

clicks impressions

ads

Ads delivery

Ads recommender

storage/ computation

Image storage

data & computation in one place


¡  Improving ads delivery ¡  Better network utilization and reliability

clicks impressions

ads

Ads delivery

Ads recommender

storage/ computation

Image storage

Anomaly detection

OTHER INTEGRATIONS

OTHER COMPLETED INTEGRATIONS

¡  CAMEL

¡  MQTT

¡  STORM

¡  FLINK SINK

¡  TWITTER

THE END

apache ignite as a data processing hub

Documents