apache ignite as a data processing hub

47
APACHE IGNITE AS A DATA PROCESSING HUB ROMAN SHTYKH CYBERAGENT, INC.

Upload: dangkhanh

Post on 14-Feb-2017

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APACHE IGNITE AS A DATA PROCESSING HUB

APACHE IGNITE AS A DATA PROCESSING HUB ROMAN SHTYKH

CYBERAGENT, INC.

Page 2: APACHE IGNITE AS A DATA PROCESSING HUB

INTRODUCTION

Page 3: APACHE IGNITE AS A DATA PROCESSING HUB

ABOUT ME

Roman Shtykh

¡  R&D Engineer at CyberAgent, Inc.

¡  Areas of focus

¡  Data streaming and NLP

¡  Committer on the Apache Ignite and MyBatis projects

¡  Judoka

¡  @rshtykh

Page 4: APACHE IGNITE AS A DATA PROCESSING HUB

CYBERAGENT, INC.

¡  Internet ads

¡  Games

¡  Media

¡  Investing

25%

13%

52%

3% 7%

Games

Media

Internet ads

Investing

Other

* As of Sep 2015

Page 5: APACHE IGNITE AS A DATA PROCESSING HUB

AMEBA SERVICES

・ Monthly visitors (DUB total):

6 billion* ・ Number of member users :

about 39 million*

CyberAgent, Inc.

Ameba Services

* As of Dec 2014

•  Games •  Community services •  Content curation •  Other

Page 6: APACHE IGNITE AS A DATA PROCESSING HUB

AMEBA SERVICES

Ameba Pigg

Page 7: APACHE IGNITE AS A DATA PROCESSING HUB

CONTENTS

¡  Apache Ignite

¡  Feed your data

¡  Log Aggregation with Apache Flume

¡  Integration with Apache Ignite

¡  Streaming Data with Apache Kafka

¡  Data Pipeline with Kafka and Ignite: Example

Page 8: APACHE IGNITE AS A DATA PROCESSING HUB

APACHE IGNITE

¡  “High-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies.”

¡  High performance, unlimited scalability and resiliency

¡  High-performance transactions and fast analytics

¡  Hadoop Acceleration, Apache Spark

¡  Apache project

https://ignite.apache.org/

Page 9: APACHE IGNITE AS A DATA PROCESSING HUB

MAKING APACHE IGNITE A DATA PROCESSING HUB

¡  Question: How to feed data?

¡  A simple solution: Create a client node

Page 10: APACHE IGNITE AS A DATA PROCESSING HUB

MAKING APACHE IGNITE A DATA PROCESSING HUB

¡  Question: How to feed data?

¡  A simple solution: Create a client node

¡  Is it reliable?

¡  Does it scale?

¡  Ignite-only solution?

¡  Does it keep your operational costs low?

Page 11: APACHE IGNITE AS A DATA PROCESSING HUB

MAKING APACHE IGNITE A DATA PROCESSING HUB

¡  Question: How to feed data?

¡  A simple solution: Create a client node

¡  Is it reliable?

¡  Does it scale?

¡  Ignite-only solution?

¡  Does it keep your operational costs low?

Page 12: APACHE IGNITE AS A DATA PROCESSING HUB

LOG AGGREGATION WITH APACHE FLUME

Page 13: APACHE IGNITE AS A DATA PROCESSING HUB

LOG AGGREGATION WITH APACHE FLUME

¡  Flume

¡  “Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.”

¡  Scalable

¡  Flexible

¡  Robust and fault tolerant

¡  Declarative configuration

¡  Apache project

Page 14: APACHE IGNITE AS A DATA PROCESSING HUB

DATA FLOW IN FLUME

Source Sink

Agent

Channel Incoming data

to another Agent or Destination

Page 15: APACHE IGNITE AS A DATA PROCESSING HUB

DATA FLOW IN FLUME (REPLICATION/MULTIPLEXING)

Source Sink

Agent

Channel Incoming data

Sink Channel Channel Selector

Page 16: APACHE IGNITE AS A DATA PROCESSING HUB

DATA FLOW IN FLUME (RELIABILITY)

¡  No data is lost (configurable)

Source Sink

Agent

Channel Incoming data

Source tx Sink tx

Page 17: APACHE IGNITE AS A DATA PROCESSING HUB

LOG TRANSFER AT AMEBA

Ameba Service � Aggregator

Aggregator

Aggregator

Monitoring Recommender

System

Elastic Search

Hadoop Batch processing

HBase

Stream Processing (Onix)

Stream Processing (HBaseSink)

Ameba Service �

Ameba Service �

Page 18: APACHE IGNITE AS A DATA PROCESSING HUB

LOG TRANSFER AT AMEBA

¡  Web Hosts

¡  More than 1600

¡  Size

¡  5.0 TB/day (raw)

¡  Traffic at peak

¡  160Mbps (compressed)

Page 19: APACHE IGNITE AS A DATA PROCESSING HUB

IGNITE SINK

¡  Reads Flume events from a channel

¡  With a user-implemented pluggable transformer converts them into cacheable entries

¡  Adding it requires no modification to the existing architecture

Page 20: APACHE IGNITE AS A DATA PROCESSING HUB

FLUME ⇒ IGNITE (1)

Source Ignite Sink

Agent

Channel Incoming data new connection

Page 21: APACHE IGNITE AS A DATA PROCESSING HUB

FLUME ⇒ IGNITE (2)

Source Ignite Sink

Agent

Channel Incoming data

Sink tx

start tx

Page 22: APACHE IGNITE AS A DATA PROCESSING HUB

FLUME ⇒ IGNITE (3)

Source Ignite Sink

Agent

Channel Incoming data

Sink tx

take event send events

Page 23: APACHE IGNITE AS A DATA PROCESSING HUB

ENABLING FLUME SINK

¡  Steps

1.  Implement EventTransformer

¡  convert Flume events into cacheable entries (java.util.Map<K, V>)

2.  Put transformer’s jar to ${FLUME_HOME}/plugins.d/ignite/lib

3.  Put IgniteSink and Ignite core jar files to ${FLUME_HOME}/plugins.d/ignite/libext

4.  Set up a Flume agent

¡  Sink setup

a1.sinks.k1.type = org.apache.ignite.stream.flume.IgniteSink a1.sinks.k1.igniteCfg = /some-path/ignite.xml a1.sinks.k1.cacheName = testCache a1.sinks.k1.eventTransformer = my.company.MyEventTransformer a1.sinks.k1.batchSize = 100

Page 24: APACHE IGNITE AS A DATA PROCESSING HUB

FLUME SINKS

¡  HDFS

¡  THRIFT

¡  AVRO

¡  HBASE

¡  ElasticSearch

¡  IRC

¡  IGNITE

Page 25: APACHE IGNITE AS A DATA PROCESSING HUB

APACHE FLUME & APACHE IGNITE

¡  If you do data aggregation with Flume

¡  Adding an Ignite cluster is as simple as writing a simple data transformer and deploying a new Flume agent

¡  If you store your data (and do computations) in Ignite

¡  Improving data injection becomes easy with Flume sink

¡  Combining Apache Flume and Ignite makes/keeps your data pipeline (both aggregation and processing) ¡  Scalable

¡  Reliable

¡  Highly-Performant

Page 26: APACHE IGNITE AS A DATA PROCESSING HUB

STREAMING DATA WITH APACHE KAFKA

Page 27: APACHE IGNITE AS A DATA PROCESSING HUB

APACHE KAFKA

“Publish-subscribe messaging rethought as a distributed commit log”

¡  Low latency

¡  High Throughput

¡  Partitioned and Replicated

¡  Kafka is an essential component of any data pipeline today

http://kafka.apache.org/

Page 28: APACHE IGNITE AS A DATA PROCESSING HUB

APACHE KAFKA

¡  Messages are grouped in topics

¡  Each partition is a log

¡  Each partition is managed by a broker (when replicated, one broker is the partition leader)

¡  Producers & consumers (consumer groups)

¡  Used for

¡  Log aggregation

¡  Activity tracking

¡  Monitoring

¡  Stream processing

http://kafka.apache.org/documentation.html

Page 29: APACHE IGNITE AS A DATA PROCESSING HUB

KAFKA CONNECT

¡  Designed for large scale stream data integration using Kafka

¡  Provides an abstraction from communication with your Kafka cluster

¡  Offset management

¡  Delivery semantics

¡  Fault tolerance

¡  Monitoring, etc.

¡  Worker (scalability & fault tolerance)

¡  Connector (task config)

¡  Task (thread)

¡  Standalone & Distributed execution models http://www.confluent.io/blog/apache-kafka-0.9-is-released

Page 30: APACHE IGNITE AS A DATA PROCESSING HUB

INGESTING DATA STREAMS

¡  Two ways

¡  Kafka Streamer

¡  Sink Connector

SQL queries Distributed closures Transactions

Con

nect

ETL

Page 31: APACHE IGNITE AS A DATA PROCESSING HUB

STREAMING VIA SINK CONNECTOR

¡  Configure your connector

¡  Configure Kafka Connect worker

¡  Start your connector

# connectorname=my-ignite-connectorconnector.class=IgniteSinkConnectortasks.max=2topics=someTopic1,someTopic2# cachecacheName=myCachecacheAllowOverwrite=trueigniteCfg=/some-path/ignite.xml

$ bin/connect-standalone.sh myconfig/connect-standalone.properties myconfig/ignite-connector.properties

Page 32: APACHE IGNITE AS A DATA PROCESSING HUB

STREAMING VIA SINK CONNECTOR

¡  Easy data pipeline

¡  Records from Kafka are written to Ignite grid via high-performance IgniteDataStreamer

¡  At-least-once delivery guarantee

¡  As of 1.6, start a new connector to write to a different cache

a b c d e

0 1 2 … Kafka offsets

a.key, a.val b.key, b.val …

a2 b2 c2 d2 e2

Page 33: APACHE IGNITE AS A DATA PROCESSING HUB

INGESTING DATA STREAMS

¡  Bi-directional streaming

SQL queries Distributed closures Transactions

Con

nect

Events Continuous queries

Con

nect

Si

nk

Sour

ce

Page 34: APACHE IGNITE AS A DATA PROCESSING HUB

STREAMING BACK TO KAFKA

¡  Listening to cache events

¡  PUT

¡  READ

¡  REMOVED

¡  EXPIRED, etc.

¡  Remote filtering can be enabled

¡  Kafka Connect offsets are ignored

¡  Currently, no delivery guarantees

evt1

evt2

evt3 as records

Page 35: APACHE IGNITE AS A DATA PROCESSING HUB

ENABLING SOURCE CONNECTOR

¡  Configure your connector

¡  Define a remote filter if needed cacheFilterCls=MyCacheEventFilter

¡  Make sure that event listening is enabled on the server nodes

¡  Configure Kafka Connect worker

¡  Start your connector

#connector name=ignite-src-connector connector.class=org.apache.ignite.stream.kafka.connect.IgniteSourceConnector tasks.max=2 #topics, events topicNames=test cacheEvts=put,removed #cache cacheName=myCache igniteCfg=myconfig/ignite.xml

key.converter=org.apache.kafka.connect.storage.StringConverter value.converter=org.apache.ignite.stream.kafka.connect.serialization.CacheEventConverter

Page 36: APACHE IGNITE AS A DATA PROCESSING HUB

APACHE KAFKA & APACHE IGNITE

¡  If you do data streaming with Kafka

¡  Adding an Ignite cluster is as simple as writing a configuration file (and creating a filter if you need it for source)

¡  If you store your data (and do computations) in Ignite

¡  Improving data injection and listening for events on data becomes easy with Kafka Connectors

¡  Combining Apache Kafka and Ignite makes/keeps your data pipeline

¡  Scalable

¡  Reliable

¡  Highly-Performant

¡  Covers a wide range of ETL contexts

Page 37: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE EXAMPLE

Page 38: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Requirements

¡  instant processing and analysis

¡  scalable and resilient to failures

¡  low latency

¡  high throughput

¡  flexibility

Page 39: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Filter and aggregate events

data Flume

filter/transform

data

data

slow down on heavy loads

more channels/layers

Page 40: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

data

filter transform

etc.

•  Parsimonious resource use •  Replay enabled •  More operations on streams •  Flexibility

Other sources

Page 41: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Filter and aggregate events

¡  Store events

¡  Notify about updates on aggregates

data

filter transform

etc.

Connectors

Page 42: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Filter and aggregate events

¡  Store events

¡  Notify about updates on aggregates

data

filter transform

etc.

Connectors

Page 43: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Improving ads delivery

clicks impressions

ads

Ads delivery

Ads recommender

storage/ computation

Image storage

data & computation in one place

Page 44: APACHE IGNITE AS A DATA PROCESSING HUB

DATA PIPELINE WITH KAFKA AND IGNITE

¡  Improving ads delivery ¡  Better network utilization and reliability

clicks impressions

ads

Ads delivery

Ads recommender

storage/ computation

Image storage

Anomaly detection

Page 45: APACHE IGNITE AS A DATA PROCESSING HUB

OTHER INTEGRATIONS

Page 46: APACHE IGNITE AS A DATA PROCESSING HUB

OTHER COMPLETED INTEGRATIONS

¡  CAMEL

¡  MQTT

¡  STORM

¡  FLINK SINK

¡  TWITTER

Page 47: APACHE IGNITE AS A DATA PROCESSING HUB

THE END