deck36 - log everything! and realtime datastream analytics with storm

27
1 Dr. Stefan Schadwinkel und

Upload: mike-lohmann

Post on 15-Jan-2015

1.253 views

Category:

Technology


3 download

DESCRIPTION

We from DECK36 show how Log everything! as requirement can be implemented with Hadoop, EMR and Twitter Storm.

TRANSCRIPT

Page 1: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1Dr. Stefan Schadwinkel und Mike Lohmann

Page 2: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

22

Who we are.

Log everything

Mike LohmannArchitektur

Author (PHPMagazin, IX, heise.de)

Dr. Stefan SchadwinkelAnalytics

Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.)

Page 3: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

33

Agenda.

Log everything

What we did. What we do.

Log everything! - Our way from Requirement to Solution

Infrastructure and technologies: Simple, Scalable, Open Source

Happy business users.

Page 4: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

44

What we did.

Log everything

Creating & operating education communities

Webapplications

Multi-language

Different market rules in different countries

Consolidating the technological basis for multiple (new) products

Page 5: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

55

DECK36 GmbH & Co. KG

Log everything

DECK36 is a young spin-off from ICANS

7 core engineers with longstanding expertise

(operate, scale, automate, analyze)

Consulting and engineering services for the

etruvian group and external customers

Page 6: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

66

Numberfacts of PokerStrategy.com

Log everything

6.000.000 Registered Users

PokerStrategy.comEducation since 2005

19 Languages

2.800.000PI/Day

700.000Posts/Day

7.600.000 Requests/Day

Page 7: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

77

Moving on…

Log everything

Build more Education communities like PokerStrategy…

Assume PokerStrategy KPIs(?)

Other Business models

Add mobile and the social web…

Our requirement: Log everything!

Page 8: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

88

Logging Tools / Technologies

Producer

Web/Mobile Apps

JS Frontend

Servers

Databases

04/10/2023

Transport

Now:RabbitMQ +Erlang Consumer

OR

Kafka +Any other Consumer

Was:Flume

Storage

Now:S3 Storage +Hadoop with EMR

OR

Any other storage

Was:Virtualized Inhouse Hadoop

Analytics

MapReduce withHive/Pig

Results in any formatExcel, QlikView, RDMS, ...

Realtime Datastream Analytics

Storm / Trident

Page 9: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

99

Logging Infrastructure

Producer

04/10/2023

Transport

Storage Analytics

Databases and Server

S3

Rabbit MQ

Consumer

Excel, QlikView, Tableau, SASS, ...

Graylog

Zabbix

Apps1-x

Hadoop- Cluster

RDMS

Realtime Datastream Analytics (Storm)

Nimbus(Master)

ZookeeperZookeeperZookeeper

SupervisorSupervisorSupervisor

WorkerWorker

Worker

NodeJS

Page 10: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1010

Producer

04/10/2023

PageController

Monolog-Logger

Shovel

LocalRabbitMQ

PageHitEvent

Listener

Processor

Handler

Formatter

PageHit-Event

Logger::log()

LogMessage, JSON

/Home

Page 11: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1111

Producer JS (in progress)

04/10/2023

JS Client

DataCollector(NodeJS)

Shovel

LocalRabbitMQ

Local Storage

Validator

Tracks Event

/Home

TriggerWebSocket

Page 12: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1212

Producer

04/10/2023

LoggingComponent: Provides interfaces, filters and handlers

LoggingBundle: Glues all together with Symfony2

Drupal Logging Module: Using the LoggingComponent

JS Frontend Client: LogClient for Browsers (in progress)

https://github.com/ICANS/IcansLoggingComponenthttps://github.com/ICANS/IcansLoggingBundlehttps://github.com/ICANS/drupal-logging-modulehttps://github.com/DECK36/starlog-js-frontend-client

Page 13: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1313

Transport

04/10/2023

1st Solution: Flume

+ Part of the Hadoop Ecosystem

+ Flexible Central config, Extensible via Plugins

- Not mature software (flume, flume-ng, plugin interfaces, ..)

- Central config has problems with puppet

2nd Solution: RabbitMQ

+ Local RabbitMQ Cluster

+ Decentralized config (producers & consumers simply connect)

- HDFS Sink not pre-packaged

Page 14: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1414

Storage

04/10/2023

1st Solution: Self-hosted Hadoop

- Virtualized Infrastructure makes HDFS redundant

- High costs (cluster always running, admin work)

2nd Solution: Cloud Storage

+ Amazon S3

+ Elastic MapReduce: Hadoop on demand

+ cost effective (only pay, what you use)

Page 15: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1515

Compaction

04/10/2023

RabbitMQ consumer (Erlang) stores data to cloud

Yet: we have a mixed message stream, but want:

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

MapReduce:

Streaming (stdin/stdout to any tool)

Computation (Hive, Pig, Cascalog, etc.)

Amazon Redshift

PostgreSQL-compatible Data Warehouse

Hive Partitioning!

Page 16: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1616

Analytics

04/10/2023

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))

Query Operator

CascadingOutput Tap

Columns of the dataset generated

by the query

„Generator“ „Predicate“

as many as you want

both can be any clojure function

clojure can call anything that is

available within a JVM

Page 17: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1717

Analytics

04/10/2023

• We use Cascalog to preprocess and organize that incoming flow of log messages:

Page 18: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1818

Analytics

04/10/2023

Let‘s run the Cascalog processing on Amazon EMR:

./elastic-mapreduce --create --name „Log Message Compaction"

--bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons

--num-instances $NUM

--slave-instance-type m1.large

--master-instance-type m1.large

--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar

--step-action TERMINATE_JOB_FLOW

--step-name "Cascalog"

--main-class icans.cascalogjobs.processing.compaction

--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

Page 19: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

1919

Analytics

04/10/2023

Now we can access the log data within Hive and store results again to S3:

Page 20: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2020

Analytics

04/10/2023

Now, get the stats by executing a query:

We can now simply copy the data from S3 and import in any local analytical tool

Excel, Redshift, QlikView, R, etc.

Page 21: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2121

Realtime Datastream Analytics

04/10/2023

• Storm: Hadoop for realtime analytics

• Rock solid HA concept

• Highly scalable

• Can:Processing Streams (and trigger events)Provide a DRPC functionalityWork on enormous data load

• Fancy names for modules (spouts/bolts/tuple/topology)

• Easy to useSmall and easy to understand APIDevMode

• Add new topologies at run time

Page 22: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2222

Realtime Datastream Analytics

04/10/2023

Page 23: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2323

Happy business users!

04/10/2023

Questions they have often can be automated (ETL, Reports)

New questions can be explored (Ad-hoc, Search)

Insights can be used as feedback into the system (Decisions, Websockets)

Data-driven applications can be created that can be used by multiple websites or

they can be taylored to individual needs.

Page 24: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2424

Merci.

04/10/2023

Questions

?

Page 25: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2525

Contacts.

04/10/2023

Dr. Stefan Schadwinkel

[email protected]

ICANS_StScha

Mike Lohmann

[email protected]

mikelohmann

Page 26: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

2626

Tools/Technologies

04/10/2023

Page 27: DECK36 - Log everything! and Realtime Datastream Analytics with Storm

27

DECK36 GmbH & CO. KG

Valentinskamp 18

20354 Hamburg

Germany

Phone: +49 40 22 63 82 9-0

Fax: +49 40 38 67 15 92

Web: www.deck36.de