deck36 - log everything! and realtime datastream analytics with storm
DESCRIPTION
We from DECK36 show how Log everything! as requirement can be implemented with Hadoop, EMR and Twitter Storm.TRANSCRIPT
1Dr. Stefan Schadwinkel und Mike Lohmann
22
Who we are.
Log everything
Mike LohmannArchitektur
Author (PHPMagazin, IX, heise.de)
Dr. Stefan SchadwinkelAnalytics
Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.)
33
Agenda.
Log everything
What we did. What we do.
Log everything! - Our way from Requirement to Solution
Infrastructure and technologies: Simple, Scalable, Open Source
Happy business users.
44
What we did.
Log everything
Creating & operating education communities
Webapplications
Multi-language
Different market rules in different countries
Consolidating the technological basis for multiple (new) products
55
DECK36 GmbH & Co. KG
Log everything
DECK36 is a young spin-off from ICANS
7 core engineers with longstanding expertise
(operate, scale, automate, analyze)
Consulting and engineering services for the
etruvian group and external customers
66
Numberfacts of PokerStrategy.com
Log everything
6.000.000 Registered Users
PokerStrategy.comEducation since 2005
19 Languages
2.800.000PI/Day
700.000Posts/Day
7.600.000 Requests/Day
77
Moving on…
Log everything
Build more Education communities like PokerStrategy…
Assume PokerStrategy KPIs(?)
Other Business models
Add mobile and the social web…
Our requirement: Log everything!
88
Logging Tools / Technologies
Producer
Web/Mobile Apps
JS Frontend
Servers
Databases
04/10/2023
Transport
Now:RabbitMQ +Erlang Consumer
OR
Kafka +Any other Consumer
Was:Flume
Storage
Now:S3 Storage +Hadoop with EMR
OR
Any other storage
Was:Virtualized Inhouse Hadoop
Analytics
MapReduce withHive/Pig
Results in any formatExcel, QlikView, RDMS, ...
Realtime Datastream Analytics
Storm / Trident
99
Logging Infrastructure
Producer
04/10/2023
Transport
Storage Analytics
Databases and Server
S3
Rabbit MQ
Consumer
Excel, QlikView, Tableau, SASS, ...
Graylog
Zabbix
Apps1-x
Hadoop- Cluster
RDMS
Realtime Datastream Analytics (Storm)
Nimbus(Master)
ZookeeperZookeeperZookeeper
SupervisorSupervisorSupervisor
WorkerWorker
Worker
NodeJS
1010
Producer
04/10/2023
PageController
Monolog-Logger
Shovel
LocalRabbitMQ
PageHitEvent
Listener
Processor
Handler
Formatter
PageHit-Event
Logger::log()
LogMessage, JSON
/Home
1111
Producer JS (in progress)
04/10/2023
JS Client
DataCollector(NodeJS)
Shovel
LocalRabbitMQ
Local Storage
Validator
Tracks Event
/Home
TriggerWebSocket
1212
Producer
04/10/2023
LoggingComponent: Provides interfaces, filters and handlers
LoggingBundle: Glues all together with Symfony2
Drupal Logging Module: Using the LoggingComponent
JS Frontend Client: LogClient for Browsers (in progress)
https://github.com/ICANS/IcansLoggingComponenthttps://github.com/ICANS/IcansLoggingBundlehttps://github.com/ICANS/drupal-logging-modulehttps://github.com/DECK36/starlog-js-frontend-client
1313
Transport
04/10/2023
1st Solution: Flume
+ Part of the Hadoop Ecosystem
+ Flexible Central config, Extensible via Plugins
- Not mature software (flume, flume-ng, plugin interfaces, ..)
- Central config has problems with puppet
2nd Solution: RabbitMQ
+ Local RabbitMQ Cluster
+ Decentralized config (producers & consumers simply connect)
- HDFS Sink not pre-packaged
1414
Storage
04/10/2023
1st Solution: Self-hosted Hadoop
- Virtualized Infrastructure makes HDFS redundant
- High costs (cluster always running, admin work)
2nd Solution: Cloud Storage
+ Amazon S3
+ Elastic MapReduce: Hadoop on demand
+ cost effective (only pay, what you use)
1515
Compaction
04/10/2023
RabbitMQ consumer (Erlang) stores data to cloud
Yet: we have a mixed message stream, but want:
s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo
MapReduce:
Streaming (stdin/stdout to any tool)
Computation (Hive, Pig, Cascalog, etc.)
Amazon Redshift
PostgreSQL-compatible Data Warehouse
Hive Partitioning!
1616
Analytics
04/10/2023
Cascalog is Clojure, Clojure is Lisp
(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))
Query Operator
CascadingOutput Tap
Columns of the dataset generated
by the query
„Generator“ „Predicate“
as many as you want
both can be any clojure function
clojure can call anything that is
available within a JVM
1717
Analytics
04/10/2023
• We use Cascalog to preprocess and organize that incoming flow of log messages:
1818
Analytics
04/10/2023
Let‘s run the Cascalog processing on Amazon EMR:
./elastic-mapreduce --create --name „Log Message Compaction"
--bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons
--num-instances $NUM
--slave-instance-type m1.large
--master-instance-type m1.large
--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
--step-action TERMINATE_JOB_FLOW
--step-name "Cascalog"
--main-class icans.cascalogjobs.processing.compaction
--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error
1919
Analytics
04/10/2023
Now we can access the log data within Hive and store results again to S3:
2020
Analytics
04/10/2023
Now, get the stats by executing a query:
We can now simply copy the data from S3 and import in any local analytical tool
Excel, Redshift, QlikView, R, etc.
2121
Realtime Datastream Analytics
04/10/2023
• Storm: Hadoop for realtime analytics
• Rock solid HA concept
• Highly scalable
• Can:Processing Streams (and trigger events)Provide a DRPC functionalityWork on enormous data load
• Fancy names for modules (spouts/bolts/tuple/topology)
• Easy to useSmall and easy to understand APIDevMode
• Add new topologies at run time
2222
Realtime Datastream Analytics
04/10/2023
2323
Happy business users!
04/10/2023
Questions they have often can be automated (ETL, Reports)
New questions can be explored (Ad-hoc, Search)
Insights can be used as feedback into the system (Decisions, Websockets)
Data-driven applications can be created that can be used by multiple websites or
they can be taylored to individual needs.
2424
Merci.
04/10/2023
Questions
?
2525
Contacts.
04/10/2023
Dr. Stefan Schadwinkel
ICANS_StScha
Mike Lohmann
mikelohmann
2626
Tools/Technologies
04/10/2023
27
DECK36 GmbH & CO. KG
Valentinskamp 18
20354 Hamburg
Germany
Phone: +49 40 22 63 82 9-0
Fax: +49 40 38 67 15 92
Web: www.deck36.de