clickstream analysis with spark
TRANSCRIPT
![Page 1: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/1.jpg)
CLICKSTREAM ANALYSIS WITH
SPARK – UNDERSTANDING
VISITORS IN REAL-TIME
Dr. Josef AdersbergerQAware GmbH, Germany
![Page 2: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/2.jpg)
THE CHALLENGE
![Page 3: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/3.jpg)
One Kettle to Rule ‘em All
Web Tracking Ad Tracking
ERPCRM
![Page 4: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/4.jpg)
One Kettle to Rule ‘em All
Retention Reach
Monetarization
steer …
Campaigns
Offers
Contents
![Page 5: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/5.jpg)
![Page 6: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/6.jpg)
THE CONCEPTS
by Randy Paulino
![Page 7: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/7.jpg)
The First Sketch
(= real-time)
Tableau
![Page 8: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/8.jpg)
User Journey Analysis
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs: Unique users
Conversions
Ad costs / conversion value
…
V
X
VT
C Click
View
View Time
Conversion
![Page 9: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/9.jpg)
THE ARCHITECTURE
![Page 10: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/10.jpg)
„Larry & Friends“ Architecture
Collector AggregationSQL DB
Runs not well for more
than 1 TB data in terms of
ingestion speed, query time
and optimization efforts
![Page 11: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/11.jpg)
by adweek.com
Nope.
Sorry, no Big Data.
![Page 12: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/12.jpg)
„Hadoop & Friends“ Architecture
CollectorBatch Processor
[Hadoop]Collector Event Data Lake Batch Processor Analytics DB
JSON
Stream
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough
![Page 13: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/13.jpg)
Nope.
Too sluggish.
![Page 14: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/14.jpg)
κ-Architecture
Collector Stream ProcessorAnalytics DB
Persistence
JSON
Stream
Cumbersome
programming model
Over-engineered: We only need
15min real-time ;-)
Stateful aggregations (unique x,
conversions) require a separate DB
with high throughput and fast
aggregations & lookups.
![Page 15: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/15.jpg)
λ-Architecture
Collector
Event Processor
Event Data Lake Batch Processor
Analytics DB
Speed Layer
Batch Layer
JSON
Stream
Cumbersome
programming modelComplex
architecture
Redundant
logic
![Page 16: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/16.jpg)
Feels Over-Engineered…
http://www.brainlazy.com/article/random-nonsense/over-engineered
![Page 17: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/17.jpg)
The Final Architecture**) Maybe called μ-architecture one day ;-)
![Page 18: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/18.jpg)
Functional Architecture
Strange Events
IngestionRaw Event
StreamCollection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
Buffers load peeks
Ensures message
delivery (fire & forget
for client)
Create user journeys and
unique user sets
Enrich dimensions
Aggregate events to KPIs
Ability to replay for schema
evolution
The representation of truth
Multidimensional data
model
Interactive queries for
actions in realtime and
data exploration
Eternal memory for all
events (even strange
ones)
One schema per event
type. Time partitioned.
class Analytics Model
«fact»
WebFact
«dimension»
Zeit
«dimension»
Kampagne
Jahr
Quartal
Monat
Woche
Tag
Stunde
Minute
Kunde
+ Land: String
Partner
«dimension»
Tracking
Tracking Group
SensorTag
+ Typ: SensorTagType
Platzierung
+ Format: ImageSize
+ Kostenmodell: KostenmodellArt
Werbemittel
+ AdGroup: String
+ Format: ImageSize
+ Größe: KiloBytes
+ LandingPage: URL
+ Motif: URL
Kampagne
«dimension»
ClientKategorie
Dev ice
+ Bezeichner: String
+ Hersteller: String
+ Typ: String
Browser
+ Typ: String
+ Version: int
«dimension»
AusspielortLandRegion
Stadt
«dimension»
Kanal
Kanal
«dimension»
Vermarktung
«enumeration»
SensorTagType
ORDER_TAG
MASTER_TAG
CUSTOM_TAG
Betriebssystem
+ Typ: String
+ Version: Version
⦁ Dimension: Unabhängiges Prädikat auf Metriken bei der Analyse ("kann isoliert darüber nachdenken / isoliert dazu Analysen fahren")
⦁ Hierarchie: Sub-Prädikat auf Metriken. Erzeugt mehr als eine (zueinander diskunkte) Teilmengen der Metriken. Entspricht den gängigen Drill-Down-Pfaden in den Reports bzw. den Batch-Aggregate-Up-Pfaden in der Aggregationslogik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne".
⦁ Asssoziation: Nicht verwendet. Separates Stammdatenmodell.
⦁ Attribut: Ermöglicht eine weitere (querschneidende) Einschränkung der Metrikmenge ergänzend zu den Hierarchien.
Domain
Website
Tracking Site
Vermarkter
Auslieferungs-
Domain
Referral
«enumerati...
KostenmodellArt
CPC
CPM
CPO
CPA
«abstract»
DimensionValue
+ id: int
+ name: String
+ sourceId: String
WebsiteFact
+ Bounces: int
+ Verweildauer: float
+ Visits: int
BasicAdFact
+ Clicks: int
+ Sichtbare Views: int
+ Validierte Clicks: int
+ View (angefragt): int
+ View (ausgeliefert): int
+ View (gemessen): int
«dimension»
Produkt
Shop
Produkt
+ Produktkategorie: String
«dimension»
Zeitfenster
Letzte X Tage
«dimension»
UserUser Segment
«dimension»
Order
OrderStatus
+ Status: OrderStatus
«enumeration»
OrderStatus
IN_BEARBEITUNG
ERFOLGREICH (AKTIVIERT)
ABGELEHNT
NICHT_IN_BEARBEITUNG
UniquesFact
+ Unique Clicks: int
+ Unique Users: int
+ Unique Views: int
AdCostFact
+ CPC: int
+ Kosten: float
Conv ersionFact
+ PC: int
+ PR: int
+ PV: int
+ Umsatz PC: float
+ Umsatz PR: float
+ Umsatz PV: float
AdVisibilityFact
+ Sichtbarkeitsdauer: float
Activ atedOrderFact
+ Orders: int
+ Umsatz: float
TrackingFact
+ Orders: int
+ Page Impressions: int
+ Umsatz: float
X = {7, 14, 28, 30}
Fault tolerant message handling
Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
Tolerates delayed events
High throughput, moderate latency (~ 1min)
![Page 19: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/19.jpg)
Series Connection of Streaming
and Batching - all based on Spark.
IngestionRaw Event
StreamCollection
Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Cool programming model
Uniform dev&ops
Simple solution
High compression ratio due to
column-oriented storage
High scan speed
Cool programming model
Uniform dev&ops
High performance
Interface to R out-of-the-box
Useful libs: MLlib, GraphX, NLP, …
Good connectivity (JDBC,
ODBC, …)
Interactive queries
Uniform ops
Can easily be replaced
due to Hive Metastore
Obvious choice for
cloud-scale messaging
Way the best throughput
and scalability of all
evaluated alternatives
![Page 20: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/20.jpg)
LESSONS LEARNED
by: http://hochmeister-alpin.at
![Page 21: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/21.jpg)
Technology Mapping
https://github.com/qaware/big-data-landscape
User Interface
Data Lake
Data Warehouse
Ingestion
Processing
Data Science Interactive Analysis Reporting & Dashboards
Data Sources
Analytics
Micro Analytics Services instead of reporting
servers.
Charting Libraries:Dashboards:
Analytics Frontends
Algorithm Libraries
Structured Data Lake: The eternal memory.
Efficient data serialization formats:
Integated compression
Column-oriented storage
Predicate pushdown
Distributed Filesystem or NoSQL DB
Data Workflows ETL Jobs Massive
Parallelization
PigOpen Studio
Data Logicstics Stream
Processing
NewSQL: SQL meets NoSQL.
Polyglott Persistence
Index Machines: Fast aggregation and search.
In-Memory Databases: Fast access.
Time Series Databases
Atlas
![Page 22: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/22.jpg)
Polyglott Analytics
Data LakeAnalytics
Warehouse
SQL
lane
R
lane
Timeseries
lane
Reporting Data ExplorationData Science
![Page 23: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/23.jpg)
Micro Analytics Services
Microservice
Dashboard
Microservice …
![Page 24: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/24.jpg)
No Retention Paranoia
Data Lake
Analytics
Warehouse
Eternal memory
Close to raw events
Allows replays and refills
into warehouse
Aggressive forgetting with clearly defined
retention policy per aggregation level like:
15min:30d
1h:4m
…
Events
Strange Events
![Page 25: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/25.jpg)
Continuous Tuning
IngestionRaw Event
StreamCollection
Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
Generator Throughput & latency probes
![Page 26: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/26.jpg)
In Numbers
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
![Page 27: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/27.jpg)
![Page 29: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/29.jpg)
Bonus Topic: Roadmap
SparkKafka HDFS
IaaS
Simplify Ops with Mesos
Faster aggregation & easier updates
with Spark-on-Solr
http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html
![Page 30: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/30.jpg)
Bonus Topic: Smart Aggregation
IngestionEvent Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
![Page 31: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/31.jpg)
Architecture follows requirementsclass Analytics Model
«fact»
WebFact
«dimension»
Zeit
«dimension»
Kampagne
Jahr
Quartal
Monat
Woche
Tag
Stunde
Minute
Kunde
+ Land: String
Partner
«dimension»
Tracking
Tracking Group
SensorTag
+ Typ: SensorTagType
Platzierung
+ Format: ImageSize
+ Kostenmodell: KostenmodellArt
Werbemittel
+ AdGroup: String
+ Format: ImageSize
+ Größe: KiloBytes
+ LandingPage: URL
+ Motif: URL
Kampagne
«dimension»
ClientKategorie
Dev ice
+ Bezeichner: String
+ Hersteller: String
+ Typ: String
Browser
+ Typ: String
+ Version: int
«dimension»
AusspielortLandRegion
Stadt
«dimension»
Kanal
Kanal
«dimension»
Vermarktung
«enumeration»
SensorTagType
ORDER_TAG
MASTER_TAG
CUSTOM_TAG
Betriebssystem
+ Typ: String
+ Version: Version
⦁ Dimension: Unabhängiges Prädikat auf Metriken bei der Analyse ("kann isoliert darüber nachdenken / isoliert dazu Analysen fahren")
⦁ Hierarchie: Sub-Prädikat auf Metriken. Erzeugt mehr als eine (zueinander diskunkte) Teilmengen der Metriken. Entspricht den gängigen Drill-Down-Pfaden in den Reports bzw. den Batch-Aggregate-Up-Pfaden in der Aggregationslogik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne".
⦁ Asssoziation: Nicht verwendet. Separates Stammdatenmodell.
⦁ Attribut: Ermöglicht eine weitere (querschneidende) Einschränkung der Metrikmenge ergänzend zu den Hierarchien.
Domain
Website
Tracking Site
Vermarkter
Auslieferungs-
Domain
Referral
«enumerati...
KostenmodellArt
CPC
CPM
CPO
CPA
«abstract»
DimensionValue
+ id: int
+ name: String
+ sourceId: String
WebsiteFact
+ Bounces: int
+ Verweildauer: float
+ Visits: int
BasicAdFact
+ Clicks: int
+ Sichtbare Views: int
+ Validierte Clicks: int
+ View (angefragt): int
+ View (ausgeliefert): int
+ View (gemessen): int
«dimension»
Produkt
Shop
Produkt
+ Produktkategorie: String
«dimension»
Zeitfenster
Letzte X Tage
«dimension»
UserUser Segment
«dimension»
Order
OrderStatus
+ Status: OrderStatus
«enumeration»
OrderStatus
IN_BEARBEITUNG
ERFOLGREICH (AKTIVIERT)
ABGELEHNT
NICHT_IN_BEARBEITUNG
UniquesFact
+ Unique Clicks: int
+ Unique Users: int
+ Unique Views: int
AdCostFact
+ CPC: int
+ Kosten: float
Conv ersionFact
+ PC: int
+ PR: int
+ PV: int
+ Umsatz PC: float
+ Umsatz PR: float
+ Umsatz PV: float
AdVisibilityFact
+ Sichtbarkeitsdauer: float
Activ atedOrderFact
+ Orders: int
+ Umsatz: float
TrackingFact
+ Orders: int
+ Page Impressions: int
+ Umsatz: float
X = {7, 14, 28, 30}
act Processing
Pro
ce
ss
ing
We
bs
ite
Ord
er
Ac
tiv
ati
on
Ad
Co
st
Co
nv
ers
ion
Un
iqu
es
+
Ov
erl
ap
Ad
Vis
ibil
ity
Ba
sic
Ad
sT
rac
kin
g
Dimensionsv erzeichnis
aktualisieren
Page
Impressions
und Orders
zählen
Aggregation
TrackingFact
Ad
Impressions
zählenAggregation
BasicAdFact
CTR und
Sichtbarkeitsrate
berechnen
Inferenz der
Sichtbarkeitsdauer
Aggregation
AdVisibilityFact
Dimensionsraum
aufspannen
Pro Vektor die Ev ents und
die dort enthaltenen UserIds
und Interaktionsarten
ermitteln
Menge der UserIds pro
Interaktionsart erzeugen und deren
Mächtigkeit bestimmen
UniquesFact
User Journeys
erstellen (bzw.
LV/LC pro User)
Attribution v on Conv ersion
auf Basis v orkonfiguriertem
Conv ersion Modell
Conv ersions
zählen (PV, PC, PR)
Warenkorbwert ermitteln
bei Order-Conv ersion
(PV, PC, PR)
Aggregation
Conv ersionFact
Kosten pro
Ev ent
ermitteln Aggregation
Nicht zuweisbare
Kosten ermitteln
CostFact
Inferenz der Visits
Visits zählen
Bounces zählen
Analyse der Verweildauer
Aggregation
WebsiteFact
Order Status
ermitteln
Anzahl der nicht getrackten Orders ermitteln
AggregationActiv atedOrderFact
:Analytics Warehouse:Event Data Lake
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
Data Processing WorkflowMultidimensional Data Model
![Page 32: Clickstream Analysis with Spark](https://reader031.vdocuments.net/reader031/viewer/2022022203/5871c6b31a28ab55058b82c7/html5/thumbnails/32.jpg)
Sample results
Geolocated and gender-
specific conversions.
Frequency of visits
Performance of an ad campaign