datageeks
DESCRIPTION
Slides for Munich Datageek MeetupTRANSCRIPT
![Page 1: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/1.jpg)
User Behaviour Tracking Track - Store - Process
!//Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net
!
![Page 2: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/2.jpg)
![Page 3: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/3.jpg)
Vision „Let’s build our own Google
Analytics“
![Page 4: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/4.jpg)
Why
analytics does sampling
we want the (raw) data
![Page 5: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/5.jpg)
Ideas,Thoughts&Goals
fast / minimal impact on page loading time
high availability
track user over multiple platforms
storage engine? -> hbase
![Page 6: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/6.jpg)
Infrastructure
![Page 7: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/7.jpg)
Numbers!
10-20ms Response Time per pixel
record for now: ~2500 concurrent reqs
1,5 billion entries in Hbase
10 Nodes in Hadoop Cluster
![Page 8: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/8.jpg)
Serving Infrastructure
Loadbalancers & RR DNS
nginx with empty_gif module (~2ms)
data is written to logfile
![Page 9: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/9.jpg)
Storing Infrastructure
every nginx node has flume-ng
flume ingests logfile
AsyncHBaseSink with custom Serializer
direct writes to HBase
![Page 10: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/10.jpg)
why flume?
we had it already in production ;)
Storm might be an interesting alternative
![Page 11: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/11.jpg)
HBase rowkey design
![Page 12: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/12.jpg)
Why?
You can scan through all data and use filters for selecting specific data
But scanning with start & stop row speeds things up (a lot)
![Page 13: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/13.jpg)
HBase rowkey design
Do I need a fast user or a fast timespan lookup?
User - clientid,ts<,connectionId>
Timespan - ts,clientid<,connectionId>
![Page 14: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/14.jpg)
Inverse Timestamps
Data in HBase is stored lexicographicaly sorted
Normal TS - scan would yield oldest results first
Inverse TS - newer entries come first (and you can cancel the scan if you have enough data)
![Page 15: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/15.jpg)
Cross Domain Tracking
(Flash)Cookies
Fingerprinting
Etag
HTML5 Storage
![Page 16: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/16.jpg)
The olden times… or
Cookies
Easy to drop a 3rd party cookie with userId on different websites
Gets more and more blocked (Safari, FF..)
![Page 17: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/17.jpg)
Fingerprinting
Yields interesting results on desktop, difficult on e.g. iPhone
invisible to user
Last resort if everything else fails?
![Page 18: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/18.jpg)
Etag
Quite new, based on browser cache
sounds interesting
![Page 19: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/19.jpg)
HTML5 Storage
Store data in local HTML5 storage
Retrieve data with Cross Domain Messaging
![Page 20: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/20.jpg)
Store data
e.g. UserId, SessionId, GeoIP, URL, action, data
![Page 21: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/21.jpg)
Batch Processing
Calculate how many users are active on platform A and also on B
Get Traffic of all Questions belonging to Channel X sorted by Country
![Page 22: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/22.jpg)
Now to something completely different…
![Page 23: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/23.jpg)
demo
![Page 24: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/24.jpg)
with Myrrix
Recommendations
![Page 25: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/25.jpg)
Myrrix
Evolution: taste -> mahout -> myrrix (-> oryx)
Recommender based on ALS
![Page 26: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/26.jpg)
Recommendations @ GF.net
User emit signals on questions
view, like, gives answer, answer is voted best
Application sends signals through RabbitMQ to recommendation servers
![Page 27: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/27.jpg)
but what happens, when a new user signs up?
YEAH
![Page 28: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/28.jpg)
?
![Page 29: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/29.jpg)
and feed it into myrrix
Fetch data from tracking
![Page 30: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/30.jpg)
using & processing is another thing ;)
Collecting&Storing data works great
![Page 31: Datageeks](https://reader033.vdocuments.net/reader033/viewer/2022051412/54b6cfac4a7959b5318b465a/html5/thumbnails/31.jpg)