Download - 使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使⽤用 Elasticsearch 及 Kibana 進⾏行 巨量資料搜尋及視覺化
Suiting @ DSC 2015
Who Am I
曾書庭 (@suitingtseng) Data Engineer
Gogolook
Jeff, CEO
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
我算⼀一下...
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
資料好了嗎?
我算⼀一下...
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
資料好了嗎?
還在跑...
我算⼀一下...
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
可以分國家嗎?
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
可以分版本嗎?
可以分國家嗎?
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
可以分版本嗎?
可以看⼀一年嗎?
可以分國家嗎?
As a data engineer in Gogolook…
書庭,請問我們的 DAU 是多少?
可以分版本嗎?
可以看⼀一年嗎?
可以嗎? 可以嗎? 可以嗎?
可以分國家嗎?
⼀一句話激怒⼯工程師⼤大賽
• 可以分XX嗎
⼀一句話激怒⼯工程師⼤大賽
• 可以分XX嗎
• 可以畫成圖嗎
• 可以給我 raw data 嗎
⼀一句話激怒⼯工程師⼤大賽
• 可以分XX嗎
• 可以畫成圖嗎
• 可以給我 raw data 嗎
• 有沒有辦法知道 user 住哪裡
• 可以知道哪些 user ⽐比較有錢嗎
• 下⾬雨天 user 會睡⽐比較晚嗎
Table of Contents
• Problems
• Solution Requirements
• Elasticsearch & Kibana
• In Gogolook
• Future
Problems
• Request-‐response model
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
Problems
• Request-‐response model
• Long cycle
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
Problems
• Request-‐response model
• Long cycle
• EAAB (engineer as a bottleneck)
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
Problems
• Request-‐response model
• Long cycle
• EAAB (engineer as a bottleneck)
• HDC (Hippo-‐driven company)
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
Problems
• Request-‐response model
• Long cycle
• EAAB (engineer as a bottleneck)
• HDC (Hippo-‐driven company)
• Lack of speed
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
Problems
• Request-‐response model
• Long innovation cycle
• EAAB (engineer as a bottleneck)
• HDC (Hippo-‐driven company)
• Lack of speed
• => We are not alone (500px)
https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83
Table of Contents
• Problems
• Solution Requirements
• Elasticsearch & Kibana
• In Gogolook
• Future
Possible solutions
• Approach 1:SQL monkey* zoo
http://www.slideshare.net/GloriaLau1/keynote-at-spark-summit/5
Possible solutions
• Approach 1:SQL monkey zoo
• Approach 2:Provide limited yet easy visualization
http://www.slideshare.net/GloriaLau1/keynote-at-spark-summit/5
Requirement
• Easy: Even CEO can use it
• Fast: Must be interactive
• Export: Provide the csv file
• Big: Must be scalable
• 80-‐20: Solves 80% problems
Table of Contents
• Problems
• Solution Requirements
• Elasticsearch & Kibana
• In Gogolook
• Future
Elasticsearch
• Lucene-‐based search engine
• Document storage (JSON)
• Distributed, scalable
• Serve search request in ms
• Build index for every field
Kibana
• ES visualization tool
• No code required
ES + Kibana
• Fast: index every field
• Fast: columnar storage*
• Big: born distributed/scalable
• Easy: GUI without code
• Export: csv
Kibana
• Discover
• Visualization
• Dashboard
Discover
• Raw data
• Check data
• Find dirty data
• Try query
Discover
Discover
Visualization
• 8 visualization types
• 9 group methods
• 9 aggregation values
Visualization
Visualization types
Grouping methods
• Date histogram
• Histogram
• Range of a value
• Top N
• Filter
Aggregation values
• Count
• Avg, Sum, Min, Max, S.D.
• Unique count* (Hyperloglog)
• Percentile* (T-‐digest)
Visualization
• Same concept, different graph
• FILTER
• GROUP
• AGGREGATE
DAU
書庭,請問我們的 DAU 是多少?
DAU by region
可以分國家嗎?
DAU by version
可以分版本嗎?
server request log
Request_total per minute
GROUP BY DATE HISTOGRAM(minute) COUNT(*)
Request_total by path
GROUP BY TOP(path, 5), DATE HISTOGRAM(minute) COUNT(*)
Dashboard
• Collection of visualizations
Community tag in MongoDB
Dashboard
Dashboard -‐ 1st peak
Dashboard -‐ 2nd peak
Table of Contents
• Problems
• Solution Requirements
• Elasticsearch & Kibana
• In Gogolook
• Future
In Gogolook (Aug. 2015)
• 200M+ data point daily
• 150GB+ data size daily
• 24 dashboards, 160 visualizations
• Service status e.g. requests_total
• Application data e.g. tag_total
• Log data e.g. button_ctr
In Gogolook (currently)
• Log user behavior on features
• ⾃自⼰己的 log ⾃自⼰己記 (Planner/PM)
• ⾃自⼰己的 board ⾃自⼰己拉 (every one)
• Monitor performance from day 1
In Gogolook
In Gogolook
• Tracking Kibana usage by Google Analytics
Table of Contents
• Problems
• Solution Requirements
• Elasticsearch & Kibana
• In Gogolook
• Future
In Gogolook (future)
• Log all user-‐event, not feature-‐based
In Gogolook (future)
• Log all user-‐event, not feature-‐based
• { "userid": "suiting", "@timestamp": "2015-‐08-‐23T11:48:00", "page": "login", "button": "register", "period": 3500}
In Gogolook (future)
• Answer questions
A
B
40%
60%
In Gogolook (future)
• Answer questions
A
B
40%, 7000ms
60%, 1500ms
Limit
• No SQL JOIN
• Subquery
How about 20%
• Powerful engine/tool required
• Compute engines:
• Google BigQuery
• AWS Redshift
• Visualization tools:
• Tableau
• Periscope
Thank you
Questions ?