all the things we didn’t do · schema vs ad-hoc based search schema-based systems addresses known...
TRANSCRIPT
All The Things We Didn’t Do
Kresten Krab Thorup Humio CTO
A Tale in Three Parts
• About Logging and Metrics Tools
• Product Team Practices
• Careful Engineering — Data Processing Engine
Part 1
Part 2
Part 3
Log Analytics— And Why You Should Care
Part 1
Record Logs, Monitor & Respond
LogAggregation & Analytics
Engine
Metrics/Monitoring:Dashboards/Alerts
Incident response:Log Search, Drill-down
Dimension in Tooling
Logs
Metrics
Historic
Real-Time
Cloud
On-Prem
Schema
Ad-Hoc
Logs vs. MetricsLogs are events — metrics are aggregates of events
Logs have high dimensionality — metrics have low dimensionality
Logs tend to be unstructured — metrics are structured
Logs support drill-down and analysis — metrics leans towards dashboards
and alerting
Logs will vary in volume — metrics have a fixed volume rate
Logs tend to be high volume — metrics tend to be low volume
Dimension in Tooling
Logs
Metrics
Historic
Real-Time
Cloud
On-Prem
Schema
Ad-Hoc
Historic vs Real TimeReal-time processing lets you generate alerts and dashboards
Historic processing is great for incident response and audits
Real-time addresses known issues to look out for
Historic searches lets you look for unknown issues
Real-time needs only CPU processing
Historic data may require a lot of disk storage
Dimension in Tooling
Logs
Metrics
Historic
Real-Time
Cloud
On-Prem
Schema
Ad-Hoc
Cloud vs On-PremisesCloud-based systems may have privacy and security concerns
On-premises are often required in health-care and banking applications
With cloud systems you can pay-as-you-go
On-prem systems requires dedicated hardware
With a cloud solution you don’t need to manage it
On-prem solution requires you to consider ease-of-operations
Dimension in Tooling
Historic
Real-Time
Cloud
On-Prem
Schema
Ad-Hoc
Logs
Metrics
Schema vs Ad-Hoc based SearchSchema-based systems addresses known issues to look out for
With ad-hoc searching, you can dig into new, unknown issues
Setting up schemas is often for the DBA or administrator
Everyone can use free text search and learn things about the system
schema ≠ index, but they often go hand in hand
Keeping around indexes increase disk-storage requirements
Lack of indexes slow down searching
effort / query
effort / insert
Dimension in Tooling
Logs
Metrics
Historic
Real-Time
Cloud
On-Prem
Schema
Ad-Hoc
Log Analytics Sweet Spot
•Record Everything - TB’s of data per day
•Generate metrics from the logs in real-time
•Interactive/ad-hoc search on historic data - 100’s of TB
•Can be installed on-premises (privacy / security)
•Affordable - TCO (hardware, license, operations)
Record Events, Monitor & Respond
Humio
Metrics/Monitoring:Dashboards/Alerts
Incident response:Log Search, Drill-down
Humio—Product Team Practices
Part 2
Be The Customer
• Design target was an on-premise solution
• Co-locate with first customer
• Provide a hosted service “eat our own dog food”⇒
Safe Environment
• “It takes all kinds”
• Be open about strengths and weaknesses
• Be open to learn (and teach) new practices
• Experienced team initially to set practices and culture
Be in doubt!
• Discuss trade offs — not do’s and don’ts
• Leave time to wonder
• No one knows “what’s best”
High BUS factor
• We depend on people. Period.
• Don’t try to make them replaceable
• Everyone is responsible
Choosing Scala
• I ❤ Erlang
• Knowing what Erlang can do for you, coordination code is painful to write and manage in Scala (threadpools, futures, async).
• Use “scala, the good parts”.
Choosing Elm
• Elm similar to React — functional javascript — but with proper syntax and static type checking.
• Tooling and libraries are less mature.
• Takes time for new devs to learn
• Upside is that it is “cool” — we give talks and contribute to the community.
Take small steps — but look up!
• Running a SaaS with frequent deployments teaches you to take small steps.
• Define design goals and discuss tradeoffs. Keep those in mind and work towards that.
• Avoid long-running side-projecs. Feature-flag new work.
Manage critical dependencies
• Own all critical components
• It is tempting (and easy) to pull in 200+ Apache libraries
• We use docker for delivery (reduce customer’s deps)
• Two outside dependencies: HighCharts and Kafka
Don’t waste hardware
“The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”
—Henry Peteroski
Humio—Data Processing Engine
Part 3
Record Events, Monitor & Respond
Humio
Metrics/Monitoring:Dashboards/Alerts
Incident response:Log Search, Drill-down
State Machine
Event Store
Query/error/i | count()
State Machine
count: 473
count: 243,565
Query Language State Machine
filter … | aggregate()
event: Map[String,String]
Aggregates State Machine
Function State Step Merge Result
count N N+1 N1+N2 N
avg (N, s) (N+1,s+value) (N1+N2, s1+s2) s/N
stddev (N, s, q)(N+1,s+value,
q+value2)(N1+N2, s1+s2,
q1+q2)√(N*q-s2)/N
GroupBy(host, function=count())State Map[String,State2]
Step(G,e) key = e[“host”]map[key] = Step2(map[key])
Merge(G1,G2) ∀key in G1,G2 => result[key] = Merge2(G1[key], G2[key])
Result(G) ∀key in G => result[key] = Result2(G[key])
7 4 3
time
144 3 6 13
3 6 2 11
Time Boxing groupby( time − time % bucket_size )
Query Language State Machine
filter … | aggregate()
Event Store Design
• Build minimal index and compress data
Store order of magnitude more events
• Fast “grep” for filtering events
Filtering and time/metadata selection reduces the problem space
Event Store
10GB (start-time, end-time, metadata)
10GB (start-time, end-time, metadata)
10GB (start-time, end-time, metadata)
10GB (start-time, end-time, metadata)
. . .
Event Store
1GB (start-time, end-time, metadata)
1GB (start-time, end-time, metadata)
1GB (start-time, end-time, metadata)
1GB (start-time, end-time, metadata)
. . .
compress
1 month x 30GB/day ingest 90GB data, <1MB index
1 month x 1TB/day ingest 4TB data, <1MB index
Query
1GB
1GB
1GB 1GB
1GB
1GB 1GB 1GB
1GB
1GB
time
#ds1, #web
#ds1, #app
#ds2, #web
metadata
Query
1GB
1GB
1GB 1GB
1GB
1GB 1GB 1GB
1GB
1GB
time
#ds1, #web
#ds1, #app
#ds2, #web
metadata
10GBState Machine
Filter 1GB data
Filter 1GB data
Filter 1GB data
Filter 1GB data
Filter 1GB data
Brute Force: Grep at 30x• Streaming disk access, use async file I/O
• Compress data at rest (and in OS-level cache)
• Run one JVM per NUMA node
• Critical search code is sticky 1 thread per core.
• Reduce context switching (explicit scheduling)
• Localize data access (each core works on 64k chunks)
Go and find videos and blog posts about “Mechanical Sympathy” (Martin Thompson,
LMAX) and “Why KDB+ is fast”
Event Processing Brute-Force Search
• “Materialized views” for relevant metrics.
• Processed when datais in-memory anyway.
• Fast response times for “known” queries.
• Shift CPU load to query time
• Data compression
• Allows ad-hoc queries
• Requires “Full stack” ownership
effort / query
effort / insert
Log Analytics
• Logging / Metrics Landscape
• Product Team Practices & User Engagement
• Careful Engineering
Part 1
Part 2
Part 3
Thanks for your time.Kresten Krab Thorup Humio CTO