all the things we didn’t do · schema vs ad-hoc based search schema-based systems addresses known...

All The Things We Didn’t Do

Kresten Krab Thorup Humio CTO

A Tale in Three Parts

• About Logging and Metrics Tools

• Product Team Practices

• Careful Engineering — Data Processing Engine

Part 1

Part 2

Part 3

Log Analytics— And Why You Should Care

Part 1

Record Logs, Monitor & Respond

LogAggregation & Analytics

Engine

Metrics/Monitoring:Dashboards/Alerts

Incident response:Log Search, Drill-down

Dimension in Tooling

Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Logs vs. MetricsLogs are events — metrics are aggregates of events

Logs have high dimensionality — metrics have low dimensionality

Logs tend to be unstructured — metrics are structured

Logs support drill-down and analysis — metrics leans towards dashboards

and alerting

Logs will vary in volume — metrics have a fixed volume rate

Logs tend to be high volume — metrics tend to be low volume


Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Historic vs Real TimeReal-time processing lets you generate alerts and dashboards

Historic processing is great for incident response and audits

Real-time addresses known issues to look out for

Historic searches lets you look for unknown issues

Real-time needs only CPU processing

Historic data may require a lot of disk storage


Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Cloud vs On-PremisesCloud-based systems may have privacy and security concerns

On-premises are often required in health-care and banking applications

With cloud systems you can pay-as-you-go

On-prem systems requires dedicated hardware

With a cloud solution you don’t need to manage it

On-prem solution requires you to consider ease-of-operations


Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Logs

Metrics

Schema vs Ad-Hoc based SearchSchema-based systems addresses known issues to look out for

With ad-hoc searching, you can dig into new, unknown issues

Setting up schemas is often for the DBA or administrator

Everyone can use free text search and learn things about the system

schema ≠ index, but they often go hand in hand

Keeping around indexes increase disk-storage requirements

Lack of indexes slow down searching

effort / query

effort / insert


Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Log Analytics Sweet Spot

•Record Everything - TB’s of data per day

•Generate metrics from the logs in real-time

•Interactive/ad-hoc search on historic data - 100’s of TB

•Can be installed on-premises (privacy / security)

•Affordable - TCO (hardware, license, operations)

Record Events, Monitor & Respond

Humio



Humio—Product Team Practices

Part 2

Be The Customer

• Design target was an on-premise solution

• Co-locate with first customer

• Provide a hosted service “eat our own dog food”⇒

Safe Environment

• “It takes all kinds”

• Be open about strengths and weaknesses

• Be open to learn (and teach) new practices

• Experienced team initially to set practices and culture

Be in doubt!

• Discuss trade offs — not do’s and don’ts

• Leave time to wonder

• No one knows “what’s best”

High BUS factor

• We depend on people. Period.

• Don’t try to make them replaceable

• Everyone is responsible

Choosing Scala

• I ❤ Erlang

• Knowing what Erlang can do for you, coordination code is painful to write and manage in Scala (threadpools, futures, async).

• Use “scala, the good parts”.

Choosing Elm

• Elm similar to React — functional javascript — but with proper syntax and static type checking.

• Tooling and libraries are less mature.

• Takes time for new devs to learn

• Upside is that it is “cool” — we give talks and contribute to the community.

Take small steps — but look up!

• Running a SaaS with frequent deployments teaches you to take small steps.

• Define design goals and discuss tradeoffs. Keep those in mind and work towards that.

• Avoid long-running side-projecs. Feature-flag new work.

Manage critical dependencies

• Own all critical components

• It is tempting (and easy) to pull in 200+ Apache libraries

• We use docker for delivery (reduce customer’s deps)

• Two outside dependencies: HighCharts and Kafka

Don’t waste hardware

“The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”

—Henry Peteroski

Humio—Data Processing Engine

Part 3

Record Events, Monitor & Respond

Humio



State Machine

Event Store

Query/error/i | count()

State Machine

count: 473

count: 243,565

Query Language State Machine

filter … | aggregate()

event: Map[String,String]

Aggregates State Machine

Function State Step Merge Result

count N N+1 N1+N2 N

avg (N, s) (N+1,s+value) (N1+N2, s1+s2) s/N

stddev (N, s, q)(N+1,s+value,

q+value2)(N1+N2, s1+s2,

q1+q2)√(N*q-s2)/N

GroupBy(host, function=count())State Map[String,State2]

Step(G,e) key = e[“host”]map[key] = Step2(map[key])

Merge(G1,G2) ∀key in G1,G2 => result[key] = Merge2(G1[key], G2[key])

Result(G) ∀key in G => result[key] = Result2(G[key])

7 4 3

time

144 3 6 13

3 6 2 11

Time Boxing groupby( time − time % bucket_size )

Query Language State Machine

filter … | aggregate()

Event Store Design

• Build minimal index and compress data

Store order of magnitude more events

• Fast “grep” for filtering events

Filtering and time/metadata selection reduces the problem space

Event Store

10GB (start-time, end-time, metadata)




. . .

Event Store





. . .

compress

1 month x 30GB/day ingest 90GB data, <1MB index

1 month x 1TB/day ingest 4TB data, <1MB index

Query

1GB

1GB

1GB 1GB

1GB

1GB 1GB 1GB

1GB

1GB

time

#ds1, #web

#ds1, #app

#ds2, #web

metadata

Query

1GB

1GB

1GB 1GB

1GB

1GB 1GB 1GB

1GB

1GB

time

#ds1, #web

#ds1, #app

#ds2, #web

metadata

10GBState Machine

Filter 1GB data

Brute Force: Grep at 30x• Streaming disk access, use async file I/O

• Compress data at rest (and in OS-level cache)

• Run one JVM per NUMA node

• Critical search code is sticky 1 thread per core.

• Reduce context switching (explicit scheduling)

• Localize data access (each core works on 64k chunks)

Go and find videos and blog posts about “Mechanical Sympathy” (Martin Thompson,

LMAX) and “Why KDB+ is fast”

Event Processing Brute-Force Search

• “Materialized views” for relevant metrics.

• Processed when datais in-memory anyway.

• Fast response times for “known” queries.

• Shift CPU load to query time

• Data compression

• Allows ad-hoc queries

• Requires “Full stack” ownership

effort / query

effort / insert

Log Analytics

• Logging / Metrics Landscape

• Product Team Practices & User Engagement

• Careful Engineering

Part 1

Part 2

Part 3

Thanks for your time.Kresten Krab Thorup Humio CTO

all the things we didn’t do · schema vs ad-hoc based search schema-based systems addresses known...

Documents