building a system for machine and event-oriented data - sf hug nov 2015

28
JOEY ECHEVERRIA | @fwiffo | November 4th, 2015 San Francisco Hadoop Users Group Building a System for Machine and Event-Oriented Data

Upload: felicia-haggarty

Post on 13-Apr-2017

177 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 1

JOEY ECHEVERRIA | @fwiffo | November 4th, 2015San Francisco Hadoop Users Group

Building a System for Machine andEvent-Oriented Data

Page 2: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 2

Context

Page 3: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 3

Joey• Where I work: Rocana – Director of Engineering

• Where I used to work: Cloudera (‘11 – ’15), NSA

• Distributed systems, security, data processing, “big data”

Page 4: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 4

Free stuff!• Tweet @rocanainc with

#SFHUG – best three tweets get a book

Page 5: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 5

What we do• Build a system for the operation of modern data centers

• Triage and diagnostics, exploration, trends, advanced analytics of complex systems

• Our data:• logs, metrics, human activity, anything that occurs in the data center

• “Enterprise Software” (i.e. we build for others.)

• Today: how we built what we built

Page 6: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 6

Our typical customer use cases• >100K events / sec (8.6B events / day), sub-second end to end latency,

full fidelity retention, critical use cases

• Quality of service - “are credit card transactions happening fast enough?”

• Fraud detection - “detect, investigate, prosecute, and learn from fraud.”

• Forensic diagnostics - “what really caused the outage last friday?”

• Security - “who’s doing what, where, when, why, and how, and is that ok?”

• User behavior - ”capture and correlate user behavior with system performance, then feed it to downstream systems in realtime.”

Page 7: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 7

10,000 foot view

Page 8: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 8

High level architecture

Page 9: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 9

Guarantees• No single point of failure exists• All components scale horizontally[1]

• Data retention and latency is a function of cost, not tech[1]

• Every event is delivered provided no more than N - 1 failures occur (where N is the kafka replication level)

• All operations, including upgrade, are online[2]

• Every event is (or appears to be) delivered exactly once[3]

[1] we’re positive there’s a limit, but thus far it has been cost.[2] from the user’s perspective, at a system level.[3] when queried via our UI. lots of details here.

Page 10: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 10

Events

Page 11: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 11

Modeling our world• Everything is an event

• Each event contains a timestamp, type, location, host, service, body, and type-specific attributes (k/v pairs)

• Build specialized aggregates as necessary - just optimized views of the data

Page 12: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 12

Event schema{

id: string,

ts: long,

event_type_id: int,

location: string,

host: string,

service: string,

body: [ null, bytes ],

attributes: map<string>

}

Page 13: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 13

Event types• Some event types are standard

• syslog, http, log4j, generic text record, …

• Users define custom event types

• Producers populate event type

• Transformations can turn one event type into another

• Event type metadata tells downstream systems how to interpret body and attributes

Page 14: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 14

Ex: generic syslog eventevent_type_id: 100, // rfc3164, rfc5424 (syslog)

body: … // raw syslog message bytes

attributes: { // extracted fields from body

syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”,

syslog_severity: “6”, // info severity

syslog_facility: “3”, // daemon facility

syslog_process: “dhclient”,

syslog_pid: “668”,

}

Page 15: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 15

Ex: generic http eventevent_type_id: 102, // generic http event

body: … // raw http log message bytes

attributes: {

http_req_method: “GET”,

http_req_vhost: “w2a-demo-02”,

http_req_path: “/api/v1/search?q=service%3Asshd&p=1&s=200”,

http_req_query: “q=service%3Asshd&p=1&s=200”,

http_resp_code: “200”,

}

Page 16: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 16

Consumers

Page 17: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 17

Consumers• …do most of the work

• Parallelism

• Kafka offset management

• Message de-duplication

• Transformation (embedded library)

• Dead letter queue support

• Downstream system knowledge

Page 18: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 18

Inside a consumer

Page 19: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 19

Metrics and time series

Page 20: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 20

Aggregation• Mostly for time series metrics• Two halves: on write and on query• Data model: (dimensions) => (aggregates)• On write

• reduce(a: A, b: A) => A over window• Store “base” aggregates, all associative and commutative

• On query• Perform same aggregate or derivative aggregates• Group by the same dimensions• SQL (Impala)

Page 21: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 21

Aside: late arriving data (it’s a thing)• Never trust a (wall) clock• Producer determines observation time, rest of the system uses this always• Data that shows up late always processed according to observation time• Aggregation consequences

• The same time window can appear multiple times• Solution: aggregate every N seconds, potentially generating multiple aggregates for

the same time bin• This is real and you must deal with it

• Do what we did or• Build a system that mutates/replaces aggregates already output or• Delay aggregate output for some slop time; drop it if late data shows up

Page 22: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 22

Ex: service event volume by host and minute• Dimensions: ts, window, location, host, service, metric

• On write, aggregates: count, sum, min, max, last

• epoch, 60000, us-west-2a, w2a-demo-1, sshd, event_volume =>

17, 42, 1, 10, 8

• On query:• SELECT floor(ts / 60000) as bin, loc, host, service, metric, sum(value_sum) FROM metrics

WHERE ts BETWEEN x AND y AND metric = ”event_volume” GROUP BY bin, loc, host, service, metric

• If late arriving data existed in events, the same dimensions would repeat with a another set of aggregates and would be rolled up as a result of the group by

• tl;dr: normal window aggregation operations

Page 23: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 23

Extension, pain, and advice

Page 24: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 24

Extending the system• Custom producers

• Custom consumers

• Event types

• Parser / transformation plugins

• Custom metric definition and aggregate functions

• Custom processing jobs on landed data

Page 25: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 25

Pain (aka: the struggle is real)• Lots of tradeoffs when picking a stream processing solution

• Apache Samza: right features, but low level programming model, not supported by vendors. missing security features.

• Apache Storm: too rigid, too slow. not supported by all Hadoop vendors.• Apache Spark streaming: tons of issues initially, but lots of community energy.

improving.• @digitallogic: “My heart says Samza, but my head says Spark Streaming.”• Our (current) needs are meager; do work inside consumers.

• Stack complexity, (relative im)maturity

• Scaling solr cloud to billions of events per day

Page 26: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 26

If you’re going to try this…• Read all the literature on stream processing[1]

• Treat it like the distributed systems problem it is

• Understand, make, and make good on guarantees

• Find the right abstractions

• Never trust the hand waving or “hello worlds”

• Fully evaluate the projects/products in this space

• Understand it’s not just about search

[1] wait, like all of it? yeah, like all of it.

Page 27: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 27

Things I didn’t talk about• Reprocessing data when bad code / transformations are detected• Dealing with data quality issues (“the struggle is real” part 2)• The user interface and all the fancy analytics

• data visualization and exploration• event search• anomalous trend and event detection• metric, source, and event correlation• motif finding• noise reduction and dithering

• Event delivery semantics (e.g. at least once, exactly once, etc.)• Alerting

Page 28: Building a system for machine and event-oriented data - SF HUG Nov 2015

© Rocana, Inc. All Rights Reserved. | 28

Questions?

@fwiffo | [email protected]