introduction · rely on the cloud for scaling keep raw data to keep your options open 5. netbeam...

Introduction● Harsh realities of network analytics● netbeam● Demo● Technology Stack● Alternative Approaches● Lessons Learned

2

ESnet Data, Analytics and Visualization Architecture

3

The Harsh Realities of Network Analytics

1. It’s a mess

2. Things change

3. There’s always more

4. It’s never really done

● Your data isn’t neat and tidy

● Time and money are limited

● More devices & more telemetry

● What you need today may not be what you need tomorrow.

4

Coping strategies

1. It’s a mess

2. Things change

3. There’s always more

4. It’s never really done

● Design knowing things won’t be tidy

● “What” not “How”

● Rely on the cloud for scaling

● Keep raw data to keep your options open

5

netbeam

Network Analytics in Google Cloud

Three Pillars

1. Real time analytics ○ Low latency, incomplete

2. Offline analytics ○ High latency, complete

3. Flexible data model○ Changing needs? Recompute from raw data!

Secret sauce: Apache Beam

6

What is Apache Beam?

1. The Beam Programming Model

2. SDKs for writing Beam pipelines

3. Runners for existing distributed processing backends

○ Apache Apex

○ Apache Flink

○ Apache Spark

○ Google Cloud Dataflow

○ Local runner for testing

Slide courtesy of the Apache Beam Project 7

The Evolution of Apache Beam

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

Slide courtesy of the Apache Beam Project 8

Architecture DiagramApache Beam

(Stream Processing)

BigQuery(immutable)

API

SNMP collection system

Client

Bigtable(realtime)

Apache Beam(Batch Processing)

BigQuery(historical)

Old SNMP system

avro

9


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)

Rollups5m, 1h, 1d avg

Align/rates


Percentiles

...

● Google Pubsub● Uses Python outside

of Google Cloud to poll devices and write to Pubsub topic

● Code within Google Cloud subscribes to topic to process data

Old SNMP system

avro

10


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...

● Apache Beam / Google Dataflow

● Stream processing● Subscribes to

Pubsub topic

Old SNMP system

avro

11


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...



Pubsub topic● Raw data is written to

BigQuery

Old SNMP system

avro

12


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...



Pubsub topic● Raw data is written to

BigQuery● Real time

transformed data (e.g. aligned data rates) written to Bigtable

● Writes and makes use of meta data in BigTable (not shown)

Old SNMP system

avro

13


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...

● Cloud Bigtable● Like HBase● Write to cells in rows,

indexed by keys● We write 1 day of

data to a single row (columns are the time of day, key is metric and day)

● Fast access to row by key, can serve data from here

● Store one year

Old SNMP system

avro

14


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...

● BigQuery● Data warehousing

solution● Cheap storage, SQL

access, but not suitable for real-time access

● Allows SQL queries for ad hoc investigation

● We store our source of truth here

Old SNMP system

avro

15


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...

● BigQuery● Data warehousing

solution● Cheap storage, SQL

access, but not suitable for real-time access

● Allows SQL queries for ad hoc investigation

● We store our source of truth here

● Also store historical data (7 years), imported via avro files

Old SNMP system

avro

16


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...


● Batch processing● Run with cron job

Old SNMP system

avro

17


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...


● Batch processing● Run with cron job● Recalculate Bigtable

data each night from source of truth in BigQuery

Old SNMP system

avro

18


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...




● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations

Old SNMP system

avro

19


(Stream)

BigQuery(immutable)

API


Client

Bigtable(realtime)


Align/rates


Percentiles

...




● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations

● Additional pre-computed views e.g. percentiles for traffic distribution over a month

Old SNMP system

avro

20


(Stream)

BigQuery(immutable)

Dataserver API(node.js)


Client

Bigtable(realtime)


Align/rates


Percentiles

Old SNMP system

avro

...

● API● Currently runs on

App Engine● Node.js● Serves data out of

Bigtable● Timeseries data is

served as ‘tiles’, each tile is one row

● Would like to use Cloud Endpoints and provide a gRPC service

● Looking forward to grpc-web solution

21

Use case example: Historical Trends

22

Use case example: Historical TrendsStream to BQ



Client

Bigtable

Per-month totals

Per-dayInterface totals


Old SNMP system avro

snmp-daily::2017-08::$interface

Jan 1 Jan 2

1.8 Pb 1.9 Pb

... Dec31

3.1 Pb...

snmp-monthly-totals

Jan 1991

28 Gb

Feb 1991

29 Gb

...

...

BigQuery

Sep 2017

56 Pb

Bigtable rows

23

Use case: real time anomaly detectionStream to BQ



Client

Bigtable

Baseline generation

baseline::5m::avg::$interface

Mon12am

Mon1am

2.1 1.9

... Sun11pm

0.5...

anomaly::5m::avg

iface-1

+0.1

iface-2

+2.0

...

...

BigQuery

iface-n

-1.5

Anomaly detection

Mon2am

0.3

Generates avg for each interface over the past 3 months for that hour/day

Compares baseline to real time values to generate current deviation from normal

24

Use case example: Percentiles

25

Stream to Bigtable



Client

Bigtable

Percentiles

Daily rollups5m avg

rollup-month-5m::2017-08::$interface::in

1 2

6Gbps 5Gbps

... 8640

2Gbps...

percentiles::2017-08::$interface::in

1 pct

0.1 Gbps

2 pct

0.3 Gbps

...

...

99 pct

22.1Gbps

Bigtable rows

Use case example: Percentiles

26

Example: Computing Total Traffic# Python Beam SDKpipeline = beam.Pipeline('DirectRunner')

(pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp'))

pipeline.run()

Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 27

http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/

Our Stack● Apache Beam using Scio● Google Cloud Platform

○ Dataflow○ Bigtable○ BigQuery○ Pub/Sub○ App Engine

● Languages○ Scala○ Javascript / Typescript○ Python

Cloud Dataflow

BigQuery Cloud Bigtable

Cloud Endpoints

App Engine

Cloud Pub/Sub

28

Current Status & Future PlansCurrent

Release candidate for SNMP data:

● Ingest to BigQuery is working● Migration of historical data is complete● Streaming ingest to Bigtable ● Early version of utilization visualization● Simple data server can provide data to

clients, but gRPC API coming● Interface time series charts functional

29

Future

More types of data:

● Flow data● perfSONAR

Machine Learning

Anomaly Detection

“Mash up” various data sources

Why not InfluxDB, Elastic or ${FAVORITE_DB}● We have a data processing problem, not a data storage problem per se.

○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us

○ Ability to move to different platform components○ machine learning (TensorFlow and others)

● InfluxDB & Elastic ○ require care and feeding -- have to think about disks and machines, etc.○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this

may have changed but other benefits outweigh that.○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier

30

Why the cloud? Why Google Cloud Platform?Why the cloud?

● Focus on our problems not on infrastructure● Scalability without needing to own lots of systems● Managed services for databases and compute

Why Google Cloud?

● Apache Beam was Google Dataflow when we first encountered it● More cohesive ecosystem than AWS in our experience● Although we have used Google Cloud specific services, the approach is

portable to other environments31

Lessons learned / Life in the cloud / Good & BadThe Good

● Not a silver bullet, but makes many things are easier● Scaling! We processed 9,902,585,175 data points in 3.5 hours● Focus on your services, not on infrastructure● Scio and Scala allow working at a high level of abstraction

The Not So Good

● GCP tech support is pretty bad● Python is a second class citizen in Beam for now● Scala is powerful but challenging at times● Learning curve is pretty steep in places

32

●●●

●

Thank you!Jon Dugan <[email protected]>

● MyESnet: https://my.es.net● ESnet Open Source: http://software.es.net/

○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/

● Scio: https://github.com/spotify/scio ● Beam: https://beam.apache.org

33

The ESnet netbeam team:

● Peter Murphy● Monte Goode● Sowmya Balasubramanian● Scott Richmond

mailto:[email protected]

https://my.es.net

http://software.es.net/

http://software.es.net/react-timeseries-charts/

http://software.es.net/pond/

http://software.es.net/react-network-diagrams/

https://github.com/spotify/scio

https://beam.apache.org/

introduction · rely on the cloud for scaling keep raw data to keep your options open 5. netbeam...

Documents