introduction · rely on the cloud for scaling keep raw data to keep your options open 5. netbeam...
TRANSCRIPT
Introduction● Harsh realities of network analytics● netbeam● Demo● Technology Stack● Alternative Approaches● Lessons Learned
2
ESnet Data, Analytics and Visualization Architecture
3
The Harsh Realities of Network Analytics
1. It’s a mess
2. Things change
3. There’s always more
4. It’s never really done
● Your data isn’t neat and tidy
● Time and money are limited
● More devices & more telemetry
● What you need today may not be what you need tomorrow.
4
Coping strategies
1. It’s a mess
2. Things change
3. There’s always more
4. It’s never really done
● Design knowing things won’t be tidy
● “What” not “How”
● Rely on the cloud for scaling
● Keep raw data to keep your options open
5
netbeam
Network Analytics in Google Cloud
Three Pillars
1. Real time analytics ○ Low latency, incomplete
2. Offline analytics ○ High latency, complete
3. Flexible data model○ Changing needs? Recompute from raw data!
Secret sauce: Apache Beam
6
What is Apache Beam?
1. The Beam Programming Model
2. SDKs for writing Beam pipelines
3. Runners for existing distributed processing backends
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Local runner for testing
Slide courtesy of the Apache Beam Project 7
The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
MillwheelApache Beam
Google Cloud Dataflow
Slide courtesy of the Apache Beam Project 8
Architecture DiagramApache Beam
(Stream Processing)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Apache Beam(Batch Processing)
BigQuery(historical)
Old SNMP system
avro
9
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Google Pubsub● Uses Python outside
of Google Cloud to poll devices and write to Pubsub topic
● Code within Google Cloud subscribes to topic to process data
Old SNMP system
avro
10
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Stream processing● Subscribes to
Pubsub topic
Old SNMP system
avro
11
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Stream processing● Subscribes to
Pubsub topic● Raw data is written to
BigQuery
Old SNMP system
avro
12
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Stream processing● Subscribes to
Pubsub topic● Raw data is written to
BigQuery● Real time
transformed data (e.g. aligned data rates) written to Bigtable
● Writes and makes use of meta data in BigTable (not shown)
Old SNMP system
avro
13
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Cloud Bigtable● Like HBase● Write to cells in rows,
indexed by keys● We write 1 day of
data to a single row (columns are the time of day, key is metric and day)
● Fast access to row by key, can serve data from here
● Store one year
Old SNMP system
avro
14
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● BigQuery● Data warehousing
solution● Cheap storage, SQL
access, but not suitable for real-time access
● Allows SQL queries for ad hoc investigation
● We store our source of truth here
Old SNMP system
avro
15
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● BigQuery● Data warehousing
solution● Cheap storage, SQL
access, but not suitable for real-time access
● Allows SQL queries for ad hoc investigation
● We store our source of truth here
● Also store historical data (7 years), imported via avro files
Old SNMP system
avro
16
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job
Old SNMP system
avro
17
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job● Recalculate Bigtable
data each night from source of truth in BigQuery
Old SNMP system
avro
18
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job● Recalculate Bigtable
data each night from source of truth in BigQuery
● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations
Old SNMP system
avro
19
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
API
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
...
● Apache Beam / Google Dataflow
● Batch processing● Run with cron job● Recalculate Bigtable
data each night from source of truth in BigQuery
● Process Bigtable rows into new rows of 5min, 1 hr and 1 day aggregations
● Additional pre-computed views e.g. percentiles for traffic distribution over a month
Old SNMP system
avro
20
Architecture DiagramApache Beam
(Stream)
BigQuery(immutable)
Dataserver API(node.js)
SNMP collection system
Client
Bigtable(realtime)
Rollups5m, 1h, 1d avg
Align/rates
BigQuery(historical)
Percentiles
Old SNMP system
avro
...
● API● Currently runs on
App Engine● Node.js● Serves data out of
Bigtable● Timeseries data is
served as ‘tiles’, each tile is one row
● Would like to use Cloud Endpoints and provide a gRPC service
● Looking forward to grpc-web solution
21
Use case example: Historical Trends
22
Use case example: Historical TrendsStream to BQ
Dataserver API(node.js)
SNMP collection system
Client
Bigtable
Per-month totals
Per-dayInterface totals
BigQuery(historical)
Old SNMP system avro
snmp-daily::2017-08::$interface
Jan 1 Jan 2
1.8 Pb 1.9 Pb
... Dec31
3.1 Pb...
snmp-monthly-totals
Jan 1991
28 Gb
Feb 1991
29 Gb
...
...
BigQuery
Sep 2017
56 Pb
Bigtable rows
23
Use case: real time anomaly detectionStream to BQ
Dataserver API(node.js)
SNMP collection system
Client
Bigtable
Baseline generation
baseline::5m::avg::$interface
Mon12am
Mon1am
2.1 1.9
... Sun11pm
0.5...
anomaly::5m::avg
iface-1
+0.1
iface-2
+2.0
...
...
BigQuery
iface-n
-1.5
Anomaly detection
Mon2am
0.3
Generates avg for each interface over the past 3 months for that hour/day
Compares baseline to real time values to generate current deviation from normal
24
Use case example: Percentiles
25
Stream to Bigtable
Dataserver API(node.js)
SNMP collection system
Client
Bigtable
Percentiles
Daily rollups5m avg
rollup-month-5m::2017-08::$interface::in
1 2
6Gbps 5Gbps
... 8640
2Gbps...
percentiles::2017-08::$interface::in
1 pct
0.1 Gbps
2 pct
0.3 Gbps
...
...
99 pct
22.1Gbps
Bigtable rows
Use case example: Percentiles
26
Example: Computing Total Traffic# Python Beam SDKpipeline = beam.Pipeline('DirectRunner')
(pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp'))
pipeline.run()
Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 27
Our Stack● Apache Beam using Scio● Google Cloud Platform
○ Dataflow○ Bigtable○ BigQuery○ Pub/Sub○ App Engine
● Languages○ Scala○ Javascript / Typescript○ Python
Cloud Dataflow
BigQuery Cloud Bigtable
Cloud Endpoints
App Engine
Cloud Pub/Sub
28
Current Status & Future PlansCurrent
Release candidate for SNMP data:
● Ingest to BigQuery is working● Migration of historical data is complete● Streaming ingest to Bigtable ● Early version of utilization visualization● Simple data server can provide data to
clients, but gRPC API coming● Interface time series charts functional
29
Future
More types of data:
● Flow data● perfSONAR
Machine Learning
Anomaly Detection
“Mash up” various data sources
Why not InfluxDB, Elastic or ${FAVORITE_DB}● We have a data processing problem, not a data storage problem per se.
○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us
○ Ability to move to different platform components○ machine learning (TensorFlow and others)
● InfluxDB & Elastic ○ require care and feeding -- have to think about disks and machines, etc.○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this
may have changed but other benefits outweigh that.○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier
30
Why the cloud? Why Google Cloud Platform?Why the cloud?
● Focus on our problems not on infrastructure● Scalability without needing to own lots of systems● Managed services for databases and compute
Why Google Cloud?
● Apache Beam was Google Dataflow when we first encountered it● More cohesive ecosystem than AWS in our experience● Although we have used Google Cloud specific services, the approach is
portable to other environments31
Lessons learned / Life in the cloud / Good & BadThe Good
● Not a silver bullet, but makes many things are easier● Scaling! We processed 9,902,585,175 data points in 3.5 hours● Focus on your services, not on infrastructure● Scio and Scala allow working at a high level of abstraction
The Not So Good
● GCP tech support is pretty bad● Python is a second class citizen in Beam for now● Scala is powerful but challenging at times● Learning curve is pretty steep in places
32
●●●
●
Thank you!Jon Dugan <[email protected]>
● MyESnet: https://my.es.net● ESnet Open Source: http://software.es.net/
○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/
● Scio: https://github.com/spotify/scio ● Beam: https://beam.apache.org
33
The ESnet netbeam team:
● Peter Murphy● Monte Goode● Sowmya Balasubramanian● Scott Richmond