network visibility and control using industry standard sflow telemetry

71
Network visibility and control using industry standard sFlow telemetry Peter Phaal InMon Corp. March, 2016 Twitter: @sFlow Blog: blog.sflow.com San Francisco Network Visibility Meetup

Upload: pphaal

Post on 14-Feb-2017

541 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Network visibility and control using industry standard sFlow telemetry

Network visibility and control using industry standard sFlow telemetry

Peter Phaal InMon Corp.March, 2016

Twitter: @sFlow Blog: blog.sflow.com

San Francisco Network Visibility Meetup

Page 2: Network visibility and control using industry standard sFlow telemetry

Why monitor?

“If you can’t measure it, you can’t improve it”Lord Kelvin

Page 3: Network visibility and control using industry standard sFlow telemetry

Time

Capacity

Demand

Static provisioning

$ Unused capacity

$$$ Service failure

$$ Unused capacity

Page 4: Network visibility and control using industry standard sFlow telemetry

$$ Savings

Time

Capacity

Demand

Dynamic provisioning

Page 5: Network visibility and control using industry standard sFlow telemetry

Feedback control

Measure

Control

System

desired output

measured output

Page 6: Network visibility and control using industry standard sFlow telemetry

Controllability and Observability

Basic concept is simple, a stable feedback control system requires:1. ability to influence all important system states (controllable)2. ability to monitor all important system states (observable)

Page 7: Network visibility and control using industry standard sFlow telemetry

It’s hard to stay on the road if you can’t see the road, or keep to the speed limit without a speedometer

It’s hard to stay on the road or maintain speed if your brakes, engine or steering fail

Controllability and Observability driving example

Observability

Controllability

States location, speed, direction, ...Tule fog in California Central Valley

Page 8: Network visibility and control using industry standard sFlow telemetry

Effect of delay on stability

Measurement delay Planning delay

Time

Configuration delayDisturbance Response delay

EffectLoop delay

DDoS launched Identify target, attacker Black hole, mark, re-route? Switch CLI commands Route propagation Traffic dropped

Components of loop delay

e.g. Slow reaction time causes tired / drunk / distracted

driver to weave, very slow reaction time and they leave

the road

Page 9: Network visibility and control using industry standard sFlow telemetry

What is sFlow?

“In God we trust. All others bring data.”Dr. Edwards Deming

Page 10: Network visibility and control using industry standard sFlow telemetry

Industry standard measurement technology integrated in switcheshttp://sflow.org/

Page 11: Network visibility and control using industry standard sFlow telemetry

Open source agents for hosts, hypervisors and applications

Host sFlow project (http://sflow.net) is center of an ecosystem of related open source projects embedding sFlow in popular

operating systems and applications

Host agent extends network visibility into public / private cloud

Page 12: Network visibility and control using industry standard sFlow telemetry

Network (maintained by hardware in network devices)- MIB-2 ifTable: ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts, ifInDiscards, ifInErrors, ifUnkownProtos,

ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors

Host (maintained by operating system kernel)- CPU: load_one, load_five, load_fifteen, proc_run, proc_total, cpu_num, cpu_speed, uptime, cpu_user, cpu_nice,

cpu_system, cpu_idle, cpu_wio, cpu_intr, cpu_sintr, interupts, contexts

- Memory: mem_total, mem_free, mem_shared, mem_buffers, mem_cached, swap_total, swap_free, page_in, page_out, swap_in, swap_out

- Disk IO: disk_total, disk_free, part_max_used, reads, bytes_read, read_time, writes, bytes_written, write_time

- Network IO: bytes_in, packets_in, errs_in, drops_in, bytes_out, packet_out, errs_out, drops_out

Application (maintained by application)- HTTP: method_option_count, method_get_count, method_head_count, method_post_count, method_put_count,

method_delete_count, method_trace_count, method_connect_count, method_other_count, status_1xx_count, status_2xx_count, status_3xx_count, status_4xx_count, status_5xx_count, status_other_count

- Memcache: cmd_set, cmd_touch, cmd_flush, get_hits, get_misses, delete_hits, delete_misses, incr_hits, incr_misses, decr_hists, decr_misses, cas_hits, cas_misses, cas_badval, auth_cmds, auth_errors, threads, con_yields, listen_disabled_num, curr_connections, rejected_connections, total_connections, connection_structures, evictions, reclaimed, curr_items, total_items, bytes_read, bytes_written, bytes, limit_maxbytes

Standard counters

Page 13: Network visibility and control using industry standard sFlow telemetry

Simple

- standard structures - densely packed blocks of counters

- extensible (tag, length, value)

- RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode /decode

- unicast UDP transport

Minimal configuration

- collector address

- polling interval

Cloud friendly

- flat, two tier architecture: many embedded agents → central “smart” collector

- sFlow agents automatically start sending metrics on startup, automatically discovered

- eliminates complexity of maintaining polling daemons (and associated configurations)

Scaleable push protocol

Page 14: Network visibility and control using industry standard sFlow telemetry

• Counters tell you there is a problem, but not why.

• Counters summarize performance by dropping high cardinality attributes:

- IP addresses

- URLs

- Memcache keys

• Need to be able to efficiently disaggregate counter by attributes in order to understand root cause of performance problems.

• How do you get this data when there are millions of transactions per second?

Counters aren’t enough

Why the spike in traffic?(100Gbit link carrying 14,000,000 packets/second)

Page 15: Network visibility and control using industry standard sFlow telemetry

• Random sampling is lightweight

• Critical path roughly cost of maintaining one counter: if(--skip == 0) sample();

• Sampling is easy to distribute among modules, threads, processes without any synchronization

• Minimal resources required to capture attributes of sampled transactions

• Easily identify top keys, connections, clients, servers, URLs etc.

• Unbiased results with known accuracy

Break out traffic by client, server and port (graph based on samples from100Gbit link carrying 14,000,000 packets/second)

sFlow also exports random samples

Page 16: Network visibility and control using industry standard sFlow telemetry

Integrated data model

Packet Header

Source Destination

TCP/UDP Socket TCP/UDP Socket

MAC Address MAC Address

Sampled Packet Headers +

Forwarding State

I/F CountersNETWORK

HOST

CPU

Memory

I/O

Adapter MACs

APPLICATION

Sampled Transactions

Transaction Counters

TCP/UDP Socket

Independent agents sFlow analyzer joins data for integrated view

Page 17: Network visibility and control using industry standard sFlow telemetry

Embedded monitoring of all switches, all

Virtual Servers

ApplicationsApache/PHP

Tomcat/Java

Memcached

Virtual Network

Servers

Network

Embedded monitoring of all switches, all servers, all applications, all the time

Consistent measurements shared between multiple

management tools

Comprehensive data center wide visibility

Page 18: Network visibility and control using industry standard sFlow telemetry

Picking the right tools

“This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together.”Doug McIlroy

Page 19: Network visibility and control using industry standard sFlow telemetry

packets

decode hash sendflow cache flushsample

Flow Records

flow cache embedded on switchswitch

NetFlowIPFIX…

decode hash sendflow cache flush

Flow Records

packets

send

polli/f counters

sample

multiple switches export sFlowpackets

send

polli/f counters

sample

...

centralized software flow cache

switch

switchJSON/RESTNetFlowIPFIX…

• Reduce ASIC cost / complexity • Fast response (data not sitting on switch) • Centralized, network-wide visibility • Increase flexibility → software defined analytics

Move flow cache from ASIC to external software

Scale-out alternative to SNMP polling

Traffic analytics with sFlow

Page 20: Network visibility and control using industry standard sFlow telemetry

sFlow-RT.com analytics engine

• Low latency flow analytics for real-time control applications

• Disaggregates flow cache from database. Choose external database(s) for history (InfluxDB, Logstash, etc.)

• Programmable analytics pipeline through open APIs

Page 21: Network visibility and control using industry standard sFlow telemetry

RESTful API for defining flows

http://blog.sflow.com/2013/08/restflow.html

curl -H "Content-Type:application/json" -X PUT —data \ '{"keys":"ipsource,ipdestination,tcpsourceport,tcpdestinationport", \"value":"bytes", "ipfixCollectors":["10.0.0.1"]}'\http://127.0.0.1:8008/flow/tcp/json

curl -H "Content-Type:application/json" -X PUT --data \'{"keys":"ipdestination,icmpunreachableport", "value":"frames"}' \http://127.0.0.1:8008/flow/unreachableport/json

• Instantly enables network wide monitoring of flows • All switches, all ports, including hosts and virtual switches • Contrast with task of re-configuring Flexible Netflow/IPFIX caches on

every switch in multi-vendor network. How many simultaneous flow definitions are allowed? What key / value combinations are allowed?

curl -H "Content-Type:application/json" -X PUT --data \'{"value":"frames"}' \http://127.0.0.1:8008/flow/frames/json

Page 22: Network visibility and control using industry standard sFlow telemetry

InMon sFlow-RT

active timeout active timeout

NetFlow

Open vSwitch

SolarWinds Real-Time NetFlow Analyzer

• sFlow does not use flow cache, so realtime charts more accurately reflect traffic trend • NetFlow spikes caused by flow cache active-timeout for long running connections

Rapid detection of large flows

Flow cache active timeout delays large flow detection,limits value of signal for real-time control applications

Page 23: Network visibility and control using industry standard sFlow telemetry

Counters and packet samples

http://blog.sflow.com/2013/02/measurement-delay-counters-vs-packet.html

• Packet samples give a fast signal that operates at scale • Counters are maintained in hardware and provide precise traffic totals. • Counters capture rare events, like packet discards, that can severely

impact performance. • Counters report important link state information, like link speed, LAG

group membership etc.

Page 24: Network visibility and control using industry standard sFlow telemetry

Metrics and Events

Metrics Events

Sources SNMP, sFlow, collectdTraps, syslog, IPFIX,

NetFlow

Crossover Event count Threshold event

Collectors InfluxDB, OpenTSDB, Graphite, rrdtool

Logstash, flowtools, Splunk

Application Performance Security and alerts

Page 25: Network visibility and control using industry standard sFlow telemetry

Data models and transportssFlow SNMP NetFlow

version 5 OpenConfig Telemetry IPFIX syslog

Model

standard measurements published by sFlow.org, Dataplane

focus: based on IEEE, IETF, APIs (MIB-2,

LAG-MIB, libvirt, JMX, …)

standard MIBs

defined by IETF

standard tcp / udp / icmp flow

record defined by

Cisco

Telemetry defined as part of YANG

configuration models by

OpenConfig.orgControl plane focus: BGP,

MPLS, VLAN, etc.

Encoding XDR (RFC 4506)

ASN1 (IETF)

NetFlow (Cisco)

protobufs, JSON,

NetConf

IPFIX (IETF)

Syslog (RFC 5424)

Transport UDP UDP UDP UDP, HTTPSCTP, TCP, UDP

UDP

Mode Push Pull Push Push Push Push

Easy to combine multiple data sources if you disaggregate tool chain e.g. separate agents from collectors, feed data from all sources into InfluxDB / Logstash

Page 26: Network visibility and control using industry standard sFlow telemetry

sflowtool

https://github.com/sflow/sflowtool

sflowtool

replicate

asciiPerl, Python, awk, grep

pcap wireshark, tcpdumpNetflow

Open source command line tool

Page 27: Network visibility and control using industry standard sFlow telemetry

Network visibility for DevOps tools

• Streaming filtering and summarization reduces data volume and increases scaleability of backend tools

• Streaming flow analytics to generate application metrics

Page 28: Network visibility and control using industry standard sFlow telemetry

Feedback control of cloud infrastructure

“You can’t control what you can’t measure”Tom DeMarco

Page 29: Network visibility and control using industry standard sFlow telemetry

Cloud depends on network

• Server costs (both capex and power consumption) far exceed networking costs in the data center.

• Network congestion caused server to wait, resulting in poor utilization of cloud infrastructure.

• Optimize network to increase data center efficiency

http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/

“Typically the resource that is most scarce is the network.” Amin Vahdat, Google, ONS2015 Keynote

http://blog.sflow.com/2015/06/optimizing-software-defined-data-center.html

Page 30: Network visibility and control using industry standard sFlow telemetry

http://blog.sflow.com/2013/02/sdn-and-large-flows.html

Elephant flows are the small number of long lived large flows responsible

for majority of bytes on network

Large “Elephant” flows

Page 31: Network visibility and control using industry standard sFlow telemetry

ECMP visibility and control

Page 32: Network visibility and control using industry standard sFlow telemetry

Default sFlow settings

Port Speed Large Flow Sampling Rate Polling Interval

1 Gbit/s >= 100 Mbit/s 1-in-1,000 30 seconds

10 Gbit/s >= 1 Gbit/s 1-in-10,000 30 seconds

25 Gbit/s >= 2.5 Gbit/s 1-in-25,000 30 seconds

40 Gbit/s >= 4 Gbit/s 1-in-40,000 30 seconds

50 Gbit/s >= 5 Gbit/s 1-in-50,000 30 seconds

100 Gbit/s >= 10 Gbit/s 1-in-100,000 30 seconds

Supports real-time detection of large “Elephant” flows where large flow is defined as flow consuming >10% of link bandwidth for >1 second

http://blog.sflow.com/2013/06/large-flow-detection.html

Page 33: Network visibility and control using industry standard sFlow telemetry

Elephant flow collisions

http://blog.sflow.com/2015/03/ecmp-visibility-with-cumulus-linux.html

Page 34: Network visibility and control using industry standard sFlow telemetry

Elephant flow collisions

http://blog.sflow.com/2015/03/ecmp-visibility-with-cumulus-linux.html

Page 35: Network visibility and control using industry standard sFlow telemetry

ECMP monitoring challenge

• large number of links, 12 x 10G links

• all links need to be monitored continuously, 180G total bandwidth

• real-time detection of congested links

• real-time detection of Elephant flows

http://blog.sflow.com/2015/03/ecmp-visibility-with-cumulus-linux.html

Page 36: Network visibility and control using industry standard sFlow telemetry

Fabric level performance metrics

• Fabric View application runs on sFlow-RT

• Downloadable from sFlow-RT.com, includes captured data set from 4 node 10G ECMP fabric

• Elephant collisions on spine links occur frequently

• Collisions halve throughput

• Collisions cause packet discards

http://blog.sflow.com/2015/10/fabric-view.html

Page 37: Network visibility and control using industry standard sFlow telemetry

http://www.slideshare.net/martin_casado/elephants-and-mice-elephant-detection-in-the-vswitch-with-hardware-handling

Large “Elephant” flows delay small “Mice” flows

Separating Elephants and Mice into different queues significantly improves latency and response time

Large flow marking

Page 38: Network visibility and control using industry standard sFlow telemetry

http://blog.sflow.com/2014/06/restful-control-of-cumulus-linux-acls.html

Large flow marking

Page 39: Network visibility and control using industry standard sFlow telemetry

ONS2015 SDN ShowcaseCORD: Open-source spine-leaf fabric

Page 40: Network visibility and control using industry standard sFlow telemetry

http://blog.sflow.com/2015/06/leaf-and-spine-traffic-engineering.html

CORD: Open-source spine-leaf fabric

RackRack

RackRack

Spine

http://blog.sflow.com/2015/08/cord-open-source-spine-leaf-fabric.html

Page 41: Network visibility and control using industry standard sFlow telemetry

SDN Central Demo Friday

SDN Optimization of Hybrid Packet / Optical Data Center Fabric

Page 42: Network visibility and control using industry standard sFlow telemetry

http://blog.sflow.com/2014/09/sdn-control-of-hybrid-packet-optical.html

Cumulus, Calient and InMon PoC. Large flows detected in real-time and diverted to direct ToR to ToR optical circuit

SDN Optimization of Hybrid Packet / Optical Data Center Fabric

Page 43: Network visibility and control using industry standard sFlow telemetry

Software Defined Internet Router (SIR)

Page 44: Network visibility and control using industry standard sFlow telemetry

https://eos.arista.com/spotifys-sdn-internet-router/

IXP Routes Installed Routes not installedLINX 18672 95711

DECIX 12518 108164EQIX-ASH 27687 75146

https://eos.arista.com/arista-eos-bgp-selective-route-download/

Page 45: Network visibility and control using industry standard sFlow telemetry

White box Internet router PoC

Page 46: Network visibility and control using industry standard sFlow telemetry

Internet Router PoC

http://blog.sflow.com/2015/07/white-box-internet-router-poc.html

Peer 1 Peer 2 Peer N

DDoS Mitigation

REST APIsFlow-RT controller

real-time analytics OpenFlow, REST, BGP, …

Large flow marking

Routing Offload

BGP

BGP

Router

Switch

http://blog.sflow.com/2015/10/active-route-manager.html

Page 47: Network visibility and control using industry standard sFlow telemetry

Open Networking Summit SDN Idol winning solution

Real-time SDN Analytics for DDoS mitigation

Page 48: Network visibility and control using industry standard sFlow telemetry

DDoS Mitigation Market Opportunity

DDoS Attack Megatrends [Reference 1]

• High bandwidth, volumetric infrastructure layer (Layer 3 & 4) attacks increased approximately 30 percent

• DDoS attack volume also increased month-to-month in 2013, with 10 out of 12 months showing higher attack volume compared to 2012

• Average DDoS attack sizes continued to increase – many over 100 Gbps, the largest peaking at 179 Gbps

DDoS Mitigation Market Growth

• $870M market by 2017, 18.2% CAGR – Source: IDC: Worldwide DDoS Prevention Products and Services 2013-2017 Forecast

• $1049M market by 2017, 25% CAGR – Source: Infonetics: Global DDoS Prevention Appliances 2012-2017 Forecast

Reference 1: Top DDoS Attack Trends http://www.itbriefcase.net/top-ddos-attack-trends-for-2013

Page 49: Network visibility and control using industry standard sFlow telemetry

DDoS Mitigation Use Case (1)

ISP 1

ISP 2

ISP N

• ISP/IX is uniquely positioned to protect customers from DDoS flood attacks• New revenue from DDoS mitigation service + differentiates ISP/IX service

Attacker

User Prevent attack from overwhelming customer access link

Filter attack traffic in real-time

Customer network

DDoS target hostAttack on single host can take out entire customer data center. Customer cannot mitigate flood attack without upstream help

ISP / IX

ISP/IX Market Segment

Page 50: Network visibility and control using industry standard sFlow telemetry

Customer portal

DDoS Mitigation Service

Web UI + RESTful programmatic API• real-time TopN analytics• programmable filtering of traffic• set thresholds + automatic blocking

Real-time sFlow visibility, Hybrid OpenFlow Control capability of Brocade switches/routers

REST API

InM

on s

Flow

-RT

REST API

Ope

nFlo

w C

ontr

olle

r

DDoS Mitigation Application

Customer Network

Internet

1. Flood attack overloads customer port

2. Attack maps to large flows [Ref. 2]. sFlow-RT detects attack (maps to large flows) and characterizes attack (srcip, dstip, protocol, ports, etc.)

3. mitigation application takes signature, applies customer policy, selects optimal control and push OpenFlow rule(s) to switch(es)

5. OpenFlow rule(s) applied to switch forwarding path to drop / mark traffic and protect link

HTTPS HTTPS

4. Controller pushes OpenFlow rule(s) to switch(es)

OpenFlow 1.3 Match Fieldsline rate filtering using Brocade switches

Reference 2: IETF I2RS Working Group Draft - https://ietf.org/doc/draft-krishnan-i2rs-large-flow-use-case/

Page 51: Network visibility and control using industry standard sFlow telemetry

Demonstration

http://blog.sflow.com/2014/03/ons2014-sdn-idol-finalist-demonstrations.html

Page 52: Network visibility and control using industry standard sFlow telemetry

DDoS Mitigation

Page 53: Network visibility and control using industry standard sFlow telemetry

Cloudflare DDoS mitigation

http://blog.sflow.com/2016/02/cloudflare-ddos-mitigation-pipeline.html

• sFlow used to identify attack signature • BPF created to match signature • Detailed signatures minimize collateral

damage

Page 54: Network visibility and control using industry standard sFlow telemetry

Big Switch Network Webinar

Big Tap sFlow: Enabling Pervasive Flow-level Visibility

Page 55: Network visibility and control using industry standard sFlow telemetry

Pervasive monitoring

http://blog.sflow.com/2015/04/big-tap-sflow-enabling-pervasive-flow.html

Page 56: Network visibility and control using industry standard sFlow telemetry

Targeted capture

http://blog.sflow.com/2015/04/big-tap-sflow-enabling-pervasive-flow.html

Page 57: Network visibility and control using industry standard sFlow telemetry

Comments

• sFlow instrumentation is widely available in switches http://sflow.org/products/network.php

• Host sFlow (sFlow.net) agent extends visibility into servers (works with libpcap, iptables, Open vSwitch to efficiently sample packets in host data plane)

• Common data model ensures strong interoperability across sFlow data sources

• Streaming counter and packet telemetry across network, compute and application tiers makes data center observable

• Observability makes it possible to apply feedback controls

Page 58: Network visibility and control using industry standard sFlow telemetry

Extra slides

Page 59: Network visibility and control using industry standard sFlow telemetry

Host sFlow monitoring of Linux datapath

Technology Reference

Adapter, bridge, macvlan, ipvlan

Berkeley Packet Filter (BPF) sampling

function

http://blog.sflow.com/2016/02/linux-bridge-

macvlan-ipvlan-adapters.html

Open vSwitch Kernel datapath has sFlow support

http://openvswitch.org/support/config-

cookbooks/sflow/

Linux Firewalliptables statistic module random

function with ulog

http://blog.sflow.com/2010/12/ulog.html

Top of Rack Switch

ASIC provides wirespeed monitoring

of attached servers

http://blog.sflow.com/2010/04/hybrid-server-

monitoring.html

Efficient monitoring of high traffic production workloads

Page 60: Network visibility and control using industry standard sFlow telemetry

Open vSwitch Fall Conference

New OVS instrumentation features aimed at real-time monitoring of virtual networks

Page 61: Network visibility and control using industry standard sFlow telemetry

Overlay / Underlay Visibility

http://blog.sflow.com/2015/11/network-virtualization-visibility-demo.html

Page 62: Network visibility and control using industry standard sFlow telemetry

Open vSwitch Fall Conference

OVN service injection demonstration

Page 63: Network visibility and control using industry standard sFlow telemetry

Service injection

http://blog.sflow.com/2015/11/ovn-service-injection-demonstration.html

Page 64: Network visibility and control using industry standard sFlow telemetry

White Paper

Actionable Intelligence in the SDN Ecosystem: Optimizing Network Traffic through FRSA

Page 65: Network visibility and control using industry standard sFlow telemetry

WAN Optimization

http://blog.sflow.com/2015/06/wan-optimization-using-real-time.html

Page 66: Network visibility and control using industry standard sFlow telemetry

Dell NFV Summit 2015

Demystifying NFV Infrastructure Hotspots End-to-end Monitoring, Analytics & SDN Control

Page 67: Network visibility and control using industry standard sFlow telemetry

Demystifying NFV Infrastructure Hotspots

http://blog.sflow.com/2016/01/demystifying-nfv-infrastructure-hotspots.html

Page 68: Network visibility and control using industry standard sFlow telemetry

Hybrid OpenFlow ECMP testbed

Page 69: Network visibility and control using industry standard sFlow telemetry

Hybrid OpenFlow ECMP testbed

http://blog.sflow.com/2015/01/hybrid-openflow-ecmp-testbed.html

http://mininet.org/

• Simulated ECMP network for developing visibility and control applications

• sFlow support in Open vSwitch

• OpenFlow for control

Page 70: Network visibility and control using industry standard sFlow telemetry

The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications

2012 Velocity Conference

Page 71: Network visibility and control using industry standard sFlow telemetry

Tagged.com case study

http://blog.sflow.com/2013/04/velocity-conference-talk.html