1 root-cause network troubleshooting optimizing the process tim titus cto, pathsolutions
TRANSCRIPT
1
Root-Cause Network TroubleshootingOptimizing the Process
Tim TitusCTO, PathSolutions
2
• Business disconnect• Why is troubleshooting so hard?• Troubleshooting methodology• Tool selection• Finding the root-cause• Achieving Total Network Visibility
Agenda
3
• You’re responsible for the entire network• Most network engineers know less about their
network’s health and performance than their user community
You can’t managewhat you can’t measure
-- Peter Drucker
Business Disconnect
4
Business Reasons• Networks are getting more complex• Less staff remains to support the network
Technical Reasons• Proper methodology is not utilized• Wrong tools are employed
Why is Troubleshooting so Hard?
5
What graduates a junior levelEngineer to a senior level
Engineer is theirtroubleshooting methodology
Troubleshooting Methodologies
6
“Do something to try to fix the problem”
• Reboot the device• Change the network settings• Replace hardware• Re-install the OS
Bad Methodology
7
Collect information
Verify Original Problem isSolved and no new problems
exist
Create hypothesis
Test hypothesis
Implement fix
Document fix
Notify users
Undo changes
Good Methodology
8
Types of Tools• Cable Testers• Packet analyzers/capture• Application Performance Monitoring (APM)• Flow collectors• SNMP Collectors
Tool Selection
9
E
H
x43
x52
x51
x53
FDB
x41
x42
Results4.3db of loss
NEXT detectedCable Tester
Actual VoIP Call
You have information aboutLayer 1 on one link in the
network
Using a cable tester to solve a call quality problemCable Testers
10
Good for:• Confirming physical issues on one link in the network
Bad for:• Finding physical issues on the network• Determining application usage• Finding bandwidth limitations• Finding device limitations
Cable Testers
11
E
H
x43
x52
x51
x53
FDB
x41
x42
Results of VoIP CallLatency: 127ms
Jitter: 87msPacket loss: 8.2%
Packet Capture
Actual VoIP Call
You have confirmation that there is a problem,
but no idea which device or link caused the packet loss
Using a sniffer to solve a call quality problemPacket Capture
12
Good for:• Confirming packet loss
(Are we missing packets?)
• Confirming packet contents issues(No QoS tagging on packets when there should be)
• Determining application-level issues(Source and destination IP and ports used for a
session)
Bad for:• Finding physical, data-link, or network issues• Finding bandwidth limitations• Finding device limitations
Packet Capture
13
E
H
x43
x52
x51
x53
FDB
x41
x42
Results of SimulationLatency: 127ms
Jitter: 87msPacket loss: 8.2%
Agent
Simulated VoIP Call
You have knowledge of the experienceacross the network, but no understanding
of the source or cause of the problem.
Using APM to determine performance through the network
Agent
Application Performance Monitoring
14
Good for:• Measuring user experience across the network
(Are we having problems right now?)
Bad for:• Finding physical, data-link, or network issues• Finding bandwidth limitations• Finding device limitations
Application Performance Monitoring
15
E
H
x43
x52
x51
x53
FDB
x41
x42
Results of FlowSourceIP: 192.168.1.12:80
DestinationIP: 172.16.3.98:3411Packets: 251Bytes: 19,386
Flow Collector
Actual VoIP Call
You have knowledge of a transfer acrossthe network, but no recognition if therewere any problems with the transfer.
Using a flow collector to determine usage of the network
Flow Record
Flow Collectors
16
Good for:• Determining communications across the network
Who is using a link?
When do they use it?
What do they use it for?
Bad for:• Finding physical, data-link, or network issues• Finding bandwidth limitations• Finding device limitations
Flow Collectors
17
E
H
x43
x52
x51
x53
FDB
x41
x42
Results of CollectionWAN link is overloaded at
2:35pm SNMP Collector
Actual VoIP Call
You have data about conditions onsome parts of the network,
but no analysis of the problem orcorrelation to events
Collecting information from switches and routers to discover faultsSNMP Collectors
18
Good for:• Tracking packet loss per interface/device
(Are we dropping packets on a link? why?)
• Monitoring device and link resource limitations(Are we over-utilizing a link? Is the router CPU
pegged?)
Bad for:• Determining who is using the network• Finding application layer problems
SNMP Collectors
19
E
H
x43
x52
x51
x53
FDB
x41
x42
Step 1:Identify the involved endpoints and where they are connected into the network
Poor Quality VoIP Call
Finding the Root-Cause
20
E
H
x43
x52
x51
x53
FDB
x41
x42
Step 2:Identify the full layer-2 path through the network from the first phone to the second phone
Finding the Root-Cause
21
E
H
x43
x52
x51
x53
FDB
x41
x42
Step 3:Investigate involved switch and router health (CPU & Memory) for acceptable levels
Finding the Root-Cause
22
E
H
x43
x52
x51
x53
FDB
x41
x42Step 4:
Investigate involved interfaces for:
• VLAN assignment• DiffServe/QoS tagging• Queuing configuration• 802.1p Priority settings• Duplex mismatches
• Cable faults• Half-duplex operation• Broadcast storms• Incorrect speed settings• Over-subscription
TRANSIENT PROBLEM WARNING:If the error condition is no longer occurring when this investigation is performed, you may not catch the problem
Finding the Root-Cause
23
In a perfect world, you want:
• Monitoring of: Every switch, router, and link in the entire infrastructure All error counters on the interfaces QoS configuration and performance
• Continuous collection of information• Automatic layer-1, 2, and 3 mapping from any IP
endpoint to any other IP endpoint• Problems identified in plain-English for rapid
remediation
This is what PathSolutions TotalView does
Optimizing the Methodology
24
Install TotalView
Result:One location is able to monitor all devices and links in the entire network for performance and errors
All Switches and Routers are queried for information
Deployment
25
• Broad: All ports on all routers & switches• Continuous: Health collected every 5 minutes• Deep: 18 different error counters collected
and analyzed
• Network Prescription engine provides plain-English descriptions of errors:
Total Network Visibility®
“This interface is dropping 12% of its packets due to a cable fault”
26
Establish Baseline of Network Health
7% Loss from cabling fault
12% Loss from Alignment
Errors
11% Loss from Jumbo Frame
Misconfiguration
28% Loss from Duplex
mismatch
Results Within 12 Minutes
27
Repair Issues
7% Loss from cabling fault
12% Loss from Alignment
Errors
28% Loss from Duplex
mismatch
Results Within 12 Minutes
11% Loss from Jumbo Frame
Misconfiguration
28
11:32am100% Transmit utilization15% Loss from discards
Latency & Jitter penalty incurred
7:56am18% Loss from
Cable Fault
12:02pm12% Loss from
Collisions
Path Analysis Report
29
Demo
30
Don’t turtle your network
31
With it, you will always have an easy way to map out your network on any white board!
Free Network Equipment Magnet Set
www.PathSolutions.com
34