7450 ethernet service switch 7750 service router · 02/01/2011 · nokia — proprietary and...

134
Nokia — Proprietary and confidential. Use pursuant to applicable agreements. 7450 Ethernet Service Switch 7750 Service Router TROUBLESHOOTING GUIDE 3HE 11475 AAAA TQZZA 01 Issue: 01 December 2016 TROUBLESHOOTING GUIDE

Upload: trinhkiet

Post on 05-Apr-2018

467 views

Category:

Documents


14 download

TRANSCRIPT

Page 1: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Nokia — Proprietary and confidential.Use pursuant to applicable agreements.

7450 Ethernet Service Switch7750 Service Router

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01

Issue: 01

December 2016

TROUBLESHOOTING GUIDE

Page 2: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

2

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Nokia is a registered trademark of Nokia Corporation. Other products and company names mentioned herein may be trademarks or tradenames of their respective owners.

The information presented is subject to change without notice. No responsibility is assumed for inaccuracies contained herein.

© 2016 Nokia.

Contains proprietary/trade secret information which is the property of Nokia and must not be made available to, or copied or used by anyone outside Nokia without its written authorization. Not to be used or disclosed except in accordance with applicable agreements.

Page 3: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE

Issue: 01 3HE 11475 AAAA TQZZA 01 3

Table of Contents

1 Getting Started................................................................................71.1 About This Guide.........................................................................................71.1.1 Audience......................................................................................................71.2 In This Chapter ............................................................................................81.3 Nokia SR-Series Troubleshooting Process Overview .................................8

2 Troubleshooting Packet Loss .....................................................112.1 In This Chapter ..........................................................................................112.2 Packet Loss Troubleshooting Flowchart....................................................122.3 To Troubleshoot Packet Loss in a Live Network .......................................142.4 To Troubleshoot Packet Loss in a Lab Environment.................................16

3 Troubleshooting XPL Data Bus Errors .......................................193.1 In This Chapter ..........................................................................................193.2 XPL Data Bus Errors Overview .................................................................203.3 Detecting XPL Data Bus Errors Between the IOM and MDA ....................203.3.1 SNMP Trap Information.............................................................................213.3.2 CLI Statistics..............................................................................................213.4 XPL Error Troubleshooting Flowchart .......................................................223.5 To Troubleshoot XPL Errors......................................................................24

4 Troubleshooting Pchip Parity Alarms ........................................294.1 In This Chapter ..........................................................................................294.2 Pchip Memory Parity Error Overview ........................................................304.3 Pchip Memory Parity Error Detection and Impact .....................................304.4 Pchip Memory Parity Alarm Troubleshooting Flowchart............................314.5 Pchip Parity Alarms Description ................................................................334.6 Pchip Alarm Sample Reports ....................................................................33

5 Troubleshooting Ingress/Egress FCS Errors.............................355.1 In This Chapter ..........................................................................................355.2 Packet Loss Errors Overview ....................................................................365.3 Ingress FCS Errors....................................................................................365.3.1 Detecting Ingress FCS Errors....................................................................365.3.1.1 SNMP Trap Information.............................................................................375.3.1.2 CLI Statistics..............................................................................................385.3.2 Ingress FCS Error Troubleshooting Flowchart ..........................................395.4 Egress FCS Errors ....................................................................................415.4.1 Detecting Egress FCS Errors ....................................................................415.4.1.1 SNMP Trap Information.............................................................................425.4.1.2 CLI Statistics..............................................................................................425.4.2 Egress FCS Error Troubleshooting Flowchart...........................................43

Page 4: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

4

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

6 Troubleshooting Pchip CAM Alarms ..........................................496.1 In This Chapter ..........................................................................................496.2 Pchip CAM Error Overview........................................................................506.3 Pchip CAM Error Detection and Impact.....................................................506.4 Pchip CAM Alarm Troubleshooting Flowchart...........................................516.5 Pchip CAM Alarms Description .................................................................536.6 Pchip CAM Alarm Sample Reports ...........................................................53

7 Troubleshooting Qchip Errors ....................................................557.1 In This Chapter ..........................................................................................557.2 Qchip Error Overview ................................................................................567.3 Detecting Qchip Errors ..............................................................................567.3.1 Impact of Qchip Errors...............................................................................577.4 Fail-On-Error..............................................................................................577.5 Qchip Error Troubleshooting Flowchart.....................................................587.6 Qchip Alarms Description ..........................................................................607.6.1 Reporting Qchip Alarms ............................................................................607.6.2 Qchip Alarm Sample Reports ....................................................................61

8 Troubleshooting MLPPP over a Serial Interface........................658.1 In This Chapter ..........................................................................................658.2 MLPPP Error Overview .............................................................................668.3 MLPPP Error Troubleshooting Flowchart ..................................................668.4 To Troubleshoot One or More Inactive Links ............................................688.5 To Troubleshoot An Inactive Channel Group ............................................728.6 To Troubleshoot Traffic Issues ..................................................................74

9 Troubleshooting Multicast Issues...............................................779.1 In This Chapter ..........................................................................................779.2 PIM-SM and IGMP Network Overview ......................................................789.3 Multicast Troubleshooting Tools................................................................799.3.1 MTRACE ...................................................................................................799.3.2 MSTAT ......................................................................................................809.4 Workflow to Troubleshoot Multicast Problems ..........................................809.4.1 Isolating the Multicast Problem .................................................................819.5 Troubleshooting a Problem Isolated to One or More SR Routers .............829.5.1 Flowchart to Troubleshoot a Problem Isolated to One SR-Series

Router........................................................................................................829.5.2 Check the PIM Output ...............................................................................839.5.3 Discard Counters.......................................................................................869.5.4 No Egress Interface...................................................................................869.5.5 No Ingress Interface ..................................................................................879.5.6 No Errors Indicated in CLI Output .............................................................889.6 Troubleshooting Hardware Issues and Queue Discards ...........................899.6.1 Hardware Errors and Queue Discards Troubleshooting Flowchart ...........899.6.2 IOM and MDA Errors .................................................................................919.6.3 Port-Level Errors .......................................................................................919.6.4 Queue-Level Drops ...................................................................................92

Page 5: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE

Issue: 01 3HE 11475 AAAA TQZZA 01 5

10 Troubleshooting ICC Errors .......................................................9710.1 In This Chapter ..........................................................................................9710.2 Inter-Card Communication Overview ........................................................9810.2.1 Automated Recovery from ICC Errors.......................................................9810.3 ICC Troubleshooting Flowchart .................................................................9910.4 To Troubleshoot ICC Failures .................................................................100

11 Upgrading Incompatible Firmware Versions ...........................10311.1 In This Chapter ........................................................................................10311.2 Command Overview ................................................................................10411.3 Command Behavior and Impact ..............................................................104

12 Recovering From Active CPM Lockup......................................10712.1 In This Chapter ........................................................................................10712.2 Recovering the Active CPM Overview.....................................................10812.3 To Recover the Active CPM and Determine Root Cause Using the

Lamp Test................................................................................................108

13 Hardware Error Protection Features.........................................11113.1 In This Chapter ........................................................................................11113.2 Hardware Error Protection Overview.......................................................11213.3 Memory Bit/Parity Errors: Causes, Detection, and Correction ...............11213.4 Fail-On-Error Overview............................................................................11313.4.1 Clearing a Failed Operational State ........................................................11313.4.2 Triggering Fail-On-Error .........................................................................11413.4.3 Enabling Log Reports ..............................................................................11513.5 Card-Level Fail-On-Error .........................................................................11613.5.1 Card-Level Fail-On-Error Examples ........................................................11613.6 MDA-Level Fail-On-Error.........................................................................12113.6.1 MDA-Level Fail-On-Error Examples ........................................................12313.7 To Troubleshoot Using the Fail-On-Error Feature...................................12513.8 Down-On-Internal-Error ...........................................................................12513.8.1 Down-On-Internal-Error Examples ..........................................................12613.9 To Troubleshoot Using the Down-On-Internal-Error Feature ..................12713.10 CRC-Monitor............................................................................................12813.10.1 CRC-Monitor Examples...........................................................................12913.11 To Troubleshoot Using the CRC-Monitor Feature...................................130

Page 6: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

6

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 7: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Getting Started

Issue: 01 3HE 11475 AAAA TQZZA 01 7

1 Getting Started

1.1 About This Guide

This guide provides troubleshooting procedures for the Nokia SR-series routers. The guide describes commonly encountered problems and events; it does not cover every potential troubleshooting scenario that might occur in the field.

Unless otherwise specified, the topics and procedures described in this document apply to the:

• 7450 ESS

• 7750 SR

7450 ESS applicability statements refer to the 7450 ESS when it is not running in mixed mode. 7750 SR applicability statements refer to the 7750 SR-7/12, 7750 SR-12e, 7750 SR-a4/a8, and 7750 SR-e1/e2/e3 platforms, unless otherwise specified.

Command outputs shown in this guide are examples only; actual outputs may differ depending on supported functionality and user configuration.

1.1.1 Audience

The Troubleshooting Guide is intended for network administrators who are responsible for configuring and managing the routers. It is assumed that the network administrators have an understanding of the following:

• 7750 SR chassis components

• 7750 SR OS CLI

• Networking principles and configurations

• Boot option, configuration, image loading, and initialization procedures

• File system concepts

Note: The alarms and troubleshooting information in this version of the Troubleshooting Guide may not apply to earlier releases of the Nokia SR OS software.

Page 8: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Getting Started

8

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

1.2 In This Chapter

This chapter provides the process flow information to troubleshoot the Nokia SR-series routers.

1.3 Nokia SR-Series Troubleshooting Process Overview

Table 1 provides an overview of the organization of the troubleshooting information.

Table 1 Overview of Troubleshooting Information

Chapter Title Description

Troubleshooting Packet Loss Provides information about how to troubleshoot packet loss caused by hardware issues in a live network or in a lab environment.

Troubleshooting XPL Data Bus Errors Provides information about how to troubleshoot XPL data bus errors (IOM/MDA errors).

Troubleshooting Pchip Parity Alarms Provides information about how to detect and troubleshoot Pchip parity alarms.

Troubleshooting Ingress/Egress FCS Errors

Provides information about how to troubleshoot ingress/egress Ethernet frame check sequence (FCS) errors.

Troubleshooting Pchip CAM Alarms Provides information about how to troubleshoot Pchip Content Addressable Memory (CAM) alarms on the IOM and the IMM.

Troubleshooting Qchip Errors Provides information about how to detect and troubleshoot Qchip errors on the IOM3-XP and IMM line cards.

Troubleshooting MLPPP over a Serial Interface

Describes how to troubleshoot the most common issues related to the Multi-Link Point-to-Point Protocol (MLPPP).

Troubleshooting Multicast Issues Describes how to troubleshoot the most common issues related to multicast networks running PIM-SM or L3 networks.

Page 9: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Getting Started

Issue: 01 3HE 11475 AAAA TQZZA 01 9

Troubleshooting ICC Errors Provides information about how to troubleshoot Inter-Card Communication (ICC) errors in the network.

Upgrading Incompatible Firmware Versions

Analyzes the associated risks and potential impact of unsolicited firmware version upgrades on the SR-series routers.

Recovering From Active CPM Lockup Describes how to use the lamp test to recover from an active Control Processor Module (CPM) lockup.

Hardware Error Protection Features Provides an overview of the hardware protection features that can be used to troubleshoot hardware alarms on the SR-series routers.

Table 1 Overview of Troubleshooting Information

Chapter Title Description

Page 10: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Getting Started

10

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 11: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Packet Loss

Issue: 01 3HE 11475 AAAA TQZZA 01 11

2 Troubleshooting Packet Loss

2.1 In This Chapter

This chapter describes how to troubleshoot packet loss on an SR-series router. Specifically, information is provided on packet loss that is caused by hardware issues in a live network or in a lab environment. Packet loss due to other network element errors or configuration errors is beyond the scope of this document.

The topics in this chapter include:

• Packet Loss Troubleshooting Flowchart

• To Troubleshoot Packet Loss in a Live Network

• To Troubleshoot Packet Loss in a Lab Environment

Note: Follow this standard troubleshooting procedure in situations where the traffic loss is not caused by hardware problems (for example, the dropped packets may be due to QoS policies limiting traffic rates, spanning tree blocking, and so on).

Page 12: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Packet Loss

12

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

2.2 Packet Loss Troubleshooting Flowchart

The flowchart in Figure 1 defines the packet loss troubleshooting steps. A process of elimination is used to isolate the issue; proceed as directed.

Note: The physical layer errors mentioned in the flowchart can include FCS, malfunctioning transmission equipment, reduced power levels on the link, and so on.

Page 13: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Packet Loss

Issue: 01 3HE 11475 AAAA TQZZA 01 13

Figure 1 Packet Loss Troubleshooting Flowchart

1025

Start

Contact Nokia Supportfor further

troubleshooting.

A

B

A

Isolate the affectedchassis and configure

an Epipe service. Use atester to send and

receive a fixed numberof packets to theEpipe service.

Investigate the impactof the traffic loss andcheck logs for any

indication of an errorthat might cause

packet loss.

Check port counters,logs and run tests

to confirm thatpacket loss is on

the SR router.

Take a tech-support file,start running traffic for 1

hour or more. Stop traffic,then take a secondtech-support file.

If physical layer errors arefound, or if the ingress andegress number of packets

in an epipe test do notindicate packet loss on the

SR router, do not escalate toNokia Support as packet loss

is external to the DUT.

Use a tester tosend and receive

packets to theEpipe service for 1

hour or more.

Confirm thatpacket loss is not

caused by any othernetwork element.

Is thesuspected chassis

with packet loss in alive network or lab

environment?

Do ingressversus egress

port counters indicatepacket loss?

Do tests indicatethat the packet loss isdue to the SR router?

Are there anyphysical layer

errors?

Livenetwork

Lab environment

No

No

Are there anyphysical layer

errors?

No

No

Yes

Yes

AYes

B

A

Yes

Page 14: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Packet Loss

14

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

2.3 To Troubleshoot Packet Loss in a Live Network

To troubleshoot packet loss in a live network, you must first confirm that the dropped packets are caused by router hardware problems, and not due to any other network element.

Step 1. Investigate the impact of the packet loss in the customer's network. For example, is there degradation of video quality or dropped calls in the customer network?

Step 2. Check the physical layer including port, fiber, SFP/XFP, transmission equipment, and power levels.

− Clean or swap the fiber if required.

− Check the CLI show port detail context for errors. Some errors, such as FCS issues, are visible in this context. Figure 2 shows an example of the show port detail output information.

Figure 2 Physical Layer Errors

− Clear any errors before proceeding.

Step 3. The traffic patterns in a live network are not always known, which often makes it difficult to determine packet loss occurrence. However, this step is necessary to rule out obvious reasons for packet loss, such as L3 adjacencies being down.

Perform these tests to check if the packet loss is on the SR-series router.

− Check Ingress and Egress Traffic Flow

Using the CLI commands shown in Figure 3, check the port/SAP and service counters to determine if traffic is flowing at ingress and egress.

1024

Example:A:FN1# show port 1/1/3 detail<..output omitted>

Ethernet-like Medium Statistics

Alignment Errors : 0 Sngl Collisions : 0FCS Errors : 0 Mult Collisions : 0SQE Test Errors : 0 Late Collisions : 0CSE : 0 Excess Collisns : 0Too long Frames : 0 Int MAC Tx Errs : 0Symbol Errors : 0 Int MAC Rx Errs : 0

Page 15: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Packet Loss

Issue: 01 3HE 11475 AAAA TQZZA 01 15

Figure 3 CLI Commands to Check Ingress/Egress Traffic

− Ping Test

Ping through the SR router so that the ping source and destination addresses are not local to the SR router being tested. If there is latency or if some ping packets are lost, it can help confirm that there is an issue with the router.

− Filters

Configure filters at ingress and egress of the SR router and log the filter hits to memory. Check the packet format and confirm that the packets should be forwarded by the system. For example, if a ping test is sent, the filter can capture the number of ingress and egress ICMP packets; the counters at ingress and egress should match. Figure 4 shows example CLI filter configuration and output information.

Figure 4 Configure Filter Example

1022

Example:*A:7x50-PE1# show port detail*A:7x50-PE1# show service id 1 sap lag-1:1.1 detail*A:7x50-PE1# show service id 1 sdp detail

1028

Example:Configure the filter:*A:7x50_PE4# configure filter ip-filter 10 *A:7x50_PE4>config>filter>ip -filter# info

entry 10 creatematch protocol icmpexit log 101action forward

exit

*A:7x50_PE4>config>filter>ip -filter#

Apply the filter:

*A:7x50_PE4>config>service>epipe# info

sap 1/1/15 createingress

filter ip 10exit

exitno shutdown

Page 16: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Packet Loss

16

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Step 4. Check log 99 and log 100 for errors that might indicate that protocol or hardware issues may be responsible for the packet loss.

If the troubleshooting indicates that there is a packet loss issue on the SR router, contact Nokia Technical Support for further troubleshooting assistance. Include the following information about the affected node:

• two tech-support files (generated an hour or more apart using the admin tech-support CLI command) from the affected node

• the IOM and MDA that is dropping traffic and reasons why packet loss is suspected

• a summary of the troubleshooting steps, including information from Step 4

• impact of the traffic loss

2.4 To Troubleshoot Packet Loss in a Lab Environment

When troubleshooting in a lab environment, it is best to isolate the affected SR router. This will eliminate the possibility of other network elements and configuration errors as the cause of traffic loss. Send test traffic through the isolated node and measure the total number of packets sent and received.

Step 1. Configure an Epipe service with two SAPs, as shown in Figure 5.

Figure 5 Epipe Service Example

Step 2. Connect two tester ports to the SAPs and send bidirectional traffic through the Epipe, as shown in Figure 6.

1027

Example:*A:7x50-PE1>config>service# epipe 100 *A:7x50-PE1>config>service>epipe# info

sap 1/1/6 createexitsap 1/1/7 createexitno shutdown

Page 17: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Packet Loss

Issue: 01 3HE 11475 AAAA TQZZA 01 17

Figure 6 Epipe Test Setup

Step 3. Check the physical layer (port, fiber, SFP/XFP). Some errors, such as FCS, are visible in CLI in the show port detail context. Clear any errors before proceeding.

Step 4. Stop traffic and clear all SAP statistics, as shown in the example in Figure 7.

Figure 7 Clear SAP Statistics Example

Step 5. Generate a technical support file using the admin tech-support CLI command.

Step 6. Send bidirectional traffic for one hour or more.

Step 7. Stop the traffic.

Step 8. Check if the ingress packets on one Epipe SAP equal the egress packets on the other SAP. In this example, that means SAP 1/1/6 (ingress) = SAP 1/1/7 (egress) and vice-versa. See the CLI example shown in Figure 8.

Step 9. If there is no match for the number of ingress and egress packets from Step 6, generate a tech-support file using the admin tech-support CLI command.

If there is a match for the number of ingress and egress packets from Step 6, the packet loss is due to factors external to the SR router and other network elements should be analyzed.

Step 10. If the troubleshooting steps indicate that there is a packet loss issue on the SR router, escalate the issue to Nokia Technical Support for further troubleshooting assistance. Include the following information about the affected node:

− tech-support files generated in Step 5 and Step 9

1026

To testerTo testerEpipe SAP 1/1/7 Epipe SAP 1/1/6

7x50 with epipe configuration

1021

Example:*A:7x50-PE1# clear service statistics sap 1/1/6 all

Page 18: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Packet Loss

18

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

− detailed information about the tester streams, including packet size, traffic rate, and type of traffic, as shown in the CLI command output in Figure 8

Figure 8 Example Tester Stream Information

1023

Example:

*A:7x50 -PE1# show service id 100 sap 1/1/6 detail

<output omitted>

Sap Statistics

Last Cleared Time : N/A

Packets OctetsForwarding Engine StatsDropped : 991 113969Off. HiPrio : 0 0Off. LowPrio : 2609337 1912228386Off. Uncolor : 0 0

Queueing Stats(Ingress QoS Policy 1)Dro. HiPrio : 0 0Dro. LowPrio : 0 0For. InProf : 0 0For. OutProf : 2609337 1912228386

Queueing Stats(Egress QoS Policy 1)Dro. InProf : 0 0Dro. OutProf : 0 0For. InProf : 110 7398For. OutProf : 2163511 547459993

Sap per Queue stats

Packets Octets

Ingress Queue 1 (Unicast) (Priority)Off. HiPrio : 0 0Off. LoPrio : 2609337 1912228386Dro. HiPrio : 0 0Dro. LoPrio : 0 0For. InProf : 0 0For. OutProf : 2609337 1912228386

Egress Queue 1For. InProf : 110 7398For. OutProf : 2163511 547459993Dro. InProf : 0 0Dro. OutProf : 0 0

Page 19: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting XPL Data Bus Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 19

3 Troubleshooting XPL Data Bus Errors

3.1 In This Chapter

This chapter describes the troubleshooting procedures for XPL data bus errors.

Topics in this chapter include:

• XPL Data Bus Errors Overview

• Detecting XPL Data Bus Errors Between the IOM and MDA

• XPL Error Troubleshooting Flowchart

• To Troubleshoot XPL Errors

Note: The troubleshooting information in this chapter applies to SR routers running TiMOS software releases 5.0.R16, 6.1.R1, 6.0.R5, and later. Some alarms and troubleshooting procedures may not apply to routers running older versions of TiMOS software.

Page 20: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting XPL Data Bus Errors

20

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

3.2 XPL Data Bus Errors Overview

The 7750 SR and 7450 ESS TiMOS software loadset supports the identification of possible intermittent packet loss within the SR node.The three specific cases of potential packet loss are characterized as:

• Ingress FCS Errors (IOM error)

• Egress FCS Errors (IOM error)

• XPL Data Bus Errors (IOM/MDA error)

The 7x50 TiMOS software generates notifications of potential internal packet loss through the main system event logs (log 99), SNMP traps and CLI. These notifications provide operators with a faster, easier way of identifying whether any of the packet loss conditions are currently occurring in the network.

This chapter describes how to troubleshoot XPL errors. See chapter 5 for information about handling IOM errors.

3.3 Detecting XPL Data Bus Errors Between the IOM and MDA

An IOM accepts up to two Media Dependent Adapters (MDAs). The XPL data bus is a bus between the IOM and each of the two MDAs that it supports. There are two XPL buses on an IOM; each bus works independently of the other (that is, neither impacts the other).

When an XPL data bus error condition exists on a node, it indicates that there is a problem either with the physical layer, or somewhere along the data bus. The XPL data bus errors affect bidirectional traffic.

In addition to an optional log message and SNMP trap, you can also get the timestamp of the last occurrence of the event, and information about the number of times the threshold was crossed. Use the show mda detail CLI command to display log event information to characterize the errors; see section 3.3.2 for more information.

Page 21: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting XPL Data Bus Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 21

3.3.1 SNMP Trap Information

• SNMP MIB: TIMETRA-CHASSIS-MIB.mib

• SNMP Trap: tmnxEqMdaXplError

Sample Event Log Entry

8249 2008/05/21 16:12:09.33 UTC MINOR: CHASSIS #2058 Base "MDA 1/1 experienced XPL errors."8250 2008/05/21 16:12:10.33 UTC MINOR: CHASSIS #2058 Base "MDA 1/2 experienced XPL errors."

3.3.2 CLI Statistics

The CLI can be used to get a report of the number of times the XPL Error trap has occurred, and the last time it was raised. The information is available in the CLI context for the MDA view referencing the individual complex where the errors are being reported.

The following output is an example CLI information for the XPL Errors trap.

*A:SR13# show mda 6/2 detail===============================================================================MDA 6/2 detail===============================================================================Slot Mda Provisioned Equipped Admin Operational

Mda-type Mda-type State State-------------------------------------------------------------------------------6 2 m20-1gb-sfp m20-1gb-sfp up up

MDA Specific DataMaximum port count : 20Number of ports equipped : 20Network ingress queue policy : defaultCapabilities : Ethernet

Note: The tmnxEqMdaXplError alarm is enabled by default in most software releases. If it is disabled, you can use the configure log event-control chassis 2058 generate CLI command to enable the alarm. See chapter 13 for more information about enabling hardware alarms.

Note: The CLI status and related statistics are cleared after an IOM reboot.

Page 22: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting XPL Data Bus Errors

22

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Hardware DataPart number : 3HE00708AAAA01CLEI code : IPUIALADAASerial number : NS061850066Manufacture date : 05032006Manufacturing string :Manufacturing deviations :Administrative state : upOperational state : upTemperature : 38CTemperature threshold : 75CTime of last boot : 2008/06/10 16:14:58Current alarm state : alarm clearedBase MAC address : 00:16:4d:27:9a:18

-------------------------------------------------------------------------------XPL Errors: Trap raised 1 times; Last Trap 06/10/2008 16:26:01-------------------------------------------------------------------------------===============================================================================

3.4 XPL Error Troubleshooting Flowchart

The flowchart in Figure 9 defines the packet loss troubleshooting steps. A process of elimination is used to isolate the issue; proceed as directed.

Note: The physical layer errors mentioned in the flowchart can include FCS, malfunctioning transmission equipment, reduced power levels on the link, and so on.

Page 23: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting XPL Data Bus Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 23

Figure 9 XPL Errors Troubleshooting Flowchart

1031

XPL alarmreported?

Do yousee XPL errors

or only IntMAC Tx?

Isnode running

releases below:5.0R23, 6.0R12,6.1R7, 7.0R2?

Is theaffected card

an IMM?Yes

Yes

Yes

Yes

Yes

Yes

Yes

YesYes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

YesONLY MAC Tx errors

XPL/IntMAC errors

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

No

Are newMAC Tx

errors still beingreported?

Reset the MDA on thecomplex reporting theMAC Tx errors using

the clear mdacommand.

Are the powerlevels within specat the near and far

end?

Replace the IOMin the slot

reporting the XPLerrors.

Replace the MDAon the complexreporting XPL

errors.

Reseat the MDAon the complexreporting XPL

errors.

Reset the MDAon the complex

reporting the XPLerror using the

clear mdacommand.

Reset the IMMon the complex

reporting theXPL error usingthe clear card

command.

Are new XPLerrors still being

reported?

Reseat the IMMon the complex

reporting theXPL error.

Are new XPLerrors still being

reported?

Collect relevantlogs and 2 tech-support files 15minutes apart.

Is the specific MDA

complex processing amixture of jumbo and small

Ethernet packets atnear line rate?

Upgrade to atleast 5.0R23,6.0R12, 6.1R7,

7.0R2.

Are new XPLerrors still being

reported?

Reseat the MDA onthe complex

reporting the errors.

Are new errorsstill beingreported?

Are new errorsstill beingreported?

Replace the MDAon the complex

reporting the errors.

Collect relevant logsand 2 tech-support files

15 minutes apart.

Monitor Card Monitor Card Monitor CardEscalate to

Nokia Support.

Escalate toNokia Support.

Escalate toNokia Support.

Escalate toNokia Support.

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Are new XPLerrors still being

reported?

Collect relevant logsand 2 tech-support files

15 minutes apart.

Are therecollisions?

Check portconfiguration (admin/oper speed) at nearand far end. Correct

any mismatch.

Are new XPL orcollisions still being

reported?

Check physical layer,near and far end,

transmissionequipment if present.

Check physical layer,near and far end,

transmissionequipment if present.

Are therealarms in

the logs for faultset/cleared?

Is the affected cardan IMM?

Replace the IMMon the complex

reporting theXPL error.

Replace the IMMon the complex

reporting theXPL error.

Reset the IMM on thecomplex reporting theXPL error using theclear card command.

Collectrelevant logsand 2 tech-support files15 minutes

apart.

Page 24: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting XPL Data Bus Errors

24

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

3.5 To Troubleshoot XPL Errors

Perform the following tasks in sequence until you identify the root cause of the problem.

Step 1. Identify the type of error

For all ports on the affected MDA, check the output of the show port a/b/c detail CLI command and identify the occurrence of any physical layer errors, such as Internal MAC Transmit errors or Collisions. Perform one of the following:

a. If XPL or Internal MAC Transmit errors are found in the CLI output, go to Step 5.

b. If Internal Mac Transmit errors are found but there are no XPL Errors, go to Step 2; the problem is probably caused by an MDA issue.

c. If an IMM is reporting the XPL errors, go to Step 5.

Step 2. Reset the MDA

Reset the MDA on the complex that is reporting the errors. Perform the clear MDA CLI command, as shown in the example CLI output in Figure 10.

Monitor the node if the error is no longer incrementing. Otherwise, go to the next step.

Figure 10 Clear MDA Sample CLI Output

Step 3. Reseat the MDA

Reseat the MDA on the complex reporting the Internal MAC Tx error.

Monitor the node if the error is no longer incrementing. Otherwise, go to the next step.

Step 4. Replace the MDA

1030

A:7750_SR7# show mda

MDA Summary

Slot Mda Provisioned Equipped Admin Operational Mda-type Mda-type Stats Stats

2 1 isa-aa isa-aa up up/active 2 m10-lgb-sfp-b m10-lgb-sfp-b up up

A:7750_SR7# clear nda 2/2A:7750_SR7#

Page 25: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting XPL Data Bus Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 25

Replace the MDA on the complex that is reporting the errors.

Monitor the node if the error is no longer incrementing. Otherwise, go to Step 9.

Step 5. Check for collisions

Common reasons for collisions are:

− Full-duplex/half-duplex mismatch

− Exceeded Ethernet cable length limits

− Incorrect cabling or a non-compliant number of hubs in the network

− Autonegotiate configured on the near end but not at the far end

As shown in Figure 11, full-duplex/half-duplex mismatch is the most probable cause of collisions.

Figure 11 Collisions CLI Sample Output

Check the output of the show port detail CLI command and perform the following:

i. If the CLI output indicates collisions, check the port configuration at the near end and the far end for a mismatch.

ii. Resolve any configuration errors found, then monitor the node.

1029

A:7750_SR7# show port 2/2/1 detail

Ethernet Interface

Description : 10/100/Gig Ethernet SFPInterface : 2/2/1 Oper Speed : 1 Gbps Link-level : Ethernet Config Speed : 1 Gbps Admin State : up Oper Duplex : halfOper State : up Config Duplex : full Physical Link : Yes MTU : 1514 IfIndex : 71335936 Hold time up : 0 seconds Last State Change : 03/27/2009 19:14:45 Hold time down : 0 seconds Last Cleared Time : N/A DDM Events : Enabled Last Cleared Time : N/A

<…snip>

Ethernet-like Medium Statistics

Alignment Errors : 0 Sngl Collisions : 17072FCS Errors : 0 Mult Collisions : 0SQE Test Errors : 0 Late Collisions : 5182CSE : 0 Excess Collisns : 0Too long Frames : 0 Int MAC Tx Errs : 0Symbol Errors : 0 Int MAC Rx Errs : 0

Page 26: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting XPL Data Bus Errors

26

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

iii. If the errors are still incrementing, go to the next step.

Step 6. Check alarms in log-id 99

Check log-id 99 for any alarms that show fault set/cleared for the remote or local node. In most cases, the alarms are set and cleared within the same second and you will not see the port bounce.

If these alarms are found in log-id 99, troubleshoot the physical layer and transmission equipment.

Step 7. Measure the power levels

Measure the power on the node reporting the error and on intermediate transmission equipment. Perform one of the following:

a. If the power levels are within specifications and the error is still incrementing, go to the next step.

b. If the power levels are within specifications and the error is not incrementing, monitor the node.

c. If the power levels are not within specifications:

i. Troubleshoot the physical layer until the power levels are within specifications.

ii. If the error is still incrementing, go to the next step.

iii. If the error is no longer incrementing, monitor the node.

Step 8. Troubleshoot the IOM/MDA

i. Perform Steps 2 to 4 to troubleshoot IOM/MDA issues.

If this resolves the error, monitor the node to verify that the error has cleared.

ii. If the error is not resolved, replace the IOM on the slot reporting the XPL errors.

iii. If the IOM replacement resolves the error, monitor the node. Otherwise, proceed to the next step.

Step 9. Escalate the issue to Nokia Technical Support

Note: The Transmit and Receive power and temperature values for DDM capable ports can be monitored in the show port detail CLI context.

Note: If the card reporting XPL errors is an IMM and performing Steps 5 to 7 has not resolved the problem, then the IMM should be replaced.

Page 27: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting XPL Data Bus Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 27

Gather the following information and escalate the issue to Nokia Technical Support for further troubleshooting assistance:

− Details of the procedure followed to troubleshoot the issue

− Two tech-support files from the node reporting the error and the far end (if it is a 7x50); the files should be generated approximately 15 minutes or more apart using the admin tech-support CLI command.

Note: XPL errors increment very slowly in some cases. Therefore, Nokia recommends that the 15 minute time gap between the two tech-support (TS) files should be observed.

− Relevant log details

− Determine if transmission equipment is present

Page 28: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting XPL Data Bus Errors

28

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 29: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Pchip Parity Alarms

Issue: 01 3HE 11475 AAAA TQZZA 01 29

4 Troubleshooting Pchip Parity Alarms

4.1 In This Chapter

This chapter describes the troubleshooting procedures for handling Pchip parity alarm errors.

The topics in this chapter include:

• Pchip Memory Parity Error Overview

• Pchip Memory Parity Error Detection and Impact

• Pchip Memory Parity Alarm Troubleshooting Flowchart

• Pchip Parity Alarms Description

• Pchip Alarm Sample Reports

Note: The alarms and troubleshooting information in this chapter applies to SR routers running TiMOS software releases 6.1.R5, 6.0.R10, 5.0.R21 and later; it does not apply to SR routers running older TiMOS releases.

Page 30: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Pchip Parity Alarms

30

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

4.2 Pchip Memory Parity Error Overview

The SR OS software loadset (6.1.R5, 6.0.R10, 5.0.R21 and later) supports the identification of conditions where an IOM is experiencing Pchip memory parity issues.

This chapter provides information about how and when the Pchip parity alarms are generated and what to do when they are reported.

4.3 Pchip Memory Parity Error Detection and Impact

The Nokia SR OS series Packet Processing chip (Pchip) is a network processor device located on the IOM. It performs various ingress and egress traffic related functions.

The SR OS error detection mechanism monitors the Pchip and determines whether an errored memory sector is correctable (and will correct the error), uncorrectable, or undetectable (possible read error).

In addition to the optional log message and SNMP trap, the timestamp of the last occurrence of the event and number of times the threshold was crossed are also displayed by the show card <X> detail CLI command.

The SR OS generates the tmnxEqCardPChipMemoryEvent alarms when errors are detected; see section 4.5 Pchip Parity Alarms Description for detailed alarm information.

Warning: Pchip memory parity errors may cause service impact to the control and data traffic; it is not possible to determine when a parity error will impact traffic.

The severity of the service impact may vary depending on the end-to-end applications, rate of error increment, location of the affected memory, and other factors.

Page 31: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Pchip Parity Alarms

Issue: 01 3HE 11475 AAAA TQZZA 01 31

4.4 Pchip Memory Parity Alarm Troubleshooting Flowchart

The flowchart in Figure 12 defines the troubleshooting steps for situations when Pchip parity alarms are reported by the system. A process of elimination is used to isolate the issue; proceed as directed.

Page 32: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Pchip Parity Alarms

32

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 12 Pchip Memory Parity Error Troubleshooting Flowchart

1035

Start

Yes

What IOM type is it?

Arenew Pchip

parity alarmsbeing reported?

Arenew Pchip

parity errorsbeing reported?

Isthe SR/ESS

running 6.1R15,7.0R8, 8.0, 9.0

or later?

Isthe SR/ESS

running 6.1R13,7.0R7, 8.0, 9.0

or later?

Are Pchip memory parityerrors incrementing, or is any

service impact associatedwith the IOM reporting

the error?

Reseat (power cycle) IOM(IMM) reporting the alarm.

Replace the IOM (IMM)reporting the alarm.

Remote power cycle the IOMusing tools perform card

<slot> power-cycle.

Monitor Card

Monitor Card

Collect relevant logsand 2 tech-support files

15 minutes apart.

Escalate toNokia Support.

Monitor the cardNo

No

No

No

No

Yes

Yes

Yes

Yes

IOM 3 XP/IMM

IOM 2

IOM 1

Reseating, remotely powering, or replacing the IOM will reboot the hardware and cause service impact. This procedure may need to be performed during a scheduled service maintenance window.

Note: Startup diagnostic will be executed during reboot of the IOM. If the IOM does not boot up successfully after power cycle, a spare IOM will need to be dispatched onsite and the faulty card will need to be replaced and sent back to Nokia repair following the Nokia Repair and Return process.

Page 33: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Pchip Parity Alarms

Issue: 01 3HE 11475 AAAA TQZZA 01 33

4.5 Pchip Parity Alarms Description

The SR OS generates the tmnxEqCardPChipMemoryEvent alarms when errors are detected.

SNMP MIB: TIMETRA-CHASSIS-MIB.mib

SNMP Trap: tmnxEqCardPChipMemoryEvent

The following configuration is required to enable the reporting of the tmnxEqCardPChipMemoryEvent alarms:

B:7x50# configure log event-control "chassis" 2098 generate

4.6 Pchip Alarm Sample Reports

This section provides event log configuration and CLI output examples of Pchip alarm reports.

tmnxEqCardPChipMemoryEvent

The following output is an example of the event log entry and CLI output information.

Sample Event Log Entry

1622 2008/09/14 12:04:05.01 UTC MINOR: CHASSIS #2063 Base"Slot 3 experienced a pchip parity error occurrence on complex 0"

Sample CLI Output

B:SR12# show card 3 detail===============================================================================Card 3===============================================================================

Note: The Pchip parity alarms are suppressed by default and must be enabled to facilitate post-failure analysis. In 5620 SAM, the alarms are not self clearing and must be cleared by an operator.

Note: The CLI status and statistics are cleared after an IOM / IMM / XCM reboot.

Page 34: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Pchip Parity Alarms

34

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Slot Provisioned Equipped Admin OperationalCard-type Card-type State State

-------------------------------------------------------------------------------3 iom-20g-b iom-20g-b up upIOM Card Specific Data

Clock source : noneAvailable MDA slots : 2Installed MDAs : 1

Hardware DataPart number : 3HE00229ABAB01CLEI code : IPUIAM9DAASerial number : NS072370366Manufacture date : 06232008Manufacturing string :Manufacturing deviations :Administrative state : upOperational state : upTemperature : 42CTemperature threshold : 75CSoftware boot (rom) version : X-0.0.private on Thu May 1 17:04:44 EDT 20*Software version : TiMOS-I-5.0.S498 iom/hops ALCATEL ESS 7450*Time of last boot : 2008/09/08 12:47:54Current alarm state : alarm clearedBase MAC address : 00:16:4d:de:ca:ddMemory capacity : 1,024 MB

Pchip Errors DetectedComplex 0 (parity error): Trap raised 1625 times; Last Trap 09/14/2008 12:04:05

===============================================================================*B:7450-RCA#

Page 35: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 35

5 Troubleshooting Ingress/Egress FCS Errors

5.1 In This Chapter

This chapter describes the troubleshooting procedures for ingress/egress Ethernet frame check sequence (FCS) errors.

The topics in this chapter include:

• Packet Loss Errors Overview

• Ingress FCS Errors

• Egress FCS Errors

Note: The troubleshooting information in this chapter applies to SR routers running TiMOS software releases 5.0.R16, 6.1.R1, 6.0.R5 and later. Some alarms and troubleshooting procedures may not apply to routers running older versions of the TiMOS software.

Page 36: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

36

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

5.2 Packet Loss Errors Overview

The 7x50 TiMOS software loadset (5.0.R16, 6.1.R1, 6.0.R5 and later) supports the identification of possible intermittent packet loss within the SR node.The three specific cases of potential packet loss are characterized as:

• Ingress FCS Errors (IOM error)

• Egress FCS Errors (IOM error)

• XPL Data Bus Errors (IOM/MDA errors)

The 7x50 TiMOS software generates notifications of potential internal packet loss through the main system event logs (log 99), SNMP traps, and CLI. These notifications provide operators with a faster, easier way of identifying whether any of the packet loss conditions are currently occurring in the network.

This chapter describes how to troubleshoot ingress and egress FCS errors. See chapter 3 for information about handling IOM/MDA errors.

5.3 Ingress FCS Errors

This section describes the troubleshooting information for ingress FCS errors. The topics are:

• Detecting Ingress FCS Errors

• Ingress FCS Error Troubleshooting Flowchart

5.3.1 Detecting Ingress FCS Errors

Ingress FCS errors are indicated by packet loss on traffic arriving at the ingress of a specific IOM/MDA complex. The packet loss associated with this issue generally affects all ingress ports on a specific IOM/MDA complex.

Warning: Ingress FCS errors will result in packet discard by the system detected with the failed CRC. The severity of the service impact may vary depending on the type of packets corrupted (for example, control or data packets), and the rate of discard.

Page 37: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 37

The Nokia SR OS series Packet Processing chip (Pchip) is a network processor device located on the IOM. It performs numerous ingress and egress traffic related functions.

The ingress FCS alarm reports are an indication that the ingress Pchip has received a packet from its on-board MDA that has failed an FCS check.

Due to the nature of the design of specific MDAs, some reports of Ingress Pchip errors may correspond to legitimate line errors. For example, Packet-over-SONET (POS) MDAs do not perform error detection on-board, and rely on their parent IOM to perform the error detection. Therefore, to accurately determine a true case of internal FCS errors, the operator must first validate that the IOM/MDA complex reporting the error is not reporting a case of external (line) FCS. See section 5.3.2 for the workflow to correctly identify and troubleshoot situations where the FCS alarms are reported by the system.

In addition to the optional log message and SNMP trap, you can also display the timestamp of the last occurrence of the event, and information about the number of times the threshold was crossed. Use the show card detail CLI command to display log event information to characterize the errors; see section 5.4.1.2 for more information.

5.3.1.1 SNMP Trap Information

• SNMP MIB: TIMETRA-CHASSIS-MIB.mib

• SNMP Trap: tmnxEqCardPChipError

Sample Event Log Entry

12 2012/04/02 15:57:34.46 UTC MINOR: CHASSIS #2059 Base "Slot 3 detected ingress FCS

Note: In some cases, the reported Ingress Pchip errors are caused by sources external to the system (that is, incoming line errors). Read the following information carefully to avoid misdiagnosing the issue in field-reported cases.

Note: The ingress/egress FCS error trap is enabled (generated) by default in loads 5.0.R16 to 5.0.R20, 6.0.R5 to 6.0.R9, and 6.1.R1 to 6.1.R4. However, the trap is disabled (suppressed) by default in releases 5.0.R21, 6.0.R10, and 6.1.R5 and later.

The configure log event-control chassis 2059 generate CLI command can be used to enable the ingress/egress FCS error trap. See chapter 13 for more information about enabling hardware alarms.

Page 38: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

38

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

errors on complex 1."11 2012/04/02 15:57:34.46 UTC MINOR: CHASSIS #2059 Base "Slot 3 detected ingress FCS errors on complex 0."

5.3.1.2 CLI Statistics

The CLI can be used to get a report of the number of times the tmnxEqCardPChipError trap has occurred, and the last time it was raised. The information is available in the CLI context for the IOM view referencing the individual complex reporting the errors.

The following output is an example of CLI information for the Card Chip Errors trap.

*A:SR13# show card 1 detail===============================================================================Card 1===============================================================================Slot Provisioned Equipped Admin Operational

Card-type Card-type State State-------------------------------------------------------------------------------1 iom-20g-b iom-20g-b up up IOM Card Specific Data

Clock source : noneNamed Pool Mode : DisabledAvailable MDA slots : 2Installed MDAs : 2

Hardware Data

Part number : 3HE00020AAAA01CLEI code :

[snipped…]Base MAC address : 00:03:fa:0e:83:ffMemory capacity : 1,024 MB

Pchip Errors Detected

Complex 0 (ingress): Trap raised 38 times; Last Trap 05/21/2008 16:22:05Complex 1 (ingress): Trap raised 1 times; Last Trap 05/21/2008 16:22:05

===============================================================================

The following output is an example of CLI information for the Card Chip Errors trap.

*A:SR13# show card 1 detail===============================================================================Card 1===============================================================================

Note: The CLI status and related statistics are cleared after an IOM reboot.

Page 39: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 39

Slot Provisioned Equipped Admin OperationalCard-type Card-type State State

-------------------------------------------------------------------------------1 iom-20g-b iom-20g-b up upIOM Card Specific Data

Clock source : noneNamed Pool Mode : DisabledAvailable MDA slots : 2Installed MDAs : 2

Hardware DataPart number : 3HE00020AAAA01CLEI code :

[snipped…]Base MAC address : 00:03:fa:0e:83:ffMemory capacity : 1,024 MB

FCS Errors DetectedComplex 0 (ingress): Trap raised 5 times; Last Trap 04/02/2012 16:00:35Complex 1 (ingress): Trap raised 5 times; Last Trap 04/02/2012 16:00:35

5.3.2 Ingress FCS Error Troubleshooting Flowchart

The flowchart in Figure 13 defines the ingress FCS alarm troubleshooting steps. A process of elimination is used to isolate the issue; proceed as directed.

Page 40: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

40

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 13 Ingress FCS Errors Troubleshooting Flowchart

1036

Start

Isthe complex

reporting the erroron a IOM or IMM

card?

Is thecomplex reportingthe error tied to an

Ethernet-basedor SONET-based

MDA?

Are the ingressport-level statistics

reporting incrementingincoming line

errors?

Is theMDA complex stillreporting ingress

FCS errors?

Is theMDA complex stillreporting ingress

FCS errors?

Is theMDA complex stillreporting ingress

FCS errors?

Are ingress FCS errorsincrementing, or is

service impact associatedwith the IOM reporting

the error?

Monitor Card

Reseat MDA Reseat MDA

Monitor Card

No

No

No

Monitor CardNo

Monitor CardNo

Monitor Card

Troubleshootto determinethe source ofthe incomingline errors,

including thefar end of the

link, e.g., showport x detail.

If all ports onthe complex

are reporting nophysical line

error (includingthe far end.)

No

Yes

Monitor CardNo

Monitor CardNo

Escalate toNokia Support.

Collect relevant logsand 2 tech-support files

15 minutes apart.

Yes

IOM

IMM

SONET based MDA

Ethernet-based MDA

Replace MDA

Yes

Replace IOM(IMM)

Yes

Is theMDA complex stillreporting ingress

FCS errors?

Is theMDA complex stillreporting ingress

FCS errors?

Is theMDA complex stillreporting ingress

FCS errors?

Replace MDA

Yes

Yes

Replace IOM

Yes

Yes

Reseat / Replace the IOM/MDA will reboot the hardware and cause service impact. This procedure should be performed during a scheduled service maintenance window.

Note: Startup diagnostic will be executed during reboot of the hardware. If the IOM or MDA doesnot boot up successfully must power cycle, a spare IOM or MDA will need to be dispatched onsite. The faulty card must be replaced and sent back to Nokia repair following the Nokia Repair and Return process.

Page 41: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 41

5.4 Egress FCS Errors

This section describes the troubleshooting information for egress FCS errors. The topics are:

• Detecting Egress FCS Errors

• Egress FCS Error Troubleshooting Flowchart

5.4.1 Detecting Egress FCS Errors

Egress FCS errors are indicated by packet loss on traffic egressing one or more MDA complexes. The packet loss associated with this issue generally affects all ports tied to a specific MDA that is reporting the error in a random manner.

The Nokia SR OS series Packet Processing chip (Pchip) is a network processor device located on the IOM. It performs numerous ingress and egress traffic related functions.

The egress FCS alarm reports are an indication that the Pchip on the egress data path has received a packet from the switching fabric that has failed an internal FCS check.

In addition to the optional log message and SNMP trap, you can also display the timestamp of the last occurrence of the event, and information about the number of times the threshold was crossed. Use the show card detail CLI command to display log event information to characterize the errors; see section 5.4.1.2 for more information.

Warning: Egress FCS errors will result in packet discard by the system detected with the failed CRC. The severity of the service impact may vary depending on the type of corrupted packets (for example, control or data packets), and the rate of discard.

Note: In some cases, the IOM complex reporting the Egress FCS errors may not be the source of the problem. If more than one IOM complex is simultaneously reporting the Egress FCS alarm, the root cause of the errors may be another card in the system that is forwarding bad frames across the fabric to multiple destination IOMs. It is imperative to understand the traffic pattern traversing the node carefully to avoid misdiagnosing the issue in field-reported cases.

Page 42: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

42

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

5.4.1.1 SNMP Trap Information

• SNMP MIB: TIMETRA-CHASSIS-MIB.mib

• SNMP Trap: tmnxEqCardPChipError

Sample Event Log Entry

13 2012/04/02 15:59:41.62 UTC MINOR: CHASSIS #2059 Base "Slot 2 detected egress FCS errors on complex 0. Source card(s) of detected errors: 2."

5.4.1.2 CLI Statistics

The CLI can be used to get a report of the number of times the tmnxEqCardPChipError trap has occurred, and the last time it was raised. The information is available in the CLI context for the IOM view referencing the individual complex reporting the errors.

The following output is an example of CLI information for the Card Chip Errors trap.

*A:SR13# show card 1 detail===============================================================================Card 1===============================================================================Slot Provisioned Equipped Admin Operational

Card-type Card-type State State-------------------------------------------------------------------------------1 iom-20g-b iom-20g-b up upIOM Card Specific Data

Clock source : noneNamed Pool Mode : DisabledAvailable MDA slots : 2Installed MDAs : 2

Hardware DataPart number : 3HE00020AAAA01

Note: The ingress/egress FCS error trap is enabled (generated) by default in loads 5.0.R16 to 5.0.R20, 6.0.R5 to 6.0.R9 and 6.1.R1-4. However, the trap is disabled (suppressed) by default in releases 5.0.R21, 6.0.R10, and 6.1R5 and later.

The configure log event-control chassis 2059 generate CLI command can be used to enable the ingress/egress FCS error trap. See chapter 13 for more information about enabling hardware alarms.

Note: The CLI status and related statistics are cleared after an IOM reboot.

Page 43: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 43

CLEI code :[snipped…]

Base MAC address : 00:03:fa:0e:83:ffMemory capacity : 1,024 MB

FCS Errors DetectedComplex 0 (egress): Trap raised 5 times; Last Trap 04/02/2012 15:59:42

Sources of egress err'd packets (from last trap): Card(s) 2

5.4.2 Egress FCS Error Troubleshooting Flowchart

The flowcharts in Figure 14, Figure 15, Figure 16, and Figure 17 describe the ingress FCS alarm troubleshooting steps. A process of elimination is used to isolate the issue; proceed as directed.

Page 44: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

44

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 14 Egress FCS Errors Troubleshooting Flowchart Part 1

1080

Are EgressFCS errors

incrementing or is anyservice impact associated

with IOM reportingthe error?

No

No No Nextflow

chart.

No

Reseat card orperform a powercycle

using CLI.

Reseat card orperform a powercyclethe source using CLI.

Yes

Yes

Yes

Yes

Only onecard reporting

errors?

Only onesource?

AreEgress FCS errorsstill incrementing?

Yes

Yes

Is asingle source

reported?

Are EgressFCS errors stillincrementing?

Start

Take 2 TS files15 minutes apartand escalate toNokia Support.

Monitor

NoMonitor

NoMonitor

Generate 2 TS files,15 minutes apart.

Page 45: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 45

Figure 15 Egress FCS Errors Troubleshooting Flowchart Part 2

1081

Nextflow

chart.

Previousflow

chart.

AreCPM5/SFM-5

installed?

Yes

Yes

Yes

x = 1

Nox = x+1

AreEgress FCS errorsstill incrementing?

AreEgress FCS errorsstill incrementing?

AreEgress FCS errorsstill incrementing?

Monitor Cards(s)

Monitor Cards(s)

No

No

No

Shutdown SFM x.

No ShutdownSFM x.

No ShutdownSFM x.

ReplaceSFM x.

Clear SFM x.

Is this thelast SFM?

ContactNokia Support.

Collect relevantlogs and 2 TS files15 minutes apart.

Page 46: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

46

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 16 Egress FCS Errors Troubleshooting Flowchart Part 3

1082

Nextflow

chart.

Previousflow

chart.

AreCPM-3 or CPM-4

installed and running on11.0R17, 12.0R1

or later?

Yes

Yes

Yes

Yes

AreEgress FCS errors

incrementing?

AreEgress FCS errors

incrementing?

AreEgress FCS errors

incrementing?

AreEgress FCS errors

incrementing?

Monitor

Take 2 TS files15 minutes apart and

escalate to Nokia Support.

Yes

No

No

No

No

No

Run ‘admin rebootstandby hold.’

Reseat thestandby CPM.

Reseat thestandby CPM.

Run ‘admin rebootstandby’, synchronize theboot environment and then

perform a switchover.

Run‘admin rebootstandby hold.’

Page 47: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Ingress/Egress FCS Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 47

Figure 17 Egress FCS Errors Troubleshooting Flowchart Part 4

1083

Take 2 TS files15 minutes apart and

escalate to Nokia Support.

Previousflow

chart.

AreCPM-1, 2, 3, 4

installed and running anolder load than

11.0R17, 12.0R1?

Yes

Yes

Yes

Yes

AreEgress FCS errors

incrementing?

AreEgress FCS errors

incrementing?

AreEgress FCS errors

incrementing?

AreEgress FCS errors

incrementing?

Monitor

Yes

No

No

No

No

Synchronize the boot-env and extract

the standby CPM.

Reinsert thestandby CPM.

Replace thestandby CPM.

Perform a CPMswitchover.

Extract thestandby CPM.

Reinsert the standbyCPM and synchronize

the boot-env.

Page 48: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Ingress/Egress FCS Errors

48

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 49: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Pchip CAM Alarms

Issue: 01 3HE 11475 AAAA TQZZA 01 49

6 Troubleshooting Pchip CAM Alarms

6.1 In This Chapter

This chapter describes the troubleshooting procedures for handling Pchip Content Addressable Memory (CAM) alarms on the IOM and the IMM.

The topics in this chapter include:

• Pchip CAM Error Overview

• Pchip CAM Error Detection and Impact

• Pchip CAM Alarm Troubleshooting Flowchart

• Pchip CAM Alarms Description

• Pchip CAM Alarm Sample Reports

Note: The alarms and troubleshooting information in this chapter applies to SR routers running TiMOS software releases 7.0.R7, 6.1.R13 and later; it does not apply to SR routers running older TiMOS releases.

Page 50: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Pchip CAM Alarms

50

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

6.2 Pchip CAM Error Overview

The SR TiMOS software load set (6.1.R13, 7.0.R7 and later) supports the reporting (and correction, if possible) of conditions associated with the CAM located on the IOMs/IMMs and SF/CPMs.

This chapter provides information about how and when the Pchip CAM alarms are generated, and how to address the alarms when they are reported.

6.3 Pchip CAM Error Detection and Impact

The Nokia SR OS series Packet Processing chip (Pchip) is a network processor device located on the IOM. It performs various ingress and egress traffic related functions. CAM is high-speed memory that is primarily used by the Pchip to access the IP and MAC filter data, QoS, and IPv6 forwarding information base (FIB).

There will be a service impact if the affected memory sector is currently at use. When Pchip CAM errors are visible, creating new filters will increase the chance of impacting services, because an errored memory sector may be used.

The SR OS generates the tmnxEqCardPChipCamEvent alarms when errors are detected; see section 6.5 for detailed alarm information.

In addition to the optional log message and SNMP trap, the timestamp of the last occurrence of the event and number of times the threshold was crossed are also displayed by the show card <X> detail CLI command.

Note: The troubleshooting procedures in this chapter apply to the IOMs and IMMs only.

Warning: Pchip CAM memory errors may cause service impact to control and data traffic; it is not possible to determine when a Pchip CAM error will impact the traffic.

The severity of the service impact may vary depending on the end-to-end applications, rate of error increment, location of the affected memory, and other factors.

Page 51: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Pchip CAM Alarms

Issue: 01 3HE 11475 AAAA TQZZA 01 51

6.4 Pchip CAM Alarm Troubleshooting Flowchart

The flowchart in Figure 18 defines the troubleshooting steps for situations when Pchip CAM alarms are reported by the system. A process of elimination is used to isolate the issue; proceed as directed.

Page 52: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Pchip CAM Alarms

52

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 18 Pchip CAM Error Troubleshooting Flowchart

1038

Start

Are new CAMerrors still being

reported?

Yes

Are new CAMalarms being

reported?

Are CAM errorsincrementing? or is any

service impact associatedwith the IOM reporting

the error?

Reseat (power cycle) IOM(IMM) reporting the alarm.

Replace the IOM (IMM)reporting the alarm.

Remote Power cycle the IOMusing the tools perform card

<slot> power-cycle command.

Collect relevant logsand 2 tech-support files

15 minutes apart.

Escalate toNokia Support.

Monitor Card

Monitor Card

Monitor Card

No

No

No

No

Yes

Yes

Yes

Reseating, remote powering, replacing the IOM will reboot the hardware and cause service impact. This procedure may need to be performed during a scheduled service maintenance window.

IMPORTANT: Startup diagnostic will be executed during reboot of the IOM, if the IOM does not boot up successfully after power cycle, a spare IOM will need to be dispatched onsite and the faulty card will need to be replaced and sent back to Nokia repair following the Nokia Repair and Return process.

IOM type 1?

Page 53: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Pchip CAM Alarms

Issue: 01 3HE 11475 AAAA TQZZA 01 53

6.5 Pchip CAM Alarms Description

The SR OS generates the tmnxEqCardPChipCamEvent alarms when errors are detected.

SNMP MIB: TIMETRA-CHASSIS-MIB.mib

SNMP Trap: tmnxEqCardPChipCamEvent

The following configuration is required to disable the reporting of the tmnxEqCardPCamMemoryEvent alarms:

B:7x50# configure log event-control "chassis" 2076 suppress

6.6 Pchip CAM Alarm Sample Reports

This section provides event log configuration and CLI output examples of Pchip CAM alarm reports.

tmnxEqCardPChipCamEvent

The following output is an example of the event log entry and CLI output information.

Sample Event Log Entry

425391 2012/02/10 09:52:58.88 CST CRITICAL: CHASSIS #2076 Base"A fault has been detected in the hardware on IOM 2-forwarding engine 1: Please contact Alcatel-Lucent support"

Sample CLI Output

B:7x50# show card 2 detail===============================================================================Card 2===============================================================================

Note: The Pchip CAM alarm reporting is enabled by default. The alarms are not self clearing in 5620 SAM and must be cleared by an operator.

Note: The CLI status and statistics are cleared after an IOM reboot.

Page 54: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Pchip CAM Alarms

54

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Slot Provisioned Equipped Admin OperationalCard-type Card-type State State

-------------------------------------------------------------------------------2 iom-20g-b iom-20g-b up up

IOM Card Specific DataClock source : noneNamed Pool Mode : DisabledAvailable MDA slots : 2Installed MDAs : 2

Hardware DataPart number : 3HE00229ABAF01CLEI code : IPUIAM9DAASerial number : NS101063307Manufacture date : 03112010Manufacturing string :Manufacturing deviations :Administrative state : upOperational state : upTemperature : 37CTemperature threshold : 75CSoftware boot (rom) version : X-7.0.R5 on Wed Sep 30 14:13:59 PDT 2009 by*Software version : TiMOS-I-7.0.R10 iom/hops ALCATEL ESS 7450 C*Time of last boot : 2010/09/19 08:54:14Current alarm state : alarm clearedBase MAC address : 6c:be:e9:6c:a4:b2Last bootup reason : hard bootMemory capacity : 1,024 MB

Pchip Errors DetectedComplex 1 (CAM error): Trap raised 1337 times; Last Trap 02/10/2012 09:44:59

B:7x50#

Page 55: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Qchip Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 55

7 Troubleshooting Qchip Errors

7.1 In This Chapter

This chapter describes the troubleshooting procedures for handling Qchip errors.

The topics in this chapter include:

• Qchip Error Overview

• Detecting Qchip Errors

• Qchip Error Troubleshooting Flowchart

• Qchip Alarms Description

• Fail-On-Error

Note: The alarms and troubleshooting information in this chapter applies to SR routers running SR OS software releases 9.0.R.23, 10.0.R12, 11.0.R4, 10.0.R18, 11.0.R10, 12.0.R3 and later; it does not apply to SR routers running older SR OS releases.

Page 56: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Qchip Errors

56

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

7.2 Qchip Error Overview

The Qchip refers to the Nokia SR OS series QoS Queuing Engine chip. The Qchip is the Traffic Manager for the line card; it is responsible for frame fragment re-assembly, buffering, and forwarding frames to the Network Processor.

The SR OS software provides the following alarm reports and fail-on-error protection capability for the IOM3-XP and IMM line cards (FP2- and FP3-based cards). See section 7.4 for more information about the fail-on-error mechanism.

• Reports events associated with the Queue Buffer Memory Errors, Queue Statistic Memory Errors, and Q-Chip Internal Memory Errors detected on a line card (SR0S Release 9.0.R23, 10.0.R12, 11.0.R4 and later).

• Generates traps when fast protocols (such as BFD or ETH-OAM) time out due to temporary traffic forwarding interruption as a result of automatic recovery of transient errors in the internal data path (Qchip). (SR OS Release 10.0.R18,11.0.R10, 12.0.R3).

7.3 Detecting Qchip Errors

The SR OS error detection mechanism monitors the QChip and raises an alarm if the Queue Buffer Memory, Queue Statistic Memory, or QChip Internal Memory detects any errors.

In addition to the optional log message and SNMP trap, the timestamp of the last occurrence of the event and number of times the trap was raised are also displayed by the show card <X> detail CLI command.

The SR OS generates two categories of Qchip alarms when errors are detected; see section 7.6 for detailed alarm information.

• Qchip Memory Errors

− tmnxEqCardQChipBufMemoryEvent

− tmnxEqCardQChipStatsMemoryEvent

− tmnxEqCardQChipIntMemoryEvent

• Qchip Recovery Event

Note: The troubleshooting procedures described in this chapter apply to the IOM3-XP and IMM line cards only.

Page 57: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Qchip Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 57

− tmnxEqDataPathFailureProtImpact

7.3.1 Impact of Qchip Errors

Most Qchip errors are transient and there is generally no adverse impact if the error occurs once. Monitor the affected card if a transient Qchip alarm is raised. Follow the troubleshooting procedure described in section 7.5 to assess the system. You can also open a ticket with Nokia Technical Support to verify system health.

The severity of the service impact caused by Qchip memory errors may vary depending on several factors including end-to-end applications, rate of error increment, and location of the affected memory. Qchip memory errors will cause a service impact if the affected memory sector is currently in use.

Qchip recovery events may cause the BFD/ETH-OAM protocol to time out due to temporary traffic forwarding interruption.

7.4 Fail-On-Error

The Queue Buffer Memory Alarm, Queue Statistic Memory Alarm, and Qchip Internal Memory Alarm are triggers for the fail-on-error feature on the line card.

If fail-on-error is enabled on the line card, the operational state of the card is set to Failed upon the first trap raised. The Failed state persists until the clear card command is performed (reset), or the card is power cycled, or the card is removed and re-inserted (physical reseat).

See chapter 13 for information about the implementation and usage of the fail-on-error feature.

Warning: Multiple Qchip memory errors may cause service impact to the control and data traffic.

Page 58: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Qchip Errors

58

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

7.5 Qchip Error Troubleshooting Flowchart

The flowchart in Figure 19 defines the Qchip alarm troubleshooting steps. A process of elimination is used to isolate the issue; proceed as directed.

Page 59: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Qchip Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 59

Figure 19 Qchip Error Troubleshooting Flowchart

1033

Escalate to Nokia Supportand provide the tech-supportfiles captured at each step.

Are new Qchip errors stillbeing reported?

Is theIOM or IMM’s

Operational State set to “Fail”(if fall-on-erroris enabled)?

Replace the IOM or IMM reporting the alarm.Capture on tech-support file

after ICM or IMM replacement.

Power cycle the IOM or IMMCapture one tech-support file after

IOM or IMM power cycle.

Are Qchip errorsincrement on the IOM or IMM?

or is any service impact associatedwith the IOM or IMM reporting

the error?

Power cycle the IOM or IMM during MaintenanceWindow Capture one tech-support file after

IOM or IMM power cycle.

Monitor Card

Monitor Card

Collect on tech-support file.Collect on tech-support file.

No

No No

YesYes

Yes

Reseating, remote powering or replacing the IOM or IMM will reboot the hardware and cause service impact.This procedure may need to be performed during a scheduled service maintenance window.

Note: Startup diagnostic will be executed during reboot of the IOM or IMM. If the IOM or IMM doesnot boot up successfully after power cycle, a spare IOM or IMM will need to be dispatched onsite. The faulty card will need to be replaced and sent back to Nokia repair following the Nokia Repair and Return process.

Start

Page 60: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Qchip Errors

60

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

7.6 Qchip Alarms Description

Table 2 lists the specific Qchip alarms and events that trigger the fail-on-error protection mechanism.

See chapter 13 for detailed information about the implementation and usage of the fail-on-error feature.

7.6.1 Reporting Qchip Alarms

Table 3 lists the configuration required to enable reporting of Qchip alarms.

Note: The Qchip alarms are not self clearing in 5620 SAM and must be cleared by an operator.

Table 2 SNMP Traps

SNMP Trap Supported Release

SNMP MIB: TIMETRA-CHASSIS-MIB.mib

SNMP Trap: tmnxEqCardQChipBufMemoryEvent 9.0.R23 / 10.0.R12 / 11.0.R4 and later

SNMP Trap: tmnxEqCardQChipBufMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqCardQChipStatsMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqCardQChipIntMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

Note: The Qchip alarms are suppressed by default and must be enabled to facilitate post-failure analysis.

Table 3 Log Event Control Configuration

Alarm Configuration

tmnxEqCardQChipBufMemoryEvent B:7x50# configure log event-control "chassis" 2098 generate

tmnxEqCardQChipStatsMemoryEvent B:7x50# configure log event-control "chassis" 2099 generate

Page 61: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Qchip Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 61

7.6.2 Qchip Alarm Sample Reports

This section provides event log configuration and CLI output examples of Qchip alarm reports.

tmnxEqCardQChipIntMemoryEvent

The following output is an example of the tmnxEqCardQChipIntMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

" 249 2014/04/17 19:36:36.31 UTC MINOR: CHASSIS #2101 Base"Slot 1 experienced a qchip internal memory error occurrence on complex 0"

Sample CLI Output

B:SR-12# show card 1 detail===============================================================================Card 1===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------1 iom3-xp up up

iom3-xp-b<snip>Qchip Errors Detected

Complex 0 (internal memory error): Trap raised 1 times; Last Trap 04/17/2014 19:36:36===============================================================================

tmnxEqCardQChipBufMemoryEvent

tmnxEqCardQChipIntMemoryEvent B:7x50# configure log event-control "chassis" 2101 generate

tmnxEqDataPathFailureProtimpact B:7x50# configure log event-control "chassis" 2126 generate

Table 3 Log Event Control Configuration (Continued)

Alarm Configuration

Note: The CLI status and statistics are cleared after an IOM / IMM / XCM reboot.

Page 62: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Qchip Errors

62

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

The following output is an example of the tmnxEqCardQChipBufMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

263 2014/04/17 20:06:36.31 UTC MINOR: CHASSIS #2098 Base"Slot 1 experienced a Q-chip buffer memory error occurrence on complex 0"

Sample CLI Output — Both MDAs are affected in the following example, and the whole IOM is failed as a result.

B:SR-12# show card 1 detail===============================================================================Card 1===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------1 iom3-xp up up

iom3-xp-b<snip>Qchip Errors Detected

Complex 0 (buffer memory error): Trap raised 1 times; Last Trap 04/17/2014 20:06:36

tmnxEqCardQChipStatsMemoryEvent

The following output is an example of the tmnxEqCardQChipStatsMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

266 2014/04/17 20:16:36.31 UTC MINOR: CHASSIS #2099 Base"Slot 1 experienced a Q-chip statistics memory error occurrence on complex 0"

Sample CLI Output

B:SR-12# show card 1 detail===============================================================================Card 1===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------1 iom3-xp up up

iom3-xp-b<snip>Qchip Errors Detected

Complex 0 (statistics memory error): Trap raised 1 times; Last Trap 04/17/2014 20:16:36

Page 63: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Qchip Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 63

tmnxEqDataPathFailureProtImpact

The following output is an example of the tmnxEqDataPathFailureProtImpact event log entry and CLI output information.

Sample Event Log Entry

369 2014/10/02 21:12:04.61 UTC MINOR: CHASSIS #2126 Base"IO Module 1 experienced a datapath failure which impacted a protocol."

Sample CLI Output

B:SR-12# show card 1 detail===============================================================================Card 1===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------1 iom3-xp up up

iom3-xp-b

Page 64: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Qchip Errors

64

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 65: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting MLPPP over a Serial Interface

Issue: 01 3HE 11475 AAAA TQZZA 01 65

8 Troubleshooting MLPPP over a Serial Interface

8.1 In This Chapter

This chapter describes how to troubleshoot Multilink Point-to-Point Protocol (MLPPP) problems. The topics in this chapter include:

• MLPPP Error Overview

• MLPPP Error Troubleshooting Flowchart

• To Troubleshoot One or More Inactive Links

• To Troubleshoot An Inactive Channel Group

• To Troubleshoot Traffic Issues

Page 66: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting MLPPP over a Serial Interface

66

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

8.2 MLPPP Error Overview

This document describes how to use the CLI commands to troubleshoot the following MLPPP problem scenarios.

• One or more inactive links; see section 8.4

• Links are active but the group is inactive; see section 8.5

• Traffic issues when group and links are active; see section 8.6

8.3 MLPPP Error Troubleshooting Flowchart

The flowchart in Figure 20 defines the MLPPP errors troubleshooting steps. A process of elimination is used to isolate the issue; proceed as directed.

Page 67: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting MLPPP over a Serial Interface

Issue: 01 3HE 11475 AAAA TQZZA 01 67

Figure 20 MLPPP Error Troubleshooting Flowchart

1034

1: “show multilink-bundle bundle-ppp-3/1.1 detail”2: “show multilink-bundle bundle-ppp-3/1.1 detail”3: “show port 3/1/1.1.2.1 detail”4: “show multilink-bundle bundle-ppp-3/1.1 ppp” “tools dump ppp bundle-ppp-3/1.1”

5: “show port 3/1/1.1.1.1 ppp” “tools dump ppp 3/1/1.1.2.1” “debug ppp 3/1/1.1.2.1”6: When running MC or SC aps the following commands may help: “tools dump aps aps-16” “tools dump aps mc-aps-signaling”

CLI Command Reference

Escalate toNokia Support.

Start

Is the bundleoper Up?

Is the primarymember port up?

Are there anyports oper Down?

Is theport physical layer

down orbouncing?

Check the primarymember port

status (1).

Check the bundleNCP layer (4).

Check the physicalport status (3).

Check the portLCP layers (5).

Escalate toNokia Support (6).

Fix the physical layer.

Check the portswithin the bundle (2).

No

No

No

Yes

Yes

For each portthat is down

Yes

Yes

Page 68: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting MLPPP over a Serial Interface

68

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

8.4 To Troubleshoot One or More Inactive Links

In this troubleshooting scenario, the primary port is up and the MLPPP bundle is operational, but one of the members is not operational.

Step 1. Check the Group/Bundle Status

Use the show multilink-bundle bundle-ppp-<X> detail CLI command to check the group/bundle status, as shown in the following sample CLI output.

A:bottom_SR7# show multilink-bundle bundle-ppp-3/1.1 detail===============================================================================Bundle bundle-ppp-3/1.1 Detail===============================================================================Description : MultiLink BundleBundle Id : bundle-ppp-3/1.1 Type : mlpppAdmin Status : up Oper Status : upMinimum Links : 1 Bundle IfIndex : 639631361Total Links : 4 Active Links : 3Red Diff Delay : 0 Yellow Diff Delay : 0Red Diff Delay Act : none MRRU : 1524Short Sequence : true Oper MRRU : 1524Oper MTU : 1526 Fragment Threshold : 128 bytesUp Time : 0d 00:04:35 Bandwidth : 4608 KBitPPP Input Discards : 0 Primary Member Port: 3/1/1.1.1.1Mode : accessInterleave-Frag : false-------------------------------------------------------------------------------Member Port Id #TS Admin Oper Act Down Reason Up Time-------------------------------------------------------------------------------3/1/1.1.1.1 24 up up yes N/A 0d 00:04:373/1/1.1.2.1 24 up up no under negotiation N/A3/1/1.1.3.1 24 up up yes N/A 0d 00:04:373/1/1.1.4.1 24 up up yes N/A 0d 00:04:38

Step 2. Check the Link Status

Use the show port <X> ppp CLI command to check the link status, as shown in the following sample CLI output. The Last Change output field is a good indicator to determine if the link is down or just toggling.

*A:bottom_SR7# show port 3/1/1.1.1.1 ppp===============================================================================LCP Protocol for 3/1/1.1.1.1===============================================================================Protocol State Last Change Restart Count Last Cleared-------------------------------------------------------------------------------lcp opened 01/13/2009 13:44:07 2 01/13/2009 13:29:37=============================================================================== Keepalive statistics Request interval : 10 Threshold exceeded : 0

Page 69: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting MLPPP over a Serial Interface

Issue: 01 3HE 11475 AAAA TQZZA 01 69

Drop Count : 3 In packets : 150Time to link drop : 00h00m30s Out packets : 150Last cleared time : 01/13/2009 13:29:37 PPP Header CompressionACFC : Disabled PFC : Disabled

===============================================================================

Step 3. Check the Physical Layer

If the link is down or toggling, it may indicate a physical layer problem. Use the show port <X> detail command to check the physical layer, as shown in the following sample CLI output. If the CLI output indicates that the port is down and/or the last status change time is increasing, then the issue is caused by physical errors.

*A:s224_72# show port 1/2/1.1.1.1.1 detail===============================================================================TDM DS0 Chan Group===============================================================================Description : DS0GRPInterface : 1/2/1.1.1.1.1TimeSlots : 1-24Speed : 64 CRC : 16Admin Status : up Oper Status : downBER SF Link Down : disabledLast State Change : 02/07/2011 13:34:53 Chan-Grp IfIndex : 574652506

Configured mode : access Encap Type : bcp-dot1qAdmin MTU : 1522 Oper MTU : 1522Scramble : falsePhysical Link : no Bundle Number : 1Idle Cycle Flags : flags Load-balance-algo : defaultPayload Fill Type : n/a Payload Pattern : N/ASignal Fill Type : n/a Signal Pattern : N/AIng. Pool % Rate : 100 Egr. Pool % Rate : 100Egr. Sched. Pol : N/A===============================================================================

===============================================================================Traffic Statistics===============================================================================

Input Output-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0===============================================================================Port Statistics===============================================================================

Input Output-------------------------------------------------------------------------------Packets 0 0Discards 0 0Unknown Proto Discards 0===============================================================================

Page 70: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting MLPPP over a Serial Interface

70

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Step 4. Collect LCP Information

If the physical layer is not at fault and the link is toggling, use the tools dump ppp <X> CLI command to get detailed Link Control Protocol (LCP) information, as shown in the following sample CLI output. The command output may indicate endpoint issues or other problems (for example, a short-sequence or MRRU option was rejected by the peer, or by the SR router).

A:bottom_SR7# tools dump ppp 3/1/1.1.2.1==============================================================================Id : 3/1/1.1.2.1 ppp unit : 8member of : bundle-ppp-3/1.1==============================================================================looped back : no dbgMask : 0x0------------------------------------------------------------------------------LCP------------------------------------------------------------------------------phase : TERMINATE state : REQSENTpassive : off silent : offrestart : onkeepalive : on echo num : 28echo timer : off echos fail : 4echo intv : 10 echos pend : 0options mru asyncMap upap chap magic pfcwe negotiate Yes No No No Yes Nopeer ack'd Yes No No No Yes Nowe allow Yes No No No Yes Nowe ack'd Yes No No No Yes Nooptions acfc lqr mrru shortSeq endPoint mlhdrfmtwe negotiate No No Yes Yes Yes Nopeer ack'd No No Yes Yes Yes Nowe allow No No Yes Yes Yes Nowe ack'd No No Yes Yes Yes NoMLPPP Endpoint:we want : 10.10.2.1we got : 10.10.2.1peer wants : 10.10.2.1peer got : 10.10.2.1------------------------------------------------------------------------------IPV6CP------------------------------------------------------------------------------active : no state : INITIALoptions intId reqIntId compwe negotiate Yes No Nopeer ack'd No No Nowe allow Yes No Nowe ack'd No No No------------------------------------------------------------------------------MPLSCP------------------------------------------------------------------------------active : no state : INITIAL------------------------------------------------------------------------------OSICP------------------------------------------------------------------------------active : no state : INITIAL------------------------------------------------------------------------------BCP

Page 71: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting MLPPP over a Serial Interface

Issue: 01 3HE 11475 AAAA TQZZA 01 71

------------------------------------------------------------------------------active : no state : INITIALlocal bcp qtag : on peer bcp qtag : offoptions bridge lineIden macType comp lanIdent macwe negotiate No No Yes No No Yespeer ack'd No No No No No Nowe allow No No Yes No No Yeswe ack'd No No No No No Nooptions: reqMac stp qtag mgmtInliwe negotiate No No Yes Yespeer ack'd No No No Nowe allow No No Yes Yeswe ack'd No No No No------------------------------------------------------------------------------

Step 5. Collect Debug LCP Protocol Traces

If the LCP diagnostics are inconclusive, the issue may be caused by a failing protocol handshake. Collect debug LCP protocol traces to isolate the problem.

Use the debug ppp <link-id> CLI command to collect the information in a log or session, as shown in the following sample CLI output.

A:bottom_SR7>config>log# info----------------------------------------------

log-id 50from debug-traceto console

exit---------------------------------------------- A:bottom_SR7# debug ppp 3/1/1.1.2.1 1 2009/01/13 13:39:40.70 UTC MINOR: DEBUG #2001 Base PPP"PPP: 3/1/1.0x6f [fsm_timeout]

(LCP) REQSENT, retrans 6" 2 2009/01/13 13:39:40.70 UTC MINOR: DEBUG #2001 Base PPP"PPP: 3/1/1.0x6f [fsm_sconfreq]

(LCP), state REQSENT (retransmit)" 3 2009/01/13 13:39:40.70 UTC MINOR: DEBUG #2001 Base PPP"PPP: 3/1/1.0x6f [lcp_addci]

go->neg_endpoint=1" 4 2009/01/13 13:39:40.70 UTC MINOR: DEBUG #2001 Base PPP"PPP: 3/1/1.0x6f [log_packet]

Sent Len 29 [LCP ConfReq id=0x29 <mrru 1524> <shortseq> <endpoint IP 0a 0a 02 01> <mru 1500> <magic 0x7f572490>]"

Note: Allow the traces to be collected for a few minutes to ensure that a full LCP handshake is captured.

Page 72: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting MLPPP over a Serial Interface

72

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

8.5 To Troubleshoot An Inactive Channel Group

In this troubleshooting scenario, the member links in the channel group are operational but the group is not active.

Before proceeding with these troubleshooting steps, perform the steps in section 8.4 to check the status of the member links; verify that the links in the bundle are stable and not toggling.

If the member links are stable, an inactive channel group generally indicates a pure NCP negotiation issue (IPCP or BCP). Perform the following steps to isolate the root cause of the problem.

Step 1. Check the Bundle NCP Status

Use the show multilink-bundle bundle-ppp-<X> ppp CLI command to check the bundle NCP status, as shown in the following sample CLI output.

A:bottom_SR7# show multilink-bundle bundle-ppp-3/1.1 ppp===============================================================================Bundle bundle-ppp-3/1.1 Multilink PPP Information===============================================================================Local EpId Class : IP AddressLocal Discriminator: 10.10.2.1Yellow Diff Delay : 0 Short Sequence : trueMRRU : 1524 Oper MRRU : 1524Interleave-Frag : false PPP Input Discards : 0Multiclass Classes : 0Ing QoS Profile Id : 0 Egr QoS Profile Id : 0Magic Number : Enabled==============================================================================================================================================================PPP Protocols for bundle-ppp-3/1.1===============================================================================Protocol State Last Change Restart Count Last Cleared-------------------------------------------------------------------------------ipcp request sent 01/13/2009 13:41:47 1 01/13/2009 13:29:37mplscp initial 01/13/2009 13:29:37 0 01/13/2009 13:29:37bcp initial 01/13/2009 13:29:37 0 01/13/2009 13:29:37osicp initial 01/13/2009 13:29:37 0 01/13/2009 13:29:37ipv6cp initial 01/13/2009 13:29:37 0 01/13/2009 13:29:37===============================================================================Local Mac address : 00:16:4d:13:21:84 Remote Mac address :Local IPv4 address : 18.1.1.1 Remote IPv4 address: 0.0.0.0Local IPv6 address : ::Remote IPv6 address: ::===============================================================================

Step 2. Check the Negotiated Parameter Details

Use the tools dump ppp bundle-ppp-<X> CLI command to check further details of the negotiated parameters, as shown in the following sample CLI output.

Page 73: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting MLPPP over a Serial Interface

Issue: 01 3HE 11475 AAAA TQZZA 01 73

In this example, the SR router did not acknowledge the far-end address.

*A:bottom_SR7# tools dump ppp bundle-ppp-3/1.1==============================================================================Id : bundle-ppp-3/1.1 ppp unit : 4100==============================================================================looped back : no dbgMask : 0x0red delay : 0 yellow delay : 0peer ofr'd mrru: 1524 ack'd peer mrru: 1524peer short seq : Yes members active diff-delay3/1/1.1.1.1 Yes 0

3/1/1.1.2.1 No 0

3/1/1.1.3.1 Yes 0

3/1/1.1.4.1 Yes 0

------------------------------------------------------------------------------IPCP------------------------------------------------------------------------------active : yes state : REQSENT options addr oldAddr reqAddr vj oldVJwe negotiate Yes No Yes No Nopeer ack'd Yes No Yes No Nowe allow Yes No No No Nowe ack'd No No No No No options priDns secDns priNbns secNbnswe negotiate No No No Nopeer ack'd No No No Nowe allow Yes No No Nowe ack'd No No No No------------------------------------------------------------------------------

Step 3. Check the NCP Protocol Handshake

Capture the NCP protocol handshake information to characterize the problem. Use the debug ppp bundle-ppp-<X> CLI command to display the handshake information, as shown in the following sample CLI output.

A:bottom_SR7>config>log# info----------------------------------------------

Note: The LCP on the primary link must be stable to obtain accurate debug traces. Allow the debug to run for a few minutes to ensure that a full cycle is captured.

Page 74: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting MLPPP over a Serial Interface

74

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

log-id 50from debug-traceto console

exit----------------------------------------------A:bottom_SR7# debug ppp bundle-ppp-3/1.1 20 2009/01/13 13:46:01.85 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [fsm_timeout

(IPCP) REQSENT, retrans 2" 21 2009/01/13 13:46:01.85 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [fsm_sconfre

(IPCP), state REQSENT (retransmit)" 22 2009/01/13 13:46:01.85 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [log_packet]

Sent Len 12 [IPCP ConfReq id=0xcb <addr 18.1.1.1>]" 23 2009/01/13 13:46:04.56 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [fsm_timeout

(IPCP) REQSENT, retrans 1" 24 2009/01/13 13:46:04.56 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [fsm_sconfre

(IPCP), state REQSENT (retransmit)" 25 2009/01/13 13:46:04.56 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [log_packet]

Sent Len 12 [IPCP ConfReq id=0xcb <addr 18.1.1.1>]" 26 2009/01/13 13:46:07.66 UTC MINOR: DEBUG #2001 Base PPP"PPP: bundle-ppp-3/1.1 [fsm_timeout

(IPCP) REQSENT, retrans 0"

8.6 To Troubleshoot Traffic Issues

In this troubleshooting scenario, the channel group and all member links are operational but there are traffic issues.

Before proceeding with troubleshooting the traffic issues, perform the steps in sections 8.4 and 8.5 to check the status (the last status change information) of the bundle and the member links; verify that the bundle and/or the links are not toggling.

Monitor the Bundle and Member Links

Use the monitor port bundle-ppp-<X> interval<X> CLI command to check the bundle and link status, as shown in the following sample CLI output.

*A:s224_72# monitor port bundle-ppp-1/2.1 interval 3===============================================================================Monitor statistics for Port bundle-ppp-1/2.1

Page 75: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting MLPPP over a Serial Interface

Issue: 01 3HE 11475 AAAA TQZZA 01 75

===============================================================================Input Output

--------------------------------------------------------------------------------------------------------------------------------------------------------------At time t = 0 sec (Base Statistics)-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0-------------------------------------------------------------------------------At time t = 3 sec (Mode: Delta)-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0-------------------------------------------------------------------------------At time t = 6 sec (Mode: Delta)-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0*A:s224_72# monitor port 1/2/1.1.1.1.1 interval 3===============================================================================Monitor statistics for Port 1/2/1.1.1.1.1===============================================================================

Input Output--------------------------------------------------------------------------------------------------------------------------------------------------------------At time t = 0 sec (Base Statistics)-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0-------------------------------------------------------------------------------At time t = 3 sec (Mode: Delta)-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0-------------------------------------------------------------------------------At time t = 6 sec (Mode: Delta)-------------------------------------------------------------------------------Octets 0 0Packets 0 0Errors 0 0

Page 76: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting MLPPP over a Serial Interface

76

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 77: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 77

9 Troubleshooting Multicast Issues

9.1 In This Chapter

This chapter provides information about troubleshooting multicast issues on an SR-series router, specifically multicast networks running Protocol Independent Multicast-Sparse Mode (PIM-SM) or L3 networks. NG-MVPN (using mLDP and P2MP LSPs), and PIM and Internet Group Management Protocol (IGMP) configuration errors are beyond the scope of this chapter.

The topics in this chapter are:

• PIM-SM and IGMP Network Overview

• Multicast Troubleshooting Tools

• Workflow to Troubleshoot Multicast Problems

• Troubleshooting a Problem Isolated to One or More SR Routers

• Troubleshooting Hardware Issues and Queue Discards

Note: This chapter provides a set of guidelines to use in the problem-solving process; it is not intended for use as a comprehensive procedure to treat multicast issues. Additional troubleshooting steps may be required to resolve the problem.

Before verifying a multicast issue, ensure that there are no preexisting network problems.

Page 78: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

78

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

9.2 PIM-SM and IGMP Network Overview

Figure 21 illustrates the general concept of a Multicast Internet Group Management Protocol (IGMP).

Figure 21 IGMP Concept

PIM-SM Summary

• PIM-SM is not a flood and prune mechanism; it requires explicit joins.

• PIM-SM relies on the underlying IGP protocols to make its routing decisions

• Natively uses a Rendezvous Point (RP) as a shared tree, sources send data to the RP that distributes to receivers using a shared tree

• PIM-SM, like other multicast protocols, uses Reverse Path Forwarding (RPF)

• RPF forwards a multicast packet only if it is received on an interface that is used by the router to route to the source

1013

Server offers stream on amulticast address e.g. 225.0.0.1

Client sends report requestingmulticast group e.g. 225.0.0.1

Multicast stream is required by one or more multicast clients

Multicast stream is offered by one or more multicast serversMulticast IGMP in a nutshell

Receiver_A Receiver_B Receiver_C

Router detects the match andtransmits multicast stream

225.0.0.1 to the client

One Router (Per LAN)is querier: sendsperiodic query

messages

Page 79: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 79

9.3 Multicast Troubleshooting Tools

This section describes the troubleshooting tools that can be used for problem diagnostics in a multicast network.

9.3.1 MTRACE

The MTRACE command traces the multicast path from a source to a receiver by passing a trace query hop-by-hop along the reverse path from the receiver to the source. At each hop, information such as the hop address, routing error conditions, and packet statistics are gathered and returned to the requester. A network administrator can determine where multicast flows stop and verify the flow of the multicast stream.

The CLI context and syntax is as follows:

mtrace source a.b.c.d group w.x.y.z

Where: a.b.c.d is the source address, and w.x.y.z is the group address

Figure 22 shows a CLI example and sample output for the MTRACE command.

Figure 22 MTRACE Sample Output

1007

A:Dut -F# mtrace source 10.10.16.9 group 224.5.6.7 Mtrace from 10.10.16.9 via group 224.5.6.7 Querying full reverse path... 0 ? (10.10.10.6) -1 ? (10.10.10.5) PIM thresh^ 1 No Error -2 ? (10.10.6.4) PIM thresh^ 1 No Error -3 ? (10.10.4.2) PIM thresh^ 1 Reached RP/Core -4 ? (10.10.1.1) PIM thresh^ 1 No Error -5 ? (10.10.2.3) PIM thresh^ 1 No Error -6 ? (10.10.16.9)

Round trip time 29 ms; total ttl of 5 required.

Page 80: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

80

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

9.3.2 MSTAT

The MSTAT command traces a multicast path from a source to a receiver and displays multicast packet rate and loss information. The command adds the capability to show the multicast path in a limited graphic display and provide drops, duplicates, TTLs, and delays at each node. Network operators can use this information to identify nodes with high drop and duplicate counts; duplicate counts are shown as negative drops. The following example shows the CLI command and sample output of the MTRACE command.

R4# mstat source 60.60.60.2 group 239.1.1.1Mtrace from 60.60.60.2 via group 239.1.1.1Querying full reverse path...Waiting to accumulate statistics...Results after 10 seconds:

Source Response Dest Overall Packet Statistics For Traffic From60.60.60.2 97.97.97.97 Mcast Pkt 60.60.60.2 To 239.1.1.1

| __/ rtt 58.0ms Rate Lost/Sent = Pct Ratev / ------- ---------------------

60.60.60.110.10.10.1 ?

| ^ ttl 2 7440 pps 0/74405= 0% 7440 ppsv |

10.10.10.210.10.10.5 ? Reached RP/Core

| ^ ttl 3 7440 pps -2/74405= 0% 7440 ppsv |

10.10.10.610.10.10.10 ?

| \__ ttl 4 7440 pps 0/74407= 0% 7440 ppsv \

10.10.10.9 97.97.97.97Receiver Query Source

9.4 Workflow to Troubleshoot Multicast Problems

Table 4 lists the workflow to troubleshoot multicast problems.

Table 4 Workflow to Troubleshoot a Multicast Problem

Workflow Task Chapter or Section

Characterize and isolate the problem Section 9.4.1, Isolating the Multicast Problem

Which routers are affected by the problem?

Section 9.5, Troubleshooting a Problem Isolated to One or More SR Routers

Are there any hardware errors or queue discards?

Section 9.6, Troubleshooting Hardware Issues and Queue Discards

Page 81: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 81

9.4.1 Isolating the Multicast Problem

It is essential to identify and isolate a problem before you can fix it. Table 5 provides a list of questions that can help you gather information to characterize the problem and narrow the scope to a specific network element (NE) or a set of NEs. Use this list to find the problem statement that most closely matches your situation. After you have successfully isolated the problem, use the procedures described in this chapter to troubleshoot the multicast issues.

Note: If the problem is isolated to a device that is not a 7x50 router, the information described in section 9.5 still applies, and can be used for problem isolation and verification.

Table 5 Characterize the Problem Questionnaire

Question Analysis

How many users are affected by the problem and where are they located?

Single User

• A problem that is limited to a single user points to a possible issue with the last mile.

Multiple Users

• Multiple users affected across multiple 7x50s - Indicates that the issue is probably upstream.

• Multiple users connected to one 7x50 - Indicates that the issue is probably on this 7x50 or a downstream device, such as a DSLAM.

Which DSLAMs are affected by the problem?

• If the DSLAMs aggregate to a common node, try to isolate a common card.

• If the DSLAMs aggregate to multiple 7x50 SR/ESS nodes, determine if there is a common node toward the multi-cast source (that is, 7750 SR, Juniper, Cisco, and so on).

• If a common 7750 SR is identified as the problem, try to isolate a card that is common to 7x50 SR/ESS.

Does the problem occur on a single channel or on multiple channels?

Single Channel

• A problem that is limited to a single channel points to a possible issue with a single source.

Multiple Channels

• A problem occurring on multiple channels points to a larger problem that is probably related to a network issue.

− Is there a common source for the affected channels?

− Is there a common 7x50 connected to the set of affected sources?

Page 82: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

82

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

9.5 Troubleshooting a Problem Isolated to One or More SR Routers

Use the troubleshooting information in this section if the questionnaire in Table 5 has isolated the problem to a single router or a set of routers.

9.5.1 Flowchart to Troubleshoot a Problem Isolated to One SR-Series Router

The flowchart in Figure 23 defines the troubleshooting steps; proceed as directed. A process of elimination is used to isolate the issue, or direct you to troubleshoot upstream (toward the source) or downstream (toward the receiver), depending on the symptoms you observe.

What is the end-user impact? Is there any video degradation or channel blackout?

Video Degradation

• Possible reason for pixelization could be a hardware-related or a capacity issue on one of the network elements.

Blackout

• Channel blackout can occur if the source is not sending traffic, or if the joins are not propagated correctly to receive traffic.

Table 5 Characterize the Problem (Continued)Questionnaire

Question Analysis

Note: This information can also be used to troubleshoot problems isolated to a device that is not an SR router.

Page 83: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 83

Figure 23 PIM Output Troubleshooting Flowchart

9.5.2 Check the PIM Output

Use the show router pim group CLI command to check the PIM output. The command syntax is as follows:

show router pim group a.b.c.d detail | match expression "Type|Curr|ets"

Run the command multiple times for an affected PIM group on a single 7x50, and collect the following information for the PIM source group:

1015

Problem isolatedto one router?

Check themulticast source.

Contact NokiaSupport.

Step 1

Go to Step 1.

Go to Step 1.

Go to Step 1.

Check ingress andegress interfaces

for PIM group.

Restarttroubleshooting thedownstream router.

Restarttroubleshooting the

upstream router.

Is there aningress

interface?

Is thereegress traffic?

Debug PIMpackets, arejoins coming

in?

Node directlyconnected tothe source?

No No

Yes

Yes

Yes

No No

YesNo

No

Yes

Yes

Is there anegress

interface?

Drop countersincrementing?

Use ip-filters toverify if the

affected group(s)are egressing.

This issue is noton this node,

restarttroubleshooting

one hopdownstream.

Page 84: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

84

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

• Check the type of join

− *,G and S,G

− S,G only

− *,G only

• Note the ingress and egress interface

• Verify that a forwarding rate is displayed

Figure 24 shows an example output of the show router pim group information, and the relevant details are highlighted.

Page 85: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 85

Figure 24 PIM Group Sample Output

1002

7750SR# show router pim group 239.1.1.1 detail

PIM Source Group ipv4

Group Address : 239.1.1.1Source Address : *RP Address : 95.95.95.95Flags : Type : (*,G)MRIB Next Hop : MRIB Src Flags : self Keepalive Timer : Not RunningUp Time : 0d 00:00:41 Resolved By : rtable-u

Up JP State : Joined Up JP Expiry : 0d 00:00:19Up JP Rpt : Not Joined StarG Up JP Rpt Override : 0d 00:00:00

Rpf Neighbor :Incoming Intf : Outgoing Intf List : int-to-96

Curr Fwding Rate : 0.0 kbps Forwarded Packets : 0 Discarded Packets : 0Forwarded Octets : 0 RPF Mismatches : 0Spt threshold : 0 kbps ECMP opt threshold : 7Admin bandwidth : 1 kbps

PIM Source Group ipv4

Group Address : 239.1.1.1Source Address : 60.60.60.2RP Address : 95.95.95.95Flags : spt, rpt-prn-des Type : (S,G)MRIB Next Hop : 10.10.10.1MRIB Src Flags : remote Keepalive Timer Exp: 0d 00:03:23Up Time : 0d 00:00:08 Resolved By : rtable-u

Up JP State : Joined Up JP Expiry : 0d 00:00:51Up JP Rpt : Pruned Up JP Rpt Override : 0d 00:00:00

Register State : No Info Reg From Anycast RP: No

Rpf Neighbor : 10.10.10.1Incoming Intf : int-to-94Outgoing Intf List : int-to-96

Curr Fwding Rate : 2738.1 kbps Forwarded Packets : 62863 Discarded Packets : 0Forwarded Octets : 2891698 RPF Mismatches : 0Spt threshold : 0 kbps ECMP opt threshold : 7Admin bandwidth : 1 kbps

Groups : 2

Check the type of join:*, G and S, GS, G only

Note ingress and egress interfaces from this output

Note ingress and egress interfaces from this output

Page 86: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

86

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

9.5.3 Discard Counters

If discard counters (RPF mismatches or discarded packets) are incrementing for the PIM group, as shown in the example output in Figure 25, contact Nokia Technical Support for further assistance.

Figure 25 Drop Counters Incrementing

9.5.4 No Egress Interface

If there is no egress interface in the PIM group information, run the debug router pim command to check if a join is being received.

• If a join is received and there is no egress interface, contact Nokia Technical Support for further assistance.

• If no join is received, troubleshoot downstream toward the receiver by restarting the troubleshooting procedures at section 9.5.2.

Figure 26 shows the CLI context to debug PIM join/prune packets.

Figure 26 Debug PIM CLI Context

1017

7750SR# show router pim group 239.1.1.1 detail | match expression "Type|Curr|ets"

Flags : Type : (*,G)Curr Fwding Rate : 0.0 kbps Forwarded Packets : 0 Discarded Packets : 0

Forwarded Octets : 0 RPF Mismatches : 0Flags : spt, rpt-prn-des Type : (S,G)Curr Fwding Rate : 2738.3 kbps

Forwarded Packets : 367925 Discarded Packets : 0Forwarded Octets : 16924550 RPF Mismatches : 0

1019

*A:SR7_IMPM# debug router pim jp - jp [group <grp-ip -address>] [source <ip-address>] [detail]

- no jp

<grp -ip -address> : multicast group address(ipv4/ipv6) or zero

<ip -address> : source address(ipv4/ipv6)<detail> : keyword

Page 87: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 87

Figure 27 shows the example CLI for debugging PIM joins and logging the information to log-id 20.

Figure 27 CLI Debug and Log PIM Join Example

9.5.5 No Ingress Interface

If there is no ingress interface in the PIM group information, as shown in the example in Figure 28, then proceed as follows:

• If the node is directly connected to the source, check the source.

• If the node is not connected directly to the source, troubleshoot downstream toward the receiver by restarting the troubleshooting procedures at section 9.5.2.

Figure 28 No Ingress Interface Sample Output

1018

*A:SR7_IMPM# configure log log -id 20 *A:SR7_IMPM>config>log>log -id$ from debug-trace

*A:SR7_IMPM>config>log>log-id$ to memory 1024

*A:SR7_IMPM# debug router pim jp group 239.1.1.1 source 5.6.7.8

1005

R4# mstat source 60.60.60.2 group 239.1.1.1Mtrace from 60.60.60.2 via group 239.1.1.1Querying full reverse path...Waiting to accumulate statistics...Results after 10 seconds:

Source Response Dest Overall Packet Statistics For Traffic From60.60.60.2 97.97.97.97 Mcast Pkt 60.60.60.2 To 239.1.1.1 | __/ rtt 58.0ms Rate Lost/sent = Pct Rate v /60.60.60.110.10.10.1 ? | ttl 2 7440 pps 0/74405= 0% 7440 pps v |10.10.10.210.10.10.5 ? Reached RP/Core | ttl 3 7440 pps -2/74405= 0% 7440 pps v |10.10.10.610.10.10.10 ? | \__ ttl 4 7440 pps 0/74407= 0% 7440 pps v \10.10.10.9 97.97.97.97Receiver Query Source

v

v

Page 88: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

88

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

9.5.6 No Errors Indicated in CLI Output

If the PIM group information indicates no incrementing errors, and both ingress and egress interfaces are displayed, then proceed as follows:

• Apply an egress IP-filter to determine if traffic is actually being forwarded.

Figure 29 shows a CLI filter log configuration example.

Figure 29 Filter Log Configuration Example

Figure 30 shows an example CLI to configure an IP filter and log the hits for entry 10, and Figure 31 shows how to apply the IP filter.

Figure 30 IP Filter Configuration Example

Figure 31 Applying An IP Filter Example

1016

*A:SR7_IMPM# configure filter log 102 create *A:SR7_IMPM>config>filter>log$ destination memory 1024

1000

*A:SR7_IMPM# configure filter ip-filter 10 create*A:SR7_IMPM>config>filter>ip-filter$ info

default-action forwardentry 10 create

match dst-ip 239.1.1.1/32

exit log 102action forward

exit

*A:SR7_IMPM>config>filter>ip-filter$

1004

*A:SR7_IMPM# configure router interface "toIxia5/7"*A:SR7_IMPM>config>router>if# info

address 10.10.10.1/24port 1/1/2

egressfilter ip 10

exit

*A:SR7_IMPM>config>router>if#

Page 89: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 89

Figure 32 shows the CLI command to display the filter hits.

Figure 32 Show Filter CLI Command

• If the show router pim group command displays a forwarding rate, but you can verify that traffic is not being forwarded, contact Nokia Technical Support for further assistance.

• If the filters confirm that traffic is being forwarded, restart the troubleshooting procedure on the downstream router.

9.6 Troubleshooting Hardware Issues and Queue Discards

This section describes the troubleshooting procedure for hardware errors and queue-level drops.

The topics in this section include:

• Hardware Errors and Queue Discards Troubleshooting Flowchart

• IOM and MDA Errors

• IOM and MDA Errors

• Port-Level Errors

• Queue-Level Drops

9.6.1 Hardware Errors and Queue Discards Troubleshooting Flowchart

The flowchart in Figure 33 defines the troubleshooting steps; proceed as directed. A process of elimination is used to isolate and troubleshoot the issue.

1020

*A:SR7_IMPM# show filter log 102

Page 90: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

90

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 33 Port Or Queue Level Drops Troubleshooting Flowchart

1001

Check IOMand MDA.

Check fiber, SFPs,and transmission

equipment.

Check QoS policy.

Check tools dumpmcast-path-mgrblackholedsgs

command output.

More than 2G hi or 2G lowpriority multicast per

ingress complex?

Modify QoS policyor enable IMPM.

Change IMPMpolicy.

Contact NokiaSupport forassistance.

Contact NokiaSupport forassistance.

Start

Contact NokiaSupport if you see

forwarding engine dropsor need help withthe QoS policy.

Contact NokiaSupport.

Hardwareerrors on IOM

or MDA?

Dropsincrementingon ingress

ports?

Is trafficwithin

QoS policylimits?

Is IMPMenabled?

Are groupsbeing

blackholed?

Is theIMPM policyblackholing?

Incrementingphysical layer errors

on ingressand egress.

No

No

No

Resolve hardwareerrors beforeproceeding.

No

No

No

No

No

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Page 91: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 91

9.6.2 IOM and MDA Errors

You can troubleshoot IOM and MDA errors using the show card detail and show mda detail CLI commands. Contact Nokia Technical Support for further assistance if you detect hardware errors on the IOM or MDA.

Figure 34 shows an example of troubleshooting IOM errors and the information output by the CLI command.

Figure 34 IOM Errors Example CLI

Figure 35 shows an example of troubleshooting MDA errors and the information output by the CLI command.

Figure 35 MDA Errors Example CLI

9.6.3 Port-Level Errors

Check the ingress and egress ports and correct any physical layer errors on the ports. See section 9.5.2 for information about identifying the ingress and egress ports using the show CLI command output.

1010

7750SR# show card detail | match expression "^Card|Time|Trap"Card 1 Time of last boot: 2010/06/08 03:32:02 Complex 0 (ingress): Trap raised 380 times; Last Trap 07/21/2010 16:22:05 Complex 0 (egress): Trap raised 559 times; Last Trap 07/21/2010 16:22:05 Complex 1 (ingress): Trap raised 100 times; Last Trap 07/21/2010 16:22:05 Complex 1 (egress): Trap raised 560 times; Last Trap 07/21/2010 16:22:05Card A Time of last boot: 2011/11/17 01:13:36Card B Time of last boot: 2011/11/17 01:13:17

1009

7750SR# show mda detail | match expression "^MDA|Time|Trap" MDA 1/1 detail MDA Specific Data Time of last boot: 2010/06/08 03:32:04 XPL Errors: Trap raised 1 times; Last Trap 07/10/2010 16:26:01 MDA 1/2 detail MDA Specific data Time of last boot: 2011/11/17 01:03:18

Page 92: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

92

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

After you have determined the ingress and egress ports, use the show port 1/2/3 detail CLI command to check for physical layer errors. Figure 36 shows an example CLI output of troubleshooting physical layer errors.

Figure 36 Physical Layer Errors Example CLI

If the show CLI command output indicates that physical layer errors are incrementing, you have isolated the problem. Check the following to resolve the errors.

• clear the fiber

• check transmission equipment

• if digital diagnostics (DDM SFP) is enabled, check the SFPs and related information from the show port detail command.

If the errors persist, contact Nokia Technical Support for further assistance.

9.6.4 Queue-Level Drops

Use the show port detail CLI command to check for queue-level drops.

Figure 37 shows a CLI example output for IES 1 SAP ID 3/1/21; the incrementing forwarding engine drops are shown in red. Contact Nokia Technical Support for further assistance if you identify forwarding engine drops in the network.

1003

7750SR # show port 6/1/24 detail | match "Ethernet-like" post-lines 10 Ethernet-like Medium Statistics

Alignment Errors : 0 Sngl Collisions : 0FCS Errors : 0 Mult Collisions : 0SQE Test Errors : 0 Late Collisions : 0CSE : 0 Excess Collisns : 0Too long Frames : 0 Int MAC Tx Errs : 0Symbol Errors : 0 Int MAC Rx Errs : 0In Pause Frames : 0 Out Pause Frames : 0

Page 93: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 93

Figure 37 Incrementing Forwarding Engine Drops Sample Output

Figure 38 shows a CLI example output for ingress queue drops; the drops for Ingress Queue 11 are shown in red.

1014

*A:SR7_IMPM# show service id 1 sap 3/1/21 detail

Service Access Points(SAP)

Service Id : 1

SAP : 3/1/21 Encap : nullDescription : (Not Specified)

Admin State : Up Oper State : Up

<…output omitted>Forwarding Engine StatsDropped : 16848385 6945292440 Off. HiPrio : 0 0 Off. LowPrio : 234929 355682506 Off. Uncolor : 0 0

Page 94: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

94

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Figure 38 Ingress Queue Drops Sample Output

If drops are incrementing on traffic mapped to multicast queues, troubleshoot the problem as follows. If the errors persist, contact Nokia Technical Support for further assistance.

• Ingress Multicast Path Management (IMPM) is Disabled

When IMPM is disabled on the system, the total high-priority ingress multicast traffic should not exceed 2G, and the total low-priority multicast traffic should not exceed 2G.

If investigation reveals that the traffic exceeds the 2G+2G ingress multicast limit per IOM/IMM, the following network design changes may be required.

− Enable the IMPM to increase the multicast capacity of the system.

1011

*A:SR7_IMPM# show service id 1 sap 3/1/21 detail <…snip>

Service Access Points(SAP)

Sap per Queue stats

Packets Octets

Ingress Queue 1 (Unicast) (Priority)Off. HiPrio : 0 0 Off. LoPrio : 0 0 Dro. HiPrio : 0 0 Dro. LoPrio : 0 0 For. InProf : 0 0 For. OutProf : 0 0

Ingress Queue 11 (Multipoint) (Priority)Off. HiPrio : 0 0 Off. LoPrio : 6378599 9657198886 Off. Managed : 0 0 Dro. HiPrio : 6378594 9657191316Dro. LoPrio : 0 0 For. InProf : 0 0 For. OutProf : 0 0

Egress Queue 1For. InProf : 10 696 For. OutProf : 0 0 Dro. InProf : 0 0 Dro. OutProf : 0 0

Caution: The implications of a network design change should be considered carefully before implementation. The impact information is beyond the scope of this document.

Page 95: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting Multicast Issues

Issue: 01 3HE 11475 AAAA TQZZA 01 95

− Modify the QoS policy to map traffic to the high- and low-priority queues so that the total ingress traffic does not exceed 2G for high-priority and low-priority traffic.

• Ingress Multicast Path Management (IMPM) is Enabled

When IMPM is enabled on the system, check to see if the traffic is being blackholed. If the IMPM policy is blackholing traffic, you may need to change the IMPM policy.

Figure 39 shows the CLI context to configure an IMPM policy.

Figure 39 CLI Context to Configure IMPM Policy

Figure 40 shows a CLI example where no groups are blackholed, and the IMPM is enabled on IOM 2 only.

Figure 40 IMPM Enabled and No Blackholes Example

Caution: Modifying the IMPM policy is a major network design change and its implications should be considered carefully before implementation. The impact information is beyond the scope of this document.

1012

*A:SR7_IMPM# configure mcast-management *A:SR7_IMPM>config>mcast-mgmt# info

chassis -levelper -mcast -plane -limit 2000 secondary 2000 dual-sfm 2000

secondary-dual-sfm 2000exitmulticast-info-policy " customer1 " create

bundle "default" createexitbundle " customer1 " create

expl icit -sf-path primarychannel "239.1.1.1" "239.1.10.1" createexit

exitexit

*A:SR7_IMPM>config>mcast-mgmt#

1006

*A:SR7_IMPM# tools dump mcast -path -mgr blackholedsgs McPathMgr[2][0]: 0xf33b0a00 Blackholed SGs:Source BW Pref IsExpGroup

Page 96: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting Multicast Issues

96

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 97: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting ICC Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 97

10 Troubleshooting ICC Errors

10.1 In This Chapter

This chapter describes how to troubleshoot Inter-Card Communication (ICC) errors in the network. The troubleshooting information does not apply in cases where the ICC automatically recovers from ICC errors described in Section 10.2.1.

The topics in this chapter include:

• Inter-Card Communication Overview

• ICC Troubleshooting Flowchart

• To Troubleshoot ICC Failures

Page 98: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting ICC Errors

98

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

10.2 Inter-Card Communication Overview

ICC is a messaging system that uses the backplane (switch fabric) as the transfer medium to communicate between cards. All ICC communication originates from the active CPM and is distributed to the IOMs and the standby CPM.

When an ICC message is sent, an ICC reply is expected within a certain amount of time. ICC requires an acknowledge (ACK) message and a response per ICC request sent. Conversely, it requires an ACK message for each sent ICC response. A timeout occurs in cases where ICC does not receive the ACK or response message, which causes the current transaction to fail and the card is declared failed as a result. Any impact on the ICC messages may lead to the failure of a specific slot.

Alarms are raised when the communication between the active CPM and the IOM is lost. The IOM will reset itself in an effort to re-establish communication with the active CPM.

10.2.1 Automated Recovery from ICC Errors

The active CPM will automatically reset itself if the standby CPM and all IOM cards have rebooted three (3) times due to an internal failure in a period shorter than, or equal to, 15 minutes. In dual-CPM systems, the active CPM will automatically reset itself if the standby CPM and all of the IOM cards have rebooted twice due to internal failures in the past 60 minutes.

Sample ICC Alarms Generated in log-id 99

6817 2012/12/14 02:53:05.48 UTC MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Failed ICC transaction"

6816 2012/12/14 02:53:05.41 UTC CRITICAL: LOGGER #2002 Base A:ICC:UNUSUAL_ERROR"iccKillUnrespIoms: Marking slot num 4 mda 0 as FAILED"

6815 2012/12/14 02:53:05.41 UTC CRITICAL: LOGGER #2002 Base A:ICC:UNUSUAL_ERROR"iccKillUnrespIoms: Id 1794639986, Seq 51400767, Sock 2, Unicast, Ptr 0x7f2ff098: S

ocket CARD_MANAGEMENT Failed unicast transaction. Slot 4, Mda 0"

Page 99: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting ICC Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 99

10.3 ICC Troubleshooting Flowchart

The flowchart in Figure 41 summarizes the ICC troubleshooting steps; proceed as directed. Some exceptions apply to the scenarios represented in the flowchart; for example, 1 IOM and 1 CPM in a chassis.

Figure 41 ICC Troubleshooting Flowchart

1032

Start

ICC alarms are beinggenerated against

cards or cards are rebooting.

Check logs (log-id 99 or SAMor syslogs) and determinewhich cards are rebooting.Take two tech-support files.

Single-CPMor dual-CPM

system?

Have the ICCerrors stopped?

Have the ICCerrors stopped? Force a CPM

switchover.*

One or multipleIOMs rebooting?

Arrange for a spareCPM and on-site

technician.

Power cycle the card.Use CLI command:

“tools perform card xremote power-cycle.”

Power cycle the card.Use CLI command:

“tools perform card xremote power-cycle.”

Take two tech-supportfiles, collect logs.

Contact Nokia Support.

One or multipleIOMs rebooting?

Dual

One One

No

No

YesYes

Monitor

A power cycle will reboot the hardware and cause service impact. The procedure should be performed duringa scheduled maintenance window if possible. Spares should be available on-site as a precautionary measure.* This step is not service impacting; ensure that the CPMs are synchronized.

Multiple

Multiple

Single

Page 100: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting ICC Errors

100

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

10.4 To Troubleshoot ICC Failures

To troubleshoot ICC failures, perform the following tasks in sequence until you identify the root cause of the problem.

Step 1. Characterize the issue: Check the logs to identify the cards that are rebooting and determine the frequency of the reboots.

Step 2. Generate two tech-support files before proceeding: Perform the admin tech-support CLI command.

Step 3. If multiple cards are rebooting in a dual-CPM system, the ICC failure may be due to a CPM issue. Force a CPM switchover:

i. Ensure that the active and standby CPM are synchronized: Perform the show redundancy synchronization CLI command.

ii. Synchronize the CPMs, if required: Perform the admin redundancy synchronize boot-env CLI command.

iii. Force a CPM switchover: Perform the admin redundancy force-switchover CLI command.

iv. Go to Step 6 if the issue persists; otherwise, exit the workflow if the ICC failure is resolved.

Step 4. If one card is rebooting, the ICC failure may be due to an IOM issue.

i. Power cycle the card: Perform the tools perform card <a lot> remote power-cycle CLI command.

ii. Perform one of the following steps if the power cycle does not resolve the issue:

a. If this is a dual-CPM system and the power cycle does not resolve the issue, then perform Step 3 again to force a CPM switchover.

b. If this is a single-CPM system and the power cycle does not resolve the issue, arrange for a spare CPM on-site and contact Nokia Technical Support for further troubleshooting assistance.

Note: This troubleshooting workflow does not apply if the frequency of card reboots is enough to trigger the automated recovery mechanisms described in section 10.2.1 Automated Recovery from ICC Errors.

Warning: Power cycling a card is a service-affecting procedure; Nokia recommends that a power cycle should be conducted during a scheduled maintenance window.

Page 101: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Troubleshooting ICC Errors

Issue: 01 3HE 11475 AAAA TQZZA 01 101

Step 5. If multiple cards are rebooting in a single-CPM system, the ICC failure may be caused by a CPM issue. Arrange for a spare CPM on-site and contact Nokia Technical Support for further troubleshooting assistance.

Step 6. If one card is rebooting in a single-CPM system, the ICC failure may be caused by an IOM issue.

i. Power cycle the card: Perform the tools perform card <a lot> remote power-cycle CLI command.

ii. If a power cycle does not resolve the issue, contact Nokia Technical Support for further troubleshooting assistance.

Page 102: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Troubleshooting ICC Errors

102

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 103: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Upgrading Incompatible Firmware Versions

Issue: 01 3HE 11475 AAAA TQZZA 01 103

11 Upgrading Incompatible Firmware Versions

11.1 In This Chapter

This chapter describes the usage of the admin reboot upgrade command and analyzes the associated risks and potential impact of the unsolicited use of the command to upgrade firmware versions on the SR-series routers.

The topics in this chapter include:

• Command Overview

• Command Behavior and Impact

Note: Nokia strongly recommends that the admin reboot upgrade command should not be used for firmware upgrades unless it is explicitly specified to do so in the Release Notes.

Page 104: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Upgrading Incompatible Firmware Versions

104

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

11.2 Command Overview

The admin reboot upgrade command is used for software upgrades in certain cases where an incompatibility exists between the firmware versions in the currently running release and the target release.

Software releases that require the admin reboot upgrade command for firmware upgrades are explicitly defined in the Release Notice of the target release (load that is being upgraded to).

Unsolicited use of the admin reboot upgrade command introduces numerous unnecessary variables and increased risk.

11.3 Command Behavior and Impact

When it is used, the admin reboot upgrade command triggers an audit of both the bootROM and the firmware of all cards in the system.

bootRom

The bootROM contains code that is executed during power-on or reboot to bring the card to a state where it can load the software image. The bootROM consists of a header and a version; the header information is updated for each new software release.

If the current bootRom version is different from the new version, the admin reboot upgrade command will cause the bootROM to be upgraded. However, if the bootRom versions are the same but the header is different between the loads, the system will not upgrade the bootROM because the code is the same.

Note: Nokia strongly recommends that the admin reboot command should be used to load a new SR software release (Ensure that all pre-upgrade tasks are complete before you run this command).

The admin reboot upgrade command should only be used when explicitly directed to do so in the Release Notes.

Note: The show card x detail CLI command output indicates the application load that the bootRom loaded from. It does not display information about the firmware version of a card.

Page 105: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Upgrading Incompatible Firmware Versions

Issue: 01 3HE 11475 AAAA TQZZA 01 105

After the software image is loaded, problems found in the card during operation should not be attributed to the bootROM version.

Firmware

If the version of firmware on a specific card is different from the version on the target image, the admin reboot upgrade command will force an update of the firmware located on the individual cards in the system. Unless it is explicitly stated in the Release Notes, this action is neither recommended or desired.

The following issues and risks are introduced as a result of the unsolicited use of the admin reboot upgrade command.

• Longer Bootup Times

By using the admin reboot upgrade command, you are forcing the card to update all of its firmware. This can add minutes to the bootup of an IOM above the regular bootup times.

In cases where the card directly connects to end users, the longer bootup procedure will induce additional downtime to customer service.

In cases where a CPM in slot A needs several firmware devices upgraded, it will take longer to boot up than the CPM in slot B. As a result, CPMB will become the active CPM. While not an issue in itself, this switchover may cause problems if a node was not properly synchronized before the upgrade.

• Increased Potential for Card Downtime

If there are any interruptions in the firmware upgrade process while devices on the IOM or CPM are being programmed, the card will not complete its upgrade. Depending on the stage at which the upgrade cycle was interrupted, the IOM or CPM may need to be reseated to boot up to achieve full operational status. In rare cases, the problem may not be field recoverable; the card may have to be removed from the system and returned to Nokia Repair/Return for reprogramming. If this card is directly connected to customers, there will be serious impact until it is replaced.

• Firmware Version Upgrades are Not Mandated

There has not been a mandated firmware upgrade since the v10 version of IOM firmware was introduced in Release 2.0.R6 of the 7750 SR platform, and Release 1.0.R2 of the 7450 ESS platform. Most customer networks have been upgraded well past any mandated changes in firmware; therefore, the current upgrade process does not require the use of the admin reboot upgrade command.

• CPM Switchover May Occur

In cases where the standby CPM does not require an upgrade but the active CPM does, there is a possibility that the standby CPM will boot-up first and become the active CPM after the upgrade.

Page 106: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Upgrading Incompatible Firmware Versions

106

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

While this may not negatively impact live traffic, it will mean that the CPM active before the upgrade, will not remain active after the upgrade (assuming that CPM-A was active before the upgrade), and prevent an additional manual CPM switchover back to CPM-A.

Page 107: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Recovering From Active CPM Lockup

Issue: 01 3HE 11475 AAAA TQZZA 01 107

12 Recovering From Active CPM Lockup

12.1 In This Chapter

This chapter describes how to use the lamp test to recover from an active Control Processor Module (CPM) lockup.

The topics in this chapter include:

• Recovering the Active CPM Overview

• To Recover the Active CPM and Determine Root Cause Using the Lamp Test

Page 108: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Recovering From Active CPM Lockup

108

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

12.2 Recovering the Active CPM Overview

In a few unexpected cases, an SR router may become unreachable through the Telnet/SSH or console sessions due to an active CPM lockup. This chapter describes how to recover the node without losing vital information that is needed to troubleshoot the root cause of the problem.

An unresponsive active CPM card is generally manually reset using one of the following:

• by pressing the Reset button

• by reseating the card

However, the usual manual reset recovery methods cause data to be lost, which makes it impossible to troubleshoot the problem that caused the CPM card to lock up.

Use the Lamp Test functionality described in this chapter to retrieve troubleshooting information and recover the CPM card.

12.3 To Recover the Active CPM and Determine Root Cause Using the Lamp Test

The Lamp Test procedure enables the operator to force the active CPM to dump information even when it is locked up; the procedure is performed using the ACO/LT button on the CPM card.

Step 1 Connect a PC or terminal server to the console port of the standby CPM.

Step 2 Press <Enter> after the connection to the standby CPM is established.

The following text is displayed.

Login not allowed on standby

Step 3 Enter the command to reset the CPM and press <Enter>.

Note: To perform this procedure, the operator must be at the physical location of the SR router, with access to a console cable to establish a direct connection to the console port of the CPMs.

Page 109: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Recovering From Active CPM Lockup

Issue: 01 3HE 11475 AAAA TQZZA 01 109

reset

The following message is displayed; there will be no indication whether the reset command was accepted by the system.

Login not allowed on standby

Step 4 Within 5 minutes of the reset command, press the ACO/LT button on the standby CPM.

This triggers a CPM HA switchover, which causes the active CPM to reboot and the standby CPM to take over. The following message is displayed.

Attempting crash dump on active CPM.

The standby CPM becomes the active CPM.

Step 5 Wait for both CPMs to come back up and synchronize.

Step 6 Generate a tech-support file using the admin tech-support CLI command.

As shown in the following sample CLI output, the generated tech-support file contains crucial information about the locked up CPM before it was reset.

*B:NS041510586# show redundancy synchronization=======================================================================Synchronization Information=======================================================================Standby Status : synchronizingLast Standby Failure : N/AStandby Up Time : 2010/06/14 19:58:32Standby Version : TiMOS-C-7.0.R6 cpm/hops ALCATEL SR 7750

Copyright (c) 2000-2009 Alcatel-Lucent.All rights reserved. All use subject toapplicable license agreements.Built on Mon Nov 23 15:53:11 PST 2009 bybuilder in /rel7.0/b1/R6/panos/main

Failover Time : 06/14/2010 19:56:36Failover Reason : active CPM sync lostBoot/Config Sync Mode : ConfigurationBoot/Config Sync Status : Config only synchronizedLast Config File Sync Time : 06/14/2010 19:58:34Last Boot Env Sync Time : Never=======================================================================*B:NS041510586# show redundancy synchronization=======================================================================Synchronization Information=======================================================================Standby Status : standby readyLast Standby Failure : N/AStandby Up Time : 2010/06/14 19:58:32Standby Version : TiMOS-C-7.0.R6 cpm/hops ALCATEL SR 7750

Copyright (c) 2000-2009 Alcatel-Lucent.All rights reserved. All use subject to

Page 110: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Recovering From Active CPM Lockup

110

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

applicable license agreements.Built on Mon Nov 23 15:53:11 PST 2009 bybuilder in /rel7.0/b1/R6/panos/main

Failover Time : 06/14/2010 19:56:36Failover Reason : active CPM sync lostBoot/Config Sync Mode : ConfigurationBoot/Config Sync Status : Config only synchronizedLast Config File Sync Time : 06/14/2010 19:58:34Last Boot Env Sync Time : Never=======================================================================*B:NS041510586# show card=======================================================================Card Summary=======================================================================Slot Provisioned Equipped Admin Operational

Card-type Card-type State State-----------------------------------------------------------------------4 iom-20g-b iom-20g-b up up9 iom2-20g iom2-20g up upA sfm-400g sfm-400g up up/standbyB sfm-400g sfm-400g up up/active*B:NS041510586# admin tech-support cf3:\lamptest.datProcessing CPM...Second PassProcessing CPM...Processing CPM Cpu 2...Processing CPM in Slot A...Processing CPM in Slot A... Cpu 2Processing IOM in Slot 4...Processing IOM in Slot 9...Processing MDA in 4/1...Processing MDA in 4/2...Processing MDA in 9/1...Processing MDA in 9/2...Done...

Page 111: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 111

13 Hardware Error Protection Features

13.1 In This Chapter

This chapter provides information about troubleshooting hardware alarms and the available hardware protection features.

The topics in this chapter include:

• Hardware Error Protection Overview

• Memory Bit/Parity Errors: Causes, Detection, and Correction

• Card-Level Fail-On-Error

• MDA-Level Fail-On-Error

• To Troubleshoot Using the Fail-On-Error Feature

• Down-On-Internal-Error

• To Troubleshoot Using the Down-On-Internal-Error Feature

• CRC-Monitor

• To Troubleshoot Using the CRC-Monitor Feature

Note: The troubleshooting steps are applicable to IOM3-XP/-B/-C, IOM4-e, IOM4-e-B or IMMs, XCMs, IOM-a and IOM-e, and MDAs or XMAs only. Some alarms and troubleshooting procedures may not apply to systems that are running older versions of the SR OS software.

Page 112: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

112

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

13.2 Hardware Error Protection Overview

The SR series routers provide the following hardware protection features:

• fail-on-error (card and MDA level), see section 13.4

• down-on-internal error, see section 13.8

• CRC monitor, see section 13.10

13.3 Memory Bit/Parity Errors: Causes, Detection, and Correction

Densely packed, high-speed memory devices in advanced computing and telecommunications equipment are susceptible to errors defined as soft, firm, and hard errors.

• Soft Errors

Soft errors are transient and occur when bit states are flipped in a memory device. The majority of these are detected and corrected with no visible impact to the system.

• Firm Errors

Firm errors or stuck bits in memory clear when a card is removed from power. For this reason, it is difficult to recreate the flipped bit or stuck bit scenario in a lab environment and is a major factor in NFF findings.

• Hard Errors

Hard errors occur when a memory cell is permanently damaged as a result of defect, over-stress, or damaging radiation impact. These persistent hard faults in memory chips are diagnosed and corrected in a lab / repair shop.

Note: A transient event is defined as a state in which an undesirable event or momentary condition occurs in an instance of time for which there is no apparent periodic pattern.

Page 113: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 113

13.4 Fail-On-Error Overview

The fail-on-error feature controls card behavior when any one (or more) of a specific set of card-level errors is encountered in the system. The fail-on-error feature can be enabled at the card and MDA level. When enabled, it causes the operational state of a card to be set to Failed when specific errors are detected, and facilitates a timely handling of the detected errors. The erroneous card or MDA can be taken out of service immediately in a predictable and operator-intended manner, allowing redundant network elements to take over the service for the failed hardware.

Depending on their type, the detected errors may not cause a noticeable service impact in some cases. However, if fail-on-error is enabled, the hardware will still be failed and taken out of service. In such cases, while the originating errors may not cause noticeable service impact, the process of failing traffic over to redundant network elements may cause a temporary service impact due to network protocol convergence.

When the error is detected in the system, the reporting of the event (logs) by the system and the fail-on-error behavior of the card are independent. Log event control configuration will determine whether the events are reported in logs (or SNMP traps, and so on), and the fail-on-error configuration will determine the behavior of the card. The card can be configured to fail-on-error even if the events are suppressed (some events may be suppressed in the system by default).

13.4.1 Clearing a Failed Operational State

When the fail-on-error feature is enabled on the slot or MDA, the operational state of the card/MDA is set to Failed when the first trap is raised. The Failed state is cleared when:

• a clear card or clear mda command is performed to reset the card or MDA

• the card or MDA is reseated or power cycled (removed and reinserted)

If the failed condition persists even after the card/MDA has been reseated, contact Nokia Technical Support for further assistance.

Table 6 describes the general behavior of the clear command.

Note: Nokia recommends that the fail-on-error feature should only be enabled on networks that are designed to route traffic around a failed card or MDA (that is, redundant cards, nodes, or other paths must exist in the network).

Page 114: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

114

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

13.4.2 Triggering Fail-On-Error

Table 7 lists the specific Pchip, Qchip, and XPL alarms and events that trigger the fail-on-error feature.

Table 6 General Behavior of the Clear Command

Command Usage Platform

clear mda • SR/ESS

MDA will be soft reset.

clear card • SR

MDA / IOM / IMM will only be soft reset.

tools perform card x power-cycle • SR

MDA / IOM / IMM will be power cycled.

Note: These alarms are suppressed by default. See section 13.4.3 for information about how to enable alarm reporting.

Table 7 SNMP Traps

SNMP Trap Supported Release

SNMP MIB: TIMETRA-CHASSIS-MIB.mib

SNMP Trap: tmnxEqCardPChipMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqCardPChipCamEvent 6.1.R13, 7.0.R7 and later

SNMP Trap: tmnxEqCardPChipError 6.1.R5 and later

SNMP Trap: tmnxEqCardQChipBufMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqCardQChipStatsMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqCardQChipIntMemoryEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqCardChipIfCellEvent 9.0.R23, 10.0.R12, 11.0.R4 and later

SNMP Trap: tmnxEqMdaXplError 6.0.R5 and later

SNMP Trap: tmnxEqMdaIngrXplError 11.0.R13, 12.0.R4 and later

Page 115: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 115

13.4.3 Enabling Log Reports

To facilitate post-failure analysis when fail-on-error is enabled, you should enable the reporting of the specific events and errors on the system (configure log event-control). Table 8 lists the log event control configuration that is required to generate logs. See TA 10-0127c for more information about log events.

In addition, the following applies to log event configuration:

• Complex number 0 is used by all FP2 and FP3 complexes on the 7x50

For example, IOM3-XP, IMM12-10GB-SFP+, IMM-2PAC-FP3

• The IOM1 and IOM2 cards have two complexes (FP1-based)

Note: The alarms listed in Table 8 are not self clearing in 5620 SAM and they must be cleared by an operator.

Table 8 Log Event Control Configuration

Alarm Configuration

tmnxEqCardPChipMemoryEvent B:7x50# configure log event-control "chassis" 2063 generate

tmnxEqCardPChipCamEvent B:7x50# configure log event-control "chassis" 2076 generate

tmnxEqCardPChipError B:7x50# configure log event-control "chassis" 2059 generate

tmnxEqCardQChipBufMemoryEvent B:7x50# configure log event-control "chassis" 2098 generate

tmnxEqCardQChipStatsMemoryEvent B:7x50# configure log event-control "chassis" 2099 generate

tmnxEqCardQChipIntMemoryEvent B:7x50# configure log event-control "chassis" 2101 generate

tmnxEqCardChipIfCellEvent B:7x50# configure log event-control "chassis" 2103 generate

tmnxEqMdaXplError B:7x50# configure log event-control "chassis" 2058 generate

SNMP Trap: tmnxEqMdaIngrXplError B:7x50# configure log event-control "chassis" 2129 generate

Page 116: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

116

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

− Complex 0 = MDA 1

− Complex 1 = MDA 2

13.5 Card-Level Fail-On-Error

Table 9 lists the card-specific errors that trigger the fail-on-error feature.

13.5.1 Card-Level Fail-On-Error Examples

This section provides event log configuration and CLI output examples of the fail-on-error feature at the card level.

tmnxEqCardPChipMemoryEvent

The following output is an example of the tmnxEqCardPChipMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

1091894 2015/04/13 12:53:31.52 CDT MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Memory (GroupA) failure"

Table 9 Card Errors

Event ID Event name

CHASSIS event ID# 2063 tmnxEqCardPChipMemoryEvent

CHASSIS event ID# 2076 tmnxEqCardPChipCamEvent

CHASSIS event ID# 2059 tmnxEqCardPChipError

CHASSIS event ID# 2098 tmnxEqCardQChipBufMemoryEvent

CHASSIS event ID# 2099 tmnxEqCardQChipStatsMemoryEvent

CHASSIS event ID# 2101 tmnxEqCardQChipIntMemoryEvent

CHASSIS event ID# 2103 tmnxEqCardChipIfCellEvent

Note: The CLI status and statistics are cleared after an IOM / IMM / XCM reboot.

Page 117: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 117

1091890 2015/04/13 12:53:31.52 CDT UTC MINOR: CHASSIS #2063 Base"Slot 4 experienced a pchip parity error occurrence on complex 0"

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failed<snip>

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Memory (GroupA) failure

<snip>Pchip Errors Detected

Complex 0 (parity error): Trap raised 1 times; Last Trap 2015/04/13 12:53:31.52

tmnxEqCardPChipCamEvent

The following output is an example of the tmnxEqCardPChipCamEvent event log entry and CLI output information.

Sample Event Log Entry

1091894 2015/04/13 12:53:31.52 CDT MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Memory (GroupA) failure"

1091890 2015/04/13 12:53:31.52 CDT CRITICAL: CHASSIS #2076 Base"A fault has been detected in the hardware on IOM 4-forwarding engine 0: Please contact Alcatel-Lucent support"

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failed<snip>

Page 118: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

118

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Memory (GroupA) failure

<snip>Pchip Errors Detected

Complex 0 (CAM error): Trap raised 1 times; Last Trap 2015/04/13 12:53:31.52

tmnxEqCardPChipError

The following output is an example of the tmnxEqCardPChipError event log entry and CLI output information.

Sample Event Log Entry

370266 2015/11/13 13:26:53.93 PST MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Ingress FCS Errors"

370260 2015/11/13 13:26:53.93 PST MINOR: CHASSIS #2059 Base"Slot 4 detected ingress FCS errors on complex 0."

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failedIOM Card Specific Data<snip>

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Ingress FCS Errors

<snip>FCS Errors Detected

Complex 0 (ingress): Trap raised 1 times; Last Trap 2015/11/13 13:26:53.93

tmnxEqCardQChipBufMemoryEvent

Page 119: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 119

The following output is an example of the tmnxEqCardQChipBufMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

1091894 2015/04/13 12:53:31.52 CDT MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Memory (GroupA) failure"

1091890 2015/04/13 12:53:31.52 CDT MINOR: CHASSIS #2098 Base"Slot 4 experienced a Q-chip buffer memory error occurrence on complex 0"

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failed<snip>

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Memory (GroupA) failure

<snip>Qchip Errors Detected

Complex 0 (buffer memory error): Trap raised 1 times; Last Trap 2015/04/13 12:53:31.52

tmnxEqCardQChipStatsMemoryEvent

The following output is an example of the tmnxEqCardQChipStatsMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

1091894 2015/04/13 12:53:31.52 CDT MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Memory (GroupA) failure"

1091890 2015/04/13 12:53:31.52 CDT MINOR: CHASSIS #2099 Base"Slot 4 experienced a Q-chip statistics memory error occurrence on complex 0"

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail

Page 120: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

120

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failed<snip>

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Memory (GroupA) failure

<snip>Qchip Errors Detected

Complex 0 (statistics memory error): Trap raised 1 times; Last Trap 2015/04/13 12:53:31.52

tmnxEqCardQChipIntMemoryEvent

The following output is an example of the tmnxEqCardQChipIntMemoryEvent event log entry and CLI output information.

Sample Event Log Entry

1091894 2015/04/13 12:53:31.52 CDT MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Memory (GroupA) failure"

1091890 2015/04/13 12:53:31.52 CDT MINOR: CHASSIS #2101 Base"Slot 4 experienced a qchip internal memory error occurrence on complex 0"

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failed<snip>

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Memory (GroupA) failure

<snip>

Page 121: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 121

Qchip Errors DetectedComplex 0 (internal memory error): Trap raised 1 times; Last Trap 2015/04/

13 12:53:31.52

tmnxEqCardChipIfCellEvent

The following output is an example of the tmnxEqCardChipIfCellEvent event log entry and CLI output information.

Sample Event Log Entry

1091894 2015/04/13 12:53:31.52 CDT MAJOR: CHASSIS #2001 Base Card 4"Class IO Module : failed, reason: Memory (GroupA) failure"

1091890 2015/04/13 12:53:31.52 CDT MINOR: CHASSIS #2103 Base"Slot 4 experienced internal datapath cell errors on complex 0"

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show card 4 detail===============================================================================Card 4===============================================================================Slot Provisioned Type Admin Operational Comments

Equipped Type (if different) State State-------------------------------------------------------------------------------4 iom3-xp up failed<snip>

Fail On Error : EnabledAvailable MDA slots : 2Installed MDAs : 0

Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : Memory (GroupA) failure

<snip>Inter-Chip Interface Errors DetectedComplex 0 (internal datapath cell errors): Trap raised 1 times; Last Trap 2015/04/

13 12:53:31.52

13.6 MDA-Level Fail-On-Error

The MDA-level fail-on-error is triggered by the following MDA-specific errors:

• Egress XPL errors

CHASSIS event ID# 2058 - tmnxEqMdaXplError

Page 122: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

122

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

• Ingress XPL errors

CHASSIS event ID# 2129 - tmnxEqMdaIngrXplError

The ingress XPL alarm/trap and fail-on-error is supported on the following line cards:

• iom3-xp

• iom3-xp-b

• iom3-xp-c

• imm48-1gb-tx

• imm48-1gb-sfp

• imm48-1gb-sfp-b

• imm48-1gb-sfp-c

• imm4-10gb-xfp

• imm8-10gb-xfp

• ism-mg

• ism-mg-b

• imm5-10gb-xfp

• imm1-oc768-tun

• imm1-40gb-tun

The following output is an example of an MDA-level fail-on-error configuration.

A:7750 SR>config>card# info----------------------------------------------

card-type iom3-xpfail-on-errormda 1

mda-type m10-1gb-sfp-bfail-on-errorno shutdown

exitmda 2

mda-type m10-1gb-sfp-bfail-on-errorno shutdown

exitno shutdown

exit----------------------------------------------

Page 123: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 123

13.6.1 MDA-Level Fail-On-Error Examples

This section provides event log configuration and CLI output examples of the fail-on-error feature at the MDA level.

tmnxEqMdaXplError

The following output is an example of the tmnxEqMdaXplError event log entry, configuration, and CLI output information.

Sample Event Log Entry

2 2015/11/16 18:53:08.89 UTC MAJOR: CHASSIS #2001 Base Mda 1/1"Class MDA Module : failed, reason: MDA failed due to 60 consecutive windows of morethan 1000 Egress XPL Errors"

Sample Configuration

A:7750 SR>config>card>mda# info detail----------------------------------------------

mda-type m20-1gb-xp-sfp<snip>egress-xpl

threshold 1000window 60

exit----------------------------------------------threshold (default: 1000): threshold value for egress XPL errorswindow (default: 60): window size (in minutes) for egress XPL errors

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show mda 1/1 detail===============================================================================MDA 1/1 detail===============================================================================Slot Mda Provisioned Type Admin Operational

Equipped Type (if different) State State-------------------------------------------------------------------------------1 1 m10-1gb-sfp-b up failedMDA Specific Data

Capabilities : EthernetFail On Error : EnabledEgress XPL error threshold : 1000Egress XPL error window : 60Ingress XPL error threshold : 1000

Note: The CLI status and statistics are cleared after an IOM / IMM / XCM reboot.

Page 124: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

124

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Ingress XPL error window : 60Hardware Data<snip>

Administrative state : upOperational state : failedFailure Reason : MDA failed due to 60 consecutive windows of

more than 1000 Egress XPL Errors<snip>Egress XPL Errors: Trap raised 1000 times; Last Trap 2015/11/16 18:53:08.89

tmnxEqMdaIngrXplError

The following output is an example of the tmnxEqMdaIngrXplError event log entry, configuration, and CLI output information.

Sample Event Log Entry

3 2015/11/16 18:42:24.66 UTC MAJOR: CHASSIS #2001 Base Mda 1/1"Class MDA Module : failed, reason: MDA failed due to 60 consecutive windows of morethan 1000 Ingress XPL Errors"

Sample Configuration

A:7750 SR>config>card>mda# info detail----------------------------------------------

mda-type m20-1gb-xp-sfp<snip>ingress-xpl

threshold 1000window 60

exit----------------------------------------------threshold (default:1000):threshold value for ingress XPL errorswindow (default:60):window size (in minutes) for ingress XPL errors

Sample CLI Output — Both MDAs are affected in the following example and the whole IOM is failed as a result.

A:7750 SR# show mda 1/1 detail===============================================================================MDA 1/1 detail===============================================================================Slot Mda Provisioned Type Admin Operational

Equipped Type (if different) State State-------------------------------------------------------------------------------1 1 m20-1gb-xp-sfp up failed<snip>

Fail On Error : EnabledEgress XPL error threshold : 1000Egress XPL error window : 60Ingress XPL error threshold : 1000Ingress XPL error window : 60

Hardware Data<snip>

Operational state : failed

Page 125: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 125

Failure Reason : MDA failed due to 60 consecutive windows ofmore than 1000 Ingress XPL Errors

<snip>Ingress XPL Errors: Trap raised 1000 times; Last Trap 2015/11/16 18:42:24.66

13.7 To Troubleshoot Using the Fail-On-Error Feature

The following steps describe the workflow to collect troubleshooting information for card and MDA errors using the fail-on-error feature.

Step 1. While the card/MDA is in the Failed operational state, use the admin tech-support command to generate one TS file.

Step 2. Remote power cycle the failed card/MDA during MW. If a remote power cycle is not supported, physically reseat the failed card.

Step 3. Use the admin tech-support command to generate two TS files, ensuring that the files (deltas) are at least 15 minutes apart.

Step 4. Monitor the card/MDA to determine if the hardware error persists. This is indicated if the card/MDA fails again.

Step 5. If the card/MDA fails again, replace the affected hardware. Monitor the replaced hardware for errors.

Step 6. Using the admin tech-support command, generate one TS file of the post-replacement system.

Step 7. Escalate the issue to Nokia Technical Support for further troubleshooting assistance. Ensure that all relevant TS files are attached and appropriately labeled (clearly specifying when each was taken).

13.8 Down-On-Internal-Error

The down-on-internal-error feature allows the operator to configure the system to bring a port operationally Down when internal MAC transmit errors are detected by the system. The Int MAC Tx Errs counter displays the number of frames for which transmission on a specific interface has failed due to an internal MAC sub-layer transmit error. In addition, the following information applies to the Int MAC Tx Errs counter.

• The Int MAC Tx Errs counters are raised at the port level.

• Traps are not generated for Int MAC Tx Errs counter increments.

Page 126: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

126

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

• The counter information can be viewed using the show port x detail CLI command.

• The internal error threshold is not user configurable.

13.8.1 Down-On-Internal-Error Examples

This section provides event log configuration and CLI output examples for the down-on-internal-error feature.

Sample Configuration

The following output is an example of an MDA-level down-on-internal-error configuration.

A:7750 SR>config>port>ethernet# info----------------------------------------------

mode accessmtu 9212down-on-internal-error

----------------------------------------------

Sample Event Log Entry

2560 2015/01/1 08:20:57.98 UTC WARNING: SNMP #2004 Base 1/1/1"Interface 1/1/1 is not operational"

2559 2015/01/1 08:20:57.98 JST MINOR: PORT #2054 Base Port 1/1/1"Excess internal MAC TX errors detected Set"

Sample CLI Output

A:7750 SR# show port 1/1/1 detail===============================================================================Ethernet Interface===============================================================================Description : toFarEndNodeInterface : 1/2/1 Oper Speed : 10 GbpsLink-level : Ethernet Config Speed : N/AAdmin State : up Oper Duplex : fullOper State : down Config Duplex : N/AReason Down : internalMacTxErrorPhysical Link : No MTU : 1918Single Fiber Mode : No Min Frame Length : 64 Bytes

<snip>

Note: Nokia recommends that the down-on-internal-error feature should only be enabled when redundant ports, interfaces, and services are available in the network.

Page 127: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 127

==============================================================================Ethernet-like Medium Statistics===============================================================================Alignment Errors : 0 Sngl Collisions : 0FCS Errors : 0 Mult Collisions : 0SQE Test Errors : 0 Late Collisions : 0CSE : 0 Excess Collisns : 0Too long Frames : 0 Int MAC Tx Errs : 489517Symbol Errors : 0 Int MAC Rx Errs : 0In Pause Frames : 0 Out Pause Frames : 0===============================================================================

13.9 To Troubleshoot Using the Down-On-Internal-Error Feature

The following steps describes the workflow to collect troubleshooting information using the down-on-internal-error feature.

A port that is taken out of service due to excessive internal errors will remain in an operationally Down state until it is manually re-enabled by an administrator.

Step 1. Check for and address accompanying XPL errors. Proceed to the next step only when the XPL errors have been addressed.

Step 2. Perform a Shut / No Shut on the affected port.

Step 3. Check the Int MAC Tx Errs counter.

If these errors continue to increment, remote power cycle the failed card during MW. Physically reseat the failed card if remote power cycle is not supported.

Step 4. Using the admin tech-support CLI command, generate two TS files, ensuring that the deltas are at least 15 minutes apart.

Step 5. Monitor the port to determine if the incrementing int MAC Tx Errs errors persist.

Step 6. If the errors persist, replace the affected MDA (or XMA) and monitor.

Step 7. If the errors persist, replace the affected IOM and monitor. This step applies to the 7750 SR only.

Step 8. Using the admin tech-support CLI command, generate one or more TS files of the post-replacement system.

Step 9. Escalate the issue to Nokia Technical Support for further troubleshooting assistance. Ensure that all relevant TS files are attached and appropriately labeled (clearly specifying when each was taken).

Page 128: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

128

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

13.10 CRC-Monitor

The crc-monitor feature allows the operator to configure the Ethernet CRC monitoring parameters. CRC errors occur in received traffic from a remote Ethernet source.

The crc-monitor feature provides three configurable commands.

• Signal-Degrade Threshold

This command specifies the error rate at which to declare the Signal Degrade (SD) condition on an Ethernet interface.

Syntax — [no] sd-threshold N [multiplier M]

Where:

N specifies the error rate (1-9) of CRC Ethernet frames

M (optional parameter) is a multiplier (1-9) which will be represented by M*10E-N, a ratio of errored frames over total frames received over W seconds of the sliding window.

Example — N=2, M=5

The command default is no sd-threshold.

• Signal-Fail Threshold

This command specifies the error rate at which to declare the Signal Fail (SF) condition on an Ethernet interface.

Syntax — [no] sf-threshold N [multiplier M]

Where:

N specifies the error rate (1-9) of CRC Ethernet frames

M (optional parameter) is a multiplier (1-9) which will be represented by M*10E-N, a ratio of errored frames over total frames received over W seconds of the sliding window.

Example — N=1

The command default is no sf-threshold.

Note: Nokia recommends that the crc-monitor feature should only be used when redundant ports, interfaces, and services are available in the network.

Note: If the multiplier keyword is omitted or no sd-threshold is specified, the multiplier will return to the default value of 1.

Page 129: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 129

• Window Size

This command specifies the sliding window size in seconds (5-60) over which the Ethernet frames are sampled to detect signal fail or signal degrade conditions. The command is used jointly with the sf-threshold and the sd-threshold commands to configure the sliding window size.

Syntax — [no] window-size W

Where:

W specifies the size of the sliding window in seconds (1-10) over which the errors are measured

Example — no window-size

The command default is no window-size (that is, 10 seconds).

13.10.1 CRC-Monitor Examples

This section provides event log configuration and CLI output examples for the crc-monitor feature.

Sample Configuration

The following output is an example of a port-level crc-monitor configuration.

A:7750 SR>config>port>ethernet>crc-mon# info detail----------------------------------------------

sd-threshold 2 multiplier 5sf-threshold 1no window-size

----------------------------------------------

Sample Event Log Entry

4 2015/11/16 22:50:57.38 UTC WARNING: SNMP #2004 Base 1/2/11"Interface 1/2/11 is not operational"

3 2015/11/16 22:50:57.38 UTC MINOR: PORT #2052 Base Port 1/2/11"CRC errors in excess of the configured fail threshold 1*10e-1 Set"

2 2015/11/16 22:50:55.38 UTC MINOR: PORT #2052 Base Port 1/2/11"CRC errors in excess of the configured degrade threshold 5*10e-2 Set"

Sample CLI Output

A:7750 SR>config>port>ethernet# show port 1/2/11 detail===============================================================================Ethernet Interface===============================================================================Description : 10/100/Gig Ethernet SFP

Page 130: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

130

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Interface : 1/2/11 Oper Speed : N/ALink-level : Ethernet Config Speed : 1 GbpsAdmin State : up Oper Duplex : N/AOper State : down Config Duplex : fullReason Down : crcErrorPhysical Link : No MTU : 1518<snip>

CRC Mon SD Thresh : 5*10E-2 CRC Mon Window : 10 secondsCRC Mon SF Thresh : 1*10E-1CRC Alarms : sdThresholdExceeded sfThresholdExceeded

<snip>

===============================================================================Ethernet Statistics===============================================================================Broadcast Pckts : 32 Drop Events : 0Multicast Pckts : 561901396 CRC/Align Errors : 8614383Undersize Pckts : 0 Fragments : 0Oversize Pckts : 0 Jabbers : 0Collisions : 0

<snip>

===============================================================================Ethernet-like Medium Statistics===============================================================================Alignment Errors : 0 Sngl Collisions : 0FCS Errors : 8614383 Mult Collisions : 0SQE Test Errors : 0 Late Collisions : 0CSE : 0 Excess Collisns : 0Too long Frames : 0 Int MAC Tx Errs : 0Symbol Errors : 0 Int MAC Rx Errs : 0In Pause Frames : 0 Out Pause Frames : 0==============================================================================================================================================================Per Threshold MDA Discard Statistics===============================================================================

Packets Octets-------------------------------------------------------------------------------Threshold 0 Dropped : 0 0<snip>Threshold 15 Dropped : 8614390 930354120===============================================================================

13.11 To Troubleshoot Using the CRC-Monitor Feature

The following steps describes the workflow to collect the relevant troubleshooting information using the crc-monitor feature.

Page 131: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

TROUBLESHOOTING GUIDE Hardware Error Protection Features

Issue: 01 3HE 11475 AAAA TQZZA 01 131

A port that is taken out of service due to signal-failure will remain in an operationally Down state until it is manually re-enabled by an administrator.

Step 1. Perform a Shut / No-shut on the port that was operationally shut down by the crc-monitor feature.

Step 2. Investigate physical layer issues that might be causing the incrementing CRC errors on the affected port, such as SFP, inspecting and cleaning the fiber, transport equipment, far-end port, and so on.

Step 3. At the end of each troubleshooting step, check for CRC errors.

If the CRC errors continue to increment, investigate each physical medium or device in the end-to-end datapath, from the affected port toward the far end.

Note: If the CRC errors persist, the crc-monitor feature may continually take the affected port out of service. To prevent this occurrence, it may be necessary to adjust the SF threshold or disable crc-monitor when the physical layer troubleshooting is in progress.

Page 132: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Hardware Error Protection Features

132

TROUBLESHOOTING GUIDE

3HE 11475 AAAA TQZZA 01 Issue: 01

Page 133: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

Customer Document and Product Support

Customer documentationCustomer Documentation Welcome Page

Technical SupportProduct Support Portal

Documentation feedbackCustomer Documentation Feedback

Page 134: 7450 Ethernet Service Switch 7750 Service Router · 02/01/2011 · Nokia — Proprietary and confidential. Use pursuant to applicable agreements. ... The alarms and troubleshooting

© 2016 Nokia.3HE 11475 AAAA TQZZA 01