nsx-t data center troubleshooting guide - vmware nsx-t ...nsx-t data center troubleshooting guide...

NSX-T Data CenterTroubleshooting GuideModified March 2019VMware NSX-T Data Center 2.3

NSX-T Data Center Troubleshooting Guide

VMware, Inc. 2

You can find the most up-to-date technical documentation on the VMware website at:

https://docs.vmware.com/

If you have comments about this documentation, submit your feedback to

[email protected]

Copyright © 2017 – 2019 VMware, Inc. All rights reserved. Copyright and trademark information.

VMware, Inc.3401 Hillview Ave.Palo Alto, CA 94304www.vmware.com

https://docs.vmware.com/

mailto:[email protected]

http://pubs.vmware.com/copyright-trademark.html

Contents

NSX-T Data Center Troubleshooting Guide 5

1 Logs and Services 6

Log Messages 6

Troubleshooting Syslog Issues 10

Checking Services 11

Collecting Support Bundles 13

2 Troubleshooting Layer 2 Connectivity 14

Check the NSX Manager and NSX Controller Cluster Status 14

Check the Logical Ports 15

Check the Transport Node Status 16

Check the Logical Switch Status 16

Check the CCP for the Logical Switch 17

Check the Local Control Plane Status 17

Troubleshoot Config Session Issues 18

Troubleshoot L2 Session Issues 19

Troubleshoot Dataplane Issues for an Overlay logical Switch 20

Troubleshoot Dataplane Issues for a VLAN logical Switch 21

Troubleshoot ARP Issues for an Overlay Logical Switch 22

Troubleshoot Packet Loss for a VLAN logical Switch or When ARP Is Resolved 22

3 Troubleshooting Installation 24

4 Troubleshooting Routing 28

5 Troubleshooting the Central Control Plane 30

Deleting a Controller Node Fails 30

Removing a Controller from a Cluster Fails 30

Controller VM Fails to Power On 31

Controller Fails to Register with NSX Manager 31

Controller Fails to Come Online after Registration 32

Controller Fails to Join the Cluster 32

Controller Clustering Fails 32

6 Troubleshooting Firewall 34

Determining Firewall Rules that Apply on an ESXi Host 34

Determining Firewall Rules that Apply on a KVM Host 37

VMware, Inc. 3

Firewall Packet Logs 38

On an ESXi Host, the getrules Command Shows an Unknown MAC Address 39

Stateful Edge Firewall Not Working 40

7 Other Troubleshooting Scenarios 41

Failure to Add or Delete a Transport Node 41

Transport Node Takes About 5 Minutes to Connect to Another Controller 42

NSX Manager VM Is Degraded 43

NSX Agent Times Out Communicating with NSX Manager 44

Failure to Add an ESXi Host 45

Incorrect NSX Controller Status 45

Management IPs on KVM VMs Not Reachable with IPFIX Enabled 46

Upgrade Fails Due to a Timeout 46

Edge Transport Node Status Degraded if Any Interface is Down 47


VMware, Inc. 4


The NSX-T Data Center Troubleshooting Guide provides information on how to troubleshoot issues thatmight occur in an NSX-T Data Center environment.

Intended AudienceThis guide is for system administrators of NSX-T Data Center. A familiarity with virtualization, networking,and datacenter operations is assumed.

VMware Technical Publications GlossaryVMware Technical Publications provides a glossary of terms that might be unfamiliar to you. Fordefinitions of terms as they are used in VMware technical documentation, go to http://www.vmware.com/support/pubs.

VMware, Inc. 5

http://www.vmware.com/support/pubs

Logs and Services 1Logs can be helpful in many troubleshooting scenarios. Checking the status of services is also important.

This chapter includes the following topics:n Log Messages

n Troubleshooting Syslog Issues

n Checking Services

n Collecting Support Bundles

Log MessagesLog messages from all NSX-T Data Center components, including those running on ESXi hosts, conformto the syslog format as specified in RFC 5424. Log messages from KVM hosts are in the RFC 3164format. The log files are in the directory /var/log.

On NSX-T Data Center appliances, you can run the following NSX-T Data Center CLI command to viewthe logs:

get log-file <auth.log | http.log | kern.log | manager.log | node-mgmt.log | syslog> [follow]

On hypervisors, you can use Linux commands such as tac, tail, grep, and more to view the logs. Youcan also use these commands on NSX-T Data Center appliances.

For more information about RFC 5424, see https://tools.ietf.org/html/rfc5424. For more information aboutRFC 3164, see https://tools.ietf.org/html/rfc3164.

RFC 5424 defines the following format for log messages:

<facility * 8 + severity> version UTC-TZ hostname APP-NAME procid MSGID [structured-data] msg

A sample log message:

<187>1 2016-03-15T22:53:00.114Z nsx-manager NSX - SYSTEM [nsx@6876 comp="nsx-manager"

errorCode="MP4039" subcomp="manager"] Connection verification failed for broker '10.160.108.196'.

Marking broker unhealthy.

Every message has the component (comp) and sub-component (subcomp) information to help identify thesource of the message.

VMware, Inc. 6

https://tools.ietf.org/html/rfc5424

https://tools.ietf.org/html/rfc3164

NSX-T Data Center produces regular logs (facility local6, which has a numerical value of 22) and auditlogs (facility local7, which has a numerical value of 23). All API calls trigger an audit log.

An audit log that is associated with an API call has the following information:

n An entity ID parameter entId to identify the object of the API.

n A request ID parameter req-id to identify a specific API call.

n An external request ID parameter ereqId if the API call contains the header X-NSX-EREQID:<string>.

n An external user parameter euser if the API call contains the header X-NSX-EUSER:<string>.

RFC 5424 defines the following severity levels:

Severity Level Description

0 Emergency: system is unusable

1 Alert: action must be taken immediately

2 Critical: critical conditions

3 Error: error conditions

4 Warning: warning conditions

5 Notice: normal but significant condition

6 Informational: informational messages

7 Debug: debug-level messages

All logs with a severity of emergency, alert, critical, or error contain a unique error code in the structureddata portion of the log message. The error code consists of a string and a decimal number. The stringrepresents a specific module.

The MSGID field identifies the type of message. For a list of the message IDs, see Log Message IDs.

Configure Remote LoggingYou can configure NSX-T Data Center appliances and hypervisors to send log messages to a remotelogging server.

Remote logging is supported on NSX Manager, NSX Controller, NSX Edge, and hypervisors. You mustconfigure remote logging on each node individually.

On an KVM host, the NSX-T Data Center installation package automatically configures the rsyslogdaemon by putting configuration files in the /etc/rsyslog.d directory.

Prerequisites

n Configure a logging server to receive the logs.


VMware, Inc. 7

Procedure

1 To configure remote logging on an NSX-T Data Center appliance:

a Run the following command to configure a log server and the types of messages to send to thelog server. Multiple facilities or message IDs can be specified as a comma delimited list, withoutspaces.

set logging-server <hostname-or-ip-address[:port]> proto <proto> level <level> [facility

<facility>] [messageid <messageid>] [certificate <filename>] [structured-data <structured-

data>]

For more information about this command, see the NSX-T CLI Reference. You can run thecommand multiple times to add multiple logging server configurations. For example:

nsx> set logging-server 192.168.110.60 proto udp level info facility syslog messageid

SYSTEM,FABRIC

nsx> set logging-server 192.168.110.60 proto udp level info facility auth,user

b you can view the logging configuration with the get logging-server command. For example,

nsx> get logging-servers

192.168.110.60 proto udp level info facility syslog messageid SYSTEM,FABRIC

192.168.110.60 proto udp level info facility auth,user

2 To configure remote logging on an ESXi host:

a Run the following commands to configure syslog and send a test message:

esxcli network firewall ruleset set -r syslog -e true

esxcli system syslog config set --loghost=udp://<log server IP>:<port>

esxcli system syslog reload

esxcli system syslog mark -s "This is a test message"

b You can run the following command to display the configuration:

esxcli system syslog config get

3 To configure remote logging on a KVM host:

a Edit the file /etc/rsyslog.d/10-vmware-remote-logging.conf for your environment.

b Add the following line to the file:

*.* @<ip>:514;RFC5424fmt

c Run the following command:

service rsyslog restart


VMware, Inc. 8

Log Message IDsIn a log message, the message ID field identifies the type of message. You can use the messageidparameter in the set logging-server command to filter which log messages are sent to a loggingserver.

Table 1‑1. Log Message IDs

Message ID Examples

FABRIC Host node

Host preparation

Edge node

Transport zone

Transport node

Uplink profiles

Cluster profiles

Edge cluster

Bridge clusters and endpoints

SWITCHING Logical switch

Logical switch ports

Switching profiles

switch security features

ROUTING Logical router

Logical router ports

Static routing

Dynamic routing

NAT

FIREWALL Firewall rules

Firewall rule sections

FIREWALL-PKTLOG Firewall connection logs

Firewall packet logs

GROUPING IP sets

Mac sets

NSGroups

NSServices

NSService groups

VNI Pool

IP Pool

DHCP DHCP relay


VMware, Inc. 9

Table 1‑1. Log Message IDs (Continued)

Message ID Examples

SYSTEM Appliance management (remote syslog, ntp, etc)

Cluster management

Trust management

Licensing

User and roles

Task management

Install (NSX Manager, NSX Controller)

Upgrade (NSX Manager, NSX Controller, NSX Edge and host-packages upgrades )

Realization

Tags

MONITORING SNMP

Port connection

Traceflow

- All other log messages.

Troubleshooting Syslog IssuesIf logs are not received by the remote log server, perform the following steps.

n Verify the remote log server's IP address.

n Verify that the level parameter is configured correctly.

n Verify that the facility parameter is configured correctly.

n If the protocol is TLS, set the protocol to UDP to see if there is a certificate mismatch.

n If the protocol is TLS, verify that port 6514 is open on both ends.

n Remove the message ID filter and see if logs are received by the server.

n Restart the rsyslog service with the command restart service rsyslogd.

A sample rsyslog configuration file (/etc/rsyslog.conf):

### rsyslog config file. Customized by VMware.

### Do not edit this file by hand. Use the API to make changes.

$PreserveFQDN on

$ModLoad imklog

$ModLoad immark

module(load="imuxsock" sysSock.useSpecialParser="off")

$RepeatedMsgReduction on

$FileOwner syslog

$FileGroup adm

$FileCreateMode 0640

$DirCreateMode 0755

$Umask 0022

$ActionFileDefaultTemplate RSYSLOG_SyslogProtocol23Format

$IncludeConfig /etc/rsyslog.d/*.conf


VMware, Inc. 10

$template RFC5424fmt,"<%PRI%>1 %TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% %PROCID% %MSGID%

%STRUCTURED-DATA% %msg%\n"

$WorkDirectory /var/spool/rsyslog

$ModLoad imudp

$UDPServerAddress 127.0.0.1

$UDPServerRun 514

$PrivDropToUser syslog

$ActionQueueType LinkedList # nsx exporter: e7347687-8be7-4519-a8e1-73c5192c9b43

*.info @1.2.3.4:514;RFC5424fmt # nsx exporter: e7347687-8be7-4519-a8e1-73c5192c9b43

Checking ServicesServices that stop running or fail to start can cause problems. It is important to make sure that all servicesare running normally.

To check the status of NSX Manager service:

nsxmgr> get services

Service name: cm-inventory

Service state: stopped

Service name: http


Session timeout: 1800

Connection timeout: 30

Redirect host: (not configured)

Service name: install-upgrade


Enabled: True

Service name: liagent


Service name: manager


Logging level: info

Service name: mgmt-plane-bus

Service state: running

Service name: node-mgmt


Service name: nsx-message-bus


Service name: nsx-upgrade-agent


Service name: ntp



VMware, Inc. 11

Service name: search


Service name: snmp


Start on boot: False

Service name: ssh


Start on boot: True

Service name: syslog


In the example above, the http service is stopped. You can start the http service with the followingcommand:

nsxmgr> start service http

SSH ServiceIf the SSH service was not enabled when deploying the appliance, you can log in to the appliance asadmin and enable it with the following command:

start service ssh

You can configure SSH to start when the host starts with the following command:

set service ssh start-on-boot

To enable SSH root login, you can log in to the appliance as root, edit the file /etc/ssh/sshd_config andreplace the line

PermitRootLogin prohibit-password

Alternatively, you can enable the SSH service and enable SSH root access by powering off the applianceand modifying its vApp properties.

with

PermitRootLogin yes

and restart the sshd server with the following command:

/etc/init.d/ssh restart


VMware, Inc. 12

Collecting Support BundlesYou can collect support bundles on registered cluster and fabric nodes and download the bundles to yourmachine or upload them to a file server.

If you choose to download the bundles to your machine, you get a single archive file consisting of amanifest file and support bundles for each node. If you choose to upload the bundles to a file server, themanifest file and the individual bundles are uploaded to the file server separately.

NSX Cloud Note If you want to collect the support bundle for CSM, log in to CSM, go to System >Utilities > Support Bundle and click on Download. The support bundle for PCG is available fromNSX Manager using the following instructions. The support bundle for PCG also contains logs for all theworkload VMs.

Procedure

1 From your browser, log in with admin privileges to NSX Manager at https://nsx-manager-ip-address.

2 Select System > Utilities from the navigation panel.

3 Click the Support Bundle tab.

4 Select the target nodes.

The available types of nodes are Management Nodes, Controller Nodes, Edges, Hosts, and PublicCloud Gateways.

5 (Optional) Specify log age in days to exclude logs that are older than the specified number of days.

6 (Optional) Toggle the switch that indicates whether to include or exclude core files and audit logs.

Note Core files and audit logs might contain sensitive information such as passwords or encryptionkeys.

7 (Optional) Select a check box to upload the bundles to a file server.

8 Click Start Bundle Collection to start collecting support bundles.

Depending on how many log files exist, each node might take several minutes.

9 Monitor the status of the collection process.

The status field shows the percentage of nodes that completed support bundle collection.

10 Click Download to download the bundle if the option to send the bundle to a file server was not set.


VMware, Inc. 13

Troubleshooting Layer 2Connectivity 2If there is a communication failure between two virtual interfaces (VIFs) that are connected to the samelogical switch, for example, you cannot ping one VM from another, you can follow the steps in this sectionto troubleshoot the failure.

Before you start, make sure that there is no firewall rule blocking traffic between the two logical ports. It isrecommended that you follow the order of the topics in this section to troubleshoot the connectivity issue.

This chapter includes the following topics:n Check the NSX Manager and NSX Controller Cluster Status

n Check the Logical Ports

n Check the Transport Node Status

n Check the Logical Switch Status

n Check the CCP for the Logical Switch

n Check the Local Control Plane Status

n Troubleshoot Config Session Issues

n Troubleshoot L2 Session Issues

n Troubleshoot Dataplane Issues for an Overlay logical Switch

n Troubleshoot Dataplane Issues for a VLAN logical Switch

n Troubleshoot ARP Issues for an Overlay Logical Switch

n Troubleshoot Packet Loss for a VLAN logical Switch or When ARP Is Resolved

Check the NSX Manager and NSX Controller ClusterStatusVerify that the status of NSX Manager and the NSX Controller cluster is normal, and the controllers areconnected to the NSX Manager.

VMware, Inc. 14

Procedure

1 Run the following CLI command on the NSX Manager to make sure the status is stable.

NSX-Manager> get management-cluster status

Number of nodes in management cluster: 1

- 192.168.110.47 (UUID 45A8869B-BB90-495D-8A01-69B5FCC56086) Online

Management cluster status: STABLE

Number of nodes in control cluster: 3

- 192.168.110.201 (UUID 45A8869B-BB90-495D-8A01-69B5FCC56086)

- 192.168.110.202 (UUID 45A8869B-BB90-495D-8A01-69B5FCC56086)

- 192.168.110.203 (UUID 45A8869B-BB90-495D-8A01-69B5FCC56086)

2 Run the following CLI command on an NSX Controller to make sure the status is active.

NSX-Controller1> get control-cluster status

uuid: db4aa77a-4397-4d65-ad33-9fde79ac3c5c

is master: true

in majority: true

uuid address status

0cfe232e-6c28-4fea-8aa4-b3518baef00d 192.168.110.201 active

bd257108-b94e-4e6d-8b19-7fa6c012961d 192.168.110.202 active

538be554-1240-40e4-8e94-1497e963a2aa 192.168.110.203 active

3 Run the following CLI command on an NSX Controller to make sure it is connected to theNSX Manager.

NSX-Controller1> get managers

- 192.168.110.47 Connected

Check the Logical PortsCheck that the logical ports are configured on the same logical switch and their status is up.

Procedure

1 From the NSX Manager GUI, get the logical ports UUIDs.

2 Make the following API call for each logical port to make sure the logical ports are on the same logicalswitch.

GET https://<nsx-mgr>/api/v1/logical-ports/<logical-port-uuid>

3 Make the following API call for each logical port to make sure the status is up.

GET https://<nsx-mgr>/api/v1/logical-ports/<logical-port-uuid>/status


VMware, Inc. 15

Check the Transport Node StatusCheck the status of the transport node.

Procedure

u Make the following API call to get the state of the transport node.

GET https://<nsx-mgr>/api/v1/transport-nodes/<transport-node-ID>/state

If the call returns the error RPC timeout, perform the following troubleshooting steps:

n Run /etc/init.d/nsx-opsAgent status to see if opsAgent is running.

n Run /etc/init.d/nsx-mpa status to see if nsx-mpa is running.

n To see if nsx-mpa is connected to the NSX Manager, check the nsx-mpa heartbeat logs.

n To see if opsAgent is connected to the NSX Manager, check the nsx-opsAgent log. You will seethe following message if opsAgent is connected to the NSX Manager.

Connected to mpa, cookie: ...

n To see if opsAgent is stuck processing HostConfigMsg, check the nsx-opsAgent log. If so, youwill see an RMQ request message but the reply is not sent or sent after a long delay.

n Check to see if opsAgent crashed while executing HostConfigMsg.

n To see if the RMQ messages are taking a long time to be delivered to the host, compare thetimestamps of log messages on the NSX Manager and the host.

If the call returns the error partial_success, there are many possible causes. Start by looking at thensx-opsAgent logs. On the ESXi host, check hostd.log and vmkernel.log. On KVM, syslogholds all the logs.

Check the Logical Switch StatusCheck the status of the logical switch.

Procedure

u Make the following API call to get the state of the logical switch.

GET https://<nsx-mgr>/api/v1/logical-switches/<logical-switch-ID>/state

If the call returns the error partial_success, the reply will contain a list of transport nodes where theNSX Manager failed to push the logical switch configuration or did not get a reply. Thetroubleshooting steps are similar to those for the transport node. Check the following:

n All required components are installed and running.

n nsx-mpa is connected to the NSX Manager.


VMware, Inc. 16

n nsxa is connected to the switching vertical.

n Grep the logical switch ID in nsxa.log and nsxaVim.log to see if the logical switch configurationwas received by the transport node.

n Check the nsxa and nsx-mpa uptime. Find out when nsxa was started and stopped by greppingnsxa log messages in the syslog file.

n Find out nsxa's connection time to the switching vertical. If the logical switch configuration is sentto the host when nsxa is not connected to the switching vertical, the configuration might not bedelivered to the host.

On KVM, no logical switch configuration is pushed to the host. Therefore, most of the logical switchissues are likely to be in the management plane.

On ESXi, an opaque network is mapped to the logical switch. To use the logical switch, users connectVMs to the opaque network using vCenter Server or vSphere API.

Check the CCP for the Logical SwitchVerify that the logical switch is in the central control plane (CCP).

Procedure

u Run the following CLI command on an NSX Controller to make sure that the logical switch is present.

NSX-Controller1> get logical switches

VNI UUID Name

52104 feab22ec-94b2-46f4-88f8-f9d44a416272 ls1

Note This CLI command does not list VLAN-backed logical switches.

Check the Local Control Plane StatusFor an overlay logical switch, check that the netcpa on the host is connected to the central control plane.

Prerequisites

Find the controller that the logical switch is on. See Check the CCP for the Logical Switch.

Procedure

1 SSH to the controller that the logical switch is on.

2 Run the following command and verify that the controller shows the hypervisors that are connected tothis VNI.

get logical-switch 5000 transport-node-table

3 On the hypervisors, run the command /bin/nsxcli to start NSX CLI.


VMware, Inc. 17

4 Run the following command to get the CCP sessions.

host1> get ccp-session

Session Index State Controller

Config 0 UP 10.33.74.163

L2 5000 UP 10.33.74.163

You should see a Config session on one of the CCP nodes in the CCP cluster. For every overlaylogical switch, you should see an L2 session to one of the CCP nodes in the CCP cluster. For VLANlogical switches, there are no CCP connections.

Troubleshoot Config Session IssuesIf the CCP config session is not up, check the status of MPA and netcpa.

Procedure

1 Make the following API call to see if MPA is connected to the NSX Manager.

GET https://<nsx-mgr>/api/v1/logical-ports/<logical-port-uuid>

2 On the hypervisor, run the command /bin/nsxcli to start NSX CLI.

3 Run the following command to get the node-uuid.

host1> get node-uuid

0c123dd4-8199-11e5-95e2-73cc1cd9b614

4 Run the following command to see if the NSX Manager pushed the CCP information to the host.

cat /etc/vmware/nsx/config-by-vsm.xml

5 If config-by-vsm.xml has CCP information, check if a transport node is configured on the hypervisor.

The NSX Manager sends the host certificate for the hypervisor in the transport node creation step.The CCP must have the host certificate before it accepts connections from the host.

6 Check the validity of the host certificate in /etc/vmware/nsx/host-cert.pem.

The certificate must be the same as the one that the NSX Manager has for the host.

7 Run the following command to check if the status of netcpa.

On ESXi:

/etc/init.d/netcpad status

On KVM:

/etc/init.d/nsx-agent status


VMware, Inc. 18

8 Start or restart netcpa.

On ESXi, start netcpa if it is not running, or restart it if it is running.

/etc/init.d/netcpad start

/etc/init.d/netcpad restart

On KVM, start netcpa if it is not running, or restart it if it is running.

/etc/init.d/nsx-agent start

/etc/init.d/nsx-agent restart

9 If the config session is still not up, collect the technical support bundles and contact VMware support.

Troubleshoot L2 Session IssuesThis applies to overlay logical switches only.

Procedure

1 On the hypervisor, run the command /bin/nsxcli to start NSX CLI.

2 Run the following command to see if the logical switch is present on the host.

host1> get logical-switches

3 Check that the state of the port is not admin down.

On ESXi, run net-dvs and look at the response. For example,

port 63eadf53-ff92-4a0e-9496-4200e99709ff:

com.vmware.port.extraConfig.opaqueNetwork.id = … <- this should match the logical switch UUID

com.vmware.port.opaque.network.id = …. <- this should match the logical switch UUID

com.vmware.port.opaque.network.type = nsx.LogicalSwitch , propType = RUNTIME

com.vmware.common.port.block = false, ... <- Make sure the value is false.

com.vmware.vswitch.port.vxlan = …

com.vmware.common.port.volatile.status = inUse ... <- make sure the value is inUse.

If the logical port ends up in the blocked state, collect the technical support bundles and contactVMware support. In the meantime, run the following command to get the DVS name:

[root@host1:~] net-dvs | grep nsx-switch

com.vmware.common.alias = nsx-switch , propType = CONFIG

Run the following command to unblock the port:

[root@host1:~] net-dvs -s com.vmware.common.port.block=false <DVS-NAME> -p <logical-port-ID>


VMware, Inc. 19

On KVM, run ovs-vsctl list interface and verify that the interface with the corresponding VIFUUID is present and admin_state is up. You can see the VIF UUID in OVSDB in external-ids:iface-id.

Troubleshoot Dataplane Issues for an Overlay logicalSwitchThe steps in this section are for troubleshooting connectivity issues between VMs on different hypervisorsthrough the overlay switch when the config and runtime states are normal.

If the VMs are on the same hypervisor, go to Troubleshoot ARP Issues for an Overlay Logical Switch.

Procedure

1 Run the following command on the controller that has the logical switch to see if CCP has the correctVTEP list:

controller1> get logical-switch 5000 vtep

2 On each hypervisor, run the following NSX CLI command to see if it has the correct VTEP list:

On ESXi:

host1> get logical-switch <logical-switch-UUID> tep-table

Alternatively, you can run the following shell command for the VTEP information:

[root@host1:~] net-vdl2 -M vtep -s vds -n VNI

On KVM:

host1> get logical-switch <logical-switch-UUID or VNI> tep-table

3 Check to see if the VTEPs on the hypervisors can ping each other.

At the ESXi shell prompt:

host1> ping ++netstack=vxlan <remote-VTEP-IP>

At the KVM shell prompt:

host1> ping <remote-VTEP-IP>

If the VTEPs cannot ping each other,

a Make sure the transport VLAN specified when creating the transport node matches what theunderlay expects. If you are using access ports in the underlay, the transport VLAN should be setto 0. If you are specifying a transport VLAN, the underlay switch ports that the hypervisorsconnect to should be configured to accept this VLAN in trunk mode.


VMware, Inc. 20

b Check underlay connectivity.

4 Check if the BFD sessions between the VTEPs are up.

On ESXi, run net-vdl2 -M bfd and look at the response. For example,

BFD count: 1

===========================

Local IP: 192.168.48.35, Remote IP: 192.168.197.243, Local State: up, Remote State: up, Local

Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000000, isDisabled: 0

On KVM, find the GENEVE interface to the remote IP.

ovs-vsctl list interface <GENEVE-interface-name>

If you don’t know the interface name, run ovs-vsctl find Interface type=geneve to return alltunnel interfaces. Look for BFD information.

If you cannot find an GENEVEinterface to remote VTEP, check if nsx-agent is running and OVSintegration bridge is connected to nsx-agent.

[root@host1 ~]# ovs-vsctl show

96c9e543-fc68-448a-9882-6e161c313a5b

Manager "tcp:127.0.0.1:6632"

is_connected: true

Bridge nsx-managed

Controller "tcp:127.0.0.1:6633"

is_connected: true

Controller "unix:ovs-l3d.mgmt"

is_connected: true

fail_mode: secure

Troubleshoot Dataplane Issues for a VLAN logical SwitchThe steps in this section are for troubleshooting connectivity issues between VMs on different hypervisorsthrough the configured VLAN on the underlay when the config and runtime states are normal.

If the VMs are on the same hypervisor and all the configuration and runtime states are normal, go to Troubleshoot ARP Issues for an Overlay Logical Switch.

Procedure

u Check that the underlay is configured for the VLAN for the logical switch in trunk mode.

On ESXi, verify VLAN is configured on the logical port by running net-dvs and looking for the logicalport. For example:

port 63eadf53-ff92-4a0e-9496-4200e99709ff:

com.vmware.common.port.volatile.vlan = VLAN 1000 propType = RUNTIME VOLATILE


VMware, Inc. 21

On KVM, the VLAN logical switch is configured as an openflow rule on integration bridge. In otherwords, for traffic received from the VIF, tag it with VLAN X and forward it on the patch port to the PIFbridge. Run ovs-vsctl list interface and verify the presence of the patch port between theNSX-managed bridge and the NSX-switch bridge.

Troubleshoot ARP Issues for an Overlay Logical SwitchThe steps in this section are for troubleshooting where packets are being lost for an overlay switch.

For a VLAN-backed logical switch, go to Troubleshoot Packet Loss for a VLAN logical Switch or WhenARP Is Resolved.

Before performing the following troubleshooting steps, run the command arp -n on each VM. If ARP issuccessfully resolved on both VMs, you do not need to perform the steps in this section. Instead, go tothe next section Troubleshoot Packet Loss for a VLAN logical Switch or When ARP Is Resolved.

Procedure

u If both endpoints are ESXi and ARP proxy is enabled on the logical switch (only supported for overlaylogical switches), check the ARP table on the CCP and the hypervisor.

On the CCP:

controller1> get logical-switch 5000 arp-table

On the hypervisor, start NSX CLI and run the following command:

host1> get logical-switch <logical-switch-UUID> arp-table

Fetching the ARP table only tells us whether we have the correct ARP proxy state. If the ARPresponse is not received via proxy, or if the host is KVM and does not support ARP proxy, thedatapath should broadcast the ARP request. There might be a problem with BUM traffic forwarding.Try the following steps:

n If the replication mode for the logical switch is MTEP, change the replication mode to SOURCE forthe logical switch from the NSX Manager GUI. This might fix the issue and ping will start working.

n Add static ARP entries and see if the rest of the datapath works.

Troubleshoot Packet Loss for a VLAN logical Switch orWhen ARP Is ResolvedYou can use the automated traceflow tool or manually trace the packets to troubleshoot packet loss.

To run the traceflow tool, from the NSX Manager GUI, navigate to Tools > Traceflow. For moreinformation, see the NSX-T Administration Guide.


VMware, Inc. 22

Procedure

u To manually trace the packets,

On ESXi, run net-stats -l to get the switchport ID of the VIFs. If the source and destination VIFsare on the same hypervisor, run the following commands:

pktcap-uw --switchport <src-switch-port-ID> --dir=0

pktcap-uw --switchport <dst-switch-port-ID> --dir=1

If the source and destination VIFs are on different hypervisors, on the hypervisor hosting the sourceVIF, run the following commands:

pktcap-uw --switchport <src-switch-port-ID> --dir=0

pktcap-uw --uplink <uplink-name> --dir=1

On the hypervisor hosting the destination VIF, run the following commands:

pktcap-uw --uplink <uplink-name> --dir=0

pktcap-uw --switchport <dest-switch-port-ID> --dir=1

On KVM, if the source and destination VIFs are on the same hypervisor, run the following command:

ovs-dpctl dump-flows


VMware, Inc. 23

Troubleshooting Installation 3This section provides information about troubleshooting installation issues.

Basic Infrastructure ServicesThe following services must be running on the appliances and hypervisors, also on vCenter Server if it isused as a compute manager.n NTP

n DNS

Make sure that firewall is not blocking traffic between NSX-T components and hypervisors. Make surethat the required ports are open between the components.

To flush the DNS cache on the NSX Manager, SSH as root to the manager and run the followingcommand:

root@nsx-mgr-01:~# /etc/init.d/resolvconf restart

[ ok ] Restarting resolvconf (via systemctl): resolvconf.service.

You can then check the DNS configuration file.

root@nsx-mgr-01:~# cat /etc/resolv.conf

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

nameserver 192.168.253.1

search mgt.sg.lab

VMware, Inc. 24

Checking Communication from Host to Controller andManagerOn an ESXi host using NSX-T CLI commands:

esxi-01.corp.local> get managers

- 192.168.110.19 Connected

esxi-01.corp.local> get controllers

Controller IP Port SSL Status Is Physical Master Session State Controller FQDN

192.168.110.16 1235 enabled connected true up NA

On a KVM host using NSX-T CLI commands:

kvm-01> get managers

- 192.168.110.19 Connected

kvm-01> get controllers

Controller IP Port SSL Status Is Physical Master Session State Controller FQDN

192.168.110.16 1235 enabled connected true up NA

On an ESXi host using host CLI commands:

[root@esxi-01:~] esxcli network ip connection list | grep 1235

tcp 0 0 192.168.110.53:42271 192.168.110.16:1235

ESTABLISHED 67702 newreno netcpa

[root@esxi-01:~]

[root@esxi-01:~] esxcli network ip connection list | grep 5671

tcp 0 0 192.168.110.253:11721 192.168.110.19:5671 ESTABLISHED 2103688

newreno mpa

tcp 0 0 192.168.110.253:30977 192.168.110.19:5671 ESTABLISHED 2103688

newreno mpa

On a KVM host using host CLI commands:

root@kvm-01:/home/vmware# netstat -nap | grep 1235

tcp 0 0 192.168.110.55:53686 192.168.110.16:1235 ESTABLISHED 2554/netcpa

root@kvm-01:/home/vmware#


root@kvm-01:/home/vmware# netstat -nap | grep 5671

tcp 0 0 192.168.110.55:50108 192.168.110.19:5671 ESTABLISHED 2870/mpa

tcp 0 0 192.168.110.55:50110 192.168.110.19:5671 ESTABLISHED 2870/mpa

root@kvm-01:/home/vmware# tcpdump -i ens32 port 1235 | grep kvm-01

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on ens32, link-type EN10MB (Ethernet), capture size 262144 bytes

<truncated output>

03:46:27.040461 IP nsxcontroller01.corp.local.1235 > kvm-01.corp.local.38754: Flags [P.], seq

3315301231:3315301275, ack 2671171555, win 323, length 44

03:46:27.040509 IP kvm-01.corp.local.38754 > nsxcontroller01.corp.local.1235: Flags [.], ack 44, win

1002, length 0


VMware, Inc. 25

^C

<truncated output>


root@kvm-01:/home/vmware# tcpdump -i ens32 port 5671 | grep kvm-01

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on ens32, link-type EN10MB (Ethernet), capture size 262144 bytes

03:51:16.802934 IP kvm-01.corp.local.58954 > nsxmgr01.corp.local.amqps: Flags [P.], seq 1153:1222, ack

1790, win 259, length 69

03:51:16.823328 IP nsxmgr01.corp.local.amqps > kvm-01.corp.local.58954: Flags [P.], seq 1790:1891, ack

1222, win 254, length 101

^C

<truncated output>

Host Registration FailureIf NSX-T uses the wrong IP address, host registration will fail. This can happen when a host has multipleIP addresses. Trying to delete the transport node leaves it in the Orphaned state. To resolve the issue:

n Go to Fabric > Nodes > Hosts, edit the host and remove all IP addresses except the managementone.

n Click on the errors and select Resolve.

KVM Host IssuesKVM host issues are sometimes caused by not enough disk space. The /boot directory can fill up quicklyand cause errors such as:

n Failed to install software on host

n No space left on device

You can run the command df -h to check available storage. If the /boot directory is at 100%, you can dothe following:

n Run sudo dpkg --list 'linux-image*' | grep ^ii to see all the kernels installed.

n Run uname -r to see your currently running kernel. Do not remove this kernel (linux-image).

n Use apt-get purge to remove images you don't need anymore. For example, run sudo apt-getpurge linux-image-3.13.0-32-generic linux-image-3.13.0-33-generic.

n Reboot the host.

n In NSX Manager, check the errors and select Resolve.

n Make sure the VMs are powered on.


VMware, Inc. 26

Configuration Error when Deploying an Edge VMAfter deploying an Edge VM, NSX Manager shows the VM's status as configuration error. The managerlog has a message similar to the following:

nsx-manager NSX - FABRIC [nsx@6876 comp="nsx-manager" errorCode="MP16027" subcomp="manager"] Edge

758ad396-0754-11e8-877e-005056abf715 is not ready for configuration error occurred, error detail is

NSX Edge configuration has failed. The host does not support required cpu features: ['aes'].

Restarting the edge datapath service and then the VM should resolve the issue.

Force Removing a Transport NodeYou can remove a transport node that is stuck in the Orphaned state by making the following API call:

DELETE https://<NSX Manager>/api/v1/transport-nodes/<TN ID>?force=true

NSX Manager will not do any validations as to whether you have any active VMs running on the host. Youare responsible for deleting the N-VDS and VIBs. If you have the node added through Compute Manager,delete the Compute Manager first and then delete the node. The transport node will be deleted as well.


VMware, Inc. 27

Troubleshooting Routing 4NSX-T has built-in tools to troubleshoot routing issues.

TraceflowYou can use Traceflow to inspect the flow of packets. You can see delivered, dropped, received, andforwarded packets. If a packet is dropped, a reason is displayed. For example, a packet can be droppedbecause of a firewall rule.

Checking Routing TablesTo see the routing table on a service router, run the following commands:

edge01> get logical-router

Logical Route

UUID VRF LR-ID Name Type

Ports

736a80e3-23f6-5a2d-81d6-bbefb2786666 0 0 TUNNEL 3

c9393d0c-1fcf-4c34-889d-2da1eeee25b8 1 10 SR-t0-router SERVICE_ROUTER_TIER0 5

9333c94e-5938-46b4-8c7d-5e6ac2c8b7b5 2 8 DR-t1-router01 DISTRIBUTED_ROUTER_TIER1 6

c91eb7c5-0297-4fed-9c22-b96df1c9b80f 3 9 DR-t0-router DISTRIBUTED_ROUTER_TIER0 4

edge01> vrf 1

edge01(tier0_sr)> get route

Flags: c - connected, s - static, b - BGP, ns - nsx_static

nc - nsx_connected, rl - router_link, t0n: Tier0-NAT, t1n: Tier1-NAT

t1l: Tier1-LB VIP, t1s: Tier1-LB SNAT

Total number of routes: 25

b 10.10.20.0/24 [20/0] via 192.168.140.1

b 10.10.30.0/24 [20/0] via 192.168.140.1

b 10.20.20.0/24 [20/0] via 192.168.140.1

b 10.20.30.0/24 [20/0] via 192.168.140.1

b 30.0.0.0/8 [20/0] via 192.168.140.1

rl 100.64.80.0/31 [0/0] via 169.254.0.1

rl 100.64.80.2/31 [0/0] via 169.254.0.1

rl 100.64.80.4/31 [0/0] via 169.254.0.1

<TRUNCATED OUTPUT>

VMware, Inc. 28

b 192.168.200.0/24 [20/0] via 192.168.140.1

b 192.168.210.0/24 [20/0] via 192.168.140.1

b 192.168.220.0/24 [20/0] via 192.168.140.1

b 192.168.230.0/24 [20/0] via 192.168.140.1

b 192.168.240.0/24 [20/0] via 192.168.140.1

To get the IP address of interfaces, run the following command:

edge01(tier0_sr)> get interfaces

Logical Router

UUID VRF LR-ID Name Type

c9393d0c-1fcf-4c34-889d-2da1eeee25b8 1 10 SR-t0-router SERVICE_ROUTER_TIER0

interfaces

interface : 977ac2eb-8ab7-40e9-8abe-782a438c749a

ifuid : 285

name : uplink01

mode : lif

IP/Mask : 192.168.140.3/24

MAC : 00:50:56:b5:d5:64

LS port : 14391f86-efef-4e3d-98c3-f291c17d13f8

urpf-mode : STRICT_MODE

admin : up

MTU : 1600

interface : 6af81d72-4d32-5f66-b7ae-403e617290e5

ifuid : 270

mode : blackhole

interface : 015e709d-6079-5c19-9556-8be2e956f775

ifuid : 269

mode : cpu

interface : 3f40f838-eb8a-4f35-854c-ea8bb872dc47

ifuid : 272

name : bp-sr0-port

mode : lif

IP/Mask : 169.254.0.2/28

MAC : 02:50:56:56:53:00

VNI : 25489

LS port : 770a208d-27fa-4f8d-afad-a9c41ca6295b

urpf-mode : NONE

admin : up

MTU : 1500

interface : 00003300-0000-0000-0000-00000000000a

ifuid : 263

mode : loopback

IP/Mask : 127.0.0.1/8

Advertising T1 RoutesYou must advertise T1 routes so that they are visible on T0 router and upwards. There are different typesof routes that you can advertise: NSX Connected, NAT, Static, LB VIP, and LB SNAT.


VMware, Inc. 29

Troubleshooting the CentralControl Plane 5This section contains information about troubleshooting central control plane (CCP) issues.

This chapter includes the following topics:

n Deleting a Controller Node Fails

n Removing a Controller from a Cluster Fails

n Controller VM Fails to Power On

n Controller Fails to Register with NSX Manager

n Controller Fails to Come Online after Registration

n Controller Fails to Join the Cluster

n Controller Clustering Fails

Deleting a Controller Node FailsWhen deleting a controller through the API(POST /api/v1/cluster/nodes/deployments/<node-id>?action=delete) or the NSX Manager UI,you get the error The deletion of cluster node VM has failed.

Cause

n The API call has invalid input, for example, an incorrect node ID.

n There is a connectivity problem between NSX Manager and vCenter Server.

Solution

n Make sure the API parameters are correct.

n Call POST /api/v1/cluster/nodes/deployments/<node-id>?action=delete with the forceoption to delete the controller. Then manually delete the VM from vCenter Server.

Removing a Controller from a Cluster FailsWhen deleting a controller through the API(POST /api/v1/cluster/nodes/deployments/<node-id>?action=delete) or the NSX Manager UI,you get the error Failed to remove the cluster node VM from the cluster.

VMware, Inc. 30

Cause

n the controller nodes become temporarily unavailable during the process of removing a controller fromits cluster.

n The status of the controller cluster is not CONNECTED.

Solution

n From the NSX Manager UI, verify that the status of the controller cluster is healthy. If the controller tobe deleted is healthy and the controller cluster is healthy, delete the controller again.

n If either the controller to be deleted or the controller cluster is not healthy, take steps to make surethat both are healthy. Then delete the controller again.

Controller VM Fails to Power OnWhen deploying a controller through the API (POST /api/v1/cluster/nodes/deployments) or theNSX Manager UI, you get the error The power on of the cluster node VM has failed.

Cause

n The host where the controller VM is running on might not have enough memory to power on the VM.

Solution

u Log in to vCenter Server and investigate why the controller VM does not power on. If there isinsufficient memory, delete the VM, free up memory on the host and redeploy the controller VM.Alternatively, redeploy the VM on a different host.

Controller Fails to Register with NSX ManagerWhen deploying a controller through the API (POST /api/v1/cluster/nodes/deployments) or theNSX Manager UI, the deployment remains in the Waiting To Register state indefinitely. The log hasthe error message The cluster node VM failed to register itself with the MP within theallotted time.

Cause

n The controller cannot communicate with the NSX Manager.

n The controller has an internal error.

Solution

n If possible, from the NSX Manager UI, delete the controller.

n If there is no option in the NSX Manager UI to delete the controller, call the APIPOST /api/v1/cluster/nodes/deployments/<node-id>?action=delete to delete the controller.

n Check network connectivity between the controller and the NSX Manager, such as IP address, subsetand firewall settings.

n Redeploy the controller.


VMware, Inc. 31

Controller Fails to Come Online after RegistrationWhen deploying a controller through the API (POST /api/v1/cluster/nodes/deployments) or theNSX Manager UI, you get the error The cluster node VM failed to come online afterregistration.

Cause

n Internal controller or NSX Manager error.

Solution

u Delete the controller and redeploy.

Controller Fails to Join the ClusterWhen deploying a controller through the API (POST /api/v1/cluster/nodes/deployments) or theNSX Manager UI, you get the error Failed to add the cluster node VM to a cluster.

Cause

n The controller cluster becomes unstable or unreachable during the clustering operation.

n The shared secret provided to the controller does not match the shared secret used by the controllercluster.

Solution

n Delete the controller.

n From the NSX Manager UI, check the cluster status of the nodes in the controller cluster.

n Check that the shared secret used by the new controller is the same as the shared secret used by thecontroller cluster.


Controller Clustering FailsWhen deploying a controller through the API (POST /api/v1/cluster/nodes/deployments) or theNSX Manager UI, you get the error Failed to add the cluster node VM to a cluster - Nostable node or cluster found to join.

Cause

n The controller cluster state is unstable during the clustering operation.

Solution

n Delete the controller.

n From the NSX Manager UI, check that the cluster status of the nodes in the controller cluster is Up.


VMware, Inc. 32



VMware, Inc. 33

Troubleshooting Firewall 6This section provides information about troubleshooting firewall issues.

This chapter includes the following topics:

n Determining Firewall Rules that Apply on an ESXi Host

n Determining Firewall Rules that Apply on a KVM Host

n Firewall Packet Logs

n On an ESXi Host, the getrules Command Shows an Unknown MAC Address

n Stateful Edge Firewall Not Working

Determining Firewall Rules that Apply on an ESXi HostTo troubleshoot firewall issues with an ESXi host, you can look at the firewall rules that apply on the host.

Get the list of dvfilters on the ESXi host:

[root@esxi-01:~] summarize-dvfilter

<TRUNCATED OUTPUT>

world 70181 vmm0:app-01a vcUuid:'50 35 9c 70 18 8e 99 1d-3c f9 8e cc 6b 27 4c 6f'

port 50331655 app-01a.eth0

vNic slot 2

name: nic-70181-eth0-vmware-sfw.2

agentName: vmware-sfw

state: IOChain Attached

vmState: Detached

failurePolicy: failClosed

slowPathID: none

filter source: Dynamic Filter Creation

world 70179 vmm0:web-02a vcUuid:'50 35 2b f3 4a 4b 10 83-54 72 50 f7 25 10 d8 64'

port 50331656 web-02a.eth0

vNic slot 2




vmState: Detached


slowPathID: none


VMware, Inc. 34

Find a dvfilter for a specific VM:

[root@esxi-01:~] summarize-dvfilter | less -p web

world 70179 vmm0:web-02a vcUuid:'50 35 2b f3 4a 4b 10 83-54 72 50 f7 25 10 d8 64'

port 50331656 web-02a.eth0

vNic slot 2




vmState: Detached


slowPathID: none


.

.

.

Determine firewall rules that apply to a specific dvfilter (in this example, nic-70227-eth0-vmware-sfw.2is the dvfilter name):

[root@esxi-02:~] vsipioctl getrules -f nic-70227-eth0-vmware-sfw.2

ruleset mainrs {

rule 3072 at 1 inout protocol tcp from any to addrset 48822ec3-2670-497b-82f9-524618c16877 port 443

accept with log;

rule 3072 at 2 inout protocol tcp from any to addrset 48822ec3-2670-497b-82f9-524618c16877 port 80

accept with log;

rule 3074 at 3 inout protocol tcp from addrset 48822ec3-2670-497b-82f9-524618c16877 to addrset

8b9e75e7-bc62-4d7f-9a58-a872f393448e port 8443 accept with log;

rule 3074 at 4 inout protocol tcp from addrset 48822ec3-2670-497b-82f9-524618c16877 to addrset

8b9e75e7-bc62-4d7f-9a58-a872f393448e port 22 accept with log;

rule 3075 at 5 inout protocol tcp from addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e to addrset

b695c8df-9894-4068-a5e7-5504fe48d459 port 3306 accept with log;

rule 3076 at 6 inout protocol tcp from ip 192.168.110.10 to addrset rdst3076 port 443 accept with log;

rule 3076 at 7 inout protocol icmp typecode 8:0 from ip 192.168.110.10 to addrset rdst3076 accept with

log;



rule 2 at 10 inout protocol any from any to any accept with log;

}

ruleset mainrs_L2 {

rule 1 at 1 inout ethertype any stateless from any to any accept;

}

Get the list of address sets used in a specific dvfilter:

[root@esxi-02:~] vsipioctl getaddrsets -f nic-70227-eth0-vmware-sfw.2

addrset 48822ec3-2670-497b-82f9-524618c16877 {

ip 172.16.10.13,

mac 52:54:00:42:4d:38,

}

addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e {


VMware, Inc. 35

}

addrset b695c8df-9894-4068-a5e7-5504fe48d459 {

ip 172.16.30.11,

mac 52:54:00:64:0e:4f,

}

addrset rdst3076 {

ip 172.16.10.13,

ip 172.16.30.11,

mac 52:54:00:42:4d:38,

mac 52:54:00:64:0e:4f,

}

Check the flows through a specific dvfilter:

[root@esxi-02:~] vsipioctl getflows -f nic-75360-eth0-vmware-sfw.2

Count retrieved from kernel active(L3,L4)=20, active(L2)+inactive(L3,L4)=0, drop(L2,L3,L4)=0

a5d914f7a5b85fe5 Active tcp 0800 IN 3076 0 0 192.168.110.10:Unknown(51281) -> 172.16.10.11:ssh(22)

513 FINWAIT2:FINWAIT2 4304 5177 34 33

a5d914f7a5b86001 Active tcp 0800 OUT 2 0 0 172.16.10.11:http(80) -> 100.64.80.1:Unknown(60006) 457

SYNSENT:CLOSED 56 819 1 1

a5d914f7a5b86006 Active igmp 0800 IN 2 0 0 0.0.0.0 -> 224.0.0.1 36 0 1 0

a5d914f7a5b86011 Active tcp 0800 IN 3072 0 0 100.64.80.1:Unknown(60098) -> 172.16.10.11:http(80) 320

FINWAIT2:FINWAIT2 413 5411 9 6

a5d914f7a5b86012 Active tcp 0800 OUT 3074 0 0 172.16.10.11:Unknown(46001) ->

172.16.20.11:Unknown(8443) 815 FINWAIT2:FINWAIT2 7418 1230 10 9

a5d914f7a5b86013 Active udp 0800 OUT 2 0 0 172.16.10.11:Unknown(40080) -> 192.168.110.10:domain(53)

268 140 2 2

a5d914f7a5b86014 Active udp 0800 OUT 2 0 0 172.16.10.11:Unknown(59251) -> 192.168.110.10:domain(53)

268 140 2 2

a5d914f7a5b86015 Active ipv6-icmp 86dd OUT 2 0 0 fe80::250:56ff:feb5:a60e -> ff02::1:ff62:5ed4 135 0

0 72 0 1

a5d914f7a5b86016 Active ipv6-icmp 86dd OUT 2 0 0 fe80::250:56ff:feb5:a60e -> ff02::1:ff62:5ed4 135 0

0 72 0 1




172.16.20.11:Unknown(8443) 815 TIMEWAIT:TIMEWAIT 7314 1230 8 9



a5d914f7a5b8601a Active tcp 0800 OUT 3074 0 0 172.16.10.11:Unknown(46003) ->


a5d914f7a5b8601b Active tcp 0800 IN 3072 0 0 100.64.80.1:Unknown(60114) -> 172.16.10.11:http(80) 328

TIMEWAIT:TIMEWAIT 413 5451 9 7

a5d914f7a5b8601c Active tcp 0800 OUT 3074 0 0 172.16.10.11:Unknown(46004) ->

172.16.20.11:Unknown(8443) 815 TIMEWAIT:TIMEWAIT 7262 1218 7 9

a5d914f7a5b8601d Active tcp 0800 OUT 2 0 0 172.16.10.11:http(80) -> 100.64.80.1:Unknown(60060) 457

SYNSENT:CLOSED 56 819 1 1

a5d914f7a5b8601e Active tcp 0800 IN 3072 0 0 100.64.80.1:Unknown(60120) -> 172.16.10.11:http(80) 320

TIMEWAIT:TIMEWAIT 373 5411 8 6

a5d914f7a5b8601f Active tcp 0800 OUT 3074 0 0 172.16.10.11:Unknown(46005) ->



VMware, Inc. 36


EST:EST 173 5371 3 5



Determining Firewall Rules that Apply on a KVM HostTo troubleshoot firewall issues with a KVM host, you can look at the firewall rules that apply on the host.

Get the list of VIFs that are subject to firewall rules on the KVM host:

# ovs-appctl -t /var/run/openvswitch/nsxa-ctl dfw/vif

Vif ID : da95fc1e-65fd-461f-814d-d92970029bf0

Port name : db-01a-eth0

Port number : 2

If the output is empty, look for connectivity issues between the node and the controllers.

Get the list of rules applied to a specific VIF (in this example, da95fc1e-65fd-461f-814d-d92970029bf0 is the VIF ID):

# ovs-appctl -t /var/run/vmware/nsx-agent/nsxa-ctl dfw/rules da95fc1e-65fd-461f-814d-d92970029bf0

Distributed firewall status: enabled

Vif ID : da95fc1e-65fd-461f-814d-d92970029bf0

ruleset d035308b-cb0d-4e7e-aae5-a428b461db46 {

rule 3072 inout protocol tcp from any to addrset 48822ec3-2670-497b-82f9-524618c16877 port 443 accept

with log;

rule 3072 inout protocol tcp from any to addrset 48822ec3-2670-497b-82f9-524618c16877 port 80 accept

with log;

rule 3074 inout protocol tcp from addrset 48822ec3-2670-497b-82f9-524618c16877 to addrset 8b9e75e7-

bc62-4d7f-9a58-a872f393448e port 8443 accept with log;

rule 3074 inout protocol tcp from addrset 48822ec3-2670-497b-82f9-524618c16877 to addrset 8b9e75e7-

bc62-4d7f-9a58-a872f393448e port 22 accept with log;

rule 3075 inout protocol tcp from addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e to addrset

b695c8df-9894-4068-a5e7-5504fe48d459 port 3306 accept with log;

}

ruleset 3027fed3-60b1-483e-aa17-c28719275704 {

rule 3076 inout protocol tcp from 192.168.110.10 to addrset b695c8df-9894-4068-a5e7-5504fe48d459 port

443 accept with log;

rule 3076 inout protocol icmp type 8 code 0 from 192.168.110.10 to addrset b695c8df-9894-4068-

a5e7-5504fe48d459 accept with log;


22 accept with log;


80 accept with log;

rule 3076 inout protocol tcp from 192.168.110.10 to addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e port


rule 3076 inout protocol icmp type 8 code 0 from 192.168.110.10 to addrset 8b9e75e7-bc62-4d7f-9a58-

a872f393448e accept with log;


22 accept with log;


VMware, Inc. 37


80 accept with log;

rule 3076 inout protocol tcp from 192.168.110.10 to addrset 48822ec3-2670-497b-82f9-524618c16877 port


rule 3076 inout protocol icmp type 8 code 0 from 192.168.110.10 to addrset

48822ec3-2670-497b-82f9-524618c16877 accept with log;


22 accept with log;


80 accept with log;

}

ruleset 5e9bdcb3-adba-4f67-a680-5e6ed5b8f40a {

rule 2 inout protocol any from any to any accept with log;

}

ruleset ddf93011-4078-4006-b8f8-73f979d7a717 {

rule 1 inout ethertype any stateless from any to any accept;

}

Get the list of address sets used in a specific VIF:

# ovs-appctl -t /var/run/vmware/nsx-agent/nsxa-ctl dfw/addrsets da95fc1e-65fd-461f-814d-d92970029bf0

48822ec3-2670-497b-82f9-524618c16877 {

mac 52:54:00:42:4d:38,

ip 172.16.10.13,

}

8b9e75e7-bc62-4d7f-9a58-a872f393448e {

}

b695c8df-9894-4068-a5e7-5504fe48d459 {

mac 52:54:00:64:0e:4f,

ip 172.16.30.11,

}

Check connections through the Linux Conntrack module. In this example, we look for flows between twospecific IP addresses.

# ovs-appctl -t ovs-l3d conntrack/show | grep 192.168.110.10 | grep 172.16.10.13

ACTIVE

icmp,orig=(src=192.168.110.10,dst=172.16.10.13,id=1,type=8,code=0),reply=(src=172.16.10.13,dst=192.168.

110.10,id=1,type=0,code=0),start=2018-03-26T04:43:28.325,id=3122159040,zone=23119,status=SEEN_REPLY|

CONFIRMED,timeout=29,mark=3076,labels=0x1f

Firewall Packet LogsIf logging is enabled for firewall rules, you can look at the firewall packet logs to troubleshoot issues.


VMware, Inc. 38

The log file is /var/log/dfwpktlogs.log for both ESXi and KVM hosts.

# tail -f /var/log/dfwpktlogs.log

2018-03-27T10:23:35.196Z INET TERM 3072 IN TCP FIN 100.64.80.1/60688->172.16.10.11/80 8/7 373/5451

2018-03-27T10:23:35.196Z INET TERM 3074 OUT TCP FIN 172.16.10.11/46108->172.16.20.11/8443 8/9 1178/7366

2018-03-27T10:23:35.196Z INET TERM 3072 IN TCP RST 100.64.80.1/60692->172.16.10.11/80 9/6 413/5411

2018-03-27T10:23:35.196Z INET TERM 3074 OUT TCP RST 172.16.10.11/46109->172.16.20.11/8443 9/7 1218/7262

2018-03-27T10:23:37.442Z 71d32787 INET match PASS 3074 IN 60 TCP 172.16.10.12/35770->172.16.20.11/8443

S

2018-03-27T10:23:38.492Z INET match PASS 2 OUT 1500 TCP 172.16.10.11/80->100.64.80.1/60660 A

2018-03-27T10:23:39.934Z INET match PASS 3072 IN 52 TCP 100.64.80.1/60720->172.16.10.11/80 S

2018-03-27T10:23:39.944Z INET match PASS 3074 OUT 60 TCP 172.16.10.11/46114->172.16.20.11/8443 S


S


S

2018-03-27T10:23:44.712Z INET TERM 3074 IN TCP RST 172.16.10.11/46109->172.16.20.11/8443 9/7 1218/7262




2018-03-27T10:23:44.939Z INET match PASS 3072 IN 52 TCP 100.64.80.1/60726->172.16.10.11/80 S

2018-03-27T10:23:44.957Z INET match PASS 3074 OUT 60 TCP 172.16.10.11/46115->172.16.20.11/8443 S


S

2018-03-27T10:23:45.480Z INET TERM 2 OUT TCP TIMEOUT 172.16.10.11/80->100.64.80.1/60528 1/1 1500/56

On an ESXi Host, the getrules Command Shows anUnknown MAC AddressOn an ESXi host, after configuring a layer-2 firewall rule with one MAC set as source and another MACset as destination, the getrules command on the host shows the destination MAC set having anunknown address.

Problem

After configuring a layer-2 firewall rule with one MAC set as source and another MAC set as destination,the getrules command on the host shows the destination MAC set as01:00:00:00:00:00/01:00:00:00:00:00. For example,

[root@host1:~] vsipioctl getrules -f nic-1000052822-eth1-vmware-sfw.2

ruleset mainrs {

# generation number: 0

# realization time : 2018-07-26T12:42:28

rule 1039 at 1 inout protocol tcp from any to any port 1521 accept as oracle;

# internal # rule 1039 at 2 inout protocol tcp from any to any port 1521 accept;

rule 1039 at 3 inout protocol icmp from any to any accept;

rule 2 at 4 inout protocol any from any to any accept with log;

}

ruleset mainrs_L2 {

# generation number: 0


VMware, Inc. 39

# realization time : 2018-07-26T12:42:28

rule 1040 at 1 inout ethertype any stateless from addrset d83a1523-0d07-4b18-8a5b-77a634540b57 to

addrset 9ad9c6ef-c7dd-4682-833d-57097b415e41 accept;

# internal # rule 1040 at 2 in ethertype any stateless from addrset

d83a1523-0d07-4b18-8a5b-77a634540b57 to addrset 9ad9c6ef-c7dd-4682-833d-57097b415e41 accept;

# internal # rule 1040 at 3 out ethertype any stateless from addrset

d83a1523-0d07-4b18-8a5b-77a634540b57 to mac 01:00:00:00:00:00/01:00:00:00:00:00 accept;

rule 1 at 4 inout ethertype any stateless from any to any accept;

}

The internal OUT rule with the address 01:00:00:00:00:00/01:00:00:00:00:00 is created by design tohandle outbound broadcasting packets and does not indicate a problem.

Solution

None required. The firewall rule will work as configured.

Stateful Edge Firewall Not WorkingStateful Edge firewall does not work between Tier-0 uplinks.

Problem

A loopback or hairpin is created when a Tier-0 router has multiple uplinks, and traffic ingresses on one ofthe uplinks and egresses on another uplink. When this occurs, firewall rules and NAT are only processedwhile the packet ingresses on the original uplink. This causes the reply returning on the second uplink tonot match the original session, and the packet may be dropped.

Cause

Services are processed once during the hairpinning process, and not on both interfaces. This causes thereply to be considered another flow, rather than part of the original flow, because the direction of thepacket for both the initial and the reply is IN.

Solution

u If no destination NAT rules are present on the SR, add one. A destination NAT rule will cause thereply be tried to be matched against the original session, rather than being treated as a new session,and the packet will not be dropped. See Configure Source and Destination NAT on a Tier-0 Router inthe NSX-T Data Center Administration Guide.


VMware, Inc. 40

Other TroubleshootingScenarios 7This section describes how to troubleshoot various error scenarios.

This chapter includes the following topics:n Failure to Add or Delete a Transport Node

n Transport Node Takes About 5 Minutes to Connect to Another Controller

n NSX Manager VM Is Degraded

n NSX Agent Times Out Communicating with NSX Manager

n Failure to Add an ESXi Host

n Incorrect NSX Controller Status

n Management IPs on KVM VMs Not Reachable with IPFIX Enabled

n Upgrade Fails Due to a Timeout

n Edge Transport Node Status Degraded if Any Interface is Down

Failure to Add or Delete a Transport NodeYou cannot delete or add a transport node.

Problem

The error occurs in the following scenario:

1 An ESXi host is a fabric node and a transport node.

2 The host is removed as a transport node. However, transport node deletion fails. The state of thetransport node is Orphaned.

3 The host is removed as a fabric node immediately.

4 The host is added as a fabric node again.

5 The host is added as a transport node with a new transport zone and switch. This step results in theerror Failed/Partial Success.

VMware, Inc. 41

Cause

In step 2, if you wait for a few minutes, the transport node deletion will succeed because NSX Managerwill retry the deletion. When you delete the fabric node immediately, NSX Manager cannot retry becausethe host is removed from NSX-T Data Center. This results in incomplete cleanup of the host, with theswitch configuration still present, which causes step 5 to fail.

Solution

1 Delete all vmknics from vCenter Server on the host that are connected to the NSX-T Data Centerswitch.

2 Get the switch name using the esxcfg-vswitch -l CLI command. For example:

esxcfg-vswitch -l

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch0 1536 4 128 1500 vmnic0

PortGroup Name VLAN ID Used Ports Uplinks

VM Network 0 0 vmnic0

Management Network 0 1 vmnic0

Switch Name Num Ports Used Ports Uplinks

nsxvswitch 1536 4

3 Delete the switch name using the esxcfg-vswitch -d <switch-name> --dvswitch CLIcommand. For example:

esxcfg-vswitch -d nsxvswitch --dvswitch

Transport Node Takes About 5 Minutes to Connect toAnother ControllerWhen an ESXi transport node's connected controller goes down, it takes about 5 minutes for the transportnode to connect to another controller.

Problem

An ESXi transport node is normally connected to a specific controller in a controller cluster. You can findthe connected controller with the CLI command get controllers. If the connected controller goesdown, it takes about 5 minutes for the transport node to be connected to another controller.

Cause

The transport node attempts to re-connect to the controller that is down for a certain amount of timebefore giving up and connecting to another controller. The whole process takes about 5 minutes. This isexpected behavior.


VMware, Inc. 42

NSX Manager VM Is DegradedNSX Manager that is deployed on a KVM host returns an error when running CLI commands such as getservice and get interface.

Problem

The CLI command get service returns an error. For example,

nsx-manager-1> get service

% An error occurred while processing the service command

Other CLI commands might also return an error. The get support-bundle command indicates thatthe /tmp directory has become read-only. For example,

nsx-manager-1> get support-bundle file failed-to-get-service.tgz

% An error occurred while retrieving the support bundle: [Errno 30] Read-only file system:

'/tmp/tmpHzXF1u'

The /var/log/messages-<timestamp> log has the a message such as the following:

Nov 17 07:26:48 no kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [qemu-kvm:4386]

Cause

One or more file systems on the NSX Manager appliance were corrupted. Some possible causes aredocumented in https://access.redhat.com/solutions/22621.

To resolve the issue, you can repair the corrupt file systems or perform a restore from a backup.

Solution

1 Option 1: Repair the corrupt file systems. The following steps are specifically for NSX Managerrunning on a KVM host.

a Run the virsh destroy command to stop the NSX Manager VM.

b Run the virt-rescue command in write mode on the qcow2 image. For example,

virt-rescue --rw -a nsx-unified-appliance-2.0.0.0.0.6522097.phadniss-p0-DK-to-DGo-on-rhel-

prod_nsx_manager_1.qcow2

c In the virt-rescue command prompt run the e2fsck command to fix the tmp file system. Forexample,

<rescue> e2fsck /dev/nsx/tmp

d If necessary, run the e2fsck /dev/nsx/tmp again until there are no more errors.

e Restart NSX Manager with the virsh start.


VMware, Inc. 43

https://access.redhat.com/solutions/22621

2 Option 2: Perform a restore from a backup.

For instructions, see the NSX-T Administration Guide.

NSX Agent Times Out Communicating with NSX ManagerIn a large-scale environment with many transport nodes and VMs on ESXi hosts, NSX agents, which runon ESXi hosts, might time out when communicating with NSX Manager.

Problem

Some operations, such as when a VM vnic tries to attach to a logical switch, fail.The /var/run/log/nsx-opsagent.log has messages such as:

level="ERROR" errorCode="MPA41542"] [MP_AddVnicAttachment] RPC call [0e316296-13-14] to NSX management

plane timout

2017-05-15T05:32:13Z nsxa: [nsx@6876 comp="nsx-esx" subcomp="NSXA[VifHandlerThread:-2282640]"

tid="1000017079" level="ERROR" errorCode="MPA42003"] [DoMpVifAttachRpc] MP_AddVnicAttachment() failed:

RPC call to NSX management plane timout

Cause

In a large-scale environment, some operations might take longer than usual and fail because the defaulttimeout values are exceeded.

Solution

1 Increase the NSX agent timeout value.

a On the ESXi host, stop the NSX opsAgent with the following command:

/etc/init.d/nsx-opsagent stop

b Edit the file /etc/vmware/nsx-opsagent/nsxa.json and change the vifOperationTimeoutvalue from 25 to, for example, 55.

"mp" : {

/* timeout for VIF operation */

"vifOperationTimeout" : 25,

Note This timeout value must be less than the hostd timeout value that you set in step 2.

c Start the NSX opsAgent with the following command:

/etc/init.d/nsx-opsagent start


VMware, Inc. 44

2 Increase the hostd timeout value.

a On the ESXi host, stop the hostd agent with the following command:

/etc/init.d/hostd stop

b Edit the file /etc/vmware/hostd/config.xml. Under <opaqueNetwork>, uncomment the entryfor <taskTimeout> and change the value from 30 to, for example, 60.

<opaqueNetwork>









c Start the hostd agent with the following command:

/etc/init.d/hostd start

Failure to Add an ESXi HostYou are not able to add an ESXi host to the NSX-T Data Center fabric.

Problem

From the NSX Manager GUI, adding an ESXi hosts fails with the error File path of ... is claimedby multiple non-overlay VIBs". The log file shows messages such as the following:

Failed to install software on host. Failed to install software on host. 10.172.120.60 :

java.rmi.RemoteException: [DependencyError] File path of '/usr/lib/vmware/vmkmod/nsx-vsip' is claimed

by multiple non-overlay VIBs

Cause

Some VIBs from a previous install are still on the host, probably because a clean uninstall did not occur.

Solution

1 From the error message, get the names of VIBs that are causing the failure.

2 Use ESXi commands to uninstall the VIBs.

Incorrect NSX Controller StatusSome controllers in an NSX Controller cluster report incorrect status for one of the controllers.

Problem

After a controller is powered off and on a number of times, the other controllers report that it is inactivewhen it is up and running.


VMware, Inc. 45

Cause

An internal error involving the ZooKeeper module sometimes occurs when a controller is powered off andon and causes a communication failure between this controller and the other controllers in the cluster.

Solution

u Remove the controller node that is reported to be inactive from the cluster, remove the clusterconfiguration from the node and rejoin the node to the cluster. For more information, see the section"Replace a Member of the NSX Controller Cluster" in the NSX-T Administration Guide.

Management IPs on KVM VMs Not Reachable with IPFIXEnabledWhen IPFIX is enabled on multiple VMs on a KVM host and the sampling rate is 100%, the managementIPs on some of the VMs might intermittently be unreachable.

Problem

When you enable IPFIX for multiple VMs on the same host and you set the sampling rate to be 100%,there can be a large amount of IPFIX traffic. This can impact management traffic, causing themanagement IPs to be intermittently unreachable, even if the production traffic and management traffic gothrough different OVSes.

Cause

The workload is too stressful for the host and the VMs.

Solution

u Reduce the load on the host by reducing the number of VMs with IPFIX enabled or reducing thesampling rate.

Upgrade Fails Due to a TimeoutAn event during the upgrade process fails and the message from the Upgrade Coordinator indicates atimeout error.

Problem

During the upgrade process, the following events might fail because they do not complete within aspecific period of time. The Upgrade Coordinator reports a timeout error for the event and the upgradefails.

Event Timeout Value

Putting a host into maintenance mode 4 hours

Waiting for a host to reboot 32 minutes

Waiting for the NSX service to be running on a host 13 minutes


VMware, Inc. 46

Solution

n For the maintenance mode issue, log in to vCenter Server and check the status of tasks related to thehost. Take actions to resolve any issues.

n For the host reboot issue, check the host to see why it failed to reboot.

n For the NSX service issue, log in to the NSX Manager UI, go to the Fabric > Nodes page and see ifthe host has an installation error. If so, you can resolve it from the NSX Manager UI. If the errorcannot be resolved, you can refer to the upgrade logs to determine the cause of the failure.

Edge Transport Node Status Degraded if Any Interface isDownAll interfaces are counted in an edge transport node status, and should be connected and Up.

Problem

An NSX Edge transport node has a status that is degraded, but the data plane is functioning normally.The Edge transport node degraded state also triggers the corresponding transport zone status to bedegraded. The Edge transport node has one or more network interfaces in a Down state.

Cause

The NSX-T Data Center management plane declares an Edge transport node to be in a degraded state ifany interface is down, regardless of whether that interface is used or configured. If the Edge is a virtualmachine, the vNIC may be disconnected. If you have a bare metal Edge, the NIC port may not beconnected, or may have a link state down.

If the interface that is Down is not used, then there is no functional impact to the Edge.

Solution

u To avoid a degraded state, ensure all Edge interfaces are connected and Up, regardless of whetherthey are in use or not.


VMware, Inc. 47

nsx-t data center troubleshooting guide - vmware nsx-t ...nsx-t data center troubleshooting guide...

Documents