capacity management in cisco hyperflex white paper
TRANSCRIPT
© 2020 Cisco and/or its affiliates. All rights reserved. Page 1 of 56
Capacity Management in Cisco HyperFlex
White paper
Cisco public
© 2020 Cisco and/or its affiliates. All rights reserved. Page 2 of 56
Contents
Cisco HyperFlex HX Data Platform overview 3
Cisco HyperFlex HX Data Platform all-NVMe storage 3
Cisco HyperFlex architecture 3
Hyperflex replication factor 4
Data striping across nodes and drives in HyperFlex 4
Cisco Intersight 5
HyperFlex (HX) space views 5
Virtual machine space view 6
HX Connect space views 6
Additional space consumption 8
vCenter space views 9
Understanding HX cleaner and its process 10
Space accounting for delayed fashion writes and deletes 11
Best practices 12
Space accounting for snapshots 26
VMware vStorage API for Array Integration (VAAI) 36
Different cases of failure modes 38
Space usage’s effect on performance 43
Handling “Out of Space” errors 45
Cluster expansion 45
Summary 47
HyperFlex storage capacity forecasting and trending 54
© 2020 Cisco and/or its affiliates. All rights reserved. Page 3 of 56
Cisco HyperFlex HX Data Platform overview
Cisco HyperFlex™ HX Data Platform (HX Data Platform) is a hyperconverged software appliance that transforms
Cisco® servers into a single pool of compute and storage resources. It eliminates the need for network storage
and enables seamless interoperability between computing and storage in virtual environments. The HX Data
Platform provides a highly fault-tolerant distributed storage system that preserves data integrity and optimizes
performance for Virtual Machine (VM) storage workloads. In addition, native compression and deduplication
reduce storage space occupied by the VMs and VM workloads.
Cisco HyperFlex HX Data Platform all-NVMe storage
Cisco HyperFlex systems are designed with an end-to-end software-defined infrastructure that eliminates the
compromises found in first-generation products. With all–Non-Volatile Memory Express (NVMe) memory
storage configurations and a choice of management tools, Cisco HyperFlex systems deliver a tightly integrated
cluster that is up and running in less than an hour and that scales resources independently to closely match
your application requirements. An all-NVMe storage solution delivers more of what you need to propel mission-
critical workloads. For a simulated Oracle OLTP workload, it provides 71 percent more I/O Operations Per
Second (IOPS) and 37 percent lower latency than our previous-generation all-flash node. A holistic system
approach is used to integrate HX Data Platform software with Cisco HyperFlex HX220c M5 all NVMe Nodes.
The result is the first fully engineered hyperconverged appliance based on NVMe storage. All-NVMe solutions
support most latency-sensitive applications with the simplicity of hyperconvergence.
Cisco HyperFlex architecture
In Cisco HyperFlex Systems, the data platform spans three or more Cisco HyperFlex HX-Series nodes to create
a highly available cluster except for a two-node edge cluster. Each node includes a Cisco HyperFlex HX Data
Platform controller that implements the distributed file system using internal flash-based SSD drives for the all-
Flash cluster, internally based NVMe SSD drives for the all-NVMe cluster, and high-capacity HDDs for the
hybrid cluster to store data. The controllers communicate with each other over 10 Gigabit Ethernet to present a
single pool of storage that spans the nodes in the cluster (see Figure 1). Nodes access data through a data
layer using file, block, object, and API plug-ins. As nodes are added, the cluster scales linearly to deliver
computing, storage capacity, and I/O performance. This hyperconverged solution integrates servers, storage
systems, network resources, and storage software to provide an enterprise-scale environment for an
application deployment. This highly integrated environment provides reliability, high availability, scalability, and
performance to handle large-scale transactional workloads.
Note: Although the testing described here was performed using Cisco UCS 6200 Series Fabric
Interconnects, the Cisco HyperFlex HX Data Platform does include support for Cisco UCS 6300 Series
Fabric Interconnects, which provide higher performance with 40 Gigabit Ethernet.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 4 of 56
Figure 1.
Overview of a Cisco HyperFlex architecture
Hyperflex replication factor
A Replication Factor (RF) is the number of copies that the Hyperflex System writes of the storage block. There
are two options: An RF of 2 or an RF of 3. The default setting during installation of the Hyperflex System is 3. All
the blocks are written with an algorithm so that none of the blocks are written on the same node or same disk.
This eliminates the single point of failure.
Data striping across nodes and drives in HyperFlex
The data platform implements a log-structured file system that uses a caching layer to accelerate read requests
and write responses, and it implements a capacity layer for persistent storage. The capacity layer is comprised
of HDD drives (in hybrid nodes), SSD drives (in all-flash nodes), or NVMe storage (in all-NVMe nodes).
Incoming data is striped across the number of nodes that you define to meet your data availability requirements.
The log-structured file system assembles blocks to be written to a configurable cache until the buffer is full or
workload conditions dictate that it be de-staged to the capacity layer. When existing data is (logically)
overwritten, the log-structured approach simply appends a new block and updates the metadata. When data is
de-staged, the write operation consists of a single seek operation with a large amount of data written. This
approach improves performance significantly compared to the traditional read-modify-write model, which is
characterized by numerous seek operations and small amounts of data written at a time.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 5 of 56
Currently, write operations are compressed as it gets written to the Write Log (WL) and when data is de-staged
to the capacity layer in each node, the data is deduplicated. This process occurs after the write operation is
acknowledged, so there is no performance penalty for these operations. A small deduplication block size helps
increase the deduplication rate. Compression further reduces the data footprint. Data is then moved to the
capacity tier as cache segments become free. Read operations in hybrid nodes cache data on the SSD drives
and in main memory for high performance. In all-Flash and NVMe nodes they read directly from storage. Having
the most frequently used data stored in the caching layer helps make Cisco HyperFlex Systems perform well for
virtualized applications. When virtual machines modify data, the original block is likely read from the cache, so
there is often no need to read and then expand the data on a spinning disk. The data platform decouples the
caching tier from the capacity tier and allows independent scaling of I/O performance and storage capacity.
Cisco Intersight
Cisco Intersight™ is a software management platform delivered as a service with embedded analytics for your
Cisco and third-party IT infrastructure. This platform offers an intelligent level of management that enables IT
organizations to analyze, simplify, and automate their environments in more advanced ways than with prior
generations of tools. Cisco Intersight provides an integrated and intuitive management experience for resources
in the traditional data center and at the edge. With flexible deployment options to address complex security
needs, getting started with Intersight is quick and easy. Cisco Intersight has deep integration with Cisco UCS®
and HyperFlex systems, allowing for remote deployment, configuration, and ongoing maintenance. The model-
based deployment works for a single system in a remote location or with hundreds of systems in a data center,
and enables rapid, standardized configuration and deployment. It also streamlines maintaining those systems,
whether you are working with small or large configurations.
Cisco Intersight offers flexible deployment either as Software as a Service (SaaS) on Intersight.com or running
on your premises as Cisco Intersight Virtual Appliance. The virtual appliance provides the benefits of Cisco
Intersight, while allowing more flexibility for those with additional data locality and security requirements. To
learn more about Cisco Intersight features and supported systems, see the Cisco Intersight Getting Started
Guide.
Different views of space
HyperFlex (HX) space views
● Raw - This is the actual total physical capacity available for use. Per node, this equates to multiplying the
number of capacity drives by the drive’s capacity.
● Usable - This is the total available space after deducting RF and metadata overhead from the raw
capacity. Typically, the metadata requirement equates to 7% of capacity or 0.93 as a multiplier.
● Effective -This is essentially the determined Usable storage multiplied by the storage optimization gain
of compression and deduplication multiplied together.
● Storage Optimization Gain – This is best explained using an example. If the anticipated gain from
Compression was 30% that equates to 0.3 (30/100) and if the anticipated gain from Deduplication was
10% that equates to 0.1 (10/100). To calculate the overall gain, we use the following calculation: (1-
compression gain) *(1-deduplication gain) = (1-0.3) * (1-0.1) = 0.63.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 6 of 56
● Storage Operating Threshold - This is the safe contingency or overhead of storage capacity needed to
ensure the system will run optimally during normal operating conditions.
◦ The threshold is typically set at 70 storage utilization, or 0.7 as a multiplier.
● The overall calculation to determine the effective storage per HX node is therefore:
◦ Raw (#Disks * Capacity of Disks) / Replication Factor (2 or 3) * 0.93 (Metadata) * 0.7 (Storage
Operating Threshold) / Storage Optimization Gain (0.63 in above example).
Virtual machine space view
● Provisioned - This is the actual logical size of the virtual machine, which is thin-provisioned.
● Used - This is the actual physical space used by the virtual machine (application level) before applying
any kind of savings such as deduplication and compression.
HX Connect space views
The HX Connect user interface provides a view of the HX storage cluster status, components, and features,
such as encryption and replication. In the HX Connect UI, there are multiple storage views, with summaries
about the space utilization in the HX cluster. The concept of capacity is applied to both datastores and storage
clusters. As storage subsystems require base-2 math, physical raw capacity values are measured in base-2
(GiB/TiB).
Note: Usually one source of confusion is that when the drive vendors say 1 TB, what they mean is one
terabyte is equal to 10^12, or 1,000,000,000,000 bytes but the storage subsystems require base-2
math. If you convert TB to TiB, there will be about a 10-percent difference between the size of a
tebibyte and a terabyte. This is significant when talking about storage capacity because what storage
subsystems are looking for is 1 TiB, meaning one tebibyte is equal to 2^40, or 1,099,511,627,776 bytes.
A tebibyte equals nearly 1.1 TB.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 7 of 56
1. Hx Connect dashboard overview
The HX Connect dashboard page displays a status summary of your HX storage cluster. This is the first page
that you see when you log in to Cisco HyperFlex Connect. The display units are labelled as TB, but they are
really measured as TiB to match reporting in vCenter. Figure 1 provides a view of the HX Connect dashboard.
HX Connect dashboard view:
The Capacity section in the HX Connect dashboard page displays a breakdown of the total storage versus how
much storage is used or free. It also displays the storage optimization, compression savings, and deduplication
percentages based on the data stored in the cluster. Capacity, used space and storage optimization are defined
as:
● Capacity (usable space) - The capacity value displayed in the view in Figure 1 isn’t the raw physical
capacity. The Capacity section in the HX Connect dashboard page provides the overall usable space that
is available after all the deductibles, which includes unit conversions overhead (9%), metadata overhead
(8%) and resiliency (space consumed for data protection depends on the replication factor that is set).
This is (raw – (conversion overhead) - (RF overhead) – (metadata overhead)).
● Used space - This is the total actual physical space used *after* deduplication and compression. An
important factor to be noted is that this counter includes additional metadata and “garbage” data that is
cleaned in a delayed fashion. Also, the capacity counter is updated in a delayed fashion as the data gets
flushed from the write log to the persistent tier. In order to better understand how this delayed fashion
works, additional details about write log, persistent tier, and how the data is being flushed are explained
in later sections of this guide.
● Storage optimization – This includes overall space efficiency, which is calculated as (1 - (1-%dedupe
savings) * (1 - %Compression savings)).
2. Hx Connect datastore overview
HX Connect datastore view:
● Size – This is the logical size of the datastore, which is thin-provisioned.
● Used – This is the used space shown in the datastore view above and is the actual physical space used *after* applying inline compression and deduplication. The capacity counter is updated in a delayed
fashion as the data gets flushed from the write log to the persistent tier.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 8 of 56
Note: As the provisioned/used/free capacity of each datastore is based on datastore (thin) provisioned
capacity, the provisioned size of a datastore (specified by the administrator while creating the datastore)
can be higher than the actual storage size.
3. Hx Connect virtual machine overview
HX Connect virtual machine view:
● Storage provisioned – This is the logical size of the virtual machine (shown in the screen capture
above), which is thin-provisioned. However, the virtual machine can also be thick-provisioned.
● Storage used - The storage used displayed in the virtual machine view image above is the actual
physical space used by the virtual machine *after* inline compression. The capacity counter is updated in
a delayed fashion as the data gets flushed from the write log to the persistent tier.
Note: In the case of a VM being thick-provisioned, the storage used would be displayed as the same
amount as the storage provisioned. In the image above, the storage used will be 50 GB if the VM is thick-
provisioned.
Additional space consumption
The “used” space shown in the HX Connect dashboard (Capacity section) page includes additional metadata
and *garbage*. Due to this extra space consumption, the used space displayed in the HX Connect datastore
and virtual machine view doesn’t match with what the overall cluster capacity view shows.
● Datastore view ∑used space = virtual machine view ∑used space
● Cluster view ∑used space (includes garbage) ≠ datastore view ∑used space and virtual machine
view ∑used space
A Log-Structured File System (LFS) gives a completely different approach to managing a file system. The
central idea behind LFS is that blocks are never modified; whenever an operation conceptually modifies a file,
the operation places a new block at the end of the log and the old version of the data block becomes “garbage”
and then the file system needs to go back and perform garbage cleaning. Writes always go to the end of the
disk. The reason behind choosing LFS over write-in-place file systems—even though it generates additional
garbage—is that LFS is beneficial for snapshots and backups (as they are inexpensive). A major disadvantage
with write-in-place file systems is that performance degrades dramatically. Therefore, the cluster’s used space
in the dashboard capacity view includes garbage generated by LFS.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 9 of 56
vCenter space views
1. Virtual machine overview
vCenter virtual machine view:
● Provisioned space – This is the logical size of the virtual ma
● chine, which is thin-provisioned.
● Used space – This is the used space shown in the virtual machine view directly above, and is the actual
physical space used *after* inline compression.
Note: If the virtual machine is thick-provisioned, the values displayed in the VMware vCenter and HX
Connect view differs. From the screen capture above, if the VM is thick-provisioned, the used space will
be displayed as 50 GB (the same as the provisioned space). The capacity-related values displayed in the
HX Connect view match with what the vCenter view displays.
2. Datastore overview
VMware vCenter datastore view:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 10 of 56
● Capacity – This is the logical size of the datastore, which is by default, thin-provisioned.
● Used - This is the actual physical space used after inline deduplication and compression.
● Free – This is the difference between the capacity (thin-provisioned capacity) and used space of the
datastore.
Understanding HX cleaner and its process
The cleaner typically runs in the background continuously. The cleaner goes into sleep mode when it is not
needed and wakes when policy-defined conditions are met. These conditions include:
● Number of flushes since the last cleaner run
● Amount of data written since the last cleaner run
● Amount of data deleted since the last cleaner run
● Storage cluster space usage (see capacity status)
Priority levels:
● Normal priority - The cleaner generates minimum I/O
● High priority - The cleaner I/O is increased. This kicks in when the capacity thresholds are near to 76
percent
If your storage cluster is experiencing an ENOSPC condition, the cleaner automatically runs at high priority.
Priority is determined by cluster space usage. If the storage cluster reaches an ENOSPC WARN condition, the
cleaner increases its intensity by increasing the number of I/O to collect garbage. With an ENOSPC set
condition, it runs at highest priority.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 11 of 56
Note: Deleting data alone does not trigger the cleaner to run and recover space. The properties and
settings govern the cleaner to trigger. The cleaner in the cluster runs aggressively after the cluster
utilization is above 76 percent. Running on a log-structured file system results in some differences
occurring in the datastore level and cluster level utilization due to dead data not being cleaned up. Until
then, the cleaner reclaims the space based on certain policy thresholds. In a case where you have deleted
a substantial chunk of an object that had a substantial size, the cleaner kicks off and will start proactively
cleaning data, but the effect is delayed.
Space accounting for delayed fashion writes and deletes
The space views shown in the HX Connect UI are from the persistent tier storage. From the HX standpoint, it’s
basically a delayed fashion model that has a write log (specially in all-Flash mode, the size is substantial—about
60 G per node) and persistent tier. Whenever there is a write/delete operation, it first sits in the write log and
doesn’t enter the persistent tier until the complete space in the write log is occupied. Therefore, you will not be
see the usable space go down for a long period of time when there is any deletion, and similarly, the used
space will not take immediate effect when there is some data written to the HX storage.
Example: In a 4-node HX cluster (all-Flash), the total write log size would be 240 GB. Consider a virtual
machine with provisioned space of 1 TB, but the actual used space is only 200 GB (meaning 200 GB of data is
written to the write log). HX counters are never updated until the write log space exceeds it maximum capacity,
so it sits in the persistent storage. Similarly, for deletion, the usable space is not going to update until there is
some data being deleted from the persistent storage. Also, for deletion, the cleaner will not kick off if the
deleted amount is tiny; if there are not a lot of writes happening in the system; and if the space utilization is low
enough.
Note: Usually, in a steady state, the flushes will happen periodically and don’t create a problem. Issues
are mostly observed during POCs, where very small amounts of data are written.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 12 of 56
Best practices
VM periodic cleanup of space
As a best practice, weekly guest level cleanup is suggested by setting up a CRON job. Why? Since there is an
ext4 or NTFS file system running on the guest VM, and the VM Disk (VMDK) presented from HX Data Platform is
presented as a device/disk to the VM, if there is any file deletion happening from the application standpoint, the
used space from the virtual machine view becomes “0”, whereas the file system still thinks that it is talking to a
drive and the drive doesn’t know if there are any free blocks. Due to this situation, there is never a call coming
to HX to mention that the deleted space is freed up. So, from the HX standpoint the usage still shows the same
value as before, but from the application standpoint, the used space is “0”. It’s a very well-known problem in
the virtualization space. Fortunately, there is a solution to overcome this issue.
As HX is doing inline deduplication and compression, in the file system you can use the dd command with
dev/zero to just write zeros and create a very large file. This forces the New Technology File System (NTFS) to
use the free blocks and write zeros (basically allocating the free space that only the file system knows about
and using them to write zeroes). When an application writes zeroes, it uses those free blocks and send those
writes to HX. As all of them are zeroes, the effective used space will not change, but the actual physical space
changes (the cluster’s used space after deduplication and compression savings) because there will no longer
be any savings as it is now all zeroes; the space becomes 100 percent free. The result is that it has the effect of
deleting space on your system.
Reclaim space within a VM on a HyperFlex cluster
HyperFlex does not have a mechanism for reclaiming space within a thinly provision VM that is freed up. Thin
provisioning allows for on demand growth of a virtual disk. This allows for more efficient use of space from the
storage system compared to thick provisioning. When files on a thinly provisioned virtual drive are deleted, the
disk size is not automatically reduced. The reason for this is that the operating system only deletes the indexes
from the file system table. This results in the appearance of the quick deletion of files but no reclamation of
space. Reclaiming disk space only works when the blocks on your virtual disk are empty. Deleting data usually
only removes the entries from the file allocation table but it does not zero out the blocks. As a result, ESX will
still think the blocks are in use.
● Reclaim space within a Windows VM on a HyperFlex cluster
The following resource provides some options in reclaiming space on thinly provisioned virtual drives.
Use the tools outlined in this section to zero the blocks of data you want deleted. Before performing any
operation that will zero out blocks, ensure you create backup files. Also, make sure there are no existing
snapshots on the virtual machine in which you perform the steps below. This includes SENTINAL
snapshots. Removal of the SENTINAL snapshot will stun the VM for a prolonged period.
Reclaiming Space on Thinly Provisioned Disks on Windows VMs.
SDelete, a free utility from Microsoft's Sysinternals Suite, will be used for reclaiming space.
Download link: https://docs.microsoft.com/en-us/sysinternals/downloads/sdelete
VM VMDK side before running SDelete:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 13 of 56
1. Download the file to the target VM.
2. Once you have downloaded SDelete to the target VM, unzip the file.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 14 of 56
3. Open a command prompt as an administrator.
4. Navigate to the directory where you uncompressed the SDelete file.
5. Run the following command: sdelete -z <drive letter>
6. Select “agree”.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 15 of 56
© 2020 Cisco and/or its affiliates. All rights reserved. Page 16 of 56
VM VMDK size after sDelete operation:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 17 of 56
Reclaim space within a Linux VM on a HyperFlex cluster:
You can use the dd utility to zero out the areas where the deleted data resides.
1. Use df -h to check the free space on the Linus OS.
2. Use the following command to zero out the empty space:
dd bs=1M count=90112 if=/dev/zero of=zeroa
3. In the previous command above, we are zeroing out 90 GB of space. After this completes, the
VMDK file will grow.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 18 of 56
Size of VMDK in vCenter:
To reclaim the space enter the following command:
/$ sudo rm zero
[sudo] password for user: <---Enter password for user
© 2020 Cisco and/or its affiliates. All rights reserved. Page 19 of 56
VM VMDK size in vCenter after deletion of the zero file:
Space utilization thresholds
A best practice of storage cluster space utilization is to not go beyond 76 percent at the HX Connect capacity
view. Beyond 76 percent, usage at the HX Connect capacity view results in:
● Performance degradation
● Poor resiliency status. Resiliency status is the ability of the storage clusters to tolerate node failures
within the storage cluster. It describes how well the storage cluster can handle disruptions.
If you experience any drive/node failures (cluster resiliency degrades), maintaining the 70-percent threshold will
help the cluster resiliency come back to a normal state by recreating the copies (stored on the failed drive) onto
the other drives. It can only do this refill/recreation if there is free space left in the cluster. Maintaining capacity
usage at 70 percent provides better performance, availability, rebalancing benefits, and future headroom for
growth.
Note: The automatic healing time for any drive failure is one minute and the wait time for any node failure
is two hours.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 20 of 56
Cluster resiliency views at different cluster space utilizations
● Normal space usage
If the cluster space utilization is less than 76 percent at the HX Connect capacity view, then the storage
cluster space usage is normal. The screen capture below provides a snip when cluster storage utilization
is normal.
Capacity view of the resiliency normal health status at 76 percent in HX Connect:
● Warn state
If the cluster space utilization is above 76 percent at the HX Connect capacity view, then the storage
cluster space utilization is above the warn threshold. The screen capture below shows what you can
expect to see when the resiliency health status is at 80 percent in the HX Connect capacity view.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 21 of 56
Capacity view of the resiliency normal health status at 80 percent in HX Connect:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 22 of 56
HX Connect Alarms view when storage capacity is above the warn state (higher than 76%):
Intersight Alarms view when storage capacity is above the warn state (higher than 76%):
© 2020 Cisco and/or its affiliates. All rights reserved. Page 23 of 56
vCenter Alarms view when storage capacity is above the warn state (higher than 76%):
● Read-only state
A cluster above 99 percent of capacity usage will revert to a read-only operational state. In this state,
the cluster doesn’t accept any further writes until you free up some space. Being in this phase leads to
poor resiliency. If there are any disk/node failures, there is no additional space to recreate the copies on
other nodes.
Capacity view of the resiliency health status at 99 percent or greater in HX Connect:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 24 of 56
Intersight Alarms view when a cluster is in a “read only state” (higher than 99% capacity):
vCenter Alarms view when a cluster is in a “read-only state” (higher than 99% capacity):
© 2020 Cisco and/or its affiliates. All rights reserved. Page 25 of 56
HX Connect System overview view when a cluster is in a “read-only state” (higher than 99% capacity):
At 92 percent of a cluster’s space usage, the cluster will return to an operational state. The operational status
changes from read-only mode to online but the resiliency status will still be in a warning state, as shown below.
HX Connect dashboard view when a cluster returns to an operational state but is still in a warning state:
How to handle misbehaving drives
A failure of a capacity disk on a node will cause the disk to go offline and will be marked as blacklisted. Multiple
disks in the same node can fail without impact to the node or the cluster with the appropriate RF setting. A
rebalance after a disk failure can take several minutes up to two hours, depending on the failed disk size and
the amount of space consumed on it. Small additional latencies may be observable during this time. After one
or more disk failures, the system will start the rebalancing process. After the rebalance, the system is marked
as healthy. Once the system is healthy again, the failing disk(s) can be replaced. The specific process to handle
a misbehaving disk is by blacklisting it. Usually if there is a disk failure, try removing and re-adding the disk five
times (which software handles) and if it still fails, it does a hard blacklist where the cluster will stop reselecting
the disk and consider the drive gone permanently. Later, users can always opt to replace the blacklisted drive
with a new drive.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 26 of 56
System information page after hard blacklisting drive:
vCenter Alarms view when the disk is blacklisted:
Space accounting for snapshots
A snapshot captures the entire state of the virtual machine at the time you take the snapshot. This includes:
● Memory state - the contents of the virtual machine’s memory
● Settings state - the virtual machine settings
● Disk state - the state of all the virtual machine’s disks
Snapshots operate on individual virtual machines. A snapshot preserves only the active virtual machine’s state.
When you revert to a snapshot, you return the virtual machine’s memory, settings, and virtual disks to the state
they were in when you took the snapshot. Storage can be an issue when using VMware snapshots. The size of
a snapshot file can never exceed the size of the original disk file. Any time you change a disk block, the system
creates the changes to the snapshot in the delta file and updates it as changes are made. If you change every
single disk block on your server after taking a snapshot, your snapshot would still be the same size as your
original disk file.
There is some additional overhead disk space that contains information that can be used to manage the
snapshots. The maximum overhead disk space varies. Traditional overhead estimates are based on the Virtual
Machine File System (VMFS) block size using VMFS-3.
Besides the performance impact, there Is also the potential for vCenter to not be aware of snapshots due to an
out-of-sync condition, also called an inconsistent VM snapshot or broken VM snapshot. In this case, it is
possible that a virtual machine, which has snapshots and performs many disk writes, could have a snapshot that
fills the datastore and causes all virtual machines on that datastore to go down until space is freed. So, you can
end up with a massive VM snapshot size eating up your production space.
Any data that was writable on a VM becomes read-only when the snapshot is taken. VMware administrators
can take multiple snapshots of a VM to create multiple possible point-in-time restore points. When a VM reverts
to a snapshot, current disk and memory states are deleted and the snapshot becomes the new parent snapshot
for that VM. The snapshot file cannot exceed the size of the original disk file, and it requires some overhead
disk space. Snapshots will grow rapidly with high disk-write activity volume. VMware recommends deleting
snapshots within 24 – 72 hours.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 27 of 56
Snapshot file formats include *--delta.vmdk file, *.vmsd file and *.vmsn file. Administrators create snapshots in
VMware vSphere's Snapshot Manager or with the vmware-cmd command line utility. Deleting, or committing,
snapshots merges all the delta files into the VMDK. If delta files remain in the VM's directory after deletion, the
snapshot did not delete properly.
Note: Release 4.5.x of HyperFlex Data Platform Release 4.5.x requires changes in the snapshot
procedure. The content provided in this document supports the HyperFlex Data Platform 4.0.2 release.
Check snapshot usage
For example, let’s take a VM titled, “test123highHxBenchVm1”. It has two hard disks and we’ve created a
snapshot called snap1 for the VM as shown below.
Snapshot example:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 28 of 56
In the following image, the highlighted disks in yellow are the delta VMDKs created by the snapshot (snap1) for
the VMDKs of the original VM, “test123highHxBenchVm1”. The highlighted file in blue is the snapshot overhead
file.
Snapshot space usage from vCenter datastore view:
The snapshot space usage from the ESXi host view is provided in the screen capture below. (It provides
additional stats such as physical shared and unshared bytes).
Snapshot space usage from the ESXi host view:
vmkfstools –extendedstatinfo <vm_name/snapshot_vmdk_name>
Space usage with multiple snapshots of a VM
When you create a snapshot, the state of the virtual disk at the time the snapshot is taken is preserved and
writes to the VMDK file are not allowed. The system creates an additional VMDK file called a delta disk for
each VMDK disk in the VM and allows you to write any changes to that delta disk. The delta disk represents the
difference between the current state of the virtual disk and the state that existed at the time the previous
snapshot was taken.
If multiple snapshots are taken, delta disks will be created for each of VM disks of every snapshot and can
represent the difference between each snapshot.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 29 of 56
When you delete a snapshot, changes between snapshots and the previous disk states are merged and all the
data from the delta disk that contains the information about the deleted snapshot is written to the parent VMDK
disk. The amount of time it takes to commit or delete snapshots depends on how much data the guest
operating system has written to the virtual disks since the last snapshot was taken.
For example, let’s take the same VM we used for explaining the snapshot view and create another snapshot,
called snap2, as shown in the first screen capture below. The next two screen captures show the multiple
snapshots’ space usage within a VMware vCenter datastore and VMware ESXi, respectively.
snap2:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 30 of 56
Multiple snapshots space usage from the vCenter datastore view:
Snapshot space usage from the ESXi host view:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 31 of 56
Native snapshots
A native snapshot is a reproduction of a VM that includes the state of the data on all VM disks and the VM
power state (on, off, or suspended) at the time the native snapshot is taken. Take a native snapshot to save the
current state of the VM, so that you can revert to the saved state. Always use the HX Data Platform Snapshot
feature to create your first snapshot of a VM. This ensures that all subsequent snapshots are in native format.
When you create the first snapshot of a VM—through either Snapshot Now or a Scheduled Snapshot—the HX
Data Platform plugin-in creates a base snapshot called a SENTINEL snapshot. The SENTINEL snapshot ensures
follow-on snapshots are all native snapshots. The SENTINEL snapshot consumes one snapshot of the total 31
snapshots that are available per a VMware limitation.
HX Data Platform native snapshot options include create a native snapshot; revert to any native snapshot; and
delete a native snapshot.
1. Create a native snapshot
Consider a test VM named "Core_perf_vmHxBenchVm12" in the vCenter that has two hard disks. The available
disks and their capacity usage before taking a snapshot are shown in the following snip:
Create native snapshot using stcli commands, as shown in this next snip:
vCenter view after creating a snapshot:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 32 of 56
The following information shows the two snapshots that were created after initiating the snapshot creation
command.
Snapshot1 – SENITNEL
Snapshot2 – snap1234
The following information shows the space usage details of the base disks after snapshot creation.
Note: Base disks get renamed after snapshot creation.
SENITNEL Snapshot
SENTINEL snapshots prevent reverted VMs from having VMware redo log-based virtual disks. Redo log-based
virtual disks occur when an original snapshot is deleted and the VM is reverted to the second oldest snapshot.
SENTINEL snapshots are in addition to the revertible native snapshot.
Note: SENTINEL snapshot creation isn’t supported with the 4.5.x Release of HX Data Platform when used
with specific ESXi versions.
When using SENTINEL snapshots:
● Do not delete the SENTINEL snapshot
● Do not revert your VM to the SENTINEL snapshot
Space usage of SENTINEL snapshot:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 33 of 56
vCenter view of SENTINEL snapshot:
Space usage view of Sentinel snapshot “snap1234”
vCenter view of HX native snapshot “snap1234”:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 34 of 56
2. Revert a snapshot
Reverting a snapshot is returning a VM to a state stored in a snapshot. Reverting a snapshot is performed on
one VM at a time. This is not performed at the resource pool or VM folder level. Reversion of snapshots is
performed through the vCenter Snapshot Manager and not through the HX Data Platform plug-in.
3. Delete a snapshot
Deleting snapshots is managed through the vSphere interface and not through the HX Data Platform plug-in.
Differences between a VM snapshot and HX native snapshot:
● Snapshots created from a VM ware are non-native snapshots, while snapshots created from the HX Data
Platform plug-in are native snapshots.
● Currently, VMware has a limitation of 32 snapshots per VM. This maximum total includes VMware-
created snapshots, a HX Data Platform SENTINEL snapshot, and HX Data Platform native snapshots.
● Native snapshots have a limitation of 31 snapshots per VM since the first native snapshot created for a
VM creates two snapshots, including a SENTINEL snapshot.
● The HX Data Platform native snapshot process is fast because it does not incur I/O overhead.
● HX Data Platform native snapshots do not degrade VM performance.
● HX Data Platform native snapshots are crash-consistent by default (I/O crash).
● Consistency is defined as maintaining the correct order of write operations to enable an application to
restart properly from a crash.
The Cisco HyperFlex HX Data Platform uses metadata-based, zero-copy snapshots to facilitate backup
operations and remote replication—critical capabilities in enterprises that require always-on data availability.
Space-efficient snapshots allow you to perform frequent online backups of data without needing to worry about
the consumption of physical storage capacity. Data can be moved offline or restored from these snapshots
instantaneously.
● Fast snapshot updates – When modified data is contained in a snapshot, it is written to a new location
and the metadata is updated, without the need for read-modify-write operations.
● Rapid snapshot deletions – You can quickly delete snapshots. The platform simply deletes a small
amount of metadata that is located on an SSD, rather than performing a long consolidation process as
needed by solutions that use a delta-disk technique.
● Highly specific snapshots – With the Cisco HyperFlex Data Platform, offloaded snapshots occur at a VM
level of granularity. You are not forced to take a snapshot of the entire volume or datastore to capture a
single VM.
Clones
A clone is a copy of an existing virtual machine. The existing virtual machine is called the parent of the clone.
When the cloning operation is complete, the clone is a separate virtual machine, although it may share virtual
disks with the parent virtual machine.
● Changes made to a clone do not affect the parent virtual machine. Changes made to the parent virtual
machine do not appear in a clone.
● A clone's MAC address and Universally Unique Identifier (UUID) are different from those of the parent
virtual machine.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 35 of 56
If you want to save the current state of the virtual machine so you can revert to that state in case you make a
mistake, take a snapshot. If you want to make a copy of a virtual machine for separate use, create a clone.
Full and linked clones
There are two types of clones:
● A full clone is an independent copy of a virtual machine that shares nothing with the parent virtual
machine after the cloning operation. Ongoing operation of a full clone is entirely separate from the parent
virtual machine.
● A linked clone is a copy of a virtual machine that shares virtual disks with the parent virtual machine in an
ongoing manner. This conserves disk space and allows multiple virtual machines to use the same
software installation.
HyperFlex clones
In the Cisco HyperFlex HX Data Platform, clones are writable snapshots that can be used to rapidly provision
items such as virtual desktops and applications for test and development environments. These fast, space-
efficient clones rapidly replicate storage volumes so that virtual machines can be replicated through just
metadata operations, with actual data copying performed only for write operations. With this approach,
hundreds of clones can be created and deleted in minutes. Compared to full-copy methods, this approach can
save a significant amount of time, increase IT agility, and improve IT productivity.
Clones are deduplicated when they are created. When clones start diverging from one another, data that is
common between them is shared, with only unique data occupying new storage space. The deduplication
engine eliminates data duplicates in the diverged clones to further reduce the clone’s storage footprint. As a
result, you can deploy many application environments without needing to worry about storage capacity use.
If you use the VMware cloning operation, you can create only a single clone from a VM. This operation is
manual and slower than batch processing multiple clones from a VM. For example, to create 20 clones of a VM,
you must manually perform the clone operation repeatedly. Use HX Data Platform Ready Clones to create
multiple clones of a VM in one click.
When a native clone of a VM is made, the HX Data Platform simply copies the set of pointers. It takes no extra
disk space (apart from the size of the set of pointers) and is inherently de-duplicated.
Clone a VM using stcli:
Space usage of clone VM:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 36 of 56
VMware vStorage API for Array Integration (VAAI)
VAAI, also called hardware acceleration or hardware offload APIs, is a set of APIs that enable communication
between VMware vSphere ESXi hosts and storage devices. This feature enables your host to offload specific
virtual machine and storage management operations to compliant storage hardware. With the storage hardware
assistance, your host performs these operations faster and consumes less CPU, memory and storage fabric
bandwidth.
This storage offload API allows vSphere to request advanced file system operations such as snapshots and
cloning. The controller causes these operations to occur through manipulation of metadata rather than actual
data copying, providing rapid response, and thus rapid deployment of new application environments.
Without the use of VAAI, cloning or migration of virtual machines by the vSphere VM Kernel Data Mover involves
software data movement. The Data Mover issues I/O to read and write blocks to and from the source and
destination datastores. With VAAI, the Data Mover can use the API primitives to offload operations to the array,
if possible. For example, if the desired operation were to copy a Virtual Machine Disk (VMDK) file from one
datastore to another inside the same array, the array would be directed to make the copy completely inside the
array. Whenever a data movement operation is invoked and the corresponding hardware offload operation is
enabled, the Data Mover will first attempt to use the hardware offload. If the hardware offload operation fails,
the Data Mover reverts to the traditional software method of data movement.
VAAI uses these fundamental operations:
● Atomic Test and Set (ATS), which is used during creation and locking of files on the VMFS volume
● Clone Blocks/Full Copy/XCOPY, which is used to copy or migrate data within the same physical array
● Zero Blocks/Write Same, which is used to zero out disk regions
● Thin-provisioning in ESXi 5.x and later hosts, which allows the ESXi host to tell the array when the space
previously occupied by a virtual machine (whether it is deleted or migrated to another datastore) can be
reclaimed on thin provisioned LUNs
● Block Delete in ESXi 5.x and later hosts, which allows for space to be reclaimed using the SCSI UNMAP
feature
Command to check VAAI status:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 37 of 56
Command for NFS datastores to check whether VAAI is supported:
vSphere view of VAAI status:
For each storage device and datastore, the vSphere client displays the hardware acceleration support status in
the Hardware Acceleration column of the Devices view and the Datastores view.
The status values are “Unknown”, “Supported”, and “Not Supported”. The initial value is “Unknown”. The status
changes to “Supported” after the host successfully performs the offload basic operations. If the offload
operation fails, the status changes to “Not Supported”.
ESXI command-line interface view of VAAI status:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 38 of 56
VAAI limitations
VAAI cannot be used when:
● The source and destination VMFS volumes have different block sizes
● The source file type is RDM and the destination file type is non-RDM (regular file)
● The source VMDK type is Thick Eager Zeroed and the destination VMDK type is thin
● The source or destination VMDK is any kind of sparse or hosted format
● Cloning a virtual machine that has snapshots, because this process involves consolidating the snapshots
into the virtual disks of the target virtual machine
● The logical address and/or transfer length in the requested operation is not aligned to the minimum
alignment required by the storage device (all datastores created with the vSphere Client are aligned
automatically)
● The VMFS datastore has multiple LUNs/extents spread across different arrays
Different cases of failure modes
Case 1: Node failure
The behavior of a Cisco HyperFlex cluster during a node failure depends on three factors:
● The overall HyperFlex cluster’s size (number of nodes) during healthy operations
● The Replication Factor (RF) that has been configured for the cluster
● The number of (additional) nodes that have failed simultaneously
There will be a noticeable impact to the cluster due to latency spike for workloads when the node failure
occurs. The surviving nodes may be computationally impacted as the VMs from the failed node are restarted.
Capacity utilization in general will rise by 1/n per node once rebuild completes (n=total number of original
nodes). After a certain time, depending on the available storage resources, the cluster will start rebalancing the
RF2 / RF3 protection. This time is called the healing time. By default, the healing time is set to two hours. If
within this time the nodes come back online, the cluster will start updating the storage layer, depending on the
number of updates written to the other copies of the data. For more information, read the following full
explanation of the different scenarios.
A storage cluster healing timeout is the length of time the HX Connect or HX Data Platform plug-in waits before
automatically healing the storage cluster. If a disk fails, the healing timeout is one minute. If a node fails, the
healing timeout is two hours. A node failure timeout takes priority if a disk and a node fail at same time or if a
disk fails after node failure, but before the healing is finished.
When the cluster resiliency status is “Warning”, the HX Data Platform system supports the storage cluster
failures and responses for RF3 outlined in the bulleted list below. Optionally, click the associated Cluster
Status/Operational Status or Resiliency Status/Resiliency Health in HX Connect and HX Data Platform plug-in to
display the reason messages that explain what is contributing to the current state.
● Three-node cluster with one simultaneous failure – The storage cluster does not automatically heal.
However, there is a Cisco Technical Assistance Center (TAC) procedure to replace the failed node to
restore storage cluster health.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 39 of 56
● Four-node cluster with one simultaneous failure – If the node does not recover in two hours, the
storage cluster starts healing by rebalancing data on the remaining nodes. To recover the failed node
immediately and fully restore the storage cluster:
◦ Check that the node is powered on and restart if possible. You might need to replace the node.
◦ Rebalance the cluster.
● A cluster of five or more nodes with two simultaneous failures – If the node does not recover in two
hours, the storage cluster starts healing by rebalancing data on the remaining nodes. To recover the
failed node immediately and fully restore the storage cluster:
◦ Check that the node is powered on and restart if possible. You might need to replace the node.
◦ Rebalance the cluster.
Various impacts during a node failure:
● Potential impacts on the VM (all components) located on the node failure – Depending on the cluster
configuration’s VMware settings of high availability and DRS, the running VMs on the failing node need to
be, or are, automatically restarted on another node. In a normal production environment, the cluster has
an N+1 configuration. This means that there is one node with additional resources (spare) in case one
node fails. Data is still available since there is no data locality. Data storage VMotion is not needed. This
is due to the usage of RF2 or RF3 (replication factor of the data platform).
● Potential impacts on the VM (all components) located on a different node – The VMs that were
running on the failing node can be restarted on the remaining nodes; however this depends on the
requirements settings of high availability and DRS. It also depends on how many resources are available
on the remaining running nodes. At the time of design, it must be considered how many resources are
available for failover in case of failures. In a normal production environment, the cluster has an N+1
configuration. This means that there is one node with additional resources (spare) in case one node fails.
● System stabilization – After the failing nodes have been replaced/repaired, the system is rebalanced
back to its original state. This means that the VMs can be transferred online to the previously failing
nodes without interruption of the operation. The data will be rebalanced in the background. The time this
will take is dependent on the size of the underlaying storage layer, the number of updates of the data,
the CPU type, and memory size.
HX Connect dashbaord page when there is a node failure:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 40 of 56
System information page after a node failure:
Case 2: Cache disk failure
After a cache SSD fails, the cluster remains highly available. The VMs will keep on running since the Hyperflex
storage controller VM is still running and will get the data from and to the HX Data Platform in conjunction with
the other nodes and the residing Hyperflex storage controller VMs on those nodes. Data will always be
accessible.
Various impacts during a cache disk failure:
● Potential impacts on the VM (all components) hosted on the same node where the cache disk failed
– Expect a slight impact to latency for that specific node as the aggregate read cache will diminish until
the SSD is replaced and rebalanced for hybrid clusters. There is no read cache for the all-Flash cluster,
so there will be no impact to latency. The write cache will always remain online since the Hyperflex
storage controller VM will ‘talk’ to the other Hyperflex storage controller VMs residing on the other
nodes.
● Potential impacts on the VM (all components) hosted on a different node where cache disk failed:
No impact
● System stabilization – After replacing the cache disk, the cluster will rebalance the data.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 41 of 56
Case 3: Capacity disk failure
A failure of a capacity disk on a node will cause the disk to go offline and be marked as blacklisted. Multiple
disks in the same node can fail without impact to the node and cluster with the appropriate RF setting:
● A cluster of three or more nodes with one simultaneous disk failure – If the disk does not recover in
one minute, the storage cluster starts healing by rebalancing data on the remaining nodes.
● A cluster of three or more nodes with two simultaneous disk failures – If the disk does not recover in
one minute, the storage cluster starts healing by rebalancing data. If the system is not restored, replace
the faulty disks and restore the system by rebalancing the cluster.
● A cluster of four or more nodes with two simultaneous disk failures – If the disk does not recover in
one minute, the storage cluster starts healing by rebalancing data on the remaining nodes (where the
disk failure didn’t happen).
● A cluster of five or more nodes with two simultaneous disk failures – The system automatically
triggers a rebalance after a minute to restore storage cluster health.
Note: For disk failures, with RF3, you can lose as many disks as you want in each node, as long as you do
not lose disks in three separate nodes at the same time. So, you can lose five disks in node 1 and 8 disks
in node 2 and still stay online. This assumes the nodes are still available and you are experiencing only disk
losses. The cluster will quickly heal, such that you can continue to sustain losses indefinitely (if free cluster
space remains).
© 2020 Cisco and/or its affiliates. All rights reserved. Page 42 of 56
Various impacts during capacity drive failure:
● Potential impacts on the VM (all components) hosted on the same node where capacity disk failed –
A rebalance after a disk failure can take several minutes to two hours, depending on the failed disk size
and the amount of space consumed on it. Small additional latencies may be observable during this time.
● Potential impacts on the VM (all components) hosted on a different node where capacity disk failed
– A slight increase of latency could be measured, although if it can be observed, the impact is extremely
low.
● System stabilization – After one or more disk failures, the system will start the rebalancing process.
After the rebalance, the system is marked as healthy. After the system is healthy again, the failing disk(s)
can be replaced.
System information page before disk failure:
System information page after disk failure:
From the previous snip, there are only five capacity disks shown out of a total of six drives.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 43 of 56
Space usage’s effect on performance
Following are the performance benchmarking results at different cluster capacities (all-Flash cluster with RF set
to 3).
Performance metrics noticed when cluster capacity is at 70 percent (before entering a Warn state)
Application-level metrics:
Load Point 10% 50% 75% 90% 100%
IOPS 7,202 35,604 53,407 63,999 69,554
Latency (ms) 1.3 2.8 5.3 6.9 7.7
Throughput (MBps) 56.3 278.2 417.2 500.0 543.4
HX Connect performance dashboard:
Performance metrics noticed when cluster capacity is at 80 percent (above a Warn state of 76 percent)
Application-level metrics:
Load Point 10% 50% 75% 90% 100%
IOPS 6,201 30,902 46,304 55,492 55,145
Latency (ms) 2.7 4.1 7.5 9.2 10.1
Throughput (MBps) 48.4 241.4 361.7 433.5 430.8
© 2020 Cisco and/or its affiliates. All rights reserved. Page 44 of 56
HX Connect performance dashboard:
From these test results, there is clearly a drop in the performance as the cluster capacity increases. However,
the actual performance degradation starts when the cluster enters the Warn state, which can be noticed from
the results at different cluster capacities.
Note: In the following graph, the load lines moving to the top left corner are typical on any storage
system. This is not just a HyperFlex phenomenon.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 45 of 56
Handling “Out of Space” errors
If the system displays an “Out of Space” error, such as the cluster is entering “Read-only mode”, you can either
add a node to increase free capacity or delete existing unused VMs to release space. When there is an “Out of
Space” condition, the VMs are unresponsive.
Note: Do not delete storage controller VMs. Storage controller VM names have the prefix stCtlVM.
Following are the steps to take to handle “Out of Space” issues:
● To add a node, use the Expand Cluster feature of the HX Data Platform Installer.
● To delete unused VMs:
◦ Determine which guest VMs you can delete. You can consider factors such as disk space used by the
VM or naming conventions.
◦ Go to vCenter > Virtual Machines to display the virtual machines in the inventory.
◦ Double-click a VM that you want to delete.
◦ Power off the VM.
◦ o Delete the VM.
● After the “Out of Space” condition is cleared:
◦ Go to vCenter > Virtual Machines to display the VM in the inventory.
◦ Double-click a VM that you want to use.
◦ Power on the VM.
Cluster expansion
When there is a need for additional storage in a cluster, the HX Data Platform supports an expansion feature to
increase storage capacity. You can add disks to the existing nodes or add new converge nodes to the existing
cluster.
Guidelines for expansion
● Replication should be paused before performing a cluster expansion; it can be resumed after completing
expansion.
● Use only admin credentials for the controller VM during the expansion workflow. Using any other
credentials other than admin may cause the expansion to fail.
● The ESXi installation is supported on SD cards for M4 converged nodes and M.2 SATA SSD for M5
converged nodes. For compute-only nodes, ESXi installation is supported for SD Cards, SAN boot, or
front SSD/HDD. Installing ESXi on a USB Flash is not supported for compute-only nodes.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 46 of 56
Mixed cluster expansion guidelines
● Expanding an existing M4 cluster with M5 converged nodes is supported.
● Expanding an existing M5 cluster with M4 converged nodes is not supported.
● Expanding an existing mixed M4/M5 cluster with M4 or M5 converged nodes is supported.
● Adding any supported compute-only nodes is permitted with all M4, M5, and mixed M4/M5 clusters
using the HX Data Platform Release 2.6 or later Installer. Some example combinations are listed here;
many other combinations are possible.
Example combinations:
◦ Expand a mixed M4/M5 cluster with compute-only Cisco UCS® B200, C220, C240 M4/M5 servers
◦ Expand an M4 cluster with compute-only Cisco UCS B200 M5, C220 M5, C240 M5 servers
● Only an expansion workflow is supported to create a mixed cluster. An initial cluster creation with mixed
M4/M5 servers is not supported.
● All M5 servers must match the form factor (220/240), type (Hybrid/AF), security capability (Non-SED
only), and disk configuration (quantity, capacity, and non-SED) of the existing M4 servers.
● HyperFlex HX220c M5 nodes will use a maximum of six capacity disks (two disk slots will remain empty)
when mixed with HX220c M4 nodes.
● HX Edge, SED, LFF, Hyper-V, and stretched clusters do not support mixed M4 and M5 clusters.
Note: In a compute node addition, adding a compute-only node will not change the available cluster
capacity.
Cluster node expansion
A converged node can be added to a HyperFlex cluster after cluster creation. The storage on a converged node
is automatically added to the cluster's storage capacity. Multiple nodes can be added without waiting for re-
balancing. After adding a new node, the new storage capacity is immediately available for use. Rebalancing
may not start immediately every time after adding the node. The rebalancing initiation will depend upon the
cluster storage capacity usage.
Scenario: adding new nodes (cluster node expansion)
● The storage cluster should be healthy.
● Check the node interoperability rules (e.g., an M4 node can be expanded with M5 and not vice versa)
● Note: Some features, such as a stretched cluster, are not supported on M4 nodes.
● Clusters need to be homogenous:
◦ All nodes need to have the same number. An M5 node has additional slots. For a mixed cluster you
will not be able to take advantage of additional slots.
◦ All data drives need to be same capacity.
◦ All data drives need to be the same type (for instance, you cannot mix SED and non-SED or NVMe
and SATA).
◦ The M4 and M5 nodes have different drive sleds; they will not fit into other nodes.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 47 of 56
◦ Cache drives must be compatible.
◦ Pay attention to the cluster version when using alternate drives. (More explanation on this topic will
be covered later in this guide.)
● Ensure that the new node uses the same configuration as the other nodes in the storage cluster. This
includes VLAN IDs and switch types (whether switches), VLAN tagging with External Switch VLAN
Tagging (EST), VLAN tagging with Virtual Switch Tagging (VST), or a virtual distributed switch.
(If the storage cluster is in an “Out of Space” condition, then when you add a new node, the system will
automatically rebalance the storage cluster. This is in addition to the rebalancing that is performed every
24 hours.)
● Ensure that the node you add is of the same model (HX220 or HX240), type (hybrid or all-Flash), and
disk configuration (SED or non-SED). In addition, ensure that the number of capacity disks matches the
existing cluster nodes.
● To add a node that has a different CPU family from what is already in use in the HyperFlex cluster,
enable EVC.
● Ensure that the software version on the node matches the Cisco HX Data Platform version, the ESXi
version, and the vCenter version. To identify the software version, go to the Storage Cluster Summary
tab in vCenter and check the HX Data Platform version in the top section. Upgrade if necessary.
(If you upgraded the cluster, you must download and install a new installer VM that matches the current
version of HyperFlex Data Platform running on the cluster.)
● Ensure that the new node has at least one valid DNS and Network Time Protocol (NTP) server
configured.
● If you are using Single Sign-On (SSO) or auto-support, ensure that the node is configured for SSO and
Simple Mail Transfer Protocol (SMTP) services.
● Allow Internet Control Message Protocol (ICMP) for pinging between the HX Data Platform installer and
the existing cluster’s management IP address.
● You can add multiple nodes without waiting for a rebalance.
● New capacity is available immediately. (The rebalance may not kick off immediately and will take time
when it does. Additional read performance will be available post-rebalance.)
Summary
● Assign the Compute-Only service profile to a server.
● Install the ESXi downloaded from www.cisco.com/go/software
● Make sure you can ping the vSphere host.
● Run the HyperFlex installer and choose the option, “Manual”.
● Check “Deploy HX Software and Expand Server”.
● Fill in the requested credentials and add a node with the vSphere IP address of the host.
● Wait until the expansion is finished.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 48 of 56
Test the results for cluster node expansion
Cluster capacity before node expansion:
System information page before node expansion:
Cluster capacity after node expansion:
System information page after expansion:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 49 of 56
System overview page after node expansion is complete:
During the rebalance operation:
Cluster capacity after the rebalance operation is complete:
Newly added node after rebalancing is complete:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 50 of 56
Storage usage on the newly added node after rebalancing is complete:
Storage usage on other nodes after rebalancing is complete:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 51 of 56
Performance snip during the complete node expansion experiment:
In the previous snip, the rebalance operation (highlighted with the black circles at the bottom of the image)
started around 5:15 p.m. and ended at 7:05 p.m.
Performance snip from start of the test (with three nodes) and a fourth node added to the cluster (before
the rebalance operation started):
© 2020 Cisco and/or its affiliates. All rights reserved. Page 52 of 56
Performance snip from the start of the rebalance operation through to its completion:
On comparing the performance results during the rebalance operation with the data before the rebalance
operation started (highlighted in the above screen captures), a slight impact on the performance during
rebalancing is happening on the cluster after adding an additional node to the existing cluster.
Performance snip after rebalancing is complete and end of the workload run:
© 2020 Cisco and/or its affiliates. All rights reserved. Page 53 of 56
Cluster data drive expansion
When a cluster is running out of space, another way to increase storage capacity is by adding more data drives
to the existing nodes in the cluster. Add data drives dynamically to all nodes without waiting for rebalance. All
data drives need to be of the same capacity in the cluster.
Scenario: Adding additional drives to existing clusters
● Add data drives dynamically - add drives to all nodes without waiting for rebalance
● Clusters need to be homogenous:
● All nodes need to have the same number
◦ All data drives need to be same capacity
◦ All data drives need to be the same type (e.g., you cannot mix SED and non-SED or NVMe and SATA
drives)
◦ M4 and M5 nodes have a different drive sled; they don’t fit into other nodes
◦ Pay attention to the cluster version when using alternate drives (more explanation on this issue is
covered later in this document)
● Adding more drives is not applicable for cache drives (we support only one per node)
● The additional capacity will be available immediately (rebalance may not kick off immediately and will
take time when it does)
Newly added drives in UX (4.0(2x) example)
© 2020 Cisco and/or its affiliates. All rights reserved. Page 54 of 56
HyperFlex storage capacity forecasting and trending
The Capacity Runway (preview) feature available in Cisco Intersight enables you to predict the storage
utilization for HyperFlex clusters. The Capacity Runway Details view displays the Storage Runway and Utilization
Trend. The details on this page enable you to monitor storage utilization in order to manage capacity planning. If
the storage utilization exceeds 76 percent, add additional drives or nodes to expand capacity. Consider
removing unused virtual machines and deleting files. In the Capacity Runway Details page, you can view the
following information:
● Storage runway – Displays the number of days predicted for the storage capacity to exceed the
recommended capacity of 76 percent in the next full year
● Storage capacity (TiB) – Displays the amount of storage capacity available in the cluster
● Storage utilization – Displays the amount of storage capacity utilized by the cluster as a percentage. The
storage utilization percentage is computed based on the total capacity and the current capacity
utilization
● Storage utilization trend – Displays the percentage increase or decrease in storage utilization in the last
six months
● Storage optimization – Provides a percentage representation of the efficient usage of the storage
infrastructure
● Utilization trend graph – Displays the current trend and forecast trend for the storage utilization. The
graph is color-coded to indicate the alarms raised for the thresholds. You can select utilization
thresholds of 76 percent or 100 percent to view the predicted storage utilization on the graph.
HyperFlex cluster capacity planning
Cisco Intersight enables you to predict the storage utilization for HyperFlex clusters. Using the historical data of
used storage capacity, along with the predicted storage utilization per HyperFlex cluster, you can proactively
scale the storage utilization on the HyperFlex clusters.
Following is a list of some important factors to consider:
● The Capacity Runway is calculated only for HyperFlex clusters claimed and connected to Intersight.
● Claim all the servers in the HyperFlex cluster with the proper license to enable the Capacity Runway.
● After a new HyperFlex cluster installation, it may take up to five minutes for the claimed cluster’s
licensing details to be refreshed. The HyperFlex Capacity Runway tab may temporarily show that that
license is not available.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 55 of 56
● Capacity Runway is computed daily at 6:00 a.m. UTC. Historical data from the last six months is used for
the computation.
● A minimum of 14 days of data is required to begin predicting storage utilization.
● The accuracy of the predicted storage utilization increases with more recent historical data collected.
Alarms: Depending on the storage capacity utilization, alarms are raised up to six months in advance. A critical
alarm is raised when the storage utilization is predicted to exceed the recommended capacity limit of 76
percent in 60 days or if the capacity may exceed 100 percent in 90 days. A warning alarm is raised if the
storage utilization is predicted to exceed the recommended capacity limit of 76 percent in 180 days.
Notes:
● Capacity Runway is an Essentials feature.
● Th Capacity Runway feature is in preview and is not meant for use in your production environment. Cisco
recommends that you use this feature on a test network or system.
● HyperFlex cluster capacity planning is not supported on the Cisco Intersight Virtual Appliance.
© 2020 Cisco and/or its affiliates. All rights reserved. Page 56 of 56
Utilization trend:
Printed in USA C11-744026-00 10/20