High availability (HA) networks continue to function even when some components fail. A variety of features in Cisco IOS XE Software provide hardware and software redundancy that contribute to five nines (99.999%) uptime, which translates to no more than 5.26 minutes of downtime per year. That's the kind of reliability that Cisco customers have come to expect. Thousands of Cisco engineers in offices throughout the world make it possible.
This is the first in a series of three blogs that describe significant features in Cisco IOS XE that contribute to HA in the enterprise.
Cisco Stack Manager is a platform-independent discovery protocol that provides failover from active to standby switches in case the active switch experiences a failure. Available on Cisco Catalyst 9000 series, it enables a switch to discover peer nodes, verify their authenticity, raise alarms in case of a mismatch, allocate a unique switch number during discovery, and assign a HA role (e.g., active, standby, and member in one type of configuration). In case of failover, switchover, or a reload of the active switch card, the standby switch takes over.
After Stack Manager assigns roles to the switches (e.g., Active, Standby, Member), the Cisco IOS XE redundancy framework enables the control plane protocols to synchronize configuration data to the standby node. Standby protocols remain in a hot state so the standby switch can become active in case of a failure.
Stack Manager works in three different HA configurations, which will be described in an upcoming blog:
Cluster Manager is an adaptation of Stack Manager for use with Cisco Next Gen StackWise? Virtual Link, which provides the ability to virtualize two connected switches into a single virtual switch. Cluster Manager enables the same standby/active failover features provided by Stack Manager, with the added ability to provide HA across an entire data center environment using Next Gen StackWise Virtual Link. Virtualization eliminates the need to physically stack switches on top of each other. Soon, Cluster Manager will be able to support HA in switch clusters across different geographically dispersed locations.
The Stack Manager solution connects switches in a ring up to 8 switches but in configurations using StackWise Virtual Link and in wireless deployments, there is only a single interface between two nodes: one active, one standby. So, two technologies were created to handle split-brain-related HA scenarios in these configurations: Redundancy Management Interface (RMI) and Dual Active Detection (DAD).
RMI adds another interface to wireless controllers so that if one interface falters or fails, the other will take over to handle HA, first determining if it is an actual failure or just a momentary glitch. If it is an actual failure, RMI provides the redundant connection to ensure that if the active switch goes down, the standby takes over.
For deployments using StackWise Virtual Link, if the connection between the active and standby switches is lost, if one switch fails over to the second, the Dual Active Detection (DAD) process is activated. It queries the node manager for the existence of the lost peer. If it is available, it sends a recovery handshake. Once the handshake is completed, if the lost connection was due to a momentary glitch, the standby switch goes into recovery mode. If the switch is experiencing a failure, the other switch goes into recovery mode and assumes the active role.
All processes in active switches update the database and the database maintains the device's state. Since the standby doesn't communicate to the outside world, when it is updated by the active switch, it uses Operational Data Manager (ODM) to update the database. ODM uses Replication Manager to trigger all the data to sync from an active to a standby switch. The update first goes to the DB and then out to update the processes in the hot standby switch.
Symmetric Early Stacking Authentication (SESA) imposes authentication when one Catalyst 9000 series switch interacts with another and encrypts and decrypts all the remote inter-process communication between them to guard against hacking attempts. It works alongside standard stacking, StackWise Virtual Link, and wireless HA solutions and is Federal Information Processing Standards (FIPS) compliant.
In the past, reloading software on Cisco platforms could take 6-7 minutes. Now, with Extended Fast Software Upgrade (xFSU), the process is reduced to 30 seconds or less. This fast reload feature for Catalyst 9300 series switches decreases downtime during reload