Virtualization is not HA
von Felix Mößbauer
Availability vs Fault Tolerance
As I have been asked multiple times if running a software / service in a virtual environment like a VMware Cluster requires licensing HA (High Availability), consider the following statements:
- Virtualization and HA are orthogonal concepts
- There is no non-HA and HA, but several levels or classes
- Increased fault-tolerance is used to achieve HA
1. Virtualization vs HA
Running software (single instance) in a multi-node cluster might reduce downtimes due to planned maintenance, as the VM can be live-migrated to another node prior to that. This in fact increases the availability of the instance. However this only holds for planned events. If the hardware the VM is running on fails unexpectedly, there is no time for migration to another node. Hence, the instance just dies.
Another possible outage could be a network partition, where the VMs are still running and operational, but cannot be reached from the other partition.
2. Levels of HA
There are various definitions of HA levels. These can be split into two categories:
Based on the service quality
Based on share of available time
This is the probably more common definition, where HA is defined by the share of the outage-time:
availabilityShare = 1 - outage/(available+outage)
There, the availability classes are defined according to the number of nines of the availability share:
- class 2: 99%
- class 3: 99.9%
- class n: 1-0.1^n
For details see High_availability on Wikipedia.
3. Fault Tolerance
A key aspect to achive HA is to reduce the number of single-point of failures. This can be done on both the hardware and the software side. These are common examples:
- storage: use redundant disk arrays (RAID)
- network: use two network adapters
- run two instances on different nodes
- use slightly different implementations on both instances to avoid bugs in the implementation
- use uninterruptible power supply (UPS)
- redundant network topology
Virtualisation is per-se not a HA concept, but is useful to reduce downtimes due to planned maintenace. To achive actual HA (>= class 4) the setup has to be designed in a way to reduce single point of failures. If running regular software on commodity / server hardware and standard infrastructure this also requires to run at least two instances.
Disclaimer: The opinion stated above shows the technical point of view. Some companies might define HA in a different way. If unsure, ask their sales and legal departments prior to the implementation.