High Availability, Disaster Recovery, and Microsoft Azure

Both High Availability (HA) and Disaster Recovery (DR) have been essential IT topics. Fundamentally HA is about fault tolerance relevant to the availability of an examined subject like application, database, VMs, etc. While DR roots on the ability to resume operations in the aftermath of a catastrophic event. A fundamental difference of these two is that HA expects no down time and no data loss, while DR does. They are different issues and should be addressed separately.

Background

For many IT shops, either HA or DR has been a high risk and high cost item. Both are essential to business continuity, while traditionally tough technical problems to solve with very significant and long-term commitments on resources. Not only they are technically challenging, but a continual cost-cutting which has become an IT standard practice in the past two decades makes purchasing hardware/software and constructing either HA or DR solution on premises further distant from IT’s financial and technical realties.

Sense of Urgency

Too often, the technical challenges and resource commitments overwhelm IT and turn HA and DR into academic discussions, or symbolic items on a project checklist. At the same time, information is rapidly exploding as internet, mobility and social-network are becoming integral in our daily lives and businesses. There are progressively more data to process and store. For many businesses, the needs for HA and DR is urgent for better managing risks. And continual availability and on-demand ability to recover IT resources are becoming increasingly critical.

This Is the Reality Now

The good news is that the recent introduction of cloud computing has fundamentally changed how an HA or DR solution can be implemented. Microsoft Azure is a vivid example of HA and DR solutions which significantly reduce the required financial commitment and involved technical complexities. The traditional approach by establishing redundancy and acquiring a physical DR site with long-term resources and financial commitments is now largely replaced with consumable services which can be configured in minutes by mouse-clicking and with a manageable cost structure based on usage. HA and DR have become IT solutions which are financially realistic and technically feasible for businesses in all sizes.

HA, Redundancy, and Microsoft Azure LRS

HA denotes a strategy to employ redundancy such that a target application can and will continue being available without downtime while experiencing a failure of hosting hardware or software. There are various and well-developed HA solutions like a hyper-V host cluster using redundant hardware to eliminate a single point of failure of hosting OS or hardware, and an application cluster for eliminating a single point of failure by running the application in multiple VM instances with a synchronous state. Although HA implementations may vary, the fundamental principle nevertheless remains the same. HA expects neither downtime nor data loss while experiencing an outage of a target hardware or software.

HA has become dramatically simple in Microsoft Azure. Basically, all data written to disk in Microsoft Azure are kept at least in the so-called LRS, Locally Redundant Storage. LRS replicates a transaction synchronously to three different storage nodes across fault domains and upgrade domains within the same region for durability. In layman’s terms, Microsoft Azure by default maintains at least three copies of user data to achieve HA.

DR, Replication, and Microsoft Azure GRS

DR is about having a plan and backups in place to resume operations in the aftermath of a catastrophic event. Unplanned outage is assumed in a DR scenario, therefore some data loss is also expected. Notice that HA and DR are different business problems and addressed differently.

While both HA and DR are based on applying redundancy, i.e. a source and replicas, or multiple identical nodes of an examines component like application instance, databases, or VMs, there are however differences between the two. A DR solution generally employs replicas or backups, are implemented with asynchronous processes, and expects an outage of a source and with some data loss in transit while the outage occurs. While HA requires a logical representation with a real-time integrity using synchronous processes across all participating nodes, expects neither downtime nor data loss while experiencing an outage of a participating node.

For a critical workload, one approach of DR is to establish geo-replication to address an outage of an entire geographic area caused by a natural disaster, for example. The concern is that a catastrophic event may impact an entire geographic area causing a datacenter where a mission critical application is being hosted becomes unavailable for an extended period of time.

image

In Microsoft Azure, Geo Redundant Storage or GRS is the default and an optional setting, as shown above, while configuring a storage account. GRS will queue a transaction committed to LRS as an asynchronous replication to a secondary region, a few hundreds miles away from the primary region where a storage account is originated. At the secondary region, data is also stored in LRS, i.e. made durable by replicating it to three storage nodes.

image

Specifically, a Microsoft Azure storage account configured with GRS essentially maintains three replicas locally for high availability, and replicates the content and maintains three replicas at a secondary datacenter a few hundreds miles away for DR. So all are six copies, three locally and three remotely. All these are configured by one, yes one mouse click from a dropdown list while creating a storage account. The above is a conceptual model illustrated a data flow of GRS.

GRS replication has little performance impact on an application since application data are committed to LRS in real-time while replication to GRS is queued, i.e. asynchronously. A write to LRS is synchronous and in real-time, once committed, the changes are expected within 15 minutes to be asynchronously replicated to the secondary site.

For a RA-GRS storage account, in addition to one primary endpoint for read/write operations as it is in a GRS, there is also one secondary endpoint as read only becomes available as shown below. image

The cost implications of GRS or RA-GRS include the additional storage and the transmission costs for egress traffic, as applicable, of the secondary datacenter. Ingress traffic is free. And Microsoft Azure Storage SLA offers 99.9% availability and a cost calculator is also available.

Microsoft Azure Recovery Services

So far, much is about backing up or replicating data. To successfully restore, a DR plan must be put in place and ensure its availability upon a DR scenario in progress. Either placing a DR plan at a primary site where the source is or a secondary site where a replica stays has some issues and concerns.

Keeping a DR plan at the source site where all the resources are in place and on-the-job trainings seems logical. Or does it? DR is assuming a catastrophic event over an extended geographic areas where the source site is experiencing an outage. In such case, keeping a DR plan in the source site defeats the purpose.

Maintaining a DR plan at the secondary site is the choice then. In a DR scenario, a recovery site is to be brought on line within a expected period of time according to a DR plan, and having the DR plan right there and then at a recovery site makes all the sense. Or does it? This decision introduces a number of requirements including the physical readiness, the timeliness, and the financial implications on securing and maintaining a DR plan at a remote physical facility.

For a VMM server running on System Center 2012 SP1 or later, an idea, reliable and straightforward way is to use Azure Recovery Services to maintain a DR plan as shown below. And for any backup needs, using cloud as a backup site makes backing up and restoring data an anytime anywhere operation.

image

Azure Site Recovery Vault

This service essentially acts as the director of a DR process. It orchestrates and manages the protection and failover of VMs in clouds managed by Virtual Machine Manager 2012 SP1 or later. A noticeable advantage is the ability to test a recovery configuration, exercise a proactive failover and recovery, and automate recovery in the event of a site outage.

The SLA of Site Recovery Services is 99.9% availability to ensure a configured DR plan is always in place with expected updates. This is a DR solution that IT can implement, simulate, verify, bring online and be absolutely confident with the readiness.

image

Azure Backup Vault

This is a reliable, scalable and inexpensive data protection solution with zero capital investment and extremely low operational expense. Like other secure communication with Microsoft Azure, you will first upload a public certificate to Microsoft Azure. Then download the backup agent to register a target server with the backup vault. Then select what to be backed up. Both Microsoft Azure Backup SLA (99.9% availability) and cost calculator are available for better assessing the solution.

Closing Thoughts

Form an application’s view, HA is an on-going event while DR is an anticipation. HA and DR are different business problems and should be addressed differently. Nevertheless, Microsoft Azure provides a single platform to gracefully address HA with LRS, DR with GRS, and DR orchestration with Recovery Services, and all with published SLAs and a predictable cost structure. Going forward, IT pros can now include HA and DR as a reliable, scalable and relatively inexpensive proposition by employing Microsoft Azure as a solution platform.

Call to Action