Asynchronous, Synchronous and Near-Synchronous Replications with Nutanix

May 4, 2020 1:00 pm |

min

In a world where uncertainty is certain and IT disasters come without warning, IT leaders cannot afford to take the risk of being ill-prepared.

Nutanix understands that the uptime needs of every application vary. For example, while mission-critical applications that deal with financial transactions, stock exchange trading, computerized hospital patient records, emergency call center, and life support services need 24x7x365 uptime, applications related to engineering services, government services, and DevOps might not have such strict requirements. In addition, IT operational challenges such as system upgrades, migration, and the handling of data corruption issues cannot be ignored.

Adding to the complexity, most enterprises today are locked into a set of vendors and the hypervisors. IT topology becomes even more complex and operational challenges become exponential when the multiple copies of application data need to be synchronized and maintained in different physical and geographical locations. The need for such a complex topology often arises due to compliance regulations, distributed business operations collaboration, and regional data privacy laws.

Consequently, IT systems must be resilient enough to handle faults and disasters to ensure business continuity. A research report by the Ponemon Institute pegs datacenter outage costs at around $9,000 per minute.

Keeping these factors in mind, Nutanix has built high availability – disks, network card, power supply fault management – and data protection into its AOS platform. Our disaster recovery solution extends the continuous availability to multiple clusters through recovery plans and run book planning. Disaster recovery is measured in terms of the recovery point objective (RPO), the maximum amount of data a customer is willing to lose, recovery time objective (RTO), the time allowed to restore operations when IT failure occurs, and the cost. A disaster recovery topology is a combination of replication and the recovery orchestration. And Nutanix supports different RPO/RTO times with disaster recovery topologies.

The Nutanix disaster recovery journey

Let’s go through each of the disaster recovery methods supported by Nutanix.

1. Asynchronous replication (Async)

Asynchronous disaster recovery can be configured by backing up a group of entities (VMs and volume groups) locally to the Nutanix cluster and optionally configuring replication to one or more remote sites. Only schedules with RPO >= 60 minutes can be configured in this mode. Configuring Asynchronous DR provides more details on the implementation guidelines.

2. Near-Synchronous replication (NearSync)

Near-Synchronous disaster recovery is built on the Async snapshots. With NearSync we support Lightweight Snapshots (LWS, which are OpLog-based markers) running on SSDs. Since the time taken by LWS is a constant O(1), there is minimal impact on the User IO. This architecture makes LWS highly scalable and distributed. LWSs are replicated continuously to the remote site. An intermediate snapshot is created every hour and retained for 6 hours. One daily snapshot is created and retained for 5 days. The intermediate async snapshots act as checkpoints to help with RTO. In AOS 5.17 we support an RPO of up to 20 seconds. Configuring NearSync disaster recovery provides more details on the implementation guidelines.

3. Synchronous replication (Metro/Sync)

With Synchronous disaster recovery, we can achieve a zero RPO at the VM granularity level. Synchronous replication is supported between sites under 5 ms latency. In order to achieve continuous availability of applications and zero data loss, a secondary copy of all data including VM data, VM metadata, and Protection Policies applied to VMs is maintained across two clusters. This ensures that there is no data loss in case of site failure. This allows the VM Live migration to be easily supported between sites.

Note:

All the above disaster recovery topologies can be managed through the Intelligent Operations (formerly Prism) UI.
Unplanned failover from the primary site to the secondary site is supported with all the above topologies.
Planned failover from the primary site to the secondary site is only supported with NearSync and Sync disaster recovery topologies.

Data Sheet

Designing for Data Protection and Disaster Recovery in a Nutanix Private Cloud

Resource Type:Data Sheet
Use Cases:Data Protection & Disaster Recovery

5 April 2024

Nutanix Multisite Replication

So far, we looked at how individual disaster recovery topologies can help with RPO and RTO requirements. By adding Sync and NearSync together, we now provide the gold standard for protecting business-critical workloads.

Highlights of the multisite replication features:

Provide a zero data loss environment for customers with the most stringent requirements across multiple sites
0 RPO for sites within 400 km or less than 5 ms latency
20-second RPO for a recovery site with no distance limitation
30-minute RPO for a fourth site with no distance limitation
Disaster recovery orchestration can be done by VMware SRM or scripts

Let’s now look at specific multisite disaster scenarios and their recovery workflows with Nutanix disaster recovery.

Note: In all our scenarios we have considered multi-site topology between four sites A, B, C, and D with the following configuration.

Site A is the primary site and Site C is the disaster recovery site
Sites A and B are in the Production Availability Zone
Sites C and D are the Recovery Availability zone
Sync Replication (0 RPO) between Sites A 🡨🡪 B
NearSync Replication (20 sec RPO) between Sites A 🡨🡪 C
Async Replication (30 min RPO) between Sites A 🡨🡪 D
There are four different clusters in each site

SCENARIO #1: Production Site Failure (Single Site - Primary cluster)

Recovery Procedure:

Metro Remote (Cluster B) has the most recent copy of the data
This data is sent to Site C (through an out-of-band snapshot)
Only a 20-second delta snapshot is transmitted
Snapshot received at Site C is activated on Recovery Cluster C
Synchronous replication is established from Site C to Site D and application can resume
When Site A (Cluster A) is back online, 20 second RPO can be established back to Site A (Cluster C to Cluster A)

SCENARIO #2: Complete Region Failure (Two Sites – Production Availability Zone)

Recovery Procedure:

DR Site C (Cluster C) has the snapshot that is 20 seconds old
This latest Snapshot at Site C is activated on Recovery Cluster C
Synchronous replication is established from Site C to Site D and application can resume
When Site A (Cluster A) and Site B (Cluster B) are back online, 20-second RPO can be established back to Site A (Cluster C to Cluster A), and 30-minute RPO can be established back to Site A (Cluster D to Cluster A)

SCENARIO #2: Complete Region Failure (Two Sites – Production Availability Zone)

SCENARIO #3: Data Corruption Restore

Recovery Procedure:

Either restore to any available 20-second LWS snapshot or restore to one of the last hourly snapshots
Changes are then propagated to all the other Sites (Clusters)

Summary

Nutanix data protection and disaster recovery provides options to configure the applications based on criticality and business requirements
Now mission-critical applications can be protected with multiple copies stored in multiple sites and managed seamlessly through Nutanix Intelligent Operations
The above features are available in AOS 5.17 with VMware ESXi

Related Resources

Data Protection and Disaster Recovery Tech Note

This technical note covers the data protection and disaster recovery capabilities in Nutanix software.

HCI Disaster Recovery Deployment

This datasheet explains how to deploy Nutanix disaster recovery with AHV, vSphere or Hyper-V.

© 2020 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.