In a world where uncertainty is certain and IT disasters come without warning, IT leaders cannot afford to take the risk of being ill-prepared.
Nutanix understands that the uptime needs of every application vary. For example, while mission-critical applications that deal with financial transactions, stock exchange trading, computerized hospital patient records, emergency call center, and life support services need 24x7x365 uptime, applications related to engineering services, government services, and DevOps might not have such strict requirements. In addition, IT operational challenges such as system upgrades, migration, and the handling of data corruption issues cannot be ignored.
Adding to the complexity, most enterprises today are locked into a set of vendors and the hypervisors. IT topology becomes even more complex and operational challenges become exponential when the multiple copies of application data need to be synchronized and maintained in different physical and geographical locations. The need for such a complex topology often arises due to compliance regulations, distributed business operations collaboration, and regional data privacy laws.
Consequently, IT systems must be resilient enough to handle faults and disasters to ensure business continuity. A research report by the Ponemon Institute pegs datacenter outage costs at around $9,000 per minute.
Keeping these factors in mind, Nutanix has built high availability – disks, network card, power supply fault management – and data protection into its AOS platform. Our disaster recovery solution extends the continuous availability to multiple clusters through recovery plans and run book planning. Disaster recovery is measured in terms of the recovery point objective (RPO), the maximum amount of data a customer is willing to lose, recovery time objective (RTO), the time allowed to restore operations when IT failure occurs, and the cost. A disaster recovery topology is a combination of replication and the recovery orchestration. And Nutanix supports different RPO/RTO times with disaster recovery topologies.
Let’s go through each of the disaster recovery methods supported by Nutanix.
Asynchronous disaster recovery can be configured by backing up a group of entities (VMs and volume groups) locally to the Nutanix cluster and optionally configuring replication to one or more remote sites. Only schedules with RPO >= 60 minutes can be configured in this mode. Configuring Asynchronous DR provides more details on the implementation guidelines.
Near-Synchronous disaster recovery is built on the Async snapshots. With NearSync we support Lightweight Snapshots (LWS, which are OpLog-based markers) running on SSDs. Since the time taken by LWS is a constant O(1), there is minimal impact on the User IO. This architecture makes LWS highly scalable and distributed. LWSs are replicated continuously to the remote site. An intermediate snapshot is created every hour and retained for 6 hours. One daily snapshot is created and retained for 5 days. The intermediate async snapshots act as checkpoints to help with RTO. In AOS 5.17 we support an RPO of up to 20 seconds. Configuring NearSync disaster recovery provides more details on the implementation guidelines.
With Synchronous disaster recovery, we can achieve a zero RPO at the VM granularity level. Synchronous replication is supported between sites under 5 ms latency. In order to achieve continuous availability of applications and zero data loss, a secondary copy of all data including VM data, VM metadata, and Protection Policies applied to VMs is maintained across two clusters. This ensures that there is no data loss in case of site failure. This allows the VM Live migration to be easily supported between sites.
Note:
So far, we looked at how individual disaster recovery topologies can help with RPO and RTO requirements. By adding Sync and NearSync together, we now provide the gold standard for protecting business-critical workloads.
Highlights of the multisite replication features:
Let’s now look at specific multisite disaster scenarios and their recovery workflows with Nutanix disaster recovery.
Note: In all our scenarios we have considered multi-site topology between four sites A, B, C, and D with the following configuration.