Simple Tips for Effective Disaster Recovery Testing
Disaster recovery testing gives you peace of mind that your data backups can be successfully restored when it matters most.
If you aren’t regularly testing your backups (or other recovery protocols), you could be in for a rude surprise when disaster strikes. A number of common problems can occur during the recovery process. And without ongoing testing, you’ll have no early warning that these issues exist:
- Data corruption that prevents your backups from being restored
- Unexpected delays that significantly extend the time it takes to complete the recovery
- Costly mistakes that further delay the recovery and negatively affect the business’s critical operations
So, what is disaster recovery testing exactly, and what’s the best way to do it?
Here’s everything you need to know.
A lesson in the importance of Disaster Recovery testing
In September, Missouri-based Ferguson Medical Group (FMG) was hit with ransomware, locking up the organization’s computer systems and more than 107,000 patient records.
FMG did everything right: it worked quickly to notify law enforcement, secure its network and activate its recovery plan. And, since the organization had a data backup system, it ignored the ransom demand.
But there was a problem.
When it came time to restore the backups, some of the data couldn’t be retrieved. In fact, more than three months of patient records from the prior year – all data entered or modified from Sept. 20 to Dec. 31, 2018 – were permanently lost.
While FMG did not disclose details on why that data couldn’t be restored, it’s possible the loss could have been prevented with more aggressive disaster recovery testing.
What is disaster recovery testing?
Disaster recovery testing is the process of testing the systems and processes that help a business recover from a disruptive event. Testing typically applies to data backup systems, but it can also include all the protocols that recovery personnel should use following a disaster, as dictated by an organization’s disaster recovery plan.
Examples of disaster recovery testing include:
- Testing backups to ensure data can be restored
- Mock recovery tests that help recovery teams familiarize the process of restoring backups through numerous methods
- Drills that test the activation of a disaster recovery plan and documented protocols
For the purposes of this post, we’ll largely be focusing on backup testing. But it’s critical that organizations test every component of their recovery planning to reduce the risk of unexpected problems after a disaster.
Why test backups?
Data backups are an essential layer of protection for every business. However, they are notoriously unreliable during the recovery process, especially if you’re relying on older technology.
Backup integrity can be compromised in a number of ways. And if you aren’t regularly performing tests, you may not know there’s a problem until it’s too late.
Application errors, hardware failure, power outages and O/S errors are just a few of the many ways that data can be corrupted, whether prior to being backed up or during the backup process.
Traditional incremental backups depend on a sometimes-fragile series of backups, referred to as the backup chain. This chain consists of the very first full backup and all the smaller incremental backups that follow. But if data corruption has occurred in any one of those incrementals, then it can compromise the integrity of the entire backup, preventing it from being restored.
Problems reconstructing the chain
With traditional incremental backups, recovery begins by reconstructing all those individual incrementals together with the initial full backup to create a single file or image.
There are two issues that tend to occur during this process:
- It can take a very long time.
- It doesn’t always work.
Reconstructing incrementals is by no means “instant.” It can take hours, even when you’re dealing with relatively small volumes of data. And again, if data corruption has occurred, it could mean that the corrupted data can never be restored, or worse: the entire backup could be useless.
Even when the backup can be restored, the length of the recovery can be a nasty surprise for businesses that haven’t been doing testing.
An unexpectedly long recovery can have a costly impact on mission-critical systems. In a ransomware attack, for instance, the loss of data can cause operational downtime across the organization, costing tens of thousands of dollars per hour for smaller companies. For enterprises, that downtime can cost millions.
The problem here isn’t just the delay; it’s the disconnect between the anticipated recovery time (as outlined with the DR plan) and the actual results. When restoring a backup doesn’t go as expected, it can have reverberating effects across the business.
How to conduct disaster recovery testing
Ideally, the first step to testing your disaster recovery systems is deploying a backup system that facilitates comprehensive testing. In the sections below, we outline what you should be testing to ensure your backups can be recovered. But ultimately, what you can test is dictated by the limitations of your BC/DR solution.
If you’re in the process of evaluating backup systems, look for solutions that offer instant recovery options and the ability to test those recovery methods via multiple storage locations, such as on-premise devices and the cloud. “Instant recovery” typically refers to image-based backups that can be restored within seconds directly from a backup device or a virtual server, locally or in the cloud.
Define which recovery systems should be tested
We mentioned above that DR testing shouldn’t be limited to backups only. But recovery teams must be on the same page about what should be tested and when.
As part of your disaster recovery plan, you should define the scope of testing, outlining all the systems and processes that need to be tested. For example, testing could also apply to backup power generators, network hardware and even fire-alarm systems.
All of this should be spelled out prior to starting the testing process.
Start with your on-site backup devices
In most data-loss scenarios, you’ll be restoring data from the local backup device or server, so this is a good place to start your testing. If feasible, test a variety of recoveries, such as file-level and full image restores.
As you test, here are some things you should be measuring:
- How long does the recovery take?
- Does it meet your recovery time objectives (RPOs)?
- Were there any errors or unexpected problems?
If any issues arise, they should be documented and resolved as soon as possible. For example, if testing reveals repeated data corruption issues, IT managers should work to identify and troubleshoot the underlying causes.
Repeat for cloud backups
If you’re replicating backups to a public or private cloud (and yes, you should be for stronger continuity), then you need to test those backups as well.
Depending on the scale of your testing and the capabilities of your BC/DR, recovering from the cloud will typically take longer than a local recovery. What you should be testing for is whether that recovery still meets your documented RTO and whether any problems surface during the process.
Virtualize locally and in the cloud
If your system allows it, backup virtualization enables you to boot your backup as a virtual machine. This provides the fastest access to your protected operating systems, applications and data, so that businesses can continue their critical operations through almost any disaster.
Virtualized recoveries must be tested regularly to ensure they perform as expected. This should be done via the on-site backup device and in the cloud, if your system offers these capabilities.
Speed and performance are key factors. This is especially true if you’re relying completely on the cloud to spin up your backups.
- Is there a noticeable drop in performance?
- Is the speed being hindered by on-site factors, such as network speed?
- Are critical applications usable and functioning properly?
How often to test
The frequency of your testing should depend on the business’s unique recovery objectives. However, for most organizations, testing should happen year-round.
Here are some general guidelines:
- Consider doing local recovery testing once per quarter, since this will usually be the most common method for restoring data.
- For more comprehensive recovery scenarios, such as cloud failovers, consider testing at least twice a year.
- Whenever there are significant changes to a production environment, that’s a good time to run another DR test, regardless of the default testing schedule.
Today’s best BC/DR solutions test your backups automatically to ensure they can be booted. This is typically referred to as backup verification or validation. It’s an automated process that actively monitors each new backup and notifies administrators that the backup was successfully completed and tested (or not).
In the event that problems occur, the system automatically alerts admins, so they can resolve the issue immediately. Some systems, such as the Datto SIRIS, offer customized control over this automation, allowing you to define how the backup should be tested. For example, you can add your own scripts to make the verification more intensive or remove values that are causing false positives.
A few last tips
We mentioned above how traditional backup systems are notorious for data corruption, due to problems in the backup chain. Newer solutions like the Datto SIRIS and ALTO solve this problem by eliminating dependence on the backup chain entirely. Datto’s Inverse Chain Technology stores each new snapshot in a fully constructed state, resulting in more resilient backups and eliminating the need to wait for incrementals to be reconstructed.
Finally, keep in mind that the first time you run a manual DR test is typically when the most problems arise. That’s the whole point of testing. By running tests on a regular basis, you’ll be able to continually uncover unexpected issues, so work to resolve them prior to a real disaster scenario.
Request a Free Demo