Best Practices for RTO Disaster Recovery Planning You Need to Know
40 percent of businesses never reopen their doors after a disaster, according to FEMA. Even among those that initially survive, more than 70 percent fail within two years. The most frustrating part is: the vast majority of these failures could be prevented with RTO disaster recovery planning.
Regardless of size or industry, every business needs a comprehensive business continuity plan (BCP). This is a working document that serves to identify the organization’s unique disaster risks, preventative measures and recovery solutions.
In essence, the BCP needs to answer the following questions:
- Which disaster scenarios put your business at risk?
- What is the impact of those disasters on your operations and revenue?
- How quickly could operations be restored after a critical event?
- What tools and protocols are needed to prevent an interruption in operations?
- What steps need to be taken to restore operations?
Today, we’re focusing specifically on question number three, which addresses one of the most important factors affecting your chances of survival after a disaster: your RTO.
What is RTO disaster recovery planning?
RTO is short for recovery time objective. This is the maximum amount of downtime that your business can experience following a disaster, before things start looking really bleak.
In the IT world, RTO disaster recovery generally refers to the recovery time of specific computer networks, data, applications, servers or other systems. It is the amount of downtime that a business can reasonably tolerate before the disaster becomes more devastating in terms of revenue loss, projected costs of survival or other factors.
- Example: if your business could not survive an email system outage for a period of more than six hours, then your RTO would six hours, at most.
Identifying your RTO, in relation to specific business systems, is thus a crucial part of your business continuity planning. It’s a starting point for determining what kind of interruption the business can withstand, and what actions must be taken to meet those recovery time objectives.
How to determine your RTO
Since a recovery time objective usually relates to operational costs and revenues, you will likely need to consult with different department managers and business units to establish the RTO. Ideally, this group of personnel will already be identified as the recovery team in your BCP.
Before you can determine the RTO, here’s what you need to determine:
- Cost of potential losses
A single hour of operational inactivity costs small businesses an average of $8,581 per hour, according to figures in The Atlantic. Other disasters can cost as much as $100,000 an hour.Carefully calculate the losses that your business will incur if various systems failed. Consider the costs of wages paid to idle workers, revenue losses, technology repairs, restoring lost data, government fines and any other expenses. Also consider the losses that may occur due to a damaged company reputation, which can be a significant factor when customers are impacted by your disaster.
- Critical dependencies
How are various operations dependent on a single business system, application or technology? Consider the impact of system failure across the organization. Identify the functions, services and processes that will grind to a halt (or even just slow down) if that single system were to fail.
- Possible workarounds
Your BCP should identify a Plan B (or Plan C, D, etc.) that may help to partially restore operations until a full recovery is completed. This is an ideal process for RTO disaster recovery planning, because it enables you to extend the recovery time objective.
- Losses incurred over time
Determine how the length of downtime will influence the cost of losses. For many businesses, this is not a straight, proportional line. Instead, you may discover that the rate of losses increases exponentially with each additional hour of downtime. Be sure to factor those potential increases into your RTO formula.
- Acceptable recovery time
Once you know what the disaster will cost in losses over time, you can determine the acceptable length of time for an outage to continue before it’s “too late.” That length of time is your RTO.
Depending on the type of outage, your RTO may be measured in weeks, days, hours, minutes or even seconds.
- Systems tied to critical business functions will naturally require a shorter RTO. Consider a major online retailer being knocked offline by a cyberattack. While companies like Amazon could likely survive a prolonged attack, despite losses in the millions of dollars, you can bet that these companies consider almost any amount of recovery time to be unacceptable. Thus, they put a staggering number of safeguards in place to minimize the risks of downtime. (This also reveals a somewhat subjective nature to identifying an RTO, which we’ll come back to in a minute.)
- Less critical systems may have an RTO of several weeks or even months. A single computer failure at a small business, for example, may not be immediately devastating. But without resolution, the losses incurred will eventually hit an unacceptable point, especially if they’re tied to idle workers and other dependencies.
The exact measurement of an RTO will be clearer once you’ve calculated potential losses as they relate to an acceptable level of recovery time.
RTO vs. RPO: What’s the difference?
RTO and RPO are both used frequently in disaster recovery planning. And while both are based on a similar concept, there is a fundamental distinction between the two terms.
RPO stands for recovery point objective. Whereas RTO defines an acceptable amount of time for recovery, RPO refers to an acceptable amount of data you can lose, measured in time. For example: 12 hours of data that wasn’t included in the most recent backup.
RPO is used to determine the appropriate recovery point for data backups. So, for example, if your organization determined that a loss of data greater than four hours would cause unacceptable losses or other adverse impact on operations, then your backup recovery points should be no longer than four hours.
Determining an RPO is just as critical as identifying an RTO. Both need to be adequately addressed within your continuity planning.
How firm is that O, anyway?
Keep in mind that a recovery time objective is exactly that: an objective. It does not necessarily have to identify the absolute “point of no return” after a disaster. Instead, it can be used as a realistic goal for your recovery teams.
Remember the Amazon example above? While large businesses have the resources to handle an extended outage, they also have the resources to set aggressive RTOs. More personnel, more preventive technologies and more comprehensive continuity planning enable these businesses to aim for the lowest recovery time objectives possible.
Important note: if you are setting an aggressive RTO, be sure it’s realistic and based on actual risk/loss projections. You will need to calculate exactly how much downtime the business can handle before the losses are too much to overcome.
3 reasons for RTO failure
As part of your business continuity plan, you should be testing your business’s resiliency on a regular basis. This can include mock recoveries and other drills to ensure your teams can meet the recovery time objectives you’ve identified.
But let’s face it – anything can happen in a disaster.
In a real-world event, many businesses fail to achieve their RTOs. Here are some common reasons why:
1) Unrealistic expectations.
Don’t set an impossible RTO. Also, don’t pull your RTO out of thin air. Your recovery time objective will always be limited to the capabilities of the technologies that are restoring your systems, as well as the people managing that process.
Set realistic expectations based on a thorough analysis of your business’s unique disaster-recovery preparedness.
2) Misguided backup management
Many businesses fail to consider the bigger picture of disaster recovery. They’re aren’t backing up everything.
For example, some companies make the mistake of backing up all their data, but not their network configurations, system state information and other settings that are needed to restore both the servers and the network. If your RTO only accounts for the time to restore data, you may find that full recovery will actually take much long.
3) Slow tape backup recovery
If you’re backing up to tapes, be aware of the limitations of recovery due to tape retention–when multiple recovery tasks are competing for resources on the same tape.
Recovery specialist Michael Maliniak writes in Forbes, “When you are restoring [from tape], many times multiple restores need access to the same tape. This can cause your restores to run serially instead of in parallel, which greatly increases your restore times.”
Know the limitations of your backup technologies and the dependencies that will influence your recovery time.
Speed up your recovery time with better backups
Learn more about dependable backup and recovery solutions that can recover data faster and virtually eliminate downtime. Talk to the business continuity experts at Invenio IT. Learn more at www.invenioIT.com, call (646) 395-1170 or email success@invenioIT.com.