Disaster recovery planning is like flossing – ignored until the pain becomes too great. The topic is large, the effort poorly-defined, and the payoff for planning is distant, and in the future. Disaster recovery (DR) planning is a task that is often scheduled for “someday”, when priorities and money will somehow become available. Yes it is in the nature of computer systems to fail, so DR planning is not for an “if” but for a “when”.
This collision between a large indefinite task, limits on effort and money, and the certainty of one day needing a plan can lead to:
- Chaotic, inefficient, and overly stressful responses to system outages;
- Angry phone calls due to mismanaged expectations (“You said this would be up in an hour!”);
- Missing key pieces or vital dependencies (“Turns out we also need XYZ”);
- An inability to show progress from year to year (“This has been an audit point for several years”); and
- Poor communications with end-users on what is needed from them (“What do you mean downtime procedures?”)
IT needs to drive the DR planning process, communicate what can and can’t be done, and be very clear on what the process needs from management (prioritization and budget) and end users (input and downtime procedures). The process should also be modularized, so that “quick wins” can be shown regularly which, will build momentum and internal support for further planning.
Our suggested sequence:
- Draft application priorities (a top 10 list is usually sufficient). Announce that anything that is not a priority will be “best effort” = any recovery plans for non-priorities will be made at the time of the system outage.
- Circulate the draft priorities for review and feedback (users may surprise you with what they consider a priority).
- Approve the priorities with management (cf. 45 CFR 164.308(a)(7)(ii)(E) – Contingency Plan).
- Identify any dependencies such as networks, VM hosts, and storage, and assign them the appropriate priorities. For example, core network switches are a dependency for almost everything, so are often assigned Priority 0.
- Estimate the existing Recovery Time Objective (RTO) for the Top 10 applications. Be conservative, and leave time for the initial troubleshooting and management, as part of the disaster response.
- For each of the Top 10, identify some possibility DR options to shorten the RTO. This could be as elaborate as split-processing between two physically separate VMs, or as simple as having a spare server on a shelf somewhere. Some applications will only have one option; others may have many.
- For each DR option, determine the cost in money and effort to implement it, as well as the reduction in possible outage time that the DR option would provide. Precise estimates are not strictly necessary at this point.
- Present to management, each of the Top 10 priorities, with the current RTO, the possible options, the options’ cost, and the gain in recovery time that the option provides. It is then management’s responsibility to either authorize the investment of effort and money to improve an RTO, or to accept the risk of an outage.
- Implement those DR options that management has authorized, and be sure to test them.
- Publish the existing priorities and estimated RTOs, and assign end users with developing downtime procedures for use during those RTOs.
- Once per year, the Top 10 priorities, and the existing RTOs, should be reviewed with management, along with progress on implementing #9.
- If you have fantastic success and complete all the desired improvements for the Top 10, give yourself a pat on the back and proceed from step #1, but with priorities 11-20.
An ideal disaster recovery planning process must satisfy several constituents: management, IT, and end users.
- Management needs to know what options are available, what those options will cost, and that the ultimate decision is up to them on how much to prepare for a disaster.
- IT needs to know what to work on, in what order, and how far to go in planning for the inevitable system outages.
- Users need to know what to expect – how long systems will be unavailable, in what order they will be restored, and what to do in the meantime.
A sensible DR planning process that focuses on the organization’s true priorities will meet the needs of these three audiences, and give comfort to your entire organization that you have a plan.