This document may not be copied by any means without the prior consent of the author. (Contact information for the author appears at the end of this document.)
A disaster recovery plan is a document that defines the policies and procedures for dealing with various types of disasters that can affect an organization, especially the organization's IT (Information Technology) infrastructure. A disaster is any event that has a significant impact on an enterprise's ability to conduct normal business. This plan includes the information and procedures needed to resume an organization's operation after some sort of disaster. Sometimes the plan is split into several plans, one to address recoverable disasters (e.g., loss of a server) and a more comprehensive business continuity plan for use in total loss situations (e.g., a hurricane takes out New Orleans and businesses must relocate to resume).
A disaster recovery plan may be known by different names
business continuity planning (BCP),
business resumption planning,
emergency response planning,
business continuity management,
security and risk analysis, and others.
Generally speaking, a DRP deals with disasters
from an IT
perspective, while BCP deals with disasters from a
business point of view (e.g., issues such as relocating an office
or key personnel replacements).
DRPs should be developed by both management and senior system administrators working together.
The following list of reasons for having a DRP was adopted from an article posted at informit.com, Legal Requirements for Disaster Recovery Planning: Common Facts and Misconceptions, by Leo Wrobel. Dated: Aug 3, 2007.
Many organizations question the value (in a business sense) of the possibly high cost of developing and maintaining such plans. Even a small business or organization, if dependent on an IT infrastructure, should develop some sort of disaster recovery plan.
In many case there are laws and regulations that will require an organization to have such a plan. In addition, some other laws and regulations require an organization to be able to provide certain information. This implies that information be kept safe in case of a disaster, and thus implies an organization will need a disaster recovery plan. Examples include (but by no means limited to):
all users, owners, and operators of the bulk power systemin the United States. As of 2006 the FERC is still creating the ERO. Once this is done the ERO will establish reliability standards.
insure [sic] optimal reliability, security, interoperability and interconnectivity of, and accessibility to, public communications networks and the internet. Unfortunately there doesn't seem to be any provision that requires telecommunications carriers to have a disaster recovery plan. Worse, most telecoms (and Internet Service Providers) waive most liability.
individually identifiable health information,past, present, or future. This requires covered entities to ensure the confidentiality, integrity, and availability of all electronic protected health information (ePHI) that the covered entity creates, receives, maintains, or transmits. It also requires entities to protect against any reasonably anticipated threats or hazards to the security or integrity of ePHI, protect against any reasonably anticipated uses or disclosures of such information that are not permitted or required by the Privacy Rule, and ensure compliance by their workforce. Required safeguards include application of appropriate policies and procedures, safeguarding physical access to ePHI, and ensuring that technical security measures are in place to protect networks, computers and other electronic devices. You can read more about this from the NIST HIPAA Guide.
In 2009, HIPAA grew more teeth with the passage of ARRA (The American Recovery and Reinvestment Act), specifically the HITECH act. HITECH Section D details many changes to privacy, breach notification, required audits, and increased penalties.
A few other laws and regulations that may mandate having a disaster recovery plan (or indirectly require one) include:
You as system administrator are at least partially responsible for making your organization compliant with applicable laws and regulations that affect the organization's IT infrastructure. You should have a talk with the organization's legal representative and make sure they find out what requirements pertain to your situation. Note that even for a lawyer, there is no way to easily or accurately keep up with all the changes in the laws and regulations. Every so often a reassessment should be made to make sure your plans are in compliance. (I.e., when the laws or regulations change, when your contracts change, when the nature of your business changes or expands.) In some cases you may be required to review your DRP annually or every few years. Typically the DRP should be reviewed once every one to three years.
Disaster recovery is all about risk management. The cost of ignoring disasters can be very high, including total collapse of the enterprise. As noted above, in most cases there are legal penalties for not having a proper plan. Building a good DRP requires an understanding of your business, your IT infrastructure, and your legal, regulatory, and fiduciary responsibilities.
The first step is to understand the risks your enterprise faces. This is often called a Business Impact Analysis. You need to answer the following questions:
(For example, for a college you could ask
Could the system being down impact instruction, research, grants,
or external funding?)
Before working out disaster recovery policy or procedure documents, you must know what the budget will be. The budget is the annualized cost of doing nothing. This in turn requires a risk analysis, which requires determining the risks for the systems identified in the business impact analysis. Doing a risk analysis is difficult and is not needed frequently, so the best advice may be to use a consultant for this who specializes in that area. The budget should be documented in your DRP. You can roughly determine a budget for each type of disaster that has a significant probability of occurence with this formula:
budgetdisaster = costdisaster × probabilitydisaster
Implementing a DRP includes implementing various mitigation measures.
Mitigation measures are techniques, policies, and procedures that either reduce the impact (expense) of a disaster, reduce the time to restore vital services, reduce the probability of the disaster occurring, or some combination of these. Mitigation measures are also known as mitigation strategies, measures, preventative measures, or simply mitigations.
Mitigation measures can include various types of disaster insurance (some insurance companies offer specialized IT insurance policies for this). Often several mitigation measures can be used together, with one amplifying the effect (or reducing the cost to implement) of the others. When performing the risk analysis the various possibilities can be compared by applying this formula for each possible group of mitigations options, summing the dollar amounts for each type of disaster:
( costdisaster – ( savingsmitigation – costmitigation ) ) × probabilitydisaster
probabilitydisaster is the probability
of the disaster occuring in one year, after applying mitigations.)
Some mitigation measures can be very effective and inexpensive, while
others can be very expensive but effective.
(Some measures are not effective at all, in some situations.)
Also remember that some mitigation measures are likely required
for any organization regardless of cost.
Some examples: If your site is in an area where the risk of a serious flood occurring in the next 12 months is 1 chance in a million, and the cost of the flood would likely be $10 million, then the cost of doing nothing is expected to be $10 per year. (This is the budget amount.) In this case it wouldn't be cost effective to implement any mitigations that cost money, not even buy flood insurance. However, you may have legal requirements to have a flood evacuation plan for employee safety, or to implement some other minimal mitigations. You could also implement low cost mitigations, such as putting critical IT servers on the second floor instead of the ground floor (when moving in; changing this later will have a cost).
On the other hand if you are located in an area where the chance of an earthquake costing a probable $50 million to recover is 1 chance in 5,000 per year, the budget is expected to be $10,000. You should consider various combinations of mitigations that give the most savings after expenses, and that include all required mitigations. (I.e., you chose the group of mitigations with the best benefit to cost ratio.)
A DRP is composed of several related documents. There are two different types of documents to consider: policy and procedure documents. The policy documents say what to do but should not mention any means for doing so. Policy documents also define expected behaviors of employees. On the other hand procedure documents say how to perform specific tasks, in order to fulfill some policy. The two types of documents should be cross-referenced, but if small enough (e.g., for a small home / small office, or SOHO), the two documents could be combined into a single document.
Often the corporate leaders decide overall policy, and it is up to others (such as management and senior system administrators) to design the specific, detailed policies and the procedures that implement them. Owing to different applicable laws in different locations, each site usually must design and implement the specific policies and procedures independently.
Since creating sensible IT policy documents is
difficult (often the obvious policy is not the wisest policy)
in many cases IT staff are involved in setting
policies, not just the procedures.
If not and unrealistic policies are
handed down from on-high,
a system administrator should try to find a way to suggest changes
that won't embarrass management (which would not a good way either
to effect changes or to get a raise).
As noted earlier an organization often hires a specialist consultant
to develop these documents, hopefully one well versed in
local laws and applicable regulations.
(A google.com search for
Disaster Recovery Planning will turn up many.)
There are many related policies to DRP, including security
and backup policies.
Often it can save money to have a consultant help with all
related documents, at the same time.
DRP documents are very critical and highly confidential. You should never place a real one on a web server for example, unless you are sure that those web pages cannot be accessed by non-employees. At the same time it is important that all the people involved have copies of the current policies and procedures documents both from their office and from an off-site location. (Use the security features of an Internet web server, or a separate intranet web server not accessible by outsiders.) Copies must be available (especially) during a disaster, even during a power loss, and from a remote location (such as an evacvuation shelter).
The policies and procedures must be very detailed.
Vague directions won't be followed!
For example, in the event of a server being attacked and wiped out
by a hacker a procedure that simply says
notify the police
is not likely to work.
Have you called the police to see if they handle this sort of
If so, who (which department) should be called and
what information will they need?
It is likely local police won't handle this sort of problem and the
correct procedure might be to notify a different law enforcement
group; or instead the various senior managers,
the public relations office, and/or maybe the company's
insurance company (to file a claim) should be notified.
Of course the technical procedures must be spelled out too
(e.g., the procedure for restoring the server from a backup,
or activating a standby server).
Any DRP should clearly address these issues at a minimum:
The disaster recovery policy can be thought of as a contract between an organization and those providing services (either in-house system administrators or outside contractors). Viewed this way, the DRP provides what is known as a service level agreement, or SLA. The SLA states policies such as what services are provided and a time frame for recovery from different types of disasters. Your policy should have a clear SLA so others know what to expect for recovery times of various services.
Contact information includes persons and organizations that should be informed of various types of disasters (the company president and/or board of trustees, campus deans, major customers, the local media including radio, TV, ...), insurance agents, etc.
Contact information includes service provider contact information (e.g., the electrician, the plumber, security company, police, ISP, gas, water, etc.). Often the organization's webmaster must be notified in order to post updated information on a web server or to switch to some backup website, so include that information too. Service contact information should include names, titles, phone numbers (work, home, and cell), email addresses (not your local email!), and account numbers. This must be keep handy in hard copy. An off-site copy must be maintained too.
Note that in any policy or procedure document,
specific locations and other information may change over time.
It is easier if you put this data into an appendix and use
generic phrases such as
off-site backup storage facility
rather than specific addresses such as
The same holds for contact names.
Assign tasks to function titles (or
and only use these title names in the DRP.
Note a given role (say
Plumber) may be filled by more
than one person/company, and also a given person/company
may serve in several capacities at the same time.
The contact information appendix of a DRP should be
sorted by role and generic names, and should list the names of
companies, organizations, or people that currently fulfill those
functions and the specific locations (and other data) that currently
fill those generic names (
The date of last update should be included.
Such contact and related information in this appendix should be regularly maintained. (And this task should be assigned to someone in the DRP!)
One last point: In a disaster it may not be possible to reach some of the key personnel listed. Also if the disaster is protracted, those with families may not be able to stay to handle your organization's disaster. You should make a clear chain of command, so if someone is unavailable everyone knows who will then take charge.
You can't document every type of disaster that might befall an organization's servers and networks. (For example, few companies have a plan on how to handle a swarm of moths shorting out computers.) Make sure your plan includes some general policy guidelines to cover any cases not specifically mentioned. In fact this can help keep your documents much shorter than otherwise.
Some types of disasters you should specifically plan for include:
Low (or no) cost mitigations should be used whatever the budget. Some of these are discussed below (Avoiding Disasters). Often a group of mitigations can be used very effectively together.
A vital step to take in advance is to determine exactly
who is responsible for what.
As mentioned previously
(Contact and Other Data)
the best way to document this is to come up with roles,
(person in charge), etc.
Then write the DRP using only the role names, clearly
indicating the responsibilities of each.
In an appendix you can then list the current personnel that are
assigned each role, including phone, fax, home phone, email, and
any other contact information.
Note a given person may be assigned multiple roles.
In a small company a single person may have all roles.
On the other hand, in a large organization you may need several
people to fulfill a single role (such as handling the phones).
The person in charge is usually in upper management. It is a mistake to list a IT person as in charge, even a senior administrator. The person in charge must have the authority to make policy decisions, such as closing the school early or directing the PR contact to make announcements to the media. However the policy should be that the person in charge must consult with some IT personnel before making vital IT related decisions. A foolish decision made without understanding the technical issues involved can cost dearly.
An often over-looked step is to implement DRP training for key (or all) personnel. Without some training it is unlikely your plan will prove effective once a disaster occurs.
Remember to establish a clear chain of command. If some key person is unavailable, without a chain of command nobody will know who is in charge or who reports to whom.
There are a number of techniques that can be used to reduce or eliminate the probability of some disasters. (Of course you can't completely eliminate the risk of disasters!) These mitigation measures often also reduce the cost or time needed for disaster recovery. You should use as many of these mitigation strategies as makes sense for your DRP:
/etc/*files), network maps (showing connections, IP address assignments, DNS data, etc.), serial numbers for all equipment, software keys, licenses and permits, room keys (and combinations for locks), and any other security information (such as the root password for your servers).
Once a disaster is imminent, has occurred, or is occurring, you need to activate the relevant DRP procedures. (Of course you are already well prepared!) The first step is to locate and review your copy of the DRP.
You must understand your DRP role. Know who you must notify, especially to protect legal rights and to avoid charges of negligence. Be certain you understand the chain of command; know who you should report to and take direction from, and who should report to you.
You must know your company policy regarding disasters, especially break-ins and other attacks. Some common policies include phoning the corporate attorney, the president or board members, and others in the company, and let them follow-up. A company may fear negative publicity more than the loss from a disaster, so the policy may be not to report the problem to anyone outside the company. Sometimes you report to the person in charge of publicity (marketing) and let them choose.
When planning your DRP you can contact your ISP and local law enforcement to see what procedures they recommend. Often government agencies such as the FBI (www.fbi.gov), police or other local law enforcement, FEMA (www.fema.gov), CERT (www.cert.org), US-CERT (www.us-cert.gov), and others should be notified in the event of an attack (although the FBI won't take action unless the loss is above $5000 or so, and won't give priority unless the loss is much greater).
If the loss affects the customers it may be required by various laws and regulations to report the disaster even if your company would prefer you not to. You should become familiar with the laws governing your organization's particular situation. Even if not required to report the problem in some cases the policy may be to report the problem to major customers.
In real life an attorney is consulted early to determine policies and procedures to follow that are required by law or by industry regulations or that are just a good idea to limit your company's legal liability. (For your class project it is OK to make this stuff up; that is you can pretend a lawyer said that you must have daily backups off-site, that you must notify the police, etc.)
As a professional system or network administrator you have
responsibility to obey the laws and applicable regulations.
If you feel your organization's policy is illegal or unethical you
should work to resolve the issue early on.
Otherwise you may be required to
whistle-blow when a disaster
strikes; this will not enhance your job prospects!
When a disaster is imminent it is a good time to perform backups, update system journals, contact backup sites (to let them know to get ready), send the documents and backups that need to be off-site, and other preparation steps. These are called proactive measures. This is also a good time to obtain a hardcopy of the current DRP and review it.
The specific steps to take after disaster has struck are usually known as reactive measures. Some specific measures that should be addressed in the case of a school such as HCC include:
Beside disaster recovery a company may (and should) have other policy and procedure documents. You may be asked to write such documents related to IT. You need to cover such topics as acceptable use of company equipment (e.g., the computers), data (e.g., customer lists), and services (e.g., email), strategic plans to replace desktop computers every so often, and so on. In addition you should inform employees about any privacy policies and related matters (e.g., password policies).
Policies and procedures that employees need to know about should be accessible, including items such as equipment use forms, account request forms, password reset procedure, etc. A good idea is to use a web server for all this and include an index and a search engine if you have a lot of documents.
Disaster Recovery Planning: Preparing For The Unthinkableby Jon Toigo.