|
Anatomy of a Disaster
Recovery Plan
A DRP is composed of several related documents. There are two different types of documents to
consider: policy and procedure documents. The policy documents say what to do but should not
mention any means for doing so. Policy documents also define expected behaviors of employees. On
the other hand procedure documents say how to perform specific tasks, in order to fulfill some
policy. The two types of documents should be cross-referenced, but if small enough (e.g., for a
small home / small office, or SOHO), the two documents could be combined into a single
document.
Often the corporate leaders decide overall policy, and it is up to others (such as management and
senior system administrators) to design the specific, detailed policies and the procedures that
implement them. Owing to different applicable laws in different locations, each site usually must
design and implement the specific policies and procedures independently.
Since creating sensible IT policy documents is difficult (often the obvious policy is not the
wisest policy) in many cases IT staff are involved in setting policies, not just the procedures. If
not and unrealistic policies are handed down from on-high, a system administrator should try to
find a way to suggest changes that won't embarrass management (which would not a good way either to
effect changes or to get a raise). As noted earlier an organization often hires a specialist
consultant to develop these documents, hopefully one well versed in local laws and applicable
regulations. (A google.com search for Disaster Recovery Planningwill turn up many.) There are many
related policies to DRP, including security and backup policies. Often it can save money to have a
consultant help with all related documents, at the same time.
DRP documents are very critical and highly confidential. You should never place a real one on a web
server for example, unless you are sure that those web pages cannot be accessed by non-employees.
At the same time it is important that all the people involved have copies of the current policies
and procedures documents both from their office and from an off-site location. (Use the security
features of an Internet web server, or a separate intranet web server not accessible by outsiders.)
Copies must be available (especially) during a disaster, even during a power loss, and from a
remote location (such as an evacvuation shelter).
The policies and procedures must be very detailed. Vague directions won't be followed! For example,
in the event of a server being attacked and wiped out by a hacker a procedure that simply says
notify the police is not likely to work. Have you called the police to see if they handle this sort
of problem? If so, who (which department) should be called and what information will they need? It
is likely local police won't handle this sort of problem and the correct procedure might be to
notify a different law enforcement group; or instead the various senior managers, the public
relations office, and/or maybe the company's insurance company (to file a claim) should be
notified. Of course the technical procedures must be spelled out too (e.g., the procedure for
restoring the server from a backup, or activating a standby server).
Any DRP should clearly address these issues at a minimum:
• Who will be in-charge (and the chain of command)
• Who will be the PR contact (i.e., who will handle the phone)
• Who must be informed
• What must be done regularly (and when)
• What must be done when a disaster is imminent
• What must be done during a disaster
• What must be done after a disaster has struck
Service Level Agreements (SLA)
The disaster recovery policy can be thought of as a contract between an organization and those
providing services (either in-house system administrators or outside contractors). Viewed this way,
the DRPprovides what is known as a service level agreement, or SLA. The SLA states policies such as
what services are provided and a time frame for recovery from different types of disasters. Your
policy should have a clear SLA so others know what to expect for recovery times of various
services.
Contact and Other Data
Contact information includes persons and organizations that should be informed of various types of
disasters (the company president and/or board of trustees, campus deans, major customers, the local
media including radio, TV, ...), insurance agents, etc.
Contact information includes service provider contact information (e.g., the electrician, the
plumber, security company, police, ISP, gas, water, etc.). Often the organization's webmaster must
be notified in order to post updated information on a web server or to switch to some backup
website, so include that information too. Service contact information should include names, titles,
phone numbers (work, home, and cell), email addresses (not your local email!), and account numbers.
This must be keep handy in hard copy. An off-site copy must be maintained too.
Note that in any policy or procedure document, specific locations and other information may change
over time. It is easier if you put this data into an appendix and use generic phrases such as
off-site backup storage facility rather than specific addresses such as 123 ReelToReel Lane. The
same holds for contact names. Assign tasks to function titles (or roles) and only use these title
names in the DRP. Note a given role (say Plumber) may be filled by more than one person/company,
and also a given person/company may serve in several capacities at the same time.
The contact information appendix of a DRP should be sorted by role and generic names, and should
list the names of companies, organizations, or people that currently fulfill those functions and
the specific locations (and other data) that currently fill those generic names (off-site storage,
backup web server, Electrician, ISP, ...). The date of last update should be included.
Such contact and related information in this appendix should be regularly maintained. (And this
task should be assigned to someone in the DRP!)
One last point: In a disaster it may not be possible to reach some of the key personnel listed.
Also if the disaster is protracted, those with families may not be able to stay to handle your
organization's disaster. You should make a clear chain of command, so if someone is unavailable
everyone knows who will then take charge.
Types of Disasters to Plan For
You can't document every type of disaster that might befall an organization's servers and networks.
(For example, few companies have a plan on how to handle a swarm of moths shorting out computers.)
Make sure your plan includes some general policy guidelines to cover any cases not specifically
mentioned. In fact this can help keep your documents much shorter than otherwise.
Some types of disasters you should specifically plan for include:
• Physical Break-ins: theft and/or destruction, terrorist attacks
• Remote attacks: attempts to steal, destroy, or corrupt data, theft of service, denial of service
(DoS), computer viruses
• Hardware failures: servers, databases, networks, power outages
• Environmental disasters: fire, flood, hurricane, etc. (Generally all these result in power
outages too)
• Accidents (human error): file loss, DB record loss, data corruption
• Other disruption: disgruntled employees, organized criminal activity, strikes, legal actions
(e.g., shutdown orders), etc.
Preparing for Disaster
Low (or no) cost mitigations should be used whatever the budget. Some of these are discussed below
(Avoiding Disasters). Often a group of mitigations can be used very effectively together.
A vital step to take in advance is to determine exactly who is responsible for what. As mentioned
previously (Contact and Other Data) the best way to document this is to come up with roles, such as
PRcontact, network manager, key contact (person in charge), etc. Then write the DRP using only the
role names, clearly indicating the responsibilities of each. In an appendix you can then list the
current personnel that are assigned each role, including phone, fax, home phone, email, and any
other contact information. Note a given person may be assigned multiple roles. In a small company a
single person may have all roles. On the other hand, in a large organization you may need several
people to fulfill a single role (such as handling the phones).
The person in charge is usually in upper management. It is a mistake to list a IT person as in
charge, even a senior administrator. The person in charge must have the authority to make policy
decisions, such as closing the school early or directing the PR contact to make announcements to
the media. However the policy should be that the person in charge must consult with some IT
personnel before making vital IT related decisions. A foolish decision made without understanding
the technical issues involved can cost dearly.
An often over-looked step is to implement DRP training for key (or all) personnel. Without some
training it is unlikely your plan will prove effective once a disaster occurs.
Remember to establish a clear chain of command. If some key person is unavailable, without a chain
of command nobody will know who is in charge or who reports to whom.
Avoiding Disasters (Mitigation Measures)
There are a number of techniques that can be used to reduce or eliminate the probability of some
disasters. (Of course you can't completely eliminate the risk of disasters!) These mitigation
measures often also reduce the cost or time needed for disaster recovery. You should use as many of
these mitigation strategies as makes sense for your DRP:
• Store key data off-site. The location and access information must be documented in your DRP.
Types of key data and documents to store off-line (and perhaps off-site) include system logs,
backups, hardware inventories and configurations, /etc/passwd and /etc/shadow (and other /etc/*
files), network maps (showing connections, IP address assignments, DNS data, etc.), serial numbers
for all equipment, software keys, licenses and permits, room keys (and combinations for locks), and
any other security information (such as the root password for your servers).
• Keep paper copies of vital data (including your DRP).
• Keep information (contact information, passwords, ...) current.
• Use anti-virus and malware removal software.
• Use and regularly test UPS, fire and smoke sensors and alarms, anti-theft systems.
• Have INFOSEC and compliance (e.g., Sarbanes-Oxley) assessments and evaluations (also known as
audits) done at least once after any major IT infrastructure changes.
• Test disaster recovery plan by staging a disaster drill. Do every 1–3 years, more often if a lot
has changed since the last drill (such as key personnel turnovers) or if your personnel need the
practice. Tell people in advance, and also fire, police, ISP, and others you are staging a drill at
a specific time. (Since you also should review the DRP every 1–3 years it makes sense to do this
test after the review, and possible changes.)
• Maintain systems, including regular inspections (e.g., change A/C filters, examine fire
extinguishers, change batteries regularly in smoke detectors and UPSes). Such disaster preventative
measures should be clearly documented in your DRP, including who is responsible for doing what.
• Have a backup ISP (say via cheap ISDN line), backup email and possibly other backup servers in
different geographical locations. (Often a reciprocal agreement can be made between East and West
coast companies to host each other's services in case of emergency.)
• Conduct training sessions.
Coping with Disasters: What to do During and After a Disaster Strikes
Once a disaster is imminent, has occurred, or is occurring, you need to activate the relevant DRP
procedures. (Of course you are already well prepared!) The first step is to locate and review your
copy of theDRP.
You must understand your DRP role. Know who you must notify, especially to protect legal rights and
to avoid charges of negligence. Be certain you understand the chain of command; know who you should
report to and take direction from, and who should report to you.
You must know your company policy regarding disasters, especially break-ins and other attacks. Some
common policies include phoning the corporate attorney, the president or board members, and others
in the company, and let them follow-up. A company may fear negative publicity more than the loss
from a disaster, so the policy may be not to report the problem to anyone outside the company.
Sometimes you report to the person in charge of publicity (marketing) and let them choose.
When planning your DRP you can contact your ISP and local law enforcement to see what procedures
they recommend. Often government agencies such as the FBI (www.fbi.gov), police or
other local law enforcement, FEMA (www.fema.gov), CERT (www.cert.org), US-CERT
(www.us-cert.gov), and others should be notified in the
event of an attack (although the FBI won't take action unless the loss is above $5000 or so, and
won't give priority unless the loss is much greater).
If the loss affects the customers it may be required by various laws and regulations to report the
disaster even if your company would prefer you not to. You should become familiar with the laws
governing your organization's particular situation. Even if not required to report the problem in
some cases the policy may be to report the problem to major customers.
In real life an attorney is consulted early to determine policies and procedures to follow that are
required by law or by industry regulations or that are just a good idea to limit your company's
legal liability. (For your class project it is OK to make this stuff up; that is you can pretend a
lawyer said that you must have daily backups off-site, that you must notify the police, etc.)
As a professional system or network administrator you have responsibility to obey the laws and
applicable regulations. If you feel your organization's policy is illegal or unethical you should
work to resolve the issue early on. Otherwise you may be required to whistle-blow when a disaster
strikes; this will not enhance your job prospects!
When a disaster is imminent it is a good time to perform backups, update system journals, contact
backup sites (to let them know to get ready), send the documents and backups that need to be
off-site, and other preparation steps. These are called proactive measures. This is also a good
time to obtain a hardcopy of the current DRP and review it.
The specific steps to take after disaster has struck are usually known as reactive measures. Some
specific measures that should be addressed in the case of a school such as HCC include:
• What specifically should be done if a hacker defaces (or completely wipes out) the main web
server?
• Who is to be phoned in the event of a school closure? (And who has the power to order a school
closure?)
• Are there backup (web, DNS, email) servers off-site and if so what must be done to activate
them?
• What is the time frame for a recovery of lost data? Or a server crash?
• What should be done if a DoS attack prevents on-line registration the week or two before a new
semester starts?
Other Policies and Procedures
Beside disaster recovery a company may (and should) have other policy and procedure documents. You
may be asked to write such documents related to IT. You need to cover such topics as acceptable use
of company equipment (e.g., the computers), data (e.g., customer lists), and services (e.g.,
email), strategic plans to replace desktop computers every so often, and so on. In addition you
should inform employees about any privacy policies and related matters (e.g., password
policies).
Policies and procedures that employees need to know about should be accessible, including items
such as equipment use forms, account request forms, password reset procedure, etc. A good idea is
to use a web server for all this and include an index and a search engine if you have a lot of
documents.
Summary
1. Determine what mitigations are required for your enterprise.
2. Perform a business impact analysis.
3. Perform a cost analysis and determine the budget.
4. Decide what mitigations, policies, and procedures are reasonable in your situation. Make sure
these make sense to upper management, legal department, and senior IT staff.
5. Conduct now, and schedule regularly, an audit of the DRP. This will ensure compliance, minimize
liability, and reduce hazards to employees and others.
6. Distribute the DRP to personnel.
7. Implement the pre-disaster mitigations.
8. Decide on schedules for reviewing (and revising) the DRP. Keep the contact information
current.
9. Decide on a schedule for conducting a DRP drill. (Before a drill is a good time to review and
revise the DRP.)
RAMNETSOL offers a full spectrum of IT Business
Solutions, including Disaster Recovery and Business Continuity. Our
consultants hold current Business Continuity Planning certifications from the Disaster Recovery
Institute International (DRII), and take into account the technology, business process, and human
factors necessary for successful recovery in the event of a disaster. Our approach includes helping
you objectively assess your requirements and current capabilities, map out the optimal path to meet
your requirements, build the capabilities you require, and develop the documented plans to mitigate
your business IT risk management. We've delivered these services for dozens of clients over the
last ten years, across the country and in every major industry.
|