Business Continuity Planning and Disaster Recovery Planning

Overview

The Business Continuity Planning (BCP) and Disaster Recovery Planning (DRP) domain is all about business. We’re not talking about infringements of security policy or unauthorized access; rather, this is about making contingency plans for a business-threatening emergency and continuing the business in the event of a disaster. While the other domains are concerned with preventing risks and protecting the infrastructure against attack, this domain assumes that the worst has happened.

The 21st century is shaping up to be the “disaster” century; it’s sure starting out that way. A lot has been said about 9/11; it was the largest implementation of Disaster Recovery Plans in American history. A great number of recovery stories sprang out of that event, and many companies had to improvise well past their plans. In the publishing world, for example, TheStreet.com and the daily newspaper American Banker ran from various journalists’ homes for several weeks afterward. The August 2003 East-Coast power blackout was proof that what looks good on paper may not work in the real world. The distributed power grid was supposed to isolate power faults and create a fault-tolerant system, whereas in actuality the grid cascaded the faults onto other utilities’ grids.

The effects of a disaster may not be immediately felt. For instance, in August 2001 a large office fire on Wall Street displaced many companies, many of whom were able to continue business after the immediate evacuation and relocation. However, a later study showed that 80percent of the businesses failed within 3 to 5 years after the event, because they could never fully recover their client base or credibility. Their clients were happy with alternative vendors; the event gave their competitors too strong of a foothold into their space.

The CISSP candidate should know the following:

The BCP and DRP domains address the preservation of business in the face of major disruptions to normal operations. Business Continuity Planning and Disaster Recovery Planning involve the preparation, testing, and updating of the actions required to protect critical business processes from the effects of major system and network failures. The CISSP candidate must have an understanding of the preparation of specific actions required to preserve the business in the event of a major disruption to normal business operations.

The BCP process includes the following:

DISASTER DEFINITION

The disaster, emergency management, and business continuity community consists of many different types of entities, such as governmental (federal, state, and local), nongovernmental (business and industry), and individuals. Each entity has its own focus and its own definition of a disaster. A very common definition of a disaster is “a suddenly occurring or unstoppable developing event that:

The DRP process includes the following:

Business Continuity Planning

Simply put, business continuity plans are created to prevent interruptions to normal business activity. They are designed to protect critical business processes from natural or man-made failures or disasters and the loss of capital resulting from the unavailability of normal business processes. Business continuity planning is a strategy to minimize the effect of disturbances and to allow for the resumption of business processes.

A disruptive event is any intentional or unintentional security violation that suspends normal operations. The aim of BCP is to minimize the effects of a disruptive event on a company. The primary purpose of business continuity plans is to reduce the risk of financial loss and enhance a company’s capability to recover promptly from a disruptive event. The business continuity plan should also help minimize the cost associated with the disruptive event and mitigate the risk associated with it.

Business continuity plans should look at all critical information-processing areas of the company, including but not limited to the following:

Life safety, or protecting the health and safety of everyone in the facility, is the first priority in an emergency or disaster. Although we talk about the preservation of capital, resumption of normal business-processing activities, and other business continuity issues, the main, overriding concern of all plans is to get the personnel out of harm’s way. Evacuation routes, assembly areas, and accounting for personnel (head counts and last known locations) are the most important elements of emergency procedures. If at any time there’s a conflict between preserving hardware or data and the threat of physical danger to personnel, the protection of the people always comes first. Personnel evacuation and safety must be the first element of a disaster response plan. Providing restoration and recovery and implementing alternative production methods come later.

ASSET LOSS

The loss of assets entails more than just the hard costs of replacing destroyed systems. Other examples of business assets that could be lost or damaged during a disaster are:

Continuity Disruptive Events

The events that can affect business continuity and require disaster recovery are well documented in the Physical Security domain (Chapter 10). Here, we are concerned with those events, either natural or man-made, that are of such a substantial nature as to pose a threat to the continuing existence of the organization. All the plans and processes in this section are “after the fact”; that is, no preventative controls similar to the controls discussed in the Operations Security domain (Chapter 6) will be demonstrated here. Business continuity plans are designed to minimize the damage done by the event and facilitate rapid restoration of the organization to its full operational capability.

We can make a simple list of these events, categorized as to whether their origination was natural or human. Examples of natural events that can affect business continuity are as follows:

Examples of man-made events that can affect business continuity are:

The Four Prime Elements of BCP

There are four major elements of the BCP process:

Scope and Plan Initiation

The Scope and Plan Initiation phase is the first step toward creating a business continuity plan. This phase marks the beginning of the BCP process. It entails creating the scope for the plan and the other elements needed to define the parameters of the plan. This phase embodies an examination of the company’s operations and support services. Scope activities could include creating a detailed account of the work required, listing the resources to be used, and defining the management practices to be employed.

With the advent of the personal computer in the workplace, distributed processing introduces special problems into the BCP process. It’s important that the centralized planning effort encompass all distributed processes and systems.

Roles and Responsibilities

The BCP process involves many personnel from various parts of the enterprise. Creation of a BCP committee will represent the first enterprisewide involvement of the major critical functional business units. All other business units will be involved in some way later, especially during the implementation and awareness phases.

The business resumption, or business continuity, plan must have total, highly visible senior management support. Senior management must agree on the scope of the project, delegate resources for the success of the project, and support the timeline and training efforts.

Also, many elements of the BCP will address senior management, such as the statement of importance and priorities, the statement of organizational responsibility, and the statement of urgency and timing. Table 8-1 shows the roles and responsibilities in the BCP process.

Table 8-1: BCP Department Involvement

Open table as spreadsheet

WHO

DOES WHAT

Executive management staff

Initiates the project, gives final approval, and gives ongoing support

Senior business unit management

Identifies and prioritizes time-critical systems

BCP committee

Directs the planning, implementation, and test processes

Functional business units

Participate in implementation and testing

CONTINGENCY PLANNERS

Contingency planners have many roles and responsibilities when planning business continuity, disaster recovery, emergency management, or business resumption processes. Some of these roles and responsibilities can include:

THE FCPA

The Foreign Corrupt Practices Act of 1977 imposes civil and criminal penalties if publicly held organizations fail to maintain adequate controls over their information systems. Organizations must take reasonable steps to ensure not only the integrity of their data but also the system controls the organization put in place.

Some organizations with mature business resumption plans (BRPs) employ a tiered structure that mirrors the organization’s hierarchy. Senior management is always the highest level of decision makers in the BRP process, although the policy group also consists of upper-level executives. The policy group approves emergency management decisions involving expenditures, liabilities, and service impacts. The next group, the disaster management team, often consists of department and business unit representatives and makes decisions regarding life safety and disaster recovery efforts. The next group, the emergency response team, supplies tactical response to the disaster and may consist of members of data processing, user support, or persons with first aid and evacuation responsibilities.[*]

Because of the concept of due diligence, stockholders may hold senior managers as well as the board of directors personally responsible if a disruptive event causes losses that adherence to base industry standards of due care could have prevented. For this reason and others, it is in the senior managers’ best interest to be fully involved in the BCP process.

Senior corporate executives are increasingly being held liable for failure of due care in disasters. They may also face civil suits from shareholders and clients for compensatory damages. The definition of due care is being updated to include computer functionality outages as more and more people around the world depend upon information to do their jobs.

Business Impact Assessment

The purpose of a BIA is to create a document to be used to help understand what impact a disruptive event would have on the business. The impact may be financial (quantitative) or operational (qualitative, such as the inability to respond to customer complaints). A vulnerability assessment is often part of the BIA process.

BIA has three primary goals:

A BIA generally takes the form of these four steps:

  1. Gathering the needed assessment materials
  2. Performing the vulnerability assessment
  3. Analyzing the information compiled
  4. Documenting the results and presenting recommendations

Gathering Assessment Materials

The initial step of the BIA is identifying which business units are critical to continuing an acceptable level of operations. Often, the starting point is a simple organizational chart that shows the business units’ relationships to each other. Other documents may also be collected at this stage in an effort to define the functional interrelationships of the organization.

As the materials are collected and the functional operations of the business are identified, the BIA will examine these business function interdependencies with an eye toward several factors, such as determining the business success factors involved, establishing a set of priorities between the units, and deciding what alternate processing procedures can be utilized.

The Vulnerability Assessment

The vulnerability assessment is often part of a BIA. It is similar to a Risk Assessment in that there is a quantitative (financial) section and a qualitative (operational) section. It differs in that the vulnerability assessment is smaller than a full risk assessment and is focused on providing information that is used solely for the business continuity plan or disaster recovery plan.

A function of a vulnerability assessment is to conduct a loss impact analysis. Because there will be two parts to the assessment (a financial assessment and an operational assessment), it will be necessary to define loss criteria both quantitatively and qualitatively.

Quantitative loss criteria can be defined as follows:

Qualitative loss criteria can consist of the following:

During the vulnerability assessment, critical support areas must be defined in order to assess the impact of a disruptive event. A critical support area is defined as a business unit or function that must be present to sustain continuity of the business processes, maintain life safety, or avoid public relations embarrassment.

Critical support areas could include the following:

The granular elements of these critical support areas will also need to be identified. By granular elements we mean the personnel, resources, and services that the critical support areas need to maintain business continuity.

Common steps to performing a vulnerability assessment could be[*]:

  1. List potential emergencies, both internally to your facility and externally to the community. Natural, man-made, technological, and human errors are all categories of potential emergencies and errors.
  2. Estimate the likelihood that each emergency could occur, in a subjective analysis.
  3. Assess the potential impact of the emergency on the organization in the areas of human impact (death or injury), property impact (loss or damage), and business impact (market share or credibility).
  4. Assess external and internal resources required to deal with the emergency, and determine whether they are located internally or whether external capabilities or procedures are required.

Figure 8-1 shows a sample vulnerability matrix. This can be used to create a subjective impact analysis for each type of emergency and its probability. The lower the final number the better, as a high number means a high probability, impact, or lack of remediation resources.

Open table as spreadsheet

TYPE OF EMERGENCY

Probability

Human Impact

Property Impact

Business Impact

Internal Resources

External Resources

Total

 

High 5↔1 Low

High Impact 5↔1 Low Impact

Weak Resources 5↔1 Strong Resources

 
               
               
               
               
               
               
               
               
               

Figure 8-1: Sample vulnerability assessment matrix.

THE CRITICALITY SURVEY

A criticality survey is another term for a standardized questionnaire or survey methodology, such as the InfoSec Assessment Method (IAM), or it could be a subset of the Security Systems Engineering Capability Maturity Model (SSECMM). Its purpose is to help identify the most critical business functions by gathering input from management personnel in the various business units.

Analyzing the Information

During the analysis phase of the BIA, several activities take place, such as documenting required processes, identifying interdependencies, and determining what an acceptable interruption period would be.

The goal of this section is to clearly describe what support the defined critical areas will require to preserve the revenue stream and maintain predefined processes, such as transaction processing levels and customer service levels. Therefore, elements of the analysis will have to come from many areas of the enterprise.

Documentation and Recommendation

The last step of the BIA entails a full documentation of all the processes, procedures, analyses, and results and the presentation of recommendations to the appropriate senior management.

The report will contain the previously gathered material, list the identified critical support areas, summarize the quantitative and qualitative impact statements, and provide the recommended recovery priorities generated from the analysis.

Business Continuity Plan Development

Business Continuity Plan development refers to using the information collected in the BIA to create the recovery strategy plan to support these critical business functions. Here the planner takes the information gathered from the BIA and begins to map out a strategy for creating a continuity plan.

This phase consists of two main steps:

  1. Defining the continuity strategy
  2. Documenting the continuity strategy

Defining the Continuity Strategy

To define the BCP strategy, the information collected from the BIA is used to create a continuity strategy for the enterprise. This task is large, and many elements of the enterprise must be included in defining the continuity strategy, such as:

In developing plans, consideration should be given to both short-term and long-term goals and objectives. Short-term goals can include:

Long-term goals and objectives can include[*]:

THE INFORMATION TECHNOLOGY DEPARTMENT

The IT department plays a very important role in identifying and protecting the company’s internal and external information dependencies. Also, the information technology elements of the BCP should address several vital issues, including:

Documenting the Continuity Strategy

Documenting the continuity strategy simply refers to the creation of documentation of the results of the continuity strategy definition phase. You will see the word documentation a lot in this chapter. Documentation is required in almost all sections, and it is the nature of BCP/DRP to require a lot of paper.

Plan Approval and Implementation

As the last step, the business continuity plan is implemented. The plan itself must contain a roadmap for implementation. Implementation here doesn’t mean executing a disaster scenario and testing the plan, but rather it refers to the following steps:

  1. Approval by senior management
  2. Creating an awareness of the plan enterprisewide
  3. Maintenance of the plan, including updating when needed

[*]Source: Paul H. Rosenthal, “Business Contingency Planning 201,” Contingency Planning and Management (May 2000).

[*]Source: FEMA, “Emergency Management Guide for Business and Industry,” August 1998.

[*]Source: National Fire Protection Association, “NFPA 1600 Standard on Disaster/Emergency Management and Business Continuity,” 2000 edition.

Disaster Recovery Planning (DRP)

I don’t think anyone can question the importance of a working, tested, reality-based Disaster Recovery Plan (DRP). A disaster recovery plan is a comprehensive statement of consistent actions to be taken before, during, and after a disruptive event that causes a significant loss of information systems resources. Disaster Recovery Plans are the procedures for responding to an emergency, providing extended backup operations during the interruption, and managing recovery and salvage processes afterwards, should an organization experience a substantial loss of processing capability.

The primary objective of the disaster recovery plan is to provide the capability to implement critical processes at an alternate site and return to the primary site and normal processing within a time frame that minimizes the loss to the organization by executing rapid recovery procedures.

When planning for a disaster, it’s important to try to account for the unexpected consequences of the both the disaster and the remediation. When you try to “expect the unexpected,” however, that doesn’t mean you can literally and financially prepare for every contingency. Preparing as well as possible for what you can will reduce the negative impact of unforeseen events. If 70 percent, 80 percent, or 90 percent of the recovery goes smoothly and according to plan, the unexpected events will have a much smaller impact on survivability of the business.

Disasters primarily affect availability, which affects the ability of the staff to access the data and access working systems, but a disaster can also affect the other two tenets: confidentiality and integrity.

Goals and Objectives of DRP

A major goal of DRP is to provide an organized way to make decisions if a disruptive event occurs. The purpose of the disaster recovery plan is to reduce confusion and enhance the ability of the organization to deal with the crisis.

Obviously, when a disruptive event occurs, the organization will not have the luxury to create and execute a recovery plan on the spot. Therefore, the amount of planning and testing that can be done beforehand will determine the capability of the organization to withstand a disaster.

The objectives of the DRP are multiple, but each is important. They can include the following:

In this section, we will examine the following areas of DRP:

The Disaster Recovery Planning Process

This phase involves the development and creation of the recovery plans, which are similar to the BCP process. However, BCP is involved in BIA and loss criteria for identifying the critical areas of the enterprise that the business requires to sustain continuity and financial viability; the DRP process assumes that those identifications have been made and the rationale has been created. Now we’re defining the steps we will need to perform to protect the business in the event of an actual disaster. Table 8-2 shows a common scheme to classify the recovery time frame needs of each business function.

Table 8-2: Recovery Time Frame Requirements Classification

Open table as spreadsheet

RATING CLASS

RECOVERY TIMEFRAME REQUIREMENTS

AAA

Immediate recovery needed; no downtime allowed

AA

Full functional recovery required within four hours

A

Same day business recovery required

B

Up to 24 hours downtime acceptable

C

24 to 72 hours downtime acceptable

D

Greater than 72 hours downtime acceptable

DISASTER RECOVERY PLAN SOFTWARE TOOLS

Several vendors distribute automated tools to create disaster recovery plans. These tools can improve productivity by providing formatted templates customized to the particular organization’s needs. Some vendors also offer specialized recovery software focused on a particular type of business or vertical market. A good source of links to various vendors is located at:

The steps in the disaster planning process phase are:

Data Processing Continuity Planning

The various means of processing backup services are all important elements to the disaster recovery plan. Here we look at the most common alternate processing types:

Mutual Aid Agreements

A mutual aid agreement (sometimes called a reciprocal agreement) is an arrangement with another company that may have similar computing needs. The other company may have similar hardware or software configurations or may require the same network data communications or Internet access as your organization.

In this type of agreement, both parties agree to support each other in the case of a disruptive event. This arrangement is made on the assumption that each organization’s operations area will have the capacity to support the others in a time of need. This is a big assumption.

There are clear advantages to this type of arrangement. It allows an organization to obtain a disaster-processing site at very little or no cost, thereby creating an alternate processing site even though a company may have very few financial resources to create one. Also, if the companies have very similar processing needs - that is, the same network operating system, the same data communications needs, or the same transaction processing procedures), this type of agreement may be workable.

This type of agreement has serious disadvantages, however, and really should be considered only if the organization has the perfect partner (a subsidiary, perhaps) and has no other alternative to disaster recovery (i.e., a solution would not exist otherwise). One disadvantage is that it is highly unlikely that each organization’s infrastructure will have the extra, unused capacity to enable full operational processing during the event. Also, in contrast to a hot or warm site, this type of arrangement severely limits the responsiveness and support available to the organization during an event and can be used only for short-term outage support.

The biggest flaw in this type of plan is obvious if we ask what happens when the disaster is large enough to affect both organizations. A major outage can easily disrupt both companies, thereby canceling any advantage that this agreement may provide. The capacity and logistical elements of this type of plan make it seriously limited.

Subscription Services

Another type of alternate processing scenario is presented by subscription services. In this scenario, third-party commercial services provide alternate backup and processing facilities. Subscription services are probably the most common of the alternate processing site implementations. They have very specific advantages and disadvantages, as we will see.

There are three basic forms of subscription services with some variations:

Hot Site

This is the Cadillac of disaster recovery alternate backup sites. A hot site is a fully configured computer facility with electrical power, heating, ventilation, and air conditioning (HVAC) and functioning file/print servers and workstations. The applications that are needed to sustain remote transaction processing are installed on the servers and workstations and are kept up-to-date to mirror the production system. Theoretically, operators and other personnel should be able to walk in and, with a data restoration of modified files from the last backup, begin full operations in a very short time. If the site participates in remote journaling - that is, mirroring transaction processing with a high-speed data line to the hot site - even the backup time may be reduced or eliminated.

This type of site requires constant maintenance of the hardware, software, data, and applications to ensure that the site accurately mirrors the state of the production site. This adds administrative overhead and can be a strain on resources, especially if a dedicated disaster recovery maintenance team does not exist.

The advantages to a hot site are numerous. The primary advantage is that 24/7 availability and exclusivity of use are ensured. The site is available immediately (or within the allowable time tolerances) after the disruptive event occurs. The site can support an outage for a short time as well as a long-term outage.

Some of the drawbacks of a hot site are as follows:

Warm Site

A warm site could best be described as a cross between a hot site and cold site. Like a hot site, the warm site is a computer facility readily available with electrical power, HVAC, and computers, but the applications may not be installed or configured. It may have file/print servers, but not a full complement of workstations. External communication links and other data elements that commonly take a long time to order and install will be present, however.

To enable remote processing at this type of site, workstations will have to be delivered quickly, and applications and their data will need to be restored from backup media.

The advantages to this type of site, as opposed to the hot site, are primarily as follows:

The primary disadvantage of a warm site, compared to a hot site, is the difference in the amount of time and effort it will take to start production processing at the new site. If extremely urgent critical transaction processing is not needed, this may be an acceptable alternative.

Cold Site

A cold site is the least ready of any of the three choices, but it is probably the most common of the three. A cold site differs from the other two in that it is ready for equipment to be brought in during an emergency, but no computer hardware (servers or workstations) resides at the site. The cold site is a room with electrical power and HVAC, but computers must be brought on-site if needed, and communications links may be ready or not. File and print servers have to be brought in, as well as all workstations, and applications will need to be installed and current data restored from backups.

A cold site is not considered an adequate resource for disaster recovery, because of the length of time required to get it going and all the variables that will not be resolved before the disruptive event. In reality, using a cold site will most likely make effective recovery impossible. It will be next to impossible to perform an in-depth disaster recovery test or to do parallel transaction processing, making it very hard to predict the success of a disaster recovery effort.

There are some advantages to a cold site, however, the primary one being cost. If an organization has very little budget for an alternative backup-processing site, the cold site may be better than nothing. Also, resource contention with other organizations will not be a problem, and neither will geographic location likely be an issue.

The big problem with this type of site is that having the cold site could engender a false sense of security. But until a disaster strikes, there’s really no way to tell whether it works or not, and by then it will be too late.

TERTIARY SITES

A tertiary site is a secondary backup site which can be used in case the primary backup site (regardless of whether it’s hot, warm, or cold) is not able to handle the recovery process or is completely unavailable. If an organization requires an extremely low MTD, or is not totally comfortable with just one backup site, a tertiary site may be designed and built.

Multiple Centers

A variation on the previously listed alternative sites is called multiple centers, or dual sites. In a multiple-center concept, the processing is spread over several operations centers, creating a distributed approach to redundancy and sharing of available resources. These multiple centers could be owned and managed by the same organization (in-house sites) or used in conjunction with some sort of reciprocal agreement.

The advantages are primarily financial, because the cost is contained. Also, this type of site will often allow for resource and support sharing among the multiple sites. The main disadvantage is the same as for mutual aid: a major disaster could easily overtake the processing capability of the sites. Also, multiple configurations could be difficult to administer.

Service Bureaus

In rare cases, an organization may contract with a service bureau to fully provide all alternate backup-processing services. The big advantage to this type of arrangement is the quick response and availability of the service bureau, testing is possible, and the service bureau may be available for more than backup. The disadvantages of this type of setup are primarily the expense and resource contention during a large emergency.

Other Data Center Backup Alternatives

There are a few other alternatives to the ones we have previously mentioned. Quite often an organization may use some combination of these alternatives in addition to one of the preceding scenarios.

Transaction Redundancy Implementations

The CISSP candidate should understand the three concepts used to create a level of fault tolerance and redundancy in transaction processing. Although these processes are not used solely for disaster recovery, they are often elements of a larger disaster recovery plan. If one or more of these processes are employed, the ability of a company to get back on-line is greatly enhanced.

The creation of hot backup sites with remote journaling and tertiary sites can become quite complicated, with layers of multiple protocols, hardware, and software mirroring required. Figure 8-2 shows an organization using a Frame Relay network, mirroring transactions to multiple sites, employing FRNDs (Frame Relay Network Devices) and FRADs (Frame Relay Access Devices).

Figure 8-2: Frame Relay network mirroring to backup sites.

Disaster Recovery Plan Maintenance

Disaster recovery plans often get out of date. A similarity common to all recovery plans is how quickly they become obsolete, for many different reasons. The company may reorganize, and the critical business units may be different from the ones existing when the plan was first created. Most commonly, changes in the network or computing infrastructure may change the location or configuration of hardware, software, and other components. The reasons may be administrative: Complex disaster recovery plans are not easily updated, personnel lose interest in the process, or employee turnover may affect involvement.

Whatever the reason, plan maintenance techniques must be employed from the outset to ensure that the plan remains fresh and usable. It’s important to build maintenance procedures into the organization by using job descriptions that centralize responsibility for updates. Also, create audit procedures that can report regularly on the state of the plan. It’s also important to ensure that multiple versions of the plan do not exist, because they could create confusion during an emergency. Always replace older versions of the text with updated versions throughout the enterprise when a plan is changed or replaced.

Emergency management plans, business continuity plans, and disaster recovery plans should be regularly reviewed, evaluated, modified, and updated. At a minimum, the plan should be reviewed at an annual audit. The plan should also be reevaluated:

Testing the Disaster Recovery Plan

Testing the disaster recovery plan is very important (a tape backup system cannot be considered working until full restoration tests have been conducted); a disaster recovery plan has many elements that are only theoretical until they have actually been tested and certified. The test plan must be created, and testing must be carried out in an orderly, standardized fashion and be executed on a regular basis.

Also, there are five specific disaster recovery plan–testing types that the CISSP candidate must know (see “The Five Disaster Recovery Plan Test Types” later in this section). Regular disaster recovery drills and tests are a cornerstone of any disaster recovery plan. No demonstrated recovery capability exists until the plan is tested. The tests must exercise every component of the plan for confidence to exist in the plan’s ability to minimize the impact of a disruptive event.

Reasons for Testing

In addition to the general reasons for testing that we have previously mentioned, there are several specific reasons to test, primarily to inform management of the recovery capabilities of the enterprise. Other specific reasons are as follows:

Creating the Test Document

To get the maximum benefit and coordination from the test, a document outlining the test scenario must be produced, containing the reasons for the test, the objectives of the test, and the type of test to be conducted (see the five following types). Also, this document should include granular details of what will happen during the test, including the following:

Certain fundamental concepts will apply to the testing procedure. Primarily, the test must not disrupt normal business functions. Also, the test should start with the easy testing types (see the following section) and gradually work up to major simulations after the recovery team has acquired testing skills.

It’s important to remember that the reason for the test is to find weaknesses in the plan. If no weaknesses were found, it was probably not an accurate test. The test is not a graded contest on how well the recovery plan or personnel executing the plan performed. Mistakes will be made, and this is the time to make them. Document the problems encountered during the test and update the plan as needed, and then test again.

TEST YOUR BACKUP REGULARLY!

If you don’t know whether the data can be retrieved quickly and accurately, or if the process has not been tested to your level of comfort, it’s not a working backup. One of us had an experience with a small New York securities firm that was in the middle of merger negotiations. Their primary server crashed, and at that point they discovered that all their backup tapes were blank; although the backup was running, no data was ever written to them. They had never tested the restore procedure. The crash was so severe that external third-party disk data restorers weren’t able to restore much data. Although some paper records existed, the value of the company tanked, and the merger failed.

The same one of us also worked with a major university that had had its e-mail system sabotaged by the recently fired systems administrator, and its backups were rendered useless. It took many weeks to build a new e-mail system, using multiple platforms, and although legal action was successfully initiated against the sysadmin, the VP of IT was forced to resign.

The Five Disaster Recovery Plan Test Types

Disaster recovery/emergency management plan testing scenarios have several levels and can be called different things, but there are generally five types of disaster recovery plan tests. The listing here is prioritized, from the simplest to the most complete testing type. As the organization progresses through the tests, each test is progressively more involved and more accurately depicts the actual responsiveness of the company. Some of the testing types, such as the last two, require major investments of time, resources, and coordination to implement. The CISSP candidate should know all of these and what they entail.

The following are the testing types:

Table 8-3 lists the five disaster recovery plan testing types in priority.

Open table as spreadsheet

LEVEL

TYPE

DESCRIPTION

1

Checklist

Copies of plan are distributed to management for review.

2

Table-top Exercise

Management meets to step through the plan.

3

Simulation

All support personnel meet in a practice execution session.

4

Functional Drill

All systems are functionally tested and drills executed.

5

Full-Scale Exercise

Real-life emergency situation is simulated.

Figure 8-3: Disaster Recovery Plan Testing Types

PLAN VIABILITY

Remember: The functionality of the recovery plan will directly determine the survivability of the organization. The plan shouldn’t be a document gathering dust in the CIO’s bookcase. It has to reflect the actual capability of the organization to recover from a disaster, and therefore needs to be tested regularly.

Disaster Recovery Procedures

This part of the plan details what roles various personnel will take on, what tasks must be implemented to recover and salvage the site, how the company interfaces with external groups, and what financial considerations will arise. Senior management must resist the temptation to participate hands-on in the recovery effort, as these efforts should be delegated. Senior management has many very important roles in the process of disaster recovery, including:

Information or technology management has more tactical roles to play, such as:

Monitoring employee morale and guarding against employee burnout during a disaster recovery event is the proper role of human resources. Other emergency recovery tasks associated with human resources could include:

The financial area is primarily responsible for:

Isolation of the incident scene should begin as soon as the emergency has been discovered. Authorized personnel should attempt to secure the scene and control access; however, no one should be placed in physical danger to perform these functions. It’s important for life safety that access be controlled immediately at the scene, and only by trained personnel directly involved in the disaster response. Additional injury or exposure to recovery personnel after the initial incident must be prevented.

The Recovery Team

A recovery team will be clearly defined with the mandate to implement the recovery procedures at the declaration of the disaster. The recovery team’s primary task is to get the predefined critical business functions operating at the alternate backup-processing site.

Among the many tasks the recovery team will have will be the retrieval of needed materials from off-site storage - that is, backup tapes, media, workstations, and so on. When this material has been retrieved, the recovery team will install the necessary equipment and communications. The team will also install the critical systems, applications, and data required for the critical business units to resume working.

The Salvage Team

A salvage team, separate from the recovery team, will be dispatched to return the primary site to normal processing environmental conditions. It’s advisable to have a different team, because this team will have a different mandate from the recovery team. They are not involved with the same issues the recovery team is concerned with, such as creating production processing and determining the criticality of data. The salvage team has the mandate to quickly and, more importantly, safely clean, repair, salvage, and determine the viability of the primary processing infrastructure after the immediate disaster has ended.

Clearly, this cannot begin until all possibility of personal danger has ended. Firefighters or police might control the return to the site. The salvage team must identify sources of expertise, equipment, and supplies that can make the return to the site possible. The salvage team supervises and expedites the cleaning of equipment or storage media that may have suffered from smoke damage, the removal of standing water, and the drying of water-damaged media and papers.

This team is often also given the authority to declare when the site is up and running again - that is, when the resumption of normal duties can begin at the primary site. This responsibility is large, because many elements of production must be examined before the green light is given to the recovery team that operations can return.

Normal Operations Resume

This job is normally the task of the recovery team, or another, separate resumption team may be created. The plan must have full procedures on how the company will return production processing from the alternate site to the primary site with the minimum of disruption and risk. It’s interesting to note that the steps to resume normal processing operations will be different from the steps in the recovery plan; that is, the least critical work should be brought back first to the primary site.

WHEN IS A DISASTER OVER?

When is a disaster over? The answer is very important. The disaster is not over until all operations have been returned to their normal location and function. A very large window of vulnerability exists when transaction processing returns from the alternate backup site to the original production site. The disaster can be officially called over only when all areas of the enterprise are back to normal in their original home, and all data has been certified as accurate.

It’s important to note that the emergency is not over until all operations are back in full production mode at the primary site. Reoccupying the site of a disaster or emergency should not be undertaken until a full safety inspection has been done. Ideally the investigation into the cause of the emergency has been completed and all damaged property has been salvaged and restored before returning. During and after an emergency, the safety of personnel must be monitored, any remaining hazards must be assessed, and security must be maintained at the scene. After all safety precautions have been taken, an inventory of damaged and undamaged property must be done to begin salvage and restoration tasks. Also, the site must not be reoccupied until all on-site investigative processes have been completed. Detailed records must be kept of all disaster-related costs, and valuations must be made of the effect of the business interruption.[*]

All elements discussed here involve well-coordinated logistical plans and resources. To manage and dispatch a recovery team, a salvage team, and perhaps a resumption team is a major effort, and the short descriptions we have here should not give the impression that it is not a very serious task.

Other Recovery Issues

Several other issues must be discussed as important elements of a disaster scenario:

When an emergency occurs that could potentially have an impact outside the facility, the public must be informed, regardless of whether there is any immediate threat to public safety. The disaster recovery plan should include determinations of the audiences that may be affected by an emergency and procedures to communicate with them. Information the public will want to know could include public safety or health concerns, the nature of the incident, the remediation effort, and future prevention steps. Common audiences for information could include:

Since the media is such an important link to the public, disaster plans and tests must contain procedures for addressing the media and communicating important information. A trained spokesperson should be designated, and established communications procedures should be prepared. Accurate and approved information should be released in a timely manner, without speculation, blame, or obfuscation.

Interfacing with External Groups

Quite often the organization may be well equipped to cope with a disaster in relation to its own employees, but it overlooks its relationship with external parties. The external parties could be municipal emergency groups such as police, fire, EMS, medical, or hospital staff; they could be civic officials, utility providers, the press, customers, or shareholders. How all personnel, from senior management on down, interact with these groups will impact the success of the disaster recovery effort. The recovery plan must clearly define steps and escalation paths for communications with these external groups.

One of the elements of the plan will be to identify how close the operations site is to emergency facilities: medical (hospital, clinic), police, and fire. The timeliness of the response of emergency groups will have a bearing on implementation of the plan when a disruptive event occurs.

Employee Relations

Another important facet of the disaster recovery plan is how the organization manages its relationship with its employees and their families. In the event of a major life- or safety-endangering event, the organization has an inherent responsibility to its employees (and families, if the event is serious enough). The organization must make preparations to be able to continue salaries even when business production has stopped. This salary continuance may be for an extended period of time, and the company should be sure its insurance can cover this cost, if needed. Also, the employees and their families may need additional funds for various types of emergency assistance for relocation or extended living support, as can happen with a major natural event such as an earthquake or flood.

Fraud and Crime

Other problems related to the event may crop up. Beware of those individuals or organizations that may seek to capitalize financially on the disaster by exploiting security concerns or other opportunities for fraud. In a major physical disaster, vandalism and looting are common occurrences. The plan must consider these contingencies.

Financial Disbursement

An often-overlooked facet of the disaster will be expense disbursement. Procedures for storing signed, authorized checks off-site must be considered in order to facilitate financial reimbursement. Also, the possibility that the expenses incurred during the event may exceed the emergency manager’s authority must be addressed.

Media Relations

A major part of any disaster recovery scenario involves the media. An important part of the plan must address dealing with the media and with civic officials. It’s important for the organization to prepare an established and unified organizational response that will be projected by a credible, trained, informed spokesperson. The company should be accessible to the media so they don’t go to other sources; report your own bad news so as to not appear to be covering up. Tell the story quickly, openly, and honestly to avoid suspicion or rumors. Before the disaster, as part of the plan, determine the appropriate clearance and approval processes for the media. It’s important to take control of dissemination of the story quickly and early in the course of the event.

[*]Source: FEMA, “Emergency Management Guide for Business and Industry,” August 1998.

Assessment Questions

You can find the answers to the following questions in Appendix A.

1. 

Which of the following choices is the first priority in an emergency?

  1. Communicating to employees’ families the status of the emergency
  2. Notifying external support resources for recovery and restoration
  3. Protecting the health and safety of everyone in the facility
  4. Warning customers and contractors of a potential interruption of service

2. 

Which of the following choices is not considered an appropriate role for senior management in the business continuity and disaster recovery process?

  1. Delegate recovery roles
  2. Publicly praise successes
  3. Closely control media and analyst communications
  4. Assess the adequacy of information security during the disaster recovery

3. 

Why is it so important to test disaster recovery plans frequently?

  1. The businesses that provide subscription services may have changed ownership.
  2. A plan is not considered viable until a test has been performed.
  3. Employees may get bored with the planning process.
  4. Natural disasters can change frequently.

4. 

Which of the following types of tests of disaster recovery/emergency management plans is considered the most cost-effective and efficient way to identify areas of overlap in the plan before conducting more demanding training exercises?

  1. Full-scale exercise
  2. Walk-through drill
  3. Table-top exercise test
  4. Evacuation drill

5. 

Which type of backup subscription service will allow a business to recover quickest?

  1. A hot site
  2. A mobile or rolling backup service
  3. A cold site
  4. A warm site

6. 

Which of the following represents the most important first step in creating a business resumption plan?

  1. Performing a risk analysis
  2. Obtaining senior management support
  3. Analyzing the business impact
  4. Planning recovery strategies

7. 

What could be a major disadvantage to a mutual aid or reciprocal type of backup service agreement?

  1. It is free or at a low cost to the organization.
  2. The use of prefabricated buildings makes recovery easier.
  3. In a major emergency, the site may not have the capacity to handle the operations required.
  4. Annual testing by the Info Tech department is required to maintain the site.

8. 

In developing an emergency or recovery plan, which of the following would not be considered a short-term objective?

  1. Priorities for restoration
  2. Acceptable downtime before restoration
  3. Minimum resources needed to accomplish the restoration
  4. The organization’s strategic plan

9. 

When is the disaster considered to be officially over?

  1. When the danger has passed and the disaster has been contained
  2. When the organization has processing up and running at the alternate site
  3. When all the elements of the business have returned to normal functioning at the original site
  4. When all employees have been financially reimbursed for their expenses

10. 

When should the public and media be informed about a disaster?

  1. Whenever site emergencies extend beyond the facility
  2. When any emergency occurs at the facility, internally or externally
  3. When the public’s health or safety is in danger
  4. When the disaster has been contained

11. 

What is the number one priority of disaster response?

  1. Resuming transaction processing
  2. Personnel safety
  3. Protecting the hardware
  4. Protecting the software

12. 

Which of the following is the best description of the criticality prioritization goal of the Business Impact Assessment (BIA) process?

  1. The identification and prioritization of every critical business unit process
  2. The identification of the resource requirements of the critical business unit processes
  3. The estimation of the maximum downtime the business can tolerate
  4. The presentation of the documentation of the results of the BIA

13. 

Which of the following most accurately describes a business impact analysis (BIA)?

  1. A program that implements the strategic goals of the organization
  2. A management-level analysis that identifies the impact of losing an entity’s resources
  3. A prearranged agreement between two or more entities to provide assistance
  4. Activities designed to return an organization to an acceptable operating condition

14. 

What is considered the major disadvantage to employing a hot site for disaster recovery?

  1. Exclusivity is assured for processing at the site.
  2. Maintaining the site is expensive.
  3. The site is immediately available for recovery.
  4. Annual testing is required to maintain the site.

15. 

Which of the following is not considered an appropriate role for Financial Management in the business continuity and disaster recovery process?

  1. Tracking the recovery costs
  2. Monitoring employee morale and guarding against employee burnout
  3. Formally notifying insurers of claims
  4. Reassessing cash flow projections

16. 

Which of the following is the most accurate description of a warm site?

  1. A backup processing facility with adequate electrical wiring and air conditioning but no hardware or software installed
  2. A backup processing facility with most hardware and software installed, which can be operational within a matter of days
  3. A backup processing facility with all hardware and software installed and 100 percent compatible with the original site, operational within hours
  4. A mobile trailer with portable generators and air conditioning

17. 

Which of the following is not one of the five disaster recovery plan testing types?

  1. Simulation
  2. Checklist
  3. Mobile
  4. Full Interruption

18. 

Which of the following choices is an example of a potential hazard due to a technological event, rather than a human event?

  1. Sabotage
  2. Financial collapse
  3. Mass hysteria
  4. Enemy attack

19. 

Which of the following is not considered an element of a backup alternative?

  1. Electronic vaulting
  2. Remote journaling
  3. Warm site
  4. Checklist

20. 

Which of the following choices refers to a business asset?

  1. Events or situations that could cause a financial or operational impact to the organization
  2. Protection devices or procedures in place that reduce the effects of threats
  3. Competitive advantage, credibility, or goodwill
  4. Personnel compensation and retirement programs

21. 

Which of the following statements is not correct regarding the role of the recovery team during the disaster?

  1. The recovery team must be the same as the salvage team, because they perform the same function.
  2. The recovery team is often separate from the salvage team, because they perform different duties.
  3. The recovery team’s primary task is to get predefined critical business functions operating at the alternate processing site.
  4. The recovery team will need full access to all backup media.

22. 

Which of the following choices is incorrect regarding when a BCP, DRP, or emergency management plan should be evaluated and modified?

  1. Never; once it has been fully tested, it should not be changed.
  2. Annually, in a scheduled review.
  3. After training drills, tests, or exercises.
  4. After an emergency or disaster response.

23. 

When should security isolation of the incident scene start?

  1. Immediately after the emergency is discovered
  2. As soon as the disaster plan is implemented
  3. After all personnel have been evacuated
  4. When hazardous materials have been discovered at the site

24. 

Which of the following is not a recommended step to take when resuming normal operations after an emergency?

  1. Reoccupy the damaged building as soon as possible.
  2. Account for all damage-related costs.
  3. Protect undamaged property.
  4. Conduct an investigation.

25. 

Which of the following would not be a good reason to test the disaster recovery plan?

  1. Testing verifies the processing capability of the alternate backup site.
  2. Testing allows processing to continue at the database shadowing facility.
  3. Testing prepares and trains the personnel to execute their emergency duties.
  4. Testing identifies deficiencies in the recovery procedures.

26. 

Which of the following statements is not true about the post-disaster salvage team?

  1. The salvage team must return to the site as soon as possible regardless of the residual physical danger.
  2. The salvage team manages the cleaning of equipment after smoke damage.
  3. The salvage team identifies sources of expertise to employ in the recovery of equipment or supplies.
  4. The salvage team may be given the authority to declare when operations can resume at the disaster site.

27. 

Which of the following is the most accurate statement about the results of the disaster recovery plan test?

  1. If no deficiencies were found during the test, then the plan is probably perfect.
  2. The results of the test should be kept secret.
  3. If no deficiencies were found during the test, then the test was probably flawed.
  4. The plan should not be changed no matter what the results of the test.

28. 

Which statement is true regarding the disbursement of funds during and after a disruptive event?

  1. Because access to funds is rarely an issue during a disaster, no special arrangements need to be made.
  2. No one but the finance department should ever disburse funds during or after a disruptive event.
  3. In the event senior-level or financial management is unable to disburse funds normally, the company will need to file for bankruptcy.
  4. Authorized, signed checks should be stored securely off-site for access by lower-level managers in the event senior-level or financial management is unable to disburse funds normally.

29. 

Which statement is true regarding company/employee relations during and after a disaster?

  1. The organization has a responsibility to continue salaries or other funding to the employees and families affected by the disaster.
  2. The organization’s responsibility to the employee’s families ends when the disaster stops the business from functioning.
  3. Employees should seek any means of obtaining compensation after a disaster, including fraudulent ones.
  4. Senior-level executives are the only employees who should receive continuing salaries during the disruptive event.

30. 

Which of the following choices is the correct definition of a Mutual Aid Agreement?

  1. A management-level analysis that identifies the impact of losing an entity’s resources
  2. An appraisal or determination of the effects of a disaster on human, physical, economic, and natural resources
  3. A prearranged agreement to render assistance to the parties of the agreement
  4. Activities taken to eliminate or reduce the degree of risk to life and property

31. 

Which of the following most accurately describes a business continuity program?

  1. An ongoing process to ensure that the necessary steps are taken to identify the impact of potential losses and maintain viable recovery
  2. A program that implements the mission, vision, and strategic goals of the organization
  3. A determination of the effects of a disaster on human, physical, economic, and natural resources
  4. A standard that allows for rapid recovery during system interruption and data loss

32. 

Which of the following would best describe a cold backup site?

  1. A computer facility with electrical power and HVAC, all needed applications installed and configured on the file/print servers, and enough workstations present to begin processing
  2. A computer facility with electrical power and HVAC but with no workstations or servers on-site prior to the event and no applications installed
  3. A computer facility with no electrical power or HVAC
  4. A computer facility available with electrical power and HVAC and some file/print servers, although the applications are not installed or configured, and all the needed workstations may not be on site or ready to begin processing

33. 

Which of the following would best describe a tertiary site?

  1. A computer facility with no electrical power
  2. A secondary backup site
  3. Remote journaling
  4. A mobile trailer with portable generators

Answers

1. 

Answer: c

Life safety, or protecting the health and safety of everyone in the facility, is the first priority in an emergency or disaster.

2. 

Answer: d

The tactical assessment of information security is a role of information management or technology management, not senior management.

3. 

Answer: b

A plan is not considered functioning and viable until a test has been performed. An untested plan sitting on a shelf is useless and might even have the reverse effect of creating a false sense of security. Although the other answers, especially a, are good reasons to test, b is the primary reason.

4. 

Answer: c

In a table-top exercise, members of the emergency management group meet in a conference room setting to discuss their responsibilities and how they would react to emergency scenarios.

5. 

Answer: a

Warm and cold sites require more work after the event occurs to get them to full operating functionality. A mobile backup site might be useful for specific types of minor outages, but a hot site is still the main choice of backup processing site.

6. 

Answer: b

The business resumption, or business continuity plan, must have total, highly visible senior management support.

7. 

Answer: c

The site might not have the capacity to handle the operations required during a major disruptive event. Mutual aid might be a good system for sharing resources during a small or isolated outage, but a major natural or other type of disaster can create serious resource contention between the two organizations, both of which may be affected simultaneously.

8. 

Answer: d

The organization’s strategic plan is considered a long-term goal.

9. 

Answer: c

The disaster is officially over when all the elements of the business have returned to normal functioning at the original site. It’s important to remember that a threat to continuity exists when processing is being returned to its original site after salvage and cleanup has been done.

10. 

Answer: a

When an emergency occurs that could potentially have an impact outside the facility, the public must be informed, regardless of whether there is any immediate threat to public safety.

11. 

Answer: b

The number one function of all disaster response and recovery is the protection of the safety of people; all other concerns are vital to business continuity but are secondary to personnel safety.

12. 

Answer: a

The three primary goals of a BIA are criticality prioritization, maximum downtime estimation, and identification of critical resource requirements. Answer d is a distracter.

13. 

Answer: b

A business impact analysis (BIA) measures the effect of resource loss and escalating losses over time in order to provide the entity with reliable data upon which to base decisions on hazard mitigation and continuity planning. Answer a is a definition of a disaster/emergency management program. Answer c describes a mutual aid agreement. Answer d is the definition of a recovery program.

14. 

Answer: b

A hot site is commonly used for those extremely time-critical functions that the business must have up and running to continue operating, but the expense of duplicating and maintaining all the hardware, software, and application elements is a serious resource drain to most organizations.

15. 

Answer: b

Monitoring employee morale and guarding against employee burnout during a disaster recovery event is the proper role of human resources.

16. 

Answer: b

17. 

Answer: c

18. 

Answer: b

A financial collapse is considered a technological potential hazard, whereas the other three are human events.

19. 

Answer: d

A checklist is a type of disaster recovery plan test. Electronic vaulting is the batch transfer of backup data to an offsite location. Remote journaling is the parallel processing of transactions to an alternate site. A warm site is a backup processing alternative.

20. 

Answer: c

Answer a is a definition for a threat. Answer b is a description of mitigating factors that reduce the effect of a threat, such as an uninterruptible power supply (UPS), sprinkler systems, or generators. Answer d is a distracter.

21. 

Answer: a

The recovery team performs different functions from the salvage team. The recovery team’s primary mandate is to get critical processing reestablished at an alternate site. The salvage team’s primary mandate is to return the original processing site to normal processing environmental conditions.

22. 

Answer: a

Emergency management plans, business continuity plans, and disaster recovery plans should be regularly reviewed, evaluated, modified, and updated. At a minimum, the plan should be reviewed at an annual audit.

23. 

Answer: a

Isolation of the incident scene should begin as soon as the emergency has been discovered.

24. 

Answer: a

Reoccupying the site of a disaster or emergency should not be undertaken until a full safety inspection has been done, an investigation into the cause of the emergency has been completed, and all damaged property has been salvaged and restored.

25. 

Answer: b

The other three answers are good reasons to test the disaster recovery plan.

26. 

Answer: a

Salvage cannot begin until all physical danger has been removed or mitigated and emergency personnel have returned control of the site to the organization.

27. 

Answer: c

The purpose of the test is to find weaknesses in the plan. Every plan has weaknesses. After the test, all parties should be advised of the results, and the plan should be updated to reflect the new information.

28. 

Answer: d

Authorized, signed checks should be stored securely off-site for access by lower-level managers in the event senior-level or financial management is unable to disburse funds normally.

29. 

Answer: a

The organization has an inherent responsibility to its employees and their families during and after a disaster or other disruptive event. The company must be insured to the extent it can properly compensate its employees and families. Alternatively, employees do not have the right to obtain compensatory damages fraudulently if the organization cannot compensate.

30. 

Answer: c

A mutual aid agreement is used by two or more parties to provide for assistance if one of the parties experiences an emergency. Answer a describes a business continuity plan. Answer b describes a damage assessment, and answer d describes risk mitigation.

31. 

Answer: a

A business continuity program is an ongoing process supported by senior management and funded to ensure that the necessary steps are taken to identify the impact of potential losses, maintain viable recovery strategies and recovery plans, and ensure continuity of services through personnel training, plan testing, and maintenance. Answer b describes a disaster/emergency management program. Answer c describes a damage assessment. Answer d is a distracter.

32. 

Answer: b

A computer facility with electrical power and HVAC, with workstations and servers not present (but available to be brought on-site when the event begins) and no applications installed, is a cold site. Answer a is a hot site, and d is a warm site. Answer c is just an empty room.

33. 

Answer: b

A “tertiary site” is a secondary backup site that can be used in case the primary backup site is not available.

Категории