Microsoft SQL Server 2000 High Availability

You can make individual parts of a system fault tolerant. You can design in redundancy for the systems, making the solution highly available. But what happens when a catastrophe strikes and your fault-tolerant, redundant, highly available systems do not weather the proverbial storm ? This is where disaster recovery comes into play.

Many will argue that high availability and disaster recovery are not related , and some incorrectly mistake high availability for disaster recovery. The truth is that high availability planning must include the entire scope of what it takes to keep an application, business, or other enterprise up and running. Technology that makes your solutions fault tolerant, such as failover clustering, might not be sufficient for your disaster recovery needs. For example, although a technology like log shipping can be employed as a primary method of availability, it can also be used in a disaster recovery scenario. People and processes are crucial to all aspects of high availability, and they also form the backbone of most aspects of disaster recovery. This chapter guides you through planning and preparation for disaster recovery, as well as what you can do from a SQL Server perspective to recover your systems.

Planning for Disaster Recovery

There is no easy or add on solution to disaster recovery needs. Disaster recovery planning does not start late in the cycle; it starts from the first day of any project, when you are first assessing risk. Contrary to popular belief, each disaster recovery plan is unique, even for the same technology, because companies are different and their tolerances are different. When you seriously plan for disaster recovery and have your local high availability story well defined, two overriding questions govern the rest of the discussion:

Run Book

There are several benefits to keeping historical records of your environment, including every change made to your servers. The first is, obviously, to help you rebuild it in the event of a disaster. For this reason a complete document, sometimes known as a run book , is crucial to disaster recovery planning and execution. A run book can contain a subset of the configuration information or a very high level of detail, depending on what suits your needs. The run book should also include contact information, disaster recovery plans, and so on.

The run book can be referred to on a daily basis for an easy way to get information about a system. If you keep contact and related system information in it, it can also help you identify systems that send or receive data from your systems and the people who need to be contacted in emergencies (or those who might be willing to answer some questions if, for example, you re researching elusive performance problems).

A run book does not have to be an actual printed book, although you should be sure to print several copies on a regular basis. One copy (whether electronic or printed) should go out with the backup tapes so that the system can be restored in the event of a disaster at the main site. In addition to printing the run book, you could also put the information on a Web site that would be accessible by only authorized users. This could be on an intranet or a secure extranet. For example, you could have a Web page that shows a grid of all the latest service packs , hotfixes, system hardware configuration, and so on. You could also pull that information into a SQL Server database or Extensible Markup Language (XML) and base some self-healing maintenance on that as well. You should also document the organization and procedures for the configuration management of a specific product, project, system, support group , or service.

Items to Place in a Run Book

The following categorized list can be used as the basis for a run book or customized operations guide for your database system. This list is based on Microsoft SQL Server 2000, but it could apply equally to other versions, in concept, and it outlines the configuration items you would use to maintain a high standard of operations support. The following list has been reviewed and approved by operations and field personnel within Microsoft, and it represents the current opinions and best practices of those groups.

On the CD

A sample run book can be found on the CD-ROM in the file Run_Book.doc. This can be used and modified to fit your environment; it serves as a good starting point. In addition, many of the worksheets and templates included for use with other chapters in this book can be used as the starting point for information for your run book.

Important

A few of the items that follow note that you must document passwords. First, make sure this is not in violation of any corporate policy. However, if you cannot document passwords needed for vital accounts during a restore, you must ensure that people who know them can be on site during a recovery operation.

SQL Server Administrative Information

There is a minimum set of information that you must keep to fulfill your responsibilities as database system administrator:

Analysis Services Administrative Information

There is a minimum set of information you must record if you are using Analysis Services in your solutions:

Application System Information

There is a minimum set of information you must record about the non-SQL Server portions of your solutions:

Database Components

There is a minimum set of information you must record about your databases:

Server Configuration

There is a minimum set of information you must record about your servers to use during disaster recovery:

SQL Server Configuration

There is a minimum set of information you must record about each SQL Server instance s configuration for use in a disaster recovery scenario:

Network Information and Configuration

There is a minimum set of information you must record about your network topology for use in a disaster recovery scenario:

Storage Information and Configuration

There is a minimum set of information you must record about your disk subsystem configurations for use in a disaster recovery scenario:

Other Necessary Information

This is a list of other items that should also be recorded that do not fall under one of the preceding categories:

SLAs, Risk, and Disaster Recovery

Think back to Chapter 2, The Basics of Achieving High Availability, and the discussion there of SLAs. As crucial as they are to high availability, they are even more crucial to disaster recovery. This is where the difference in downtime ”planned or unplanned ”of minutes, hours, or even days comes into play. SLAs, in the form of support contracts, must exist for every physical and software component of a system to ensure that your SLAs with all parties are met. If you cannot meet your end user SLA because, for example, your disk subsystem was down and you did not have a proper SLA in place for it, that could be a problem. Now, in some cases this might not be possible, but you need to try. SLAs typically encompass issues such as physical security, money, uptime, and response times.

Obviously, the cost of buying the right SLA must be considered . Does the risk justify the financial cost of addressing it? For example, your company s off-site data storage facility might have limited hours. This is probably not a wise choice, and you would need to find a better solution. In the meantime, management decides to put their off-site backups in a bank safe deposit box for easy access. This is a cost- effective solution because it costs only $100 per year. However, only certain individuals can get at the tapes and only during banking hours. Alternatively, you might decide to let the DBA on the night rotation take home a set of tapes. This is not only a risk, it might also violate corporate security policy.

When you are purchasing your SLAs, do not take only high availability into account. Remember disaster recovery, too.

Planning Step 1: Assessing Risk and Defining Dependencies

Disaster recovery, like local high availability, is about mitigating risk. Risk comes in several varieties, such as risk to the business, risk to human life, and risk to technology. Unfortunately, assessing risk is a complex task and each company s risks and tolerance for those risks will be different. The following steps show you the thought process to follow and how to assess risk in your company.

Example Scenario

You have implemented a customized Microsoft Solution for Internet Business (MSIB). This encompasses IIS, SQL Server 2000, Microsoft BizTalk, and Microsoft Commerce Server. This particular solution involves two SQL Server 2000 failover clusters, each with two instances of SQL Server on them.

First, identify the physical servers that are part of this solution and their names. This extends to not only server functionality, but domain controllers and so on. Missing even one server could prove fatal. Taking an inventory of your environment, you see that you have the following:

Next, take an inventory of all of the software configured on each server and make a detailed list, including versions. A final list should also include service packs and hotfixes. Here is a preliminary list for the example environment:

Now it is time to gather information about the instances of SQL Server. The instance names are as follows :

Table 12-1 lists the databases.

Table 12-1: Databases

Database

SQL Instance

Use

CustomDB3

SQL2A

CustomDB3

CustomDB8

SQL2A

CustomDB3

CustomDB4

SQL2A

CustomDB3

CUSTOMDB1

SQL1B\INST2

CUSTOMDB1

Company_CUSTOMDB1

STAGINGSQL1

CUSTOMDB1

ETSBizTalkReference

SQL1A

BizTalk

CustomDB2

SQL1A

BizTalk

InterchangeBTM

SQL1B\INST2

BizTalk

InterchangeDTA

SQL1B\INST2

BizTalk

InterchangeSQ

SQL1A

BizTalk

MCMS_Data

SQL2A

Content Management Server

MSCS_Admin

SQL2B\INST2

Commerce Server

MSCS_Admin

STAGINGSQL1

Commerce Server

MSCS_BDPermissions

SQL2B\INST2

Commerce Server

MSCS_BDPermissions

STAGINGSQL1

Commerce Server

MSCS_Campaigns

SQL2B\INST2

Commerce Server

MSCS_Campaigns

STAGINGSQL1

Commerce Server

MSCS_CatalogScratch

SQL2B\INST2

Commerce Server

MSCS_CatalogScratch

STAGINGSQL1

Commerce Server

MSCS_EndeavorCE

STAGINGSQL1

Commerce Server

MSCS_ProductCatalog

SQL2B\INST2

Commerce Server

MSCS_ProductCatalog

STAGINGSQL1

Commerce Server

MSCS_Profiles

SQL2B\INST2

Commerce Server

MSCS_Profiles

STAGINGSQL1

Commerce Server

MSCS_TransactionConfig

SQL2B\INST2

Commerce Server

MSCS_TransactionConfig

STAGINGSQL1

Commerce Server

MSCS_Transactions

SQL2B\INST2

Commerce Server

MSCS_Transactions

STAGINGSQL1

Commerce Server

Inventory

SQL2A

CustomDB3

Inventory_Asia

SQL2A

CustomDB3

Inventory_SA

SQL2A

CustomDB3

InventoryArchive

SQL2A

CustomDB3

InventoryArchive2

SQL2A

CustomDB3

CustomDB9

SQL2A

CustomDB3

CustomDB2

SQL2A

CustomDB3

CustomDB11

SQL2A

CustomDB3

CustomDB5

SQL2A

CustomDB3

CustomDB7

SQL2A

CustomDB3

XLANG

SQL1A

BizTalk

Table 12-2 lists database dependencies and their impact on availability.

Table 12-2: Database Dependencies and Impact on Solution Availability

Database

Impact

Dependencies (N/A if None)

CustomDB6

CustomDB3 (North America) will not work if this database is not running.

Databases: CustomDB1, CustomDB8, Inventory, CustomDB9

CustomDB1 (production)

This is a crucial database for the solution. It will, over time, also contain tables for other applications. It is a central point of integration.

All CustomDB3 databases

CustomDB7

These are the database lookups for BizTalk Server (BTS) maps. Some tolerance where CustomDB1, BTM, DTA, and XLANG die here.

N/A

CustomDB5

Risk is low for losing data because it can be recovered elsewhere, but is part of the whole CustomDB2 process.

N/A

InterchangeBTM

Contains the definition of BTS objects. This will not change often, but is crucial for BTS being up.

N/A

InterchangeDTA

This is the logging database for BTS. The use of the textimage field is configurable, and right now it is turned on. There is a low risk to the solution from a high level if this database fails, and it is used mainly for archiving and reporting purposes. However, it is actually a high risk to the solution because if BTS cannot log, BizTalk will not be able to work. You will not lose transactions if this happens, though.

N/A

InterchangeSQ

Shared Queue database. This is very important, as it contains the state of anything received and recorded. If you lose this database, you lose the transaction. CustomDB2 system dials every 60 minutes, CustomDB3 data (spike between 9 A.M. and 12 P.M. North America and Latin America, Asia 10 hours prior, EMEA 6 hours prior).

N/A

MCMS_Data

Content Management Server will be down if this database is not available, and the entire solution will be down if this database is down because it contains all the site content.

Commerce Server databases (from a restore standpoint)

MSCS_Admin (production)

Will affect Commerce Server as it is the administrative definition database.

MCMS_Data (from a restore standpoint)

MSCS_BDPermissions (production)

Like the MSCS_Admin, this is a configuration database mainly used with BizDesk. BizDesk will be affected if this is down, but site will not go down.

MCMS_Data (from a restore standpoint)

MSCS_Campaigns (production only)

This will not affect solution because it is not in use.

N/A

MSCS_CatalogScratch (production only)

Used for virtual catalog generation. There is no need to back up or worry about it because it only affects BizDesk.

N/A

MSCS_ProductCatalog (production only)

Although the site will not shut down if this database is unavailable (it is the master product catalog), you cannot navigate the solution. You will still be able to register users as long as other databases are up.

MCMS_Data (from a restore standpoint), MSCS_Transactions (GUID orphans), MSCS_ProductCatalog (staging)

MSCS_ProductCatalog (staging)

Very important to the solution, as it contains the copy of the catalog that is pushed out to the production server.

N/A

MSCS_Profiles (production)

If down, no one will be able to buy anything on the site or register users.

MCMS_Data (from a restore standpoint)

MSCS_TransactionConfig (production)

This contains the definition of all pipeline configurations. It is updated with code push. It is needed for BTS operation, but can be re-created as long as staging is available. If it is down, users will not be able to buy anything on the site.

MSCS_TransactionConfig (staging)

MSCS_Transactions (production)

This is the heart of BTS; it contains all completed transactions and it will definitely affect the solution. If this database is down, users will not be able to buy anything on the site.

MCMS_Data, MSCS_ProductCatalog

Inventory_USA

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB8, CustomDB6, CustomDB9

Inventory_Asia

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB10, CustomDB12

Inventory_SA

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB11, CustomDB13

InventoryArchive

This has a minimal impact on just archived data for CustomDB3.

CustomDB1

InventoryArchive2

This has a minimal impact on just archived data for CustomDB3.

CustomDB1

CustomDB9

CustomDB3 will not work if this database is down. Customers and partners are in here, and it contains all item definitions.

CustomDB1, CustomDB8, CustomDB6, Inventory_USA

CustomDB2

CustomDB3 (Asia) will not work if this database is not running.

CustomDB1, Inventory_Asia, CustomDB12

CustomDB11

CustomDB3 (South America) will not work if this database is not running.

CustomDB1, Inventory_SA, CustomDB13

CustomDB5

CustomDB3 (Asia) will not work if this database is not running.

CustomDB1, Inventory_Asia, CustomDB10

CustomDB7

CustomDB3 (South America) will not work if this database is not running.

CustomDB1, Inventory_SA, CustomDB11

XLANG

Similar to InterchangeSQ, but stores state for orchestration schedules. This is very important to keeping BTS up.

N/A

Once the dependencies are known, you can then establish the facts for each product that will influence any disaster recovery planning.

Known Facts About Servers

The following is a list of facts for each server product.

BizTalk

Commerce Server

Content Management Server

CustomDB1, CustomDB2, and CustomDB3

Other Servers

Risks and Unknowns

The following is a list of risks and unknowns that must be accounted for:

Planning Step 2: Putting the Plan Together

Once you have assembled your lists of known facts, risks, and dependencies and have all pertinent information in hand (such as SLAs), you can put the plan together. This plan should include the exact steps it will take to recover the system and who is responsible for what task (job roles mainly; a list maintained in the run book should tell you who is currently responsible for the action). Other information that you need to put into your disaster recovery plan includes the following:

When All Else Fails, Go to Plan B

Do not be naive and assume that your plan is perfect. There could be something beyond your control (such as data corruption or inability to get a new disk drive) that might mean you are down for a longer time. Always have a trick up your sleeve and create a backup plan so that if your primary plan fails, you can still get up and running sooner rather than much, much later.

Tip

You might even want to execute Plan B in parallel with Plan A so that there is no lag time in the event Plan A fails.

More Info

For links to SQL Server “specific disaster recovery Knowledge Base articles that contain additional information, consult Knowledge Base article 307775, INF: Disaster Recovery Articles for Microsoft SQL Server. You might also want to print some of these articles and incorporate them into your test plans or run book.

Категории