Upgrading and Repairing Servers

Eventually every system fails, and occasionally systems fail in very unpredictable ways, such as the following:

  • A construction crew cuts through the main optic cable line and disables all data communication to your city.

  • Your building is struck by an act of terrorism.

  • A water main leak or hurricane floods your entire data center.

  • Your building is swallowed up by the earth during an earthquake.

  • A computer virus destroys all of your systems' boot files or a hacker gets into the system and takes control.

  • Solar flares knock out a power station in Ontario, which disables the entire North American power grid.

  • Someone causes a fire in your building.

  • Mergatroid in IT successfully upgrades your domain controllers but fails to migrate all the user accounts before overwriting all the information. Heavens!

All the aforementioned qualify as disasters, and we all pray that we are spared problems like these. However, good practice necessitates that organizations do disaster planning to keep critical business processes working or to recover from these problems as quickly and as cost-effectively as possible.

The Purpose of Disaster Recovery Planning

When disaster strikes, your first inclination when trying to address the problem is likely to be wrong, and in fact may make things worse. In an emergency, there's a very strong tendency to act first and think second. A well-thought-out disaster recovery plan is an essential component of any well-run computing center. In some locales, planning for business continuity is not just good business practiceit is the law.

The purpose of disaster planning is to codify a set of rules and actions that are to be followed when a problem occurs. Disaster recovery takes time to plan, and if disaster doesn't strike, it's a cost in time and salaries you might be tempted to forgo. A plan is a lot like insurance: If you don't need it, it's a waste of money, but when you do need it, it can save your organization a considerable amount of money, improve the quality of the recovery, and greatly diminish your downtime.

Disaster recovery planning is part of an overall fault tolerance strategy. It's where you utilize all your backup systems and test your strategies. However, you don't have time to fix any deficiencies in your systems at the time a disaster occurs. Therefore, it is absolutely critical that every system that you count on for recovery be tested beforehand. Just as is the case for data backups themselves, it is absolutely critical that you know the integrity of your systems by doing the following:

  • If you have a data backup system you count on, do a restore from that system.

  • If you have a mirrored disk system, test it by taking one-half of the mirror offline, running from the second mirror, and then reestablishing the mirror.

  • If you have a clustered server system, try removing one of the nodes.

  • If you have a backup power system, pull the plug on your main system.

  • If there are people who are important in the chain of notification, call them to see if they respond.

There are many different tests that you can perform to test the viability of your recovery plan. The point is that none of them are any good if you decide to test them at the time a disaster strikes. So as part of your recovery plan, you need a set of regular action items to test the systems you count on.

An Example of a Disaster Recovery Plan

A disaster recovery plan should be an ongoing effort that results in a working document. Every year, at the appointed time, the document should be brought out and revised. People at every significant level of the organization should review and sign off on the plan. Disaster planning is not just an IT exercise; the level of loss that an organization is willing to endure or the amount of money that an organization is willing to pay to avoid a loss is really a business decision. There should be a reasonable calculation made to quantify the decisions made in the disaster recovery plan.

A disaster recovery plan should be written in the same way that any project plan is written. The plan should start with a clear and concise description of its purpose, there should be a table of contents for the issues the plan covers, and each issue should be written up in its own section. Sections should not only describe the issue and its potential solution(s) but designate who is responsible for action.

Note

A disaster plan needs to be readily accessible when needed. If your disaster plan is found only on a computer system, it isn't going to do you any good if that system goes down. A disaster plan should be a paper document that is stored with emergency equipment, with a copy or set of copies stored offsite in logical locations.

Disaster recovery plans are a little different than a lot of other project plans. They don't include definite time lines, although they may specify how long operations should take. They must also specify how problems are identified and how to escalate actions to the next level when issues aren't resolved. A good disaster recovery plan should include flowcharts that illustrate how actions should flow.

Table 21.3 shows an example of the parts of a disaster recovery plan.

Table 21.3. The Parts of a Disaster Recovery Plan

Section

Contents

Title page

The title of document, with the names of the recovery managers, along with their titles, phone numbers, cell phone numbers, pager numbers, and email addresses. The point of the page is that even if someone doesn't open the document, he or she has enough information to take the next step.

Action summary

A one-page description of the purpose of the document. The summary describes the problems that are discussed in the document, how these issues are addressed (the chain of execution), and how the document is organized. It is important that the action summary be one page only and that there is enough information on that page to allow the reader to bring a problem to the attention of the person responsible.

Table of contents

A list of topics covered.

Topic title page

A one-line description of the problem.

Short description page

A description of how the problem can be detected and what the scope of the problem includes.

Notification page

A list of the persons to be notified for this particular issue, with an order that shows quickly how notification gets escalated.

System isolation page

A description of how systems should be taken offline if necessary or what steps should be taken immediately by the person onsite to avoid further damage.

Repair sequence page

A description of the proposed remedies, in the order which they should be performed, along with the conditions that represent restoration of the service.

With a fully developed disaster recovery plan, if and when a disaster strikes, you will be in a much better position to minimize the damage, contain the costs, and bring your systems and services back online much more quickly. At times of great difficulty, it is best not to have to spend time thinking through complex responses.

Категории