Mastering Microsoft Exchange Server 2007 SP1
Too often when a consultant sets foot in a new organization that is having availability problems, the consultant starts talking about clustering, load balancing, storage area networks, replication technologies, and so on. Everyone in the room has that "deer in the headlights" look in their eyes, but all of those new technologies sure sound great!
Unfortunately, availability problems are frequently "people and process" related rather than being a technical problem. Poor documentation, insufficient training, lack of procedures, and improper preparation for supporting Exchange are the most frequent causes of downtime.
In the following sections, we will start with some of the basics of providing higher availability for your Exchange users.
One of the biggest issues for my clients is Exchange server disaster recovery. The first time a client brings up the subject, I tell them that disaster recovery is the last thing they should worry about. After they get up off the floor, I tell them that they should focus first on Exchange server reliability and availability, of which disaster recovery is but the last part. I go on to tell them that if they worry about the whole reliability and availability spectrum, they'll not only prepare themselves for serious non-disaster recovery scenarios, but they also might be able to avoid many of the disasters that haunt their nightly dreams.
Reliability vs. Availability
Don't sacrifice the reliability of your system just to make your availability numbers look good. In general, an hour of unplanned downtime to fix an impending problem is much more acceptable than an entire day of downtime if the impending problem actually causes a crash.
Tip | Don't sacrifice reliability for availability. |
No one likes to have unplanned downtime, even if it is just for a few minutes because you need to reboot a server. An hour of unscheduled downtime in the middle of the business day is the sort of thing that users will remember for the next year. Consequently, a lot of organizations avoid unplanned downtime, even if they might be able to do something during the off-hours.
Tip | Good reliability helps you sleep at night. Good availability helps you keep your job. |
Some of the tasks that might need to be performed sooner than your scheduled maintenance window will allow include things like replacing a disk drive that is running in a degraded state or swapping out a failed power supply.
Scheduled Maintenance
People don't often think about high availability and scheduled downtime at the same time. However, there will always be maintenance tasks that will require some scheduled downtime. When planning your weekly or monthly operations schedule, you need to make sure that you have a window available for maintenance. This window of time should be published to your user community and they should expect that the system may be unavailable during those times.
Scheduled downtime may be a hard sell in your organization. This is especially true if you have users that work around the clock. Still, you can make a compelling argument for this downtime window if you think about all the things you might need to do during your scheduled downtime:
-
Performing server reboots
-
Applying configuration changes
-
Installing service packs and critical updates
-
Updating firmware and the FlashBIOS
-
Performing database maintenance such enabling or managing LCR databases
-
Replacing power supplies, disks, UPS batteries, RAID controller batteries
-
Reconfiguring IP addressing or networking equipment
-
Addressing environmental issues such as air conditioning, UPS, power, and the security system
Granted, not every organization needs a weekly maintenance window, but you should ask for it anyway. You don't have to use the maintenance window if you have no maintenance to perform.
When scheduling the window, take in to consideration staffing concerns. After all, if you schedule maintenance once per week, someone knowledgeable has to be available during that time to perform the work.
Finally, when planning your scheduled downtime windows, make sure downtime during these windows do not count against your "nines."
The Quest for Nines
No respectable discussion about improving service availability is complete without a definition of "nines." Everyone is familiar with the quest to achieve 99.999 percent availability. This is a measure of the time that you are providing service (in our case, e-mail services) when compared with the time you have stated you will provide e-mail services.
If you ask your users how many "nines" you should be providing, they are undoubtedly going to tell you that you should be operating at 99.999 percent availability. What does that mean? Let's assume that you are expected to provide e-mail services 365 days per year, 24 hours per day, 7 days per week. Here is a breakdown different "nine" values and the actual amount of unplanned downtime you could actually have:
99.999 | About 5.3 minutes |
99.99 | About 53 minutes |
99.9 | About 8.7 hours |
99.7 | About 1 day |
99 | About 3.7 days |
Providing "five nines" is pretty hard to do; one server reboot during the business day and you have exceeded you maximum permissible downtime for the entire year. Even providing 99.99 percent availability is a hard target to meet, though this is certainly not unrealistic if you have designed in to your system redundancy, fault tolerance, and high-availability solutions.
If you were to plot on a graph the cost to implement a system on the y-axis and the desired nines on the x-axis, you will find that the cost climbs considerably as you approach (or pass) 99.999 percent availability.
For organizations that are having availability problems, we typically recommend that you first target somewhere between 99.7 and 99.9 percent availability. With good procedures, scheduled maintenance windows, hardware, and properly configured software, you can easily meet this goal even without clustering.
The Process Is Just as Important as the Technology
Some of us are just techies at heart, but information technology is more than just providing technology to our user communities. We must take into consideration a lot of other factors besides the technology, including budget, customer service, availability, documentation, responsibility, and change control. The process of running and managing information technology is becoming just as important as the technology itself.
With respect to Exchange Server, there are a number of things that we recommend you document and indoctrinate into your organization. These include creating documentation, processes, and procedures for the following:
-
Daily and weekly maintenance
-
Keeping documentation up-to-date
-
Security review and audit procedures
-
Change management and configuration control procedures
-
Service pack and update procedures
-
Escalation procedures and who is responsible for making decisions
-
Information technology acceptable use and ethics policies
Building a Reliable Platform
Recommending that you build a reliable platform for Exchange may seem a bit obvious. After all, everyone wants a reliable platform. Our recommendations are meant to help you make sure that your hardware and software platform is as reliable and stable as possible.
First and foremost, when you are choosing server hardware, don't skimp when it comes to choosing the hardware you will use to support your organization's e-mail servers. You may not need the "BMW 525i" of servers, but at the very least you should be planning to buy a reliable "Honda Civic" type server. Maybe our analogy is not very good, but the point is that you need a server from a major vendor. The server hardware should have a warranty and hardware support options for replacing failed components.
When you choose components for your server (such as additional RAM, disks, disk controllers, tape devices, and so on), you should purchase these components from your server vendor. This ensures that anything that you put inside that server's case is going to be supported by the vendor. This also helps to ensure that the hardware and software integrates.
When choosing server hardware and components, make sure all components are on the Windows Server 2003 x64 Hardware Compatibility List. This is especially true if you are planning to implement clustering, volume shadow copy backups, storage area networks, or iSCSI storage. You can view the catalog of Windows Server-tested products at www.windowsservercatalog.com.
As you unbox your shiny new x64-capable server, remember that it has probably been sitting on a shelf in a warehouse for a few months or longer before it made it to your door. This means that it may well be a little out-of-date. The vendor's setup CDs are going to be out-of-date and the Windows Server 2003 R2 CD-ROM is going to be out-of-date when it comes to device drivers. There are a few things we recommend you do:
-
Download the latest server setup CD from your hardware vendor. Use this CD to install Windows on to the server.
-
Download any firmware and FlashBIOS updates that are relevant to the server's motherboard, backplane, network adapters, disk controllers, and so on.
-
Confirm that you have the latest device drivers for additional hardware on your system. This includes network adapters, disk controllers, tape devices, host bus adapters, and so on.
-
When formatting hard disks, use a 64KB allocation unit size.
-
Confirm that you have the correct versions of any third-party software that you will be using, such as antivirus software, backup software, and storage area network (SAN) or storage management software.
-
If your server has dual-power supplies, connect each power supply to a different UPS.
-
As you are building your server, make sure that you have all the components you need and that they are ready to put into production. When you bring a server online for users, you want to know for sure that you don't need to come back in a few weeks and install additional software or hardware.
-
Apply all service packs and critical updates that are available before putting the server into production.
Fault Tolerance vs. High Availability
You have probably heard as many explanations of what fault tolerance and high availability means as we have. Over the years, these two terms have been squashed together so many times that even knowledgeable people use them interchangeably. However, there is a distinct difference between fault tolerance and high availability.
Fault-tolerance features or components enable a system to offer error-free, nonstop availability in the event of a failure. Examples include mirrored disks or striping with parity volumes as well as uninterruptible power supplies (UPSs) and redundant power supplies. If a component fails that has fault tolerance, the system continues to operate without any sort of interruption in service. Clustering is not a fault tolerance solution.
High-availability solutions are things that you do in order to provide better availability for a specific service. Implementing fault tolerance is one of these things. However, while high-availability solutions promise "better" availability through the use of technologies such as clustering and replication, they do not offer complete tolerance from faults.
Implementing fault tolerance for disks, power supplies, and UPS systems is reasonably cost effective. However, implementing full-fledged fault-tolerance solutions for an application such as Exchange is very costly.
Complexity vs. Supportability
Do not install anything that you are not prepared to support yourself or for which you don't have someone on standby on the other end of a telephone ready to support you. We frequently see companies that have put in to production high-availability solutions, failover solutions, backup software, or procedures that are beyond their ability to support. A good example of this is a company whose vendor convinced it to install a SAN-based replication solution and a customized failover script. However, no one within the customer's organization had any experience with the SAN or the replication software or even knowledge of how the failover script worked.
No high-availability solution comes without ongoing costs. This cost may be in the form of training, additional personnel, consultant fees, or on-going maintenance costs.
We don't want to discourage you from seeking solutions that will meet your needs, but approach these solutions with realistic expectations of what they will require to support. Keep these tips in mind:
-
Practice simplicity in any system you build or design. Even when the system is complex, keep the components and procedures as simple as possible.
-
For any solution that you implement, ask yourself how difficult it would be to explain it for someone else to manage.
-
Document everything!
Outlook Local Cached Mode and Perceived Downtime
If a server crashes in the woods, but no one is around to hear it, was there truly downtime? Downtime and outages are a matter of perception. If no one was affected, then no one will be calling the help desk to complain. For Outlook 2003 and 2007 users, Microsoft has included a feature that allows Outlook to keep a complete copy of the mailbox cached on the local hard disk.
By default, Outlook 2003 and 2007 enable cached Exchange mode (see Figure 15.1), but in some older Outlook profiles this may have been turned off.
This setting won't help you with server reliability, but it can help with the user perception of those annoying little outages such as network problems or server reboots.
An Exchange server administrator can set a parameter on a user's mailbox that requires the use of cached mode. This is done using the Set-CASMailbox cmdlet. For example, if we want to require that Outlook 2003 or Outlook 2007 must be in cached Exchange mode for a certain user's mailbox, we would use this command:
Set-CASMailbox "Henrik.Walther" -MAPIBlockOutlookNonCachedMode $True
Категории