Windows Server 2003 on Proliants. Deployment Techniques and Management Tools for System Administrators
< Day Day Up > |
If you have dealt with troubleshooting AD for Windows 2000, you know how difficult it can be. Multimaster replication, dynamic DNS, the AD database, File Replication Service (FRS), security options, Group Policy, and other functions of AD make troubleshooting complex. And, then you have to deal with security patches and virus updates just to keep the environment somewhat secure. It is impossible to put my more than five years of experience troubleshooting AD in one section of one chapter. Instead, this section provides some guidelines on how to proceed with troubleshooting problems, lists some tools that are very powerful in Windows Server 2003, and offers a few tips and tricks I've picked up. Define the Problem
The first step in troubleshooting is to step back from the immediate fire and assess the whole problem. To ascertain the root cause, you need to ask enough questions to determine the scope and magnitude of the problem. Of course, you also need to determine if there is a problem at all. We get customer calls all the time from Administrators who are worried because their event logs have some events they don't understand ”warnings or even informational events. The first question we ask is, "Besides the events, what's broken?"Usually, the answer is "Nothing." That helps sets the priority and expectations of solving the problem. We need to solve it, but we don't have to work through the night to do it. Following is a set of questions and actions I use to determine the scope and definition of the problem:
After you get a pretty good handle on the scope and magnitude of the problem, you need to build an Action Plan ”a road map of how you intend to attack the problem. The Action Plan
The action plan can be as simple as an e-mail with a four- or five-bullet list if the problem is fairly confined and low impact, or a formal document that is shared with management if the problem is widespread and causes or might cause significant downtime. It seems everyone has his or her own format for an action plan, so use whatever works for you. I have included a couple of sanitized action plans in this section so you can see what works best for you. An action plan to determine the cause of users at a site complaining about drives not being mapped might include the following bullet points:
This part of the action plan is just the data-gathering phase. After you get some questions answered and get some event logs to look at, you can develop another plan based on those findings. The action plan is very much an iterative process. Start out with your best shot, and then as you narrow it down and research error messages, modify the plan to address the next steps until the problem is resolved. For a more complex action plan, you should take a more comprehensive approach. This would be the case for a DC or Exchange server that is intermittently hanging, which could be hardware failure, driver failure, or an application failure; probably requiring the creation and analysis of a crash dump, reboot of the server, and other intrusive activities. Because this would have widespread implications, you will probably assemble a team, including IT staff and management, and support the troubleshooting effort by having support contracts with one or more vendors . My job for the past year or so with HP has been to work through these complex issues, putting together the technical team and managing the problem to resolution and perhaps to determine root cause. Following are the components of a complex action plan that I usually use, in the form of a formal document:
Although Table 10.13 is good for a large multitask action plan, it takes considerable effort to update; when you send updates, especially to the management team, it's difficult to read and see what the new actions are. This format is good for identifying multiple issues simultaneously , each with its own action plan. However for a narrow, fairly well-defined scope, consider the example shown here: Action Plan For : ABC Corporation Purpose : Action Plan for SQL Server Hang Last updated : March 24, 2004 Overview : Determine cause of hang of ATL-DC1 Summary : DC is experiencing intermittent hangs where the server is unresponsive and has no video, keyboard, or mouse activity. We need to force a crash from a remote console and analyze the dump to determine the cause. We have provided instructions to the customer on how to do this with the windbg.exe tool, and have provided a copy of the tool. Next step is to wait for the next occurrence of a hang, force the crash, reboot, get a dump, and analyze the dump. Item #1
Item #2
Item #3
Completed Tasks
This format has a title, scope, and listing of technical team contacts, but then it identifies each action item in a paragraph type listing rather than a matrix. This format allows you to list the items in priority and easily adjust them up and down in the list as needed. I've also added a section called "Completed Tasks" where I put completed tasks, rather than deleting them. This keeps the current ones uncluttered, but still records what we did. Collect Data
One of the first actions in any action plan is to collect data. The problem is getting all you need the first time. Often you look at one log, then need to run a utility to see something else, and on and on. This delays the troubleshooting process considerably. To resolve this situation, Microsoft has provided the Windows community with a very powerful tool called MPS Reports. Once reserved only for Microsoft support engineers and Microsoft support partners , these scripts are publicly available on the Microsoft Web site at http://microsoft.com/downloads/details.aspx?FamilyId=CEBF3C7C-7CA5-408F-88B7-F9C79B7306C0&displaylang=en. On that site, you will see links for versions of MPS Reports that have special features to collect data for
MPS Reports is simply an executable that runs a variety of command-line utilities to collect data and logs and returns the results in plain text format. Each of the versions runs utilities targeted to that product. The results are compressed into a .CAB file, which can then be opened to expose the text files. The text files can be opened in WinZip and viewed . For instance, the DS - DirSvc version runs helpful utilities such as DCDiag, NetDiag, GPresult, GPO Tool, Repadmin/showreps, Net Share, and Net Accounts, and contains text and .evt versions of the event logs. You get a good snapshot of the whole environment in one sitting. MPS Reports is generally run on DCs or servers, but some versions, such as the Networking one, can be run on a client. Figure 10.54 shows a sample listing from running the DirSvc (DS) version of MPS Reports on a DC called HPQnet-DC3. Just by viewing the contents of the CAB file, you can easily double-click a file such as Netdiag.txt, opening it in Notepad for a quick look for errors. Note also that the individual text files in the CAB file are prefixed with the name of the computer it is run on, which is very handy for keeping them straight. Because all of the event logs are included in .evt and .txt format, and all other output is in .txt format, you can easily and quickly review the files to get a good overall picture of errors. Figure 10.54. Listing from CAB file generated by the DirSvc (DS) version of MPS Reports.
Of course, if you have a number of these to analyze, it becomes a problem, but using tricks like extracting them to a directory and then writing a simple bat file using the findstr utility to search for "error" or "Warning" or "Failure" can make the job a lot easier. Also with the event logs in a text file, you can do a simple cut and paste to put it in an Excel spreadsheet using the following steps:
The Excel spreadsheet formats the output easily and allows you to do searches and sorts. This is much faster than putting the .evt file on a DC, opening the event viewer, opening the log in event viewer, and then scanning individual events. In Figure 10.55, I was analyzing MPS Report output from several DCs to determine general health of the AD. The text version of the event logs can be imported into an Excel spreadsheet. Here, logs from servers HPQnet-DC1, HPQnet-DC2, HPQnet-DC3, and HPQnet-DC4 were imported to separate worksheets for quick comparison. (See the names of the worksheet tabs at the bottom.) My golden rule for collecting data is, "You can never have too much data." Don't try to save bandwidth or disk space by asking for bare minimums. Get all you can and then sort it out. After the data is collected, you can turn to the more complex task of analyzing it. The next several sections examine the various components of AD, including some useful tools and troubleshooting tips that will arm you to troubleshoot whatever problem arises. Troubleshooting General AD Issues
In the "Monitoring Active Directory" section of this chapter we discussed a number of Microsoft and third-party monitoring tools. If you are administering a large environment, you'll soon find that using snap-ins and command-line utilities is not enough. However, for troubleshooting specific problems, the utilities in Support Tools, the Windows Server 2003 Resource Kit, and the utilities built-in to the OS are very handy. Some helpful tools for general AD issues include DCDiag, NetDiag, NLTest , the event logs, and even the RDU. In addition, Windows 2003 permits the Administrator to use his or her SmartCard as authentication to execute commands such as Runas . Downloadable Account Lockout tools from Microsoft's Download site are very helpful in resolving account lockout issues. Each of these tools is summarized in the following sections. DCDiag
DCDiag is a command-line utility that looks at DC functions, connectivity, and so on. DCDiag is similar to NetDiag, but has more domain- related information. It is available in the Windows 2000 and Windows Server 2003 Support Tools and is included in the DS version of MPS Reports. Some of its features include tests for the following:
Some useful options include
tip There is no switch to generate a log file. You need to direct the command to an output file, for example, dcdiag /v > dcdiag.txt .
Note that each of these tests reports success or failure, and reports an appropriate error message; often you'll see the same message that appears in the event log, as well as others. Thus, you can search for words such as "Fail," " Error," and "Warning" to locate problem spots. NetDiag
Similar to DCDiag, NetDiag concentrates on network-related errors. It runs tests on WINS and DNS connectivity, DC discovery (can it find a DC?), Kerberos authentication, NetBIOS over TCP/IP (NBT) and DNS name resolution, NETSTAT output, trust relationships, and default gateway and route connectivity. NetDiag includes useful output from IPConfig/all ; and lists hotfixes installed, NIC details, and NETSTAT output. This exposes all network configuration information such as the IP address, WINS and DNS servers, bindings, and so on, so you don't have to do a lot of exploring on the machine. Like DCDiag, you can execute the tests individually with the /test: option. You can use the /l switch to send the output to netdiag.log or redirect the output to a file. Unfortunately, NetDiag can't be executed remotely. NLTest
The NLTest utility is more of an interactive testing utility than a reporting tool like DCDiag and NetDiag, so you can use it to test certain operations interactively. NLTest can be executed remotely. Some of my favorite options are
The example shown here shows the result of the command nltest /server:ATL-DC1 /dsgetdc:company.com /timeserv . Note that this command was issued on ATL-DC1; because ATL-DC1 was a time server, it returned itself in response to the query. In the following example, the nltest.exe command is used to report the timeserver being used. E:\>nltest /server:atl-dc1 /dsgetdc:company.com /timeserv DC: \ATL-DC1.Company.com Address: \10.0.7.253 Dom Guid: 5dd3afa3-0004-47d6-8fde-311af13a3934 Dom Name: Company.com Forest Name: Company.com Dc Site Name: Atlanta Our Site Name: Atlanta Flags: PDC GC DS LDAP KDC TIMESERV GTIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST CLOSE_SITE The command completed successfully
In the following example, we have added the /avoidself flag, which causes NLTest to return the result from a DC other than the one we are on. In this case, it returns CXO-DC2 as the time server. E:\>nltest /server:atl-dc1 /dsgetdc:company.com /timeserv /avoidself DC: \CXO-dc2.Company.com Address: \10.0.7.2 Dom Guid: 5dd3afa3-0004-47d6-8fde-311af13a3934 Dom Name: Company.com Forest Name: Company.com Dc Site Name: SaltLakeCity Flags: DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST The command completed successfully
Event Logs
Never underestimate the power of the event logs. Microsoft made some great improvements in Windows 2000 events, putting more verbose information and some problem resolution help in the description field. Windows Server 2003 made further progress in this area by adding more troubleshooting tips to more events and includes references to relevant KB articles and a link to Microsoft's support Web site. Previously in this chapter, I described how to save the event log as a text file and import it into Excel to allow search and sorting of the event data. One powerful feature related to event logs is the verbose logging feature. For AD, you can enable verbose logging for a variety of AD functions that will dump verbose logging into appropriate event logs. This logging is enabled via the Registry on each DC in the following key: HKLM\System\CurrentControlSet\Services\NTDS\Diagnostics
The options are shown in Figure 10.56. Note that Windows Server 2003 has added several new values not present in Windows 2000. The data defined for the various values is a hex number from 0 to 5. The default is 0 and has verbose logging turned off, whereas 5 is so verbose, it takes forever to wade through the output and fill your disk. Normally, you want to set the value to 3 and then crank it up higher if needed. When you are finished troubleshooting, reset it to 0 to prevent it from either filling your disk or overwriting itself, hiding useful information, depending on how you have it configured. For example, if you want more detailed events related to replication on GC servers, simply edit the value 5 Replication Events and set the data to 3, and then do the same for the value 18 Global Catalog . Figure 10.56. The NTDS Diagnostics key in the Registry provides a way to set verbose logging for a variety of AD functions and processes.
note Setting these values does not require a reboot of the server or the client. All you have to do is make the setting change and repeat the actions that reproduce the problem. The result will be more verbose events in the standard event logs in Windows 2003.
Remote Desktop
Perhaps one of the most powerful tools built in to Windows 2000 was Terminal Services Administration mode. Microsoft improved on this with the Remote Desktop in XP and Windows Server 2003. Built in to the OS, Remote Desktop has the capability for the remote session to see local drives and printers and cut and paste between the remote and local sessions. You'll find this a great troubleshooting tool ”being able to cut and paste logs locally from a remote server. Enhanced Use of SmartCards
Windows 2000 supported SmartCards in a low-level manner. Windows Server 2003 Administrators can use SmartCards to run DCPromo, execute Net and Runas commands, and use Terminal Services to remotely administer a machine. Account Lockout Tools
Account Lockout continues to be a big issue for most help desks and was described in Chapter 5. Microsoft provided two tools:
These tools can be downloaded from http://www.microsoft.com/downloads/details.aspx?FamilyID=7af2e69c-91f3-4e63-8629-b999adde0b9e&DisplayLang=en. Troubleshooting DNS
DNS is a critical component of AD. Every time a client needs a resource for authentication, GC searches, and so on, it uses DNS to find a server that can satisfy the request. Every time a DC replicates, it uses DNS. Any time any server needs to talk to another, it will probably require DNS. Broken name DNS name resolution can ripple through the entire AD environment, causing problems with AD replication, problems with FRS/ Distributed File System (DFS) replication, authentication failures, network resource access failures, and so on. The Administrator should make sure the DNS structure is designed correctly, adheres to best practices, and addresses failures immediately. DNS failures can be reported in the DS and FRS event logs as well as the DNS log, so make sure you look through all of them. You will often see " DNS Lookup Failure" included on other events such as the infamous 1311 in the Directory Service event log., so you have to read the description of the event to see the DNS failure. Some things you can do to test DNS name resolution include
Refer to Chapter 8 for DNS design and best practices as well as Microsoft's DNS Center at http://www.microsoft.com/dns. Troubleshooting Replication
Next to DNS, replication is of prime importance in the health of the AD. Troubleshooting multimaster replication can be difficult, but with a good understanding of how it works and some tools, it's fairly predictable. Failures in the AD can be due to changes not being replicated. Adding a user or changing user attributes, such as passwords, user rights, and so on, might appear to not take effect due to replication failure. Group Policy depends on replication, so there could be a combination of things. The "Replication" section in Chapter 5 provided some good information on analyzing the topology. A poor design will produce a lot of problems. I'm not afraid to tell a customer with a problematic topology to fix the topology and then work on the individual problems. If the topology is sound, a few good tools can help diagnose the problems:
To effectively diagnose replication problems, you must take a holistic view of the entire forest. Replication runs in the configuration NC ”a forest level context ”so you need to take a forest level approach. Table 10.14 shows a quick checklist of troubleshooting steps when you suspect replication failure. Table 10.14. Replication Troubleshooting Checklist
Hopefully, this will give you some ideas about how to troubleshoot this complex technology. Reviewing the case studies and experiences in Chapter 5 should help as well. Consider these techniques when you build an action plan. Troubleshooting DCPromo
DCPromo troubleshooting is pretty straightforward, and errors can be solved if you understand where in the process the failure occurs. There are two phases of DCPromo. In phase 1, you answer all the questions in the UI. Phase 2 is after the reboot, and there are no messages, no UI, and so on that alert you to success or failure, but a lot of activity goes on behind the scenes. You might think that if you get to the reboot without an error that DCPromo was successful, but that's only half of the process. The following sections break down the DCPromo process step by step, and I've included troubleshooting tips at each step. The DCPromo logs are important during the debug process. First, let's look at the logs and some preliminary items. DCPromo Logs
There are two logs ”DCPromo.log and DCPromoui.log ”located in %systemroot%\debug . DCPromo.log is a fairly nonverbose log, and subsequent DCPromo executions append each instance to the end of the log. Thus, there is only one log with potentially multiple instances of the DCPromo log in it. DCPromoui.log logs all the information seen in the UI so you'll see prompts, answers, and so on. DCpromo.log and DCpromoui.log will contain errors, but sometimes there will be different errors logged, so check both logs. I usually start with DCpromo.log because it's easier and less verbose (remember that the most recent data is at the end of the file). If I need to do more analysis, I'll move to the DCpromoui.log. These logs really aren't hard to read and provide good information, such as the source DC, the Time Services synchronization, credentials used, and so on. A few pointers on evaluating these logs include
Step 1 Data Gathered, Credentials Checked
In this section, we collect the data entered from the UI or answer file (forest, domain, new or replica DC, credentials, and so on). To create a replica, you need Domain Admin rights. To create a new domain or demote the last DC in a domain, you need Enterprise Admin rights. Failures are recorded in the DCPromo*.log. DNS Check
In Windows Server 2003, a DNS summary screen appears just before replication begins. This confirms whether DNS is working or not. If it fails, check DNS configuration:
Step 2 Machine Account Moved to DC's OU
At this point, the UserAccountControl attribute on the server object will be changed. This attribute is set to 4096 (Dec) in Windows 2000 or 1000 (Hex) in Windows Server 2003 for a workstation or server. This attribute will be updated to 532480 (Dec) in Windows 2000 or 82000 (Hex) in Windows Server 2003 for a DC. Note that 532480 is the decimal equivalent of 82000 (Hex) and 4096 is the decimal equivalent of 1000 (Hex). This can be viewed with the LDP or ADSIEDIT tool by browsing to the server object and viewing the attributes, as shown in Figure 10.57. Figure 10.57. The UserAccountControl attribute is exposed in the LDP tool for a DC in this example.
If the server being promoted is a member of the domain, the machine account is now moved to the DC's OU. If it's in a workgroup, it's created in the Computers container in the domain, and then moved to the DC's OU. If DCPromo fails to create the machine account (visible in the dcpromo.log and dcpromoui.log), you can try to join the server to the domain first and then run DCPromo. You join the computer to the domain by right-clicking My Computer, and choosing Properties, Computer Name, Change. Set the Member Of option to Domain and then specify the domain name. It's easier to resolve the issue that way because it breaks DCPromo into two pieces, making it easier to see what the problem is. If joining the domain in this manner fails, you can start checking things such as DNS until it joins the domain. After that is resolved, you can run DCPromo again to determine whether the next part of DCPromo fails. After this is successful, the Machine Acct. (that is, DC1$) should be in the DC's OU. Step 3 Source DC Located (Using DC Locater)
The source DC is identified via the DC Locater process using DNS. The source DC is identified in the DCpromo.log and dcpromoui.log. If there is failure at this point, there are several options:
Step 4 AD Replicated from Source DC
Inbound connections are created using UDP from the source to the new DC. If this is the first DC in a site, it needs to go outside the site, and use UDP. I've seen issues where a firewall blocked UDP traffic, so check that as well. At this point, the AD (NTDS.DIT) will be replicated from the source to the new DC.
note At this point, the computer will reboot to finish DCPromo. This completes Phase 1 of DCPromo.
Step 5 Phase 2
After the system reboots, the following will take place in this order:
Proceed with the following troubleshooting steps.
tip You cannot solve the problem of SYSVOL and Net Logon shares not showing up by creating those shares. You can create the share, but it won't fix the problem. This is typically a replication issue, which can be in turn a DNS issue or simple connectivity. Solve the replication issue, and the shares will be created. You don't have to rerun DCPromo.
There are a couple of tricks to troubleshooting this. Microsoft produced a pretty good article ”KB article 327781, "How to Troubleshoot Missing SYSVOL and NETLOGON Shares on Windows Server 2003 Domain Controllers." One of the suggestions about creating manual connection objects is good ”sometimes just give the KCC a kick and it creates the connections. However, working this issue awhile back in Windows 2000 with a Microsoft engineer, we found a very reliable way to solve this if the suggestions in the KB don't work. You use the Repadmin /add command to add a replication link (kind of a low-level connection) between the two servers; then, execute a repadmin /sync to force replication across it. It's nondestructive and solves a lot of these DCPromo problems. The article is available on this book's Web site at http://www.phptr.com/title/0131467581. Troubleshooting Group Policy
Group Policy failures are usually manifest in user complaints of authentication, access rights, desktop lockdown (unexpected limits), password failures (due to change in complexity policies, and so on), and logon script failure (mapped drives aren't showing up). This could be caused by AD replication failure, FRS failure, or DNS lookup failures. Tools
The most common tools used for Group Policy troubleshooting include GPresult, Group Policy Management Console (GPMC), GPO tool, and event logs:
Why Policy Isn't Applied
The most common Group Policy problem that has to be resolved is determining why Group Policy isn't being applied. Some causes for this include
For additional information on Group Policy troubleshooting, see http://www.microsoft.com/gp. Troubleshooting FRS/DFS
You were given a pretty comprehensive discussion on FRS/DFS issues in Chapter 5 in the "File Replication Service" section. The Ultrasound, Sonar, and FRSDiag tools, as described in that section, are a great help in managing and diagnosing FRS and FRS problems. Also, the Ultrasound help file is invaluable for diagnosing FRS issues, providing information on how it works, resolutions , and a listing of common FRS events, what they mean and common solutions. Remember that FRS is dependent upon AD replication, which is dependent on RPC and DNS. Troubleshoot FRS problems by starting at DNS to make sure it is healthy , see if there are AD replication problems, and then move to FRS. The FRS event log is also helpful. |
< Day Day Up > |