Windows Server 2003 on Proliants. Deployment Techniques and Management Tools for System Administrators

2017-07-07 02:10:07

< Day Day Up >

If you have dealt with troubleshooting AD for Windows 2000, you know how difficult it can be. Multimaster replication, dynamic DNS, the AD database, File Replication Service (FRS), security options, Group Policy, and other functions of AD make troubleshooting complex. And, then you have to deal with security patches and virus updates just to keep the environment somewhat secure. It is impossible to put my more than five years of experience troubleshooting AD in one section of one chapter. Instead, this section provides some guidelines on how to proceed with troubleshooting problems, lists some tools that are very powerful in Windows Server 2003, and offers a few tips and tricks I've picked up.

Define the Problem

The first step in troubleshooting is to step back from the immediate fire and assess the whole problem. To ascertain the root cause, you need to ask enough questions to determine the scope and magnitude of the problem. Of course, you also need to determine if there is a problem at all. We get customer calls all the time from Administrators who are worried because their event logs have some events they don't understand ”warnings or even informational events. The first question we ask is, "Besides the events, what's broken?"Usually, the answer is "Nothing." That helps sets the priority and expectations of solving the problem. We need to solve it, but we don't have to work through the night to do it.

Following is a set of questions and actions I use to determine the scope and definition of the problem:

What exactly is failing ? "It doesn't work" isn't an acceptable answer. This might take a little investigative work, but ask some leading questions to find out whether it's a security or authentication issue, access to a network resource, or application failure (Exchange will be the big one). Are there other things that don't work?

When did you notice it ? Determine when the problem was noticed. When was the last time the functionality worked?

Are there any events in the event log ? Whether it's a client workstation, a file/print/application server, or a DC, take a look at the event log for errors, warnings, and even informational messages close to the time of the problem and before. Often the real cause of the failure will log events before the actual symptom is manifest.

How is the problem replicated ? Describe specific steps needed to reproduce the problem. Can it be replicated at will, or is the problem intermittent?

Is the problem isolated to a single user , computer, server or DC ? Ask questions to find out whether this is an isolated incident or more widespread. Remember, you might have to visit the site physically to find out if users have experienced the problem; you can't always rely on them to call the help desk.

What are the variables ? : After you get an idea of the scope of the problem ”such as it happens to multiple users on multiple machines, but not all ”narrow the variables. That is, find out if the users are authenticated by the same DC; see if Group Policy is applied; see if they all are members of the same group or are accessing the same resource.

What are the test clients ? : After you have narrowed the scope of the problem sufficiently, identify a user that you can work with to resolve the problem. If possible, plug a test client into that site and reproduce the problem. That lets you work on the problem without disrupting a user. When you have a possible solution, have the user test it.

After you get a pretty good handle on the scope and magnitude of the problem, you need to build an Action Plan ”a road map of how you intend to attack the problem.

The Action Plan

The action plan can be as simple as an e-mail with a four- or five-bullet list if the problem is fairly confined and low impact, or a formal document that is shared with management if the problem is widespread and causes or might cause significant downtime. It seems everyone has his or her own format for an action plan, so use whatever works for you. I have included a couple of sanitized action plans in this section so you can see what works best for you. An action plan to determine the cause of users at a site complaining about drives not being mapped might include the following bullet points:

Identify all users with the drive-mapping problem. Are there users at other sites with the problem?

What groups do the users have in common?

List errors seen by user.

Determine whether a logon script is being run at all (GPresult on client).

Get copies of client event logs.

Determine whether the users are all being authenticated by the same DC.

Get event logs from the DC.

Analyze events and errors.

This part of the action plan is just the data-gathering phase. After you get some questions answered and get some event logs to look at, you can develop another plan based on those findings. The action plan is very much an iterative process. Start out with your best shot, and then as you narrow it down and research error messages, modify the plan to address the next steps until the problem is resolved.

For a more complex action plan, you should take a more comprehensive approach. This would be the case for a DC or Exchange server that is intermittently hanging, which could be hardware failure, driver failure, or an application failure; probably requiring the creation and analysis of a crash dump, reboot of the server, and other intrusive activities. Because this would have widespread implications, you will probably assemble a team, including IT staff and management, and support the troubleshooting effort by having support contracts with one or more vendors . My job for the past year or so with HP has been to work through these complex issues, putting together the technical team and managing the problem to resolution and perhaps to determine root cause. Following are the components of a complex action plan that I usually use, in the form of a formal document:

Title : Short description of the problem (five ”six words).

Description of the Problem : Keep this brief, such as "DC ATL-DC1 hanging intermittently."

Scope : List all the issues that this action plan covers. This must be well-defined . Avoid "problem creep" where you start merging other problems. If other problems are dependent on each other for resolution, then you can add them, but problems are more easily solved if they are isolated.

Technical Team : Identify names and contact information of those who are working on this issue. This is especially critical if you have logged support calls with vendors.

Task List : Put this in a table format to identify the steps to be accomplished in sequence, persons assigned for each task, and dates such as scheduled and actual completion. Table 10.12 is an example of how this task list could be formatted (note the column headings). Although this example is of multiple tasks (see the next bullet in this list), you can use the same idea for a single task.

Table 10.12. Table Format for a Simple Action Plan

Item No.

Task Description

Responsible Party

Scheduled Completion

Status

Multitask List : If the problem involves tasks to be performed at multiple sites or working with multiple technologies, you can subdivide the Tasks category into several subsections, as shown in Table 10.13. In that example, the problem involved multiple technologies such as client workstations, AD, hardware, and networks. Because you might have different IT staff from each area and perhaps different vendors involved, these subsections allow you to develop mini-action plans in each area. Table 10.13 shows an example of a multifaceted problem that involved Data Gathering, Networking, Outlook, and the XP client. Although the table shows only two sections, Data Gathering and Networking, this format was repeated for the other task areas.

Table 10.13. Formatting a Multitask Action Plan

Table XX-X Table title (Style TH)
Action Plan Title : Action Plan for Company XYZ
Author : Gary Olsen
Problem : Poor network performance for Outlook client and file share access.
Task Areas:
	Data Gathering ” This area requires Company XYZ to gather data required by HP . Networking ” This area will work the possible network configuration and performance issues, and includes a Microsoft support person. Outlook Client ” This area will work Outlook configuration issues. XP Client ” This area will examine configuration of the XP client since the problems started after the XP upgrade.
Data Gathering Actions
Item No.		Task Description	Responsible Party	Scheduled Completion	Status
1		Run MPS Reports	Tyler Olsen	4/26/2004	Complete ”waiting analysis
Network Actions
Item No.		Task Description	Responsible Party	Scheduled Completion	Status
1		Run NetMon between server and client	Caroline Urbanawiz	6/06/2004	Need new download of NetMon

Resolution Criteria : Determine how you will all agree that the problem is resolved. In cases of problems that don't have a clear definition of resolution, such as a hang or performance problems, it's important that you define what constitutes resolution. Problems like hangs could conceivably never be resolved because you can't really tell if they will come back. One standard measurement is 2.5 times the longest time between failures. So, if the longest time between instances of the hang is 10 days, then the resolution criteria would be 2.5 * 10 or 25 days.

Support and Escalation Path : If external support channels are required, it's important to define the procedure to engage the right resources, especially for off-hours. For instance, if a mission-critical application server hangs on Saturday night, and the action plan calls for forcing a crash dump before it is rebooted, you need to know who to call. In problem resolution, this will often skirt normal problem reporting procedures and require calling an engineer directly. Make sure this is defined and your staff and the vendors agree.

Although Table 10.13 is good for a large multitask action plan, it takes considerable effort to update; when you send updates, especially to the management team, it's difficult to read and see what the new actions are. This format is good for identifying multiple issues simultaneously , each with its own action plan. However for a narrow, fairly well-defined scope, consider the example shown here:

Action Plan

For : ABC Corporation

Purpose : Action Plan for SQL Server Hang

Last updated : March 24, 2004

Overview : Determine cause of hang of ATL-DC1

Summary : DC is experiencing intermittent hangs where the server is unresponsive and has no video, keyboard, or mouse activity. We need to force a crash from a remote console and analyze the dump to determine the cause. We have provided instructions to the customer on how to do this with the windbg.exe tool, and have provided a copy of the tool. Next step is to wait for the next occurrence of a hang, force the crash, reboot, get a dump, and analyze the dump.

Item #1

Action:	Configure DC for debug.
Why:	Set /debug flag to permit forcing a crash to collect the dump.
Priority:	High.
When:	March 25.
Who:	Seymour Reign.
Status:	Scheduled for maintenance window March 25.

Item #2

Action:	Force crash and collect dump file.
Why:	Analyze for possible cause of the hang.
Priority:	High.
When:	At next occurrence of the hang.
Who:	Seymour Reign.
Status:	Waiting for hang to occur.

Item #3

Action:	Crash dump analysis.
Why:	Determine cause of hang.
Priority:	Medium.
When:	TBD.
Who:	Jack Sprat.
Status:	Waiting for receipt of the crash dump to be taken at next hang.

Completed Tasks

Action:	Provide instructions to customer on how to use windbg to force a crash. Also provide a copy of windbg.exe.
Why:	Allow a crash to be forced from a remote console.
Priority:	Medium.
When:	March 23.
Who:	HP.
Status:	Complete.

This format has a title, scope, and listing of technical team contacts, but then it identifies each action item in a paragraph type listing rather than a matrix. This format allows you to list the items in priority and easily adjust them up and down in the list as needed. I've also added a section called "Completed Tasks" where I put completed tasks, rather than deleting them. This keeps the current ones uncluttered, but still records what we did.

Collect Data

One of the first actions in any action plan is to collect data. The problem is getting all you need the first time. Often you look at one log, then need to run a utility to see something else, and on and on. This delays the troubleshooting process considerably. To resolve this situation, Microsoft has provided the Windows community with a very powerful tool called MPS Reports. Once reserved only for Microsoft support engineers and Microsoft support partners , these scripts are publicly available on the Microsoft Web site at http://microsoft.com/downloads/details.aspx?FamilyId=CEBF3C7C-7CA5-408F-88B7-F9C79B7306C0&displaylang=en. On that site, you will see links for versions of MPS Reports that have special features to collect data for

DS - DirSvc

Networking

Clusters

Microsoft Data Access Components (MDAC)

Setup/Performance

Software Update Service (SUS)

MPS Reports is simply an executable that runs a variety of command-line utilities to collect data and logs and returns the results in plain text format. Each of the versions runs utilities targeted to that product. The results are compressed into a .CAB file, which can then be opened to expose the text files. The text files can be opened in WinZip and viewed .

For instance, the DS - DirSvc version runs helpful utilities such as DCDiag, NetDiag, GPresult, GPO Tool, Repadmin/showreps, Net Share, and Net Accounts, and contains text and .evt versions of the event logs. You get a good snapshot of the whole environment in one sitting. MPS Reports is generally run on DCs or servers, but some versions, such as the Networking one, can be run on a client.

Figure 10.54 shows a sample listing from running the DirSvc (DS) version of MPS Reports on a DC called HPQnet-DC3. Just by viewing the contents of the CAB file, you can easily double-click a file such as Netdiag.txt, opening it in Notepad for a quick look for errors. Note also that the individual text files in the CAB file are prefixed with the name of the computer it is run on, which is very handy for keeping them straight. Because all of the event logs are included in .evt and .txt format, and all other output is in .txt format, you can easily and quickly review the files to get a good overall picture of errors.

Figure 10.54. Listing from CAB file generated by the DirSvc (DS) version of MPS Reports.

Of course, if you have a number of these to analyze, it becomes a problem, but using tricks like extracting them to a directory and then writing a simple bat file using the findstr utility to search for "error" or "Warning" or "Failure" can make the job a lot easier.

Also with the event logs in a text file, you can do a simple cut and paste to put it in an Excel spreadsheet using the following steps:

1. Open the MPS Reports CAB file in WinZip, as shown in Figure 10.54.

2. Open the <server name> system.txt file in Notepad by double-clicking the file name displayed in the WinZip console.

3. In Notepad, choose to Edit, Select All.

4. Copy the selected text

5. Open an Excel spreadsheet.

6. On the Sheet 1 worksheet, right-click on the A1 cell ( upper-left corner of the xls) and paste the text. This automatically puts all the event data into the rows and columns (see Figure 10.55).

Figure 10.55. MPS Report output in an Excel spreadsheet.

7. Repeat this procedure for Sheet 2, Sheet 3, and other worksheets so the system log for each server is pasted onto a separate worksheet.

8. Double-click the names of the worksheet tab and give each worksheet the name of the respective server.

The Excel spreadsheet formats the output easily and allows you to do searches and sorts. This is much faster than putting the .evt file on a DC, opening the event viewer, opening the log in event viewer, and then scanning individual events. In Figure 10.55, I was analyzing MPS Report output from several DCs to determine general health of the AD. The text version of the event logs can be imported into an Excel spreadsheet. Here, logs from servers HPQnet-DC1, HPQnet-DC2, HPQnet-DC3, and HPQnet-DC4 were imported to separate worksheets for quick comparison. (See the names of the worksheet tabs at the bottom.)

My golden rule for collecting data is, "You can never have too much data." Don't try to save bandwidth or disk space by asking for bare minimums. Get all you can and then sort it out.

After the data is collected, you can turn to the more complex task of analyzing it. The next several sections examine the various components of AD, including some useful tools and troubleshooting tips that will arm you to troubleshoot whatever problem arises.

Troubleshooting General AD Issues

In the "Monitoring Active Directory" section of this chapter we discussed a number of Microsoft and third-party monitoring tools. If you are administering a large environment, you'll soon find that using snap-ins and command-line utilities is not enough. However, for troubleshooting specific problems, the utilities in Support Tools, the Windows Server 2003 Resource Kit, and the utilities built-in to the OS are very handy. Some helpful tools for general AD issues include DCDiag, NetDiag, NLTest , the event logs, and even the RDU. In addition, Windows 2003 permits the Administrator to use his or her SmartCard as authentication to execute commands such as Runas . Downloadable Account Lockout tools from Microsoft's Download site are very helpful in resolving account lockout issues. Each of these tools is summarized in the following sections.

DCDiag

DCDiag is a command-line utility that looks at DC functions, connectivity, and so on. DCDiag is similar to NetDiag, but has more domain- related information. It is available in the Windows 2000 and Windows Server 2003 Support Tools and is included in the DS version of MPS Reports. Some of its features include tests for the following:

FSMO connectivity

Replication

Whether the DC is advertising itself

Registration of Service Principal Names (SPNs)

RID pool information

AD services operating

LDAP and Remote Procedure Call (RPC) connectivity

FRS connectivity

SYSVOL sharing

Some useful options include

/? : Online help with good descriptions.

/v (verbose) : DCDiag output is not much value without this switch.

/test : Allows you to just run a single test instead of all the defaults you get with the /v switch or just running DCDiag with no switches.

/Test: DCPromo : Running this test (not run by default or with /v ) tests to see whether the DNS structure will permit this machine to be promoted to a DC.

/Test: Topology : Checks whether replication topology is fully connected for all DCs. Not run by default or with /v .

/Fix : A cool switch that actually does fix things on occasion ” certainly won't hurt!

tip

There is no switch to generate a log file. You need to direct the command to an output file, for example, dcdiag /v > dcdiag.txt .

Note that each of these tests reports success or failure, and reports an appropriate error message; often you'll see the same message that appears in the event log, as well as others. Thus, you can search for words such as "Fail," " Error," and "Warning" to locate problem spots.

NetDiag

Similar to DCDiag, NetDiag concentrates on network-related errors. It runs tests on WINS and DNS connectivity, DC discovery (can it find a DC?), Kerberos authentication, NetBIOS over TCP/IP (NBT) and DNS name resolution, NETSTAT output, trust relationships, and default gateway and route connectivity. NetDiag includes useful output from IPConfig/all ; and lists hotfixes installed, NIC details, and NETSTAT output. This exposes all network configuration information such as the IP address, WINS and DNS servers, bindings, and so on, so you don't have to do a lot of exploring on the machine. Like DCDiag, you can execute the tests individually with the /test: option. You can use the /l switch to send the output to netdiag.log or redirect the output to a file. Unfortunately, NetDiag can't be executed remotely.

NLTest

The NLTest utility is more of an interactive testing utility than a reporting tool like DCDiag and NetDiag, so you can use it to test certain operations interactively. NLTest can be executed remotely. Some of my favorite options are

/Server : Server to execute command on (this isn't a favorite, but the others won't work without it.)

/SC_RESET : Resets secure channel on target DC.

/SC_Verify : Verifies secure channel on target DC.

/SC_Change_Pwd : Changes a secure password for the target domain on this server.

/DCList : Lists all DCs for the domain.

/DCname : Returns the name of the PDC for the domain.

/DSGETDC : Calls the DSGetDCName API that Net Logon uses in the DC Locater process. This returns information about the DC, but you can also issue options such as /GC to list GCs in the domain.

The example shown here shows the result of the command nltest /server:ATL-DC1 /dsgetdc:company.com /timeserv . Note that this command was issued on ATL-DC1; because ATL-DC1 was a time server, it returned itself in response to the query. In the following example, the nltest.exe command is used to report the timeserver being used.

E:\>nltest /server:atl-dc1 /dsgetdc:company.com /timeserv DC: \ATL-DC1.Company.com Address: \10.0.7.253 Dom Guid: 5dd3afa3-0004-47d6-8fde-311af13a3934 Dom Name: Company.com Forest Name: Company.com Dc Site Name: Atlanta Our Site Name: Atlanta Flags: PDC GC DS LDAP KDC TIMESERV GTIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST CLOSE_SITE The command completed successfully

In the following example, we have added the /avoidself flag, which causes NLTest to return the result from a DC other than the one we are on. In this case, it returns CXO-DC2 as the time server.

E:\>nltest /server:atl-dc1 /dsgetdc:company.com /timeserv /avoidself DC: \CXO-dc2.Company.com Address: \10.0.7.2 Dom Guid: 5dd3afa3-0004-47d6-8fde-311af13a3934 Dom Name: Company.com Forest Name: Company.com Dc Site Name: SaltLakeCity Flags: DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST The command completed successfully

Event Logs

Never underestimate the power of the event logs. Microsoft made some great improvements in Windows 2000 events, putting more verbose information and some problem resolution help in the description field. Windows Server 2003 made further progress in this area by adding more troubleshooting tips to more events and includes references to relevant KB articles and a link to Microsoft's support Web site. Previously in this chapter, I described how to save the event log as a text file and import it into Excel to allow search and sorting of the event data. One powerful feature related to event logs is the verbose logging feature. For AD, you can enable verbose logging for a variety of AD functions that will dump verbose logging into appropriate event logs. This logging is enabled via the Registry on each DC in the following key:

HKLM\System\CurrentControlSet\Services\NTDS\Diagnostics

The options are shown in Figure 10.56. Note that Windows Server 2003 has added several new values not present in Windows 2000. The data defined for the various values is a hex number from 0 to 5. The default is 0 and has verbose logging turned off, whereas 5 is so verbose, it takes forever to wade through the output and fill your disk. Normally, you want to set the value to 3 and then crank it up higher if needed. When you are finished troubleshooting, reset it to 0 to prevent it from either filling your disk or overwriting itself, hiding useful information, depending on how you have it configured. For example, if you want more detailed events related to replication on GC servers, simply edit the value 5 Replication Events and set the data to 3, and then do the same for the value 18 Global Catalog .

Figure 10.56. The `NTDS Diagnostics` key in the Registry provides a way to set verbose logging for a variety of AD functions and processes.

note

Setting these values does not require a reboot of the server or the client. All you have to do is make the setting change and repeat the actions that reproduce the problem. The result will be more verbose events in the standard event logs in Windows 2003.

Remote Desktop

Perhaps one of the most powerful tools built in to Windows 2000 was Terminal Services Administration mode. Microsoft improved on this with the Remote Desktop in XP and Windows Server 2003. Built in to the OS, Remote Desktop has the capability for the remote session to see local drives and printers and cut and paste between the remote and local sessions. You'll find this a great troubleshooting tool ”being able to cut and paste logs locally from a remote server.

Enhanced Use of SmartCards

Windows 2000 supported SmartCards in a low-level manner. Windows Server 2003 Administrators can use SmartCards to run DCPromo, execute Net and Runas commands, and use Terminal Services to remotely administer a machine.

Account Lockout Tools

Account Lockout continues to be a big issue for most help desks and was described in Chapter 5. Microsoft provided two tools:

AcctInfo.dll : Provides an extra tab in the user object property sheet to allow password resets to be done on the user's local DC.

LockoutStatus.exe tool : Manages the account lockout status of users.

These tools can be downloaded from http://www.microsoft.com/downloads/details.aspx?FamilyID=7af2e69c-91f3-4e63-8629-b999adde0b9e&DisplayLang=en.

Troubleshooting DNS

DNS is a critical component of AD. Every time a client needs a resource for authentication, GC searches, and so on, it uses DNS to find a server that can satisfy the request. Every time a DC replicates, it uses DNS. Any time any server needs to talk to another, it will probably require DNS. Broken name DNS name resolution can ripple through the entire AD environment, causing problems with AD replication, problems with FRS/ Distributed File System (DFS) replication, authentication failures, network resource access failures, and so on.

The Administrator should make sure the DNS structure is designed correctly, adheres to best practices, and addresses failures immediately. DNS failures can be reported in the DS and FRS event logs as well as the DNS log, so make sure you look through all of them. You will often see " DNS Lookup Failure" included on other events such as the infamous 1311 in the Directory Service event log., so you have to read the description of the event to see the DNS failure.

Some things you can do to test DNS name resolution include

Ping a failed DC or client by name and then by address. If address succeeds and name fails, there's a DNS problem.

Ping the DNS domain name by name (for example, Ping company.com). This forces a reply from a DNS server that is authoritative for the domain so you know whether the domain name can be resolved and whether the server replying for the domain is correct.

Run NSLookup to determine name resolution (requires reverse lookup zones be established).

Run DCDiag and NetDiag (described earlier) to test DNS in different ways.

Use the Replication Monitor tool (in Windows 2003 Support Tools) to generate a status report of DNS failures.

Use DNSLint (see Microsoft KB article 321045, "Description of the DNSLint Utility") to resolve DNS name resolution issues for replication. The KB has some details and a downloadable image.

Use the DNS Manager snap-in, which has a couple of nice troubleshooting features.

Run tests for a simple query and for a recursive query in the server properties page from the Monitoring tab. The recursive query tests to see whether forwarding and delegation are working. (To run the tests, select the check box by each test and the click OK.) This will help define where the problem is (this machine or someone else).

Use event logging to filter DNS events. All Events is the default.

Use debug logging to generate a log with the packets sent and received by DNS. This is disabled by default and has a number of configuration options.

Clear your server cache. One of the crucial steps in DNS troubleshooting when you are making changes is to ensure the client cache and server cache are cleared so you know you aren't testing against stale data. You clear the server cache by right-clicking on the DNS Server icon and selecting Clear Cache.

Launch NSLookup by right-clicking the DNS Server icon and selecting Launch NSLookup. This is just a shortcut to the NSLookup CLI tool.

Manually observe DNS records. It's always a good idea, if you have an issue with connecting to a DC, to look in the snap-in and determine whether the SRV, host, and Cname records for that DC are correct: correct address, name, and so on. If in doubt, delete them and restart Net Logon to register them.

Clear the client cache. On each client, execute the command ipconfig /flushdns . Just like the server, clearing the DNS cache on the client during troubleshooting ensures you aren't dealing with stale data.

Reregister DNS records (from workstation, servers, DCs). Just restart Net Logon to reregister its DNS records:

C:> Net stop Netlogon & Net start Netlogon

Refer to Chapter 8 for DNS design and best practices as well as Microsoft's DNS Center at http://www.microsoft.com/dns.

Troubleshooting Replication

Next to DNS, replication is of prime importance in the health of the AD. Troubleshooting multimaster replication can be difficult, but with a good understanding of how it works and some tools, it's fairly predictable. Failures in the AD can be due to changes not being replicated. Adding a user or changing user attributes, such as passwords, user rights, and so on, might appear to not take effect due to replication failure. Group Policy depends on replication, so there could be a combination of things. The "Replication" section in Chapter 5 provided some good information on analyzing the topology. A poor design will produce a lot of problems. I'm not afraid to tell a customer with a problematic topology to fix the topology and then work on the individual problems. If the topology is sound, a few good tools can help diagnose the problems:

MPS Reports (DirSvc version) : This gives you a good snapshot of the environment ”replication, connectivity, name resolution, and so on.

Repadmin : The output of the command Repadmin /showreps is included in MPS Reports and is an excellent report of inbound and outbound replication on that DC, but a few other switches are valuable as well.

/replsum : Tests end-to-end replication in the forest. See the example in Chapter 11, "Disaster Recovery," where we talked about manual demotion of a DC. The sample output shown here demonstrates how this command can quickly show that four DCs have replication errors, and that they have not replicated for 11 days, 8 days, 5 days, and 4 days. This is considerably faster and easier than reviewing event logs on all DCs just to find these four problems.

Replication Summary Start Time: 2004-02-17 19:38:52 Beginning data collection for replication summary, this may take awhile: ............................. Source DC largest delta fails/total %% error HPQAM-DC3 11d.06h:19m:22s 3 / 5 60 (1722) The RPC server is unavailable. HPQEU-DC4 08d.12h:34m:06s 3 / 3 100 (1722) The RPC server is unavailable. HPQNET-DC2 05d.11h:36m:26s 3 / 5 60 (1722) The RPC server is unavailable. HPQEU-DC26 04d.17h:53m:00s 6 / 6 100 (1753) There are no more endpoints available from the endpoint mapper.

/removeLingeringObjects : A powerful tool for a big problem. Previously, we had to track these lingering objects down by hand and figure out how to delete them. Now we do it with a simple command (hopefully). Chapters 4, "Assessment of the Enterprise," and 7, "ProLiant Server Installation and Deployment," provide more information on lingering objects.

/latency : Shows delta since last successful replication and the latency of each DC in the forest. Good way to identify machines that are way out of sync.

/bridgeheads : Lists all BHSs in the forest.

Replication Monitor : Perhaps the most valuable tool still for replication. No changes were made for Windows Server 2003. Especially valuable is the Status Report, which I usually request in addition to MPS Reports. You can also generate a nice list of all the replication errors in the domain by choosing Action, Domain, Search Domain for Replication Errors. Click the Run Search button, enter the domain to analyze, and click OK. The list can be output as a text file by selecting the Run As button. Replication Monitor can also
- List and test FSMOs
- Show current status of replication on each DC (and error code if failure)
- Show application of Group Policy
- Show metadata
- Identify NCs replicated by each DC

HP OVOW and the ADTV tool : As noted earlier in this chapter, ADTV is a powerful tool, providing a graphic display of the AD replication topology, identifying sites, connections, site links, GCs, and reporting errors. Examples of ADTV were shown in Figures 10.5 and 10.6 in this chapter.

To effectively diagnose replication problems, you must take a holistic view of the entire forest. Replication runs in the configuration NC ”a forest level context ”so you need to take a forest level approach. Table 10.14 shows a quick checklist of troubleshooting steps when you suspect replication failure.

Table 10.14. Replication Troubleshooting Checklist

Task or Problem	Procedure	Tool(s)
Validate DNS.	Test name resolution to/from DC, DNS errors, DNS configuration	Ping, Event Logs, DNSLint.exe, MPS Reports, Replication Monitor Status report
Validate Cname DNS record ”Server GUID to fully qualified domain name (FQDN) mapping required for successful replication.	Ping Cname record `<server guid>._domain.com`	Ping, DNS snap-in to determine server GUID of Cname record for this DC (on failure, delete the Cname record, let the DC re-register it)
Test inbound and outbound replication.	1. Create user on problem DC and partner DC to see whether both DCs get both users.	Sites & Services and Users & Computers snap-ins, `Repadmin /showreps`
Test inbound and outbound replication.	2. Force replication from both sides (to/from).	MPS Reports
Test end-to-end replication; see which DCs are in error.	`Repadmin /replsum/bysrc /bydest/sort:delta`	Repadmin.exe
Validate the topology.	Accurately diagram sites, DCs, GCs, site links and cost, connections.	HP OVOW, Quest's Snapshot
Replication failure for greater than tombstone lifetime.	Disconnect, manually demote, repromote.	`DCPromo/forceremoval +cleanup` (see Chapter 11)
RPC server not available.	Test connectivity.	Ping by name
Event 1311.	Follow procedure in the event description.	Sites & Services Snap-in
Duplicate inbound connections between two DCs.	Delete all duplicate objects and use the Check Topology option to get the KCC to rebuild the connections. If they keep coming back, it's a DNS issue.	Sites & Services Snap-in
No inbound connection objects.	Create a manual connection object from another DC in the same site if possible and force replication across it.	Sites & Services Snap-in
Unable to establish outbound connection after DCPromo.	See the "Troubleshooting DCPromo" section in this chapter.
Isolated to one DC?	Consider manual demotion, repromotion (see Chapter 11).	DCPromo /forceremoval

Hopefully, this will give you some ideas about how to troubleshoot this complex technology. Reviewing the case studies and experiences in Chapter 5 should help as well. Consider these techniques when you build an action plan.

Troubleshooting DCPromo

DCPromo troubleshooting is pretty straightforward, and errors can be solved if you understand where in the process the failure occurs. There are two phases of DCPromo. In phase 1, you answer all the questions in the UI. Phase 2 is after the reboot, and there are no messages, no UI, and so on that alert you to success or failure, but a lot of activity goes on behind the scenes. You might think that if you get to the reboot without an error that DCPromo was successful, but that's only half of the process. The following sections break down the DCPromo process step by step, and I've included troubleshooting tips at each step. The DCPromo logs are important during the debug process. First, let's look at the logs and some preliminary items.

DCPromo Logs

There are two logs ”DCPromo.log and DCPromoui.log ”located in %systemroot%\debug . DCPromo.log is a fairly nonverbose log, and subsequent DCPromo executions append each instance to the end of the log. Thus, there is only one log with potentially multiple instances of the DCPromo log in it.

DCPromoui.log logs all the information seen in the UI so you'll see prompts, answers, and so on. DCpromo.log and DCpromoui.log will contain errors, but sometimes there will be different errors logged, so check both logs. I usually start with DCpromo.log because it's easier and less verbose (remember that the most recent data is at the end of the file). If I need to do more analysis, I'll move to the DCpromoui.log. These logs really aren't hard to read and provide good information, such as the source DC, the Time Services synchronization, credentials used, and so on. A few pointers on evaluating these logs include

Read DCPromo.log bottom-up because the latest info is on the bottom (check the time stamp). This is a quick read and I usually do it first.

Errors reported for either log can be looked up with Net Helpmsg. Sometimes, Net Helpmsg gives a different message ”also you might get different messages between DCPromo and DCPromoui.log.

See if there are errors in the system or DS event logs that coordinate with errors in DCPromo.

DCPromoui.log gets rewritten with each DCPromo attempt. When DCpromoui is run the second time, it renames the log to DCPromoui.001.log. The next time DCPromo runs, the dcpromoui.log gets renamed to DCpromoui.002.log, and so on. DCPromoui.log will always be the log of the latest DCPromo attempt.

DCpromoui can be run in verbose mode (see Microsoft KB article 221254, "Registry Settings for Event Detail in the DCPromoui.log File").

In DCPromoui as it goes through each module, you'll see a line " Error (0x0) “ Success. " This is not an error. However, if you see numbers other than 0x0, that is a problem. Look at the module above that message and see what it is doing.

Step 1 Data Gathered, Credentials Checked

In this section, we collect the data entered from the UI or answer file (forest, domain, new or replica DC, credentials, and so on). To create a replica, you need Domain Admin rights. To create a new domain or demote the last DC in a domain, you need Enterprise Admin rights. Failures are recorded in the DCPromo*.log.

DNS Check

In Windows Server 2003, a DNS summary screen appears just before replication begins. This confirms whether DNS is working or not. If it fails, check DNS configuration:

Make sure the DC's TCP/IP properties are pointing to a valid DNS. If this DC is also a DNS and is the first DNS in the domain and you are using AD Integrated DNS, point this machine to itself for DNS.

note

In Windows 2000, in order to avoid the "DNS Island" problem in Active Directory Integrated (ADI) zones, only one DC/DNS should be pointed to itself for DNS. All other DCs/DNSs should point their TCP/IP properties to this single DNS. (See Microsoft KB article 275278, "DNS Server becomes an Island when a domain controller points to itself for the _msdcs.forestdnsname domain.") This has been addressed in Windows 2003 and this is no longer a requirement. However, in spite of Microsoft's assertion that it only affects replication, I have routinely required customers to configure DNS so that only one ADI Primary name server pointed to itself for DNS for each ADI zone, and it has always made a big performance improvement. This may well be due to replication. My recommendation is to still configure ADI DNS zones this way (only one DNS server per ADI zone pointing to itself for DNS) as Microsoft at this writing has not definitively stated to do it otherwise .

Make sure the DNS zone is set for dynamic updates (zone properties). This is NOT a requirement, but it is a big administrative issue if you have to manually register all the SRV records, etc.

If this is the first DC in the domain, you can select the option on the DNS summary to let DCPromo build DNS. In Windows 2000, this was not recommended, but in Windows Server 2003, it works great. Of course, you can preconfigure DNS as well.

Remember the first DC in the forest does not need to contact a DNS. There are no DNS queries needed. So if DNS is not working, it will likely not show up until you try to join the second DC.

Step 2 Machine Account Moved to DC's OU

At this point, the UserAccountControl attribute on the server object will be changed. This attribute is set to 4096 (Dec) in Windows 2000 or 1000 (Hex) in Windows Server 2003 for a workstation or server. This attribute will be updated to 532480 (Dec) in Windows 2000 or 82000 (Hex) in Windows Server 2003 for a DC. Note that 532480 is the decimal equivalent of 82000 (Hex) and 4096 is the decimal equivalent of 1000 (Hex). This can be viewed with the LDP or ADSIEDIT tool by browsing to the server object and viewing the attributes, as shown in Figure 10.57.

Figure 10.57. The `UserAccountControl` attribute is exposed in the LDP tool for a DC in this example.

If the server being promoted is a member of the domain, the machine account is now moved to the DC's OU. If it's in a workgroup, it's created in the Computers container in the domain, and then moved to the DC's OU. If DCPromo fails to create the machine account (visible in the dcpromo.log and dcpromoui.log), you can try to join the server to the domain first and then run DCPromo. You join the computer to the domain by right-clicking My Computer, and choosing Properties, Computer Name, Change. Set the Member Of option to Domain and then specify the domain name. It's easier to resolve the issue that way because it breaks DCPromo into two pieces, making it easier to see what the problem is. If joining the domain in this manner fails, you can start checking things such as DNS until it joins the domain. After that is resolved, you can run DCPromo again to determine whether the next part of DCPromo fails. After this is successful, the Machine Acct. (that is, DC1$) should be in the DC's OU.

Step 3 Source DC Located (Using DC Locater)

The source DC is identified via the DC Locater process using DNS. The source DC is identified in the DCpromo.log and dcpromoui.log. If there is failure at this point, there are several options:

You can force the new DC to source from a specific DC if it's finding a remote DC when a DC from the same domain is available, which happens more than you might think. The command to force source from DC45 is Dcpromo /replicationsourcedc=DC45

You can also go into the Sites and Services UI and move the desired source DC into the same site as the new DC so that the DC locator will find it first (if there are no other DCs in the site). After replication takes place, you must move the source DC back to maintain the designed AD replication topology

Check DNS by pinging the source DC by name.

Check your account permissions. You need domain Administrator credentials to create a replica DC. You need Enterprise Admin rights to create the first DC in a new domain.

Step 4 AD Replicated from Source DC

Inbound connections are created using UDP from the source to the new DC. If this is the first DC in a site, it needs to go outside the site, and use UDP. I've seen issues where a firewall blocked UDP traffic, so check that as well. At this point, the AD (NTDS.DIT) will be replicated from the source to the new DC.

note

At this point, the computer will reboot to finish DCPromo. This completes Phase 1 of DCPromo.

Step 5 Phase 2

After the system reboots, the following will take place in this order:

Outbound connections are created to replicate AD information to the new DC's replication partners. This includes information such as the UserAccountControl attribute, the Computer object moving to the DC's OU, and the creation of the Server object (used for replication).

SYSVOL is populated . The SYSVOL tree, including Group Policy templates and logon scripts, is copied to the new DC.

SYSVOL and Net Logon shares are created.

Troubleshooting for this phase begins with simply determining whether the SYSVOL and Net Logon shares were created. If they were, then DCPromo was successful. If not, then replication was not successful. This is typically manifest by inbound, but no outbound replication. That is, changes made on other DCs will be replicated to the new one, but changes made on this DC will not be replicated to the others.

Proceed with the following troubleshooting steps.

tip

You cannot solve the problem of SYSVOL and Net Logon shares not showing up by creating those shares. You can create the share, but it won't fix the problem. This is typically a replication issue, which can be in turn a DNS issue or simple connectivity. Solve the replication issue, and the shares will be created. You don't have to rerun DCPromo.

There are a couple of tricks to troubleshooting this. Microsoft produced a pretty good article ”KB article 327781, "How to Troubleshoot Missing SYSVOL and NETLOGON Shares on Windows Server 2003 Domain Controllers." One of the suggestions about creating manual connection objects is good ”sometimes just give the KCC a kick and it creates the connections.

However, working this issue awhile back in Windows 2000 with a Microsoft engineer, we found a very reliable way to solve this if the suggestions in the KB don't work. You use the Repadmin /add command to add a replication link (kind of a low-level connection) between the two servers; then, execute a repadmin /sync to force replication across it. It's nondestructive and solves a lot of these DCPromo problems. The article is available on this book's Web site at http://www.phptr.com/title/0131467581.

Troubleshooting Group Policy

Group Policy failures are usually manifest in user complaints of authentication, access rights, desktop lockdown (unexpected limits), password failures (due to change in complexity policies, and so on), and logon script failure (mapped drives aren't showing up). This could be caused by AD replication failure, FRS failure, or DNS lookup failures.

Tools

The most common tools used for Group Policy troubleshooting include GPresult, Group Policy Management Console (GPMC), GPO tool, and event logs:

GPresult : Run on the local computer to expose all GPO settings applied to the user and computer. A new version of this built in to XP and Windows Server 2003 displays Resultant Set of Policy (RSoP) information, including exposing all security settings and other detailed client-side extension settings. Where the Windows 2000 version was a reskit utility and had to be copied to clients for testing, GPResult is built in to Windows Server 2003 and XP. It exposes security settings such as password length, Kerberos policy, and so on. The Windows 2000 version simply reported which DC the client got the security policy from. This is an incredibly valuable tool. I've actually required customers to go out and buy XP so they could diagnose a Group Policy problem in their Windows 2000 domain. If you have a client joined to a domain, you can execute gpresult from the command line and see the output. This new version also reports any security filtering; that is, if any GPOs are blocked by security filters. In Windows 2000, we had to drill down through the Group Policy Editor's security settings to determine that. GPResult is run in verbose mode ( /v ) for MPS Reports. See Chapter 5 for detailed examples of how to use it within the GPMC tool.

GPO Tool : A DC-based tool that analyzes the GPOs stored on each DC, tests integrity, and compares the AD and SYSVOL version numbers. If there is a mismatch, policy fails to apply. A mismatch of a few version numbers' difference is normal after a change is made to a GPO, until FRS replication completes. If it is off by thousands, you need to determine why (replication problem, and so on)

Group Policy Management Console (GPMC) : Allows the export of all GPO settings to a file so you can review them without clicking through the UI. You can also import settings via the GPMC. Chapter 5 contains a very detailed description of how to use GPMC to troubleshoot Group Policy, including the use of WMI filters. You can also refer to Microsoft's Web site at http://www.microsoft.com/GP for more information on and examples of this tool. Jeremy Moskowitz's book, Group Policy, Profiles and IntelliMirror for Windows 2003, Windows 2000 and Windows XP (Sybex, 2004) is also an excellent resource. You can get more details on that book at http://www.amazon.com/ exec /obidos/tg/detail/-/0782142982/qid=1066869151/sr=8-3/ref=sr_8_3/103-1472978-5343063?v=glance&n=507846.

Event logs : Look for FRS failures, AD replication failures, and DNS failures. You can also check the application log for Event ID 1704 (source SCECLI), which indicates successful application of policy. If you don't see these events, you have a problem. Group Policy-specific events will be in the application log.

GPupdate : This built-in utility replaces the old Secedit /refresh policy command in Windows 2000 to refresh policies after changes have been made to allow immediate testing without waiting for replication.

Why Policy Isn't Applied

The most common Group Policy problem that has to be resolved is determining why Group Policy isn't being applied. Some causes for this include

The user account is not in the OU/domain structure to allow it to be applied. Remember, policy is applied only to the user, not to groups the user is a member of.

The policy hasn't been replicated to the user's logon server (check the environmental variable logon server on the client and then use Replication Monitor to determine whether that DC has had the GPO replicated.

A replication error is preventing the Group Policy from replicating to the authenticating DC for the user (FRS or AD replication).

GPO changes aren't being saved (user error).

Group Policy inheritance might be blocked if the policy is on an upper-level domain.

There may be a conflicting policy that takes precedence (for example, you set the policy to disable something and a higher-priority policy enables it, so you get it enabled).

User or group may be "filtered;" they don't have read and apply Group Policy rights to the policy. GPResult and the GPMC tool will both indicate if a policy is not applied to the user or computer due to filtering.

Use GPResult to see if the GPO is being applied (it lists the GPOs applied).

For additional information on Group Policy troubleshooting, see http://www.microsoft.com/gp.

Troubleshooting FRS/DFS

You were given a pretty comprehensive discussion on FRS/DFS issues in Chapter 5 in the "File Replication Service" section. The Ultrasound, Sonar, and FRSDiag tools, as described in that section, are a great help in managing and diagnosing FRS and FRS problems. Also, the Ultrasound help file is invaluable for diagnosing FRS issues, providing information on how it works, resolutions , and a listing of common FRS events, what they mean and common solutions. Remember that FRS is dependent upon AD replication, which is dependent on RPC and DNS. Troubleshoot FRS problems by starting at DNS to make sure it is healthy , see if there are AD replication problems, and then move to FRS. The FRS event log is also helpful.