UNIX Fault Management: A Guide for System Administrators
I l @ ve RuBoard |
Many UNIX commands exist to check configuration, status, and resource information. These tools generally report on only a snapshot in time. You can write or use custom scripts that incorporate these or other commands and run them periodically so that you can track configuration changes or test the status of system resources over time. The more commonly used commands are described in this section. Note that they are organized alphabetically . You may also want to check the online man pages for additional information about each command. Unless otherwise noted, the commands listed in this section are available on multiple UNIX platforms. (Tools that are specific to networking, such as netstat and nfsstat , are discussed in Chapter 6.) In addition to these commands, you may want to check the system log file, /var/adm/syslog /syslog.log, for error messages if your system is experiencing problems. Messages written to this log file include information regarding the module experiencing the problem and the time that the event occurred, which can be very valuable when troubleshooting. bdf and df
The bdf and df commands are commonly used to show the amount of disk and swap space used and available. bdf “ i reports the number of used and free filesystem structures (inodes) in the kernel. By default, bdf shows information for all mounted filesystems. If this information is too lengthy, you can also run the command and specify a filesystem as a command-line option. An example is shown in Listing 4-1. ioscan
The ioscan command is used to discover and display the system hardware, usable I/O system devices, or kernel I/O system data structures. The results displayed list the default hardware path to the device, the class of hardware, and a brief description. ioscan includes information on the following hardware: processors, memory, network interface cards, and I/O devices. Listing 4-2 shows how you can check the number of processors on your system by using ioscan. ioscan is a good tool to use to get a complete picture of your system hardware layout. It reports the status of the installed software, indicating whether the proper drivers are loaded. By storing the command output in files, you can maintain a history of the hardware configuration changes to your system. iostat
iostat reports CPU statistics and I/O statistics for disks and terminals. For disks, it lists the device name , number of bytes transferred per second (bps), number of seeks per second (sps), and milliseconds per average seek (msps). For terminals, it shows the number of characters read and the number of characters written. For the CPU, it shows the percentages of time that the system has spent in user mode, nice mode (low-priority user processes), and system mode. Listing 4-3 shows sample output for a system with only one physical disk. Listing 4-1 bdf output for a specific filesystem.
# bdf /dev/vg00/lvol3 Filesystem kbytes used avail %used Mounted on /dev/vg00/lvol3 126976 33003 93912 26% / # Listing 4-2 Output from ioscan for a two-processor HP-UX system.
# ioscan grep processor 32 processor Processor 34 processor Processor # Listing 4-3 Output from the iostat command showing performance measures for disks, terminals, and CPU.
# iostat -t tty cpu tin tout us ni sy id 0 1 1 0 2 97 device bps sps msps c0t5d0 0 0.0 1.0 You may want to use iostat to compare the activity on different disks, to see whether a load imbalance exists. It is normal for the system disk to have more activity. ipcs
The ipcs command shows the status of active message queues, shared memory, and system semaphores. Listing 4-4 shows example output from using ipcs. You may want to consult the online manpage to see all the available options for this command. mailstats
If your system is being used as a mail server, you may want to use mailstats to check mail statistics. The mailstats command shows the number of messages and amount of data sent or received for each mailer running on the system. ps
The ps command is used to display information about all processes on the system. The metrics provided by ps include: Process Identifier (PID), parent PID, process start time, cumulative execution time, process state, priority, physical size (in pages), and the command with its command-line options. ps is a quick way to get a profile of the processes on your system. It is useful for checking whether a specific application or process is running. For example, Listing 4-5 shows an easy way to display the Network File System (NFS) daemons running on your system. This listing can be used to identify runaway processes, both in CPU time and size. Numerous processes in the wait state may be an indication of a system bottleneck. sar
sar is the System Activity Reporter. It is useful for monitoring system activity and can be used to identify memory, CPU, and kernel bottlenecks. It enables you to specify the polling interval and has the ability to log data to a file (in binary format). It can report on activity from many system resources, including CPU utilization by processor, buffer cache, swapping, disks and tape, run and swap queues, and several system tables. Refer to the online man page for the command-line options. Listing 4-4 Output from ipcs showing active message queues, shared memory, and semaphores.
#ipcs IPC status from /dev/kmem as of Sun Mar 14 17:47:20 1999 T ID KEY MODE OWNER GROUP Message Queues: q 0 0x3c1c0330 -Rrw--w--w- root root q 1 0x3e1c0330 --rw-r--r-- root root Shared Memory: m 0 0x2f180002 --rw------- root sys m 201 0x411c031b --rw-rw-rw- root sys m 402 0x4e0c0002 --rw-rw-rw- root sys m 403 0x41201219 --rw-rw-rw- root sys Semaphores: s 0 0x2f180002 --ra-ra-ra- root sys s 65 0x411c031b --ra-ra-ra- root sys s 130 0x4e0c0002 --ra-ra-ra- root sys s 131 0x4120121a --ra-ra-ra- root sys s 4 0x00446f6e --ra-r--r-- root root s 5 0x00446f6d --ra-r--r-- root root s 6 0x01090522 --ra-r--r-- root root s 7 0x411c1f3a --ra-ra-ra- root root s 8 0x410c319a --ra-ra-ra- root root # Listing 4-5 Finding your NFS daemons.
#ps -ef grep -E 'nfsPPID' UID PID PPID C STIME TTY TIME COMMAND root 681 1 0 Dec 22 ? 0:00 /usr/sbin/nfsd 4 root 682 681 0 Dec 22 ? 0:00 /usr/sbin/nfsd 4 root 686 681 0 Dec 22 ? 0:00 /usr/sbin/nfsd 4 root 688 681 0 Dec 22 ? 0:00 /usr/sbin/nfsd 4 root 16761 16718 1 12:14:48 pts/0 0:00 grep nfs # For CPU activity, sar shows CPU utilization by user mode, system mode, idle time waiting for I/O to complete, and idle time either on a per-processor level or averaged for all processors. Sample output is shown in Listing 4-6. Listing 4-6 sar output showing system activity.
# sar 5 5 HP-UX cadbury B.10.20 A 9000/871 03/15/99 20:36:32 %usr %sys %wio %idle 20:36:37 0 1 0 98 20:36:42 0 1 0 99 20:36:47 1 1 0 99 20:36:52 0 1 0 99 20:36:57 0 1 0 99 Average 0 1 0 99 # By using sar -q, you can look at the average lengths of the run and swap queues, and the percentage of times the queues were occupied. This is shown in Listing 4-7. High CPU utilization and a large run queue may indicate a CPU bottleneck. A large swap queue is one sign of memory contention . sar can be used to check the effectiveness of buffer cache use. It reports the rates of reads and writes between a disk and the buffer cache. It also reports the rates of logical reads and writes to and from the buffer cache, as well as buffer cache hit ratios. For swapping activity, you can monitor swap-in rates, swap outs per second, and context switch rates. sar -v reports the current size, maximum size, and number of overflows of various system tables, including the process table, inode table, and system file table. Listing 4-7 Output from sar showing queue lengths.
# sar -q 5 5 HP-UX cadbury B.10.20 A 9000/871 03/15/99 20:44:03 runq-sz %runocc swpq-sz %swpocc 20:44:08 1.0 20 0.0 0 20:44:13 0.0 0 0.0 0 20:44:18 0.0 0 0.0 0 20:44:23 1.0 10 0.0 0 20:44:28 0.0 0 0.0 0 Average 1.0 6 0.0 0 # swapinfo
swapinfo reports system paging or swapping activity, and memory utilization. On some implementations of UNIX, it is called swap. This command is useful for showing swap space usage and configuration. It displays for each swap type and device the kilobytes (K) available, kilobytes used, kilobytes free, and percentage used. If you have insufficient memory, you may see lots of pages being swapped or high utilization of the swap device. An example using swapinfo is shown in Listing 4-8. For device swap areas, reserve is the number of 1K blocks reserved for filesystem use by ordinary users. For device swap areas, this value is always "-". Checking swapinfo periodically may help you to schedule additions to your swap capacity. sysdef
The sysdef command, available on HP-UX, reports on a system's tunable kernel parameters. For each kernel parameter, this command shows the current value, value at boot time, and minimum and maximum values allowed for the parameter, as demonstrated in Listing 4-9. This command can be used both to monitor whether the system kernel is configured properly and to track whether certain kernel resource usage is at or approaching its configured limit. You can also use this command, together with ioscan, to track kernel configuration changes. Listing 4-8 Output from the swapinfo command shows system paging activity.
# swapinfo Kb Kb Kb PCT START/ Kb TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME dev 524288 12488 511800 2% 0 - 1 /dev /vg00/lvol2 reserve - 246876 -246876 memory 404396 207844 196552 51% v Listing 4-9 Showing current values of kernel-tunable parameters.
#sysdef NAME VALUE BOOT MIN-MAX UNITS FLAGS acctresume 4 - -100-100 - acctsuspend 2 - -100-100 - allocate_fs_swapmap 0 - - - bufpages 10714 - 0- Pages - create_fastlinks 0 - - - dbc_max_pct 50 - - - dbc_min_pct 5 - - - default_disk_ir 0 - - - dskless_node 0 - 0-1 - eisa_io_estimate 768 - - - eqmemsize 15 - - - file_pad 10 - 0- - fs_async 0 - 0-1 - hpux_aes_override 0 - - - maxdsiz 16384 - 256-655360 Pages - maxfiles 120 - 30-2048 - maxfiles_lim 1024 - 30-2048 - maxssiz 2048 - 256-655360 Pages - maxswapchunks 256 - 1-16384 - maxtsiz 16384 - 256-655360 Pages - maxuprc 75 - 3- - maxvgs 10 - - - msgmap 2555904 - 3- - nbuf 5772 - 0- - ncallout 316 - 6- - ncdnode 150 - - - ndilbuffers 30 - 1- - netisr_priority -1 - -1-127 - netmemmax 14356480 - - - nfile 1034 - 14- - nflocks 200 - 2- - ninode 500 - 14- - no_lvm_disks 0 - - - nproc 300 - 10- - npty 60 - 1- - nstrpty 60 - - - nswapdev 10 - 1-25 - nswapfs 10 - 1-25 - public_shlibs 1 - - - remote_nfs_swap 0 - - - rtsched_numpri 32 - - - sema 0 - 0-1 - semmap 4128768 - 4- - shmem 0 - 0-1 - shmmni 200 - 3-1024 - streampipes 0 - 0- - swapmem_on 1 - - - swchunk 2048 - 2048-16384 kBytes - timeslice 10 --1-2147483648 Ticks - unlockable_mem 2158 - 0- Pages - timex
The timex command can be used to measure and report, in seconds, the elapsed time, user CPU time, and system CPU time spent executing a given command. The command to be executed is given on the timex command line. This command reports process accounting data for the command and all of its children, as well as the total system activity during execution of the command. The timex command can give you a crude idea of the impact of a command on the rest of the system. top
The top command is useful for monitoring the system CPU and memory loads. It also lists the most active processes on the system. top output is displayed in the terminal window and is updated every five seconds, by default. top shows CPU resource statistics, including load averages (job queues over the last 1 minute, 5 minutes, and 15 minutes), the number of processes in each state (sleeping, waiting, running, starting, zombie, stopped ), the percentage of time spent in each processor state (user, nice, system, idle, interrupt, and swapper ) per processor on the system, as well as the average for each processor in a multiprocessor system. For memory utilization, top shows virtual and real memory in use, the amount of active memory, and the amount of free memory. At the process level, top lists the top processes, based on their CPU usage. The process data displayed by top includes the PID, process size (text, data, and stack), resident size of the process (K), process state (sleeping, waiting, running, idle, zombie, or stopped), the number of CPU seconds consumed by the process, and the average CPU utilization of the process. This command can be used to identify processes that may be using large amounts of CPU or memory. Note that top can also be a quick way to check the number of processors on your system. Listing 4-10 shows the output for a four-processor system. Listing 4-10 Output from the top command showing process activity.
System: gsyview1 Fri Feb 12 13:40:24 1999 Load averages: 0.08, 0.11, 0.16 616 processes: 614 sleeping, 2 running Cpu states: CPU LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS 0 0.30 0.0% 0.0% 1.3% 98.7% 0.0% 0.0% 0.0% 0.0% 1 0.00 0.0% 0.0% 0.7% 99.3% 0.0% 0.0% 0.0% 0.0% 2 0.01 0.0% 0.0% 0.2% 99.8% 0.0% 0.0% 0.0% 0.0% 3 0.02 0.4% 0.0% 7.9% 91.8% 0.0% 0.0% 0.0% 0.0% - - avg 0.08 0.0% 0.0% 2.6% 97.4% 0.0% 0.0% 0.0% 0.0% Memory: 25754K (2356K)real, 27864K (6144K)virtual, 27838K free Page# 1/42 CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND 3 pts/4 12555 jsymons 187 20 25992K 568K run 0:02 7.84 5.48 top 0 rroot 19 root 100 20 0K 0K sleep 1449:04 1.05 1.05 netisr 1 rroot 494 root 154 20 216K 284K sleep 1479:50 1.03 1.02 syncer 0 rroot 3 root 128 20 0K 0K sleep 960:56 1.00 0.99 statdaemo 0 rroot 1432 root 20 20 8120K 6956K sleep 842:38 0.61 0.61 cmcld 3 rroot 38 root 138 20 0K 0K sleep 336:22 0.32 0.31 vx_iflush 1 rroot 7 root -32 20 0K 0K sleep 321:07 0.25 0.25 ttisr 3 rroot 934 root 154 20 6100K 1436K sleep 297:15 0.22 0.22 rpcd 1 rroot 40 root 138 20 0K 0K sleep 193:38 0.16 0.16 vx_inacti 0 rroot 26626 root 154 20 868K 880K sleep 245:15 0.15 0.15 opcle 2 rroot 26587 root 154 20 2580K 1348K sleep 125:22 0.07 0.07 opcmsga 1 rroot 39 root 138 20 0K 0K sleep 88:15 0.07 0.07 vx_ifree_ 3 rroot 22 root 100 20 0K 0K sleep 159:58 0.06 0.06 netisr 1 rroot 26586 root 154 20 8468K 1752K sleep 53:47 0.06 0.06 opcctla uname
The uname command can be used to display configuration information about your system. This information includes the operating system name, machine model, and operating system version. You may want to gather this information and store it for later use. This may be useful if you are trying to keep all of your systems on the same release of the operating system, for example. uptime
The uptime command is probably the most commonly used command to check system resources. This command shows the current time, length of time the system has been up, number of users logged on, and the average number of jobs in the run queue for the last 1, 5, and 15 minutes. Using uptime with the -w option shows a summary of the current activity on the system for each user. As shown in Listing 4-11, you can see the login time, CPU usage, and command activity for each user. vmstat
The vmstat command provides good information about system resources, including virtual memory and CPU usage, and is useful for detecting whether you are low on memory or swap space. Listing 4-11 Output from the uptime command showing paging activity.
uptime -w 12:49pm up 3 days, 2:19, 5 users, load average: 0.49, 0.56, 0.56 User tty login@ idle JCPU PCPU what jsymons console 12:32pm 74:17 /usr/sbin jsymons ttyp7 12:18pm uptime -w For monitoring real and virtual memory, vmstat shows page faults and paging activity, including reclaimed pages and swapping rates. For the CPU, you can see more detailed information with vmstat than that provided by iostat. vmstat shows faults, including device interrupts, system calls, and context switches. vmstat also includes the breakdown of CPU utilization by user, system, and idle time. For processes, vmstat shows the number of processes in various states, including the following: currently in the run queue, blocked on an I/O operation, and swapped out to disk. The statistics that you see vary depending on the command option that you specify. By specifying a time interval, you can have vmstat run continuously, so that you can see how the values vary over time. As shown in Listing 4-12, using the -s option prints paging- related activity. who
The who command tells you who is logged in to the system, and how long each user has been connected. This command can be useful if a performance problem arises, because you can quickly determine whether an increase in the number of concurrent users has occurred. It can also be useful in checking for security intrusions, because you may notice an unexpected user. Listing 4-12 Output from the vmstat command showing paging activity.
$ vmstat -s 5431 swap ins 5431 swap outs 1376 pages swapped in 426 pages swapped out 9704169 total address trans. faults taken 2159795 page ins 9236 page outs 136606 pages paged in 21451 pages paged out 2064504 reclaims from free list 2097094 total page reclaims 773 intransit blocking page faults 6040874 zero fill pages created 3925703 zero fill page faults 1457303 executable fill pages created 76804 executable fill page faults 0 swap text pages found in free list 735656 inode text pages found in free list 185 revolutions of the clock hand 105428 pages scanned for page out 16850 pages freed by the clock daemon 50286274 cpu context switches 90662460 device interrupts 2732863 traps 229976779 system calls |
I l @ ve RuBoard |