Mac OS X Internals: A Systems Approach

2017-07-07 02:10:07

3.3. The PowerPC 970FX

3.3.1. At a Glance

In this section, we will look at details of the PowerPC 970FX. Although several parts of the discussion could apply to other PowerPC processors, we will not attempt to identify such cases. Table 34 lists the important technical specifications of the 970FX.

Table 34. The PowerPC 970FX at a Glance
Feature	Details
Architecture	64-bit PowerPC AS,^[a] with support for 32-bit operating system bridge facility
Extensions	Vector/SIMD Multimedia extension (VMX^[b])
Processor clock frequency	Up to 2.7GHz^[c]
Front-side bus frequency	Integer fraction of processor clock frequency
Data-bus width	128 bits
Address-bus width	42 bits
Maximum addressable physical memory	4TB (2⁴² bytes)
Address translation	65-bit virtual addresses, 42-bit real addresses, support for large (16MB) virtual memory pages, a 1024-entry translation lookaside buffer (TLB), and a 64-entry segment lookaside buffer (SLB)
Endianness	Big-endian; optional little-endian facility not implemented
L1 I-cache	64KB, direct-mapped, with parity
L1 D-cache	32KB, two-way set-associative, with parity
L2 cache	512KB, eight-way set-associative, with ECC, fully inclusive of L1 D-cache
L3 cache	None
Cache line width	128 bytes for all caches
Instruction buffer	32 entries
Instructions/cycle	Up to five (up to four nonbranch + up to one branch)
General-purpose registers	32x64-bit
Floating-point registers	32x64-bit
Vector registers	32x128-bit
Load/Store Units	Two units, with 64-bit data paths
Fixed-Point Units	Two asymmetrical^[d] 64-bit units
Floating-Point Units	Two 64-bit units, with support for IEEE-754 double-precision floating-point, hardware fused multiply-add, and square root
Vector units	A 128-bit unit
Condition Register Unit	For performing logical operations on the Condition Register (CR)
Execution pipeline	Ten execution pipelines, with up to 25 stages in a pipeline, and up to 215 instructions in various stages of execution at a time
Power management	Multiple software-initialized power-saving modes, PowerTune frequency and voltage scaling

^[a] AS stands for Advanced Series.

^[b] VMX is interchangeable with AltiVec. Apple markets the PowerPC's vector functionality as Velocity Engine.

^[c] As of 2005.

^[d] The two fixed-point (integer) units of the 970FX are not symmetrical. Only one of them can perform division, and only one can be used for special-purpose register (SPR) operations.

3.3.2. Caches

A multilevel cache hierarchy is a common aspect of modern processors. A cache can be defined as a small chunk of very fast memory that stores recently used data, instructions, or both. Information is typically added and removed from a cache in aligned quanta called cache lines. The 970FX contains several caches and other special-purpose buffers to improve memory performance. Figure 34 shows a conceptual diagram of these caches and buffers.

Figure 34. Caches and buffers in the 970FX

3.3.2.1. L1 and L2 Caches

The level 1 (L1) cache is closest to the processor. Memory-resident information must be loaded into this cache before the processor can use it, unless that portion of memory is marked noncacheable. For example, when a load instruction is being executed, the processor refers to the L1 cache to see if the data in question is already held by a currently resident cache line. If so, the data is simply loaded from the L1 cachean L1 cache hit. This operation takes only a few processor cycles as compared to a few hundred cycles for accessing main memory.^[23] If there is an L1 miss, the processor checks the next level in the cache hierarchy: the level 2 (L2) cache. An L2 hit would cause the cache line containing the data to be loaded into the L1 cache and then into the appropriate register. The 970FX does not have level 3 (L3) caches, but if it did, similar steps would be repeated for the L3 cache. If none of the caches contains the requested data, the processor must access main memory.

^[23] Main memory refers to the system's installed and available dynamic memory (DRAM).

As a cache line's worth of data is loaded into L1, a resident cache line must be flushed to make room for the new cache line. The 970FX uses a pseudo-least-recently-used (LRU) algorithm^[24] to determine which cache line to evict. Unless instructed otherwise, the evicted cache line is sent to the L2 cache, which makes L2 a victim cache. Table 35 shows the important properties of the 970FX's caches.

^[24] The 970FX allows the data-cache replacement algorithm to be changed from LRU to FIFO through a bit in a hardware-dependent register.

Table 35. 970FX Caches
Property	L1 I-cache	L1 D-cache	L2 Cache
Size	64KB	32KB	512KB
Type	Instructions	Data	Data and instructions
Associativity	Direct-mapped	Two-way set-associative	Eight-way set-associative
Line size	128 bytes	128 bytes	128 bytes
Sector size	32 bytes
Number of cache lines	512	256	4096
Number of sets	512	128	512
Granularity	1 cache line	1 cache line	1 cache line
Replacement policy		LRU	LRU
Store policy		Write-through, with no allocate-on-store-miss	Write-back, with allocate-on-store-miss
Index	Effective address	Effective address	Physical address
Tags	Physical address	Physical address	Physical address
Inclusivity			Inclusive of L1 D-cache
Hardware coherency	No	Yes	Yes, standard MERSI cache-coherency protocol
Enable bit	Yes	Yes	No
Reliability, availability, and serviceability (RAS)	Parity, with invalidate-on-error for data and tags	Parity, with invalidate-on-error for data and tags	ECC on data, parity on tags
Cache locking	No	No	No
Demand load latencies (typical)		3, 5, 4, 5 cycles for GPRs, FPRs, VPERM, and VALU, respectively^[a]	11, 12, 11, 11 cycles for GPRs, FPRs, VPERM, and VALU, respectively^a

^[a] Section 3.3.6.1 discusses GPRs and FPRs. Section 3.3.10.2 discusses VPERM and VALU.

Harvard Architecture

The 970FX's L1 cache is split into separate caches for instructions and data. This design aspect is referred to as the Harvard Architecture, alluding to the separate memories for instructions and data in the Mark-III and Mark-IV vacuum tube machines that originated at Harvard University in the 1940s.

You can retrieve processor cache information using the sysctl command on Mac OS X as shown in Figure 35. Note that the hwprefs command is part of Apple's CHUD Tools package.

Figure 35. Retrieving processor cache information using the `sysctl` command

$ sudo hwprefs machine_type # Power Mac G5 Dual 2.5GHz PowerMac7,3 $ sysctl -a hw ... hw.cachelinesize: 128 hw.l1icachesize: 65536 hw.l1dcachesize: 32768 hw.l2settings = 2147483648 hw.l2cachesize: 524288 ... $ sudo hwprefs machine_type # Power Mac G5 Quad 2.5GHz PowerMac11,2 $ sysctl -a hw ... hw.cachelinesize = 128 hw.l1icachesize = 65536 hw.l1dcachesize = 32768 hw.l2settings = 2147483648 hw.l2cachesize = 1048576 ...

3.3.2.2. Cache Properties

Let us look more closely at some of the cache-related terminology used in Table 35.

Associativity

As we saw earlier, the granularity of operation for a cachethat is, the unit of memory transfers in and out of a cacheis a cache line (also called a block). The cache line size on the 970FX is 128 bytes for both the L1 and L2 caches. The associativity of a cache is used to determine where to place a cache line's worth of memory in the cache.

If a cache is m-way set-associative, then the total space in the cache is conceptually divided into sets, with each set containing m cache lines. In a set-associative cache, a block of memory can be placed only in certain locations in the cache: It is first mapped to a set in the cache, after which it can be stored in any of the cache lines within that set. Typically, given a memory block with address B, the target set is calculated using the following modulo operation:

target set = B MOD {number of sets in cache}

A direct-mapped cache is equivalent to a one-way set-associative cache. It has the same number of sets as cache lines. This means a memory block with address B can exist only in one cache line, which is calculated as the following:

target cache line = B MOD {number of cache lines in cache}

Store Policy

A cache's store policy defines what happens when an instruction writes to memory. In a write-through design, such as the 970FX L1 D-cache, information is written to both the cache line and to the corresponding block in memory. There is no L1 D-cache allocation on write missesthe affected block is modified only in the lower level of the cache hierarchy and is not loaded into L1. In a write-back design, such as the 970FX L2 cache, information is written only to the cache linethe affected block is written to memory only when the cache line is replaced.

Memory pages that are contiguous in virtual memory will normally not be contiguous in physical memory. Similarly, given a set of virtual addresses, it is not possible to predict how they will fit in the cache. A related point is that if you take a block of contiguous virtual memory the same size as a cache, say, a 512KB block (the size of the entire L2 cache), there is little chance that it will fit in the L2 cache.

MERSI

Only the L2 cache is physically mapped, although all caches use physical address tags. Stores are always sent to the L2 cache in addition to the L1 cache, as the L2 cache is the data coherency point. Coherent memory systems aim to provide the same view of all devices accessing the memory. For example, it must be ensured that processors in a multiprocessor system access the correct datawhether the most up-to-date data resides in main memory or in another processor's cache. Maintaining such coherency in hardware introduces a protocol that requires the processor to "remember" the state of the sharing of cache lines.^[25] The L2 cache implements the MERSI cache-coherency protocol, which has the following five states.

^[25] Cache-coherency protocols are primarily either directory-based or snooping-based.

ModifiedThis cache line is modified with respect to the rest of the memory subsystem.

ExclusiveThis cache line is not cached in any other cache.

RecentThe current processor is the most recent reader of this shared cache line.

SharedThis cache line was cached by multiple processors.

InvalidThis cache line is invalid.

RAS

The caches incorporate parity-based error detection and correction mechanisms. Parity bits are additional bits used along with normal information to detect and correct errors in the transmission of that information. In the simplest case, a single parity bit is used to detect an error. The basic idea in such parity checking is to add an extra bit to each unit of informationsay, to make the number of 1s in each unit either odd or even. Now, if a single error (actually, an odd number of errors) occurs during information transfer, the parity-protected information unit would be invalid. In the 970FX's L1 cache, parity errors are reported as cache misses and therefore are implicitly handled by refetching the cache line from the L2 cache. Besides parity, the L2 cache implements an error detection and correction scheme that can detect double errors and correct single errors by using a Hamming code.^[26] When a single error is detected during an L2 fetch request, the bad data is corrected and actually written back to the L2 cache. Thereafter, the good data is refetched from the L2 cache.

^[26] A Hamming code is an error-correcting code. It is an algorithm in which a sequence of numbers can be expressed such that any errors that appear in certain numbers (say, on the receiving side after the sequence was transmitted by one party to another) can be detected, and corrected, subject to certain limits, based on the remaining numbers.

3.3.3. Memory Management Unit (MMU)

During virtual memory operation, software-visible memory addresses must be translated to real (or physical) addresses, both for instruction accesses and for data accesses generated by load/store instructions. The 970FX uses a two-step address translation mechanism^[27] based on segments and pages. In the first step, a software-generated 64-bit effective address (EA) is translated to a 65-bit virtual address (VA) using the segment table, which lives in memory. Segment table entries (STEs) contain segment descriptors that define virtual addresses of segments. In the second step, the virtual address is translated to a 42-bit real address (RA) using the hashed page table, which also lives in memory.

^[27] The 970FX also supports a real addressing mode, in which physical translation can be effectively disabled.

The 32-bit PowerPC architecture provides 16 segment registers through which the 4GB virtual address space can be divided into 16 segments of 256MB each. The 32-bit PowerPC implementations use these segment registers to generate VAs from EAs. The 970FX includes a transitional bridge facility that allows a 32-bit operating system to continue using the 32-bit PowerPC implementation's segment register manipulation instructions. Specifically, the 970FX allows software to associate segments 0 through 15 with any of the 2³⁷ available virtual segments. In this case, the first 16 entries of the segment lookaside buffer (SLB), which is discussed next, act as the 16 segment registers.

3.3.3.1. SLB and TLB

We saw that the segment table and the page table are memory-resident. It would be prohibitively expensive if the processor were to go to main memory not only for data fetching but also for address translation. Caching exploits the principle of locality of memory. If caching is effective, then address translations will also have the same locality as memory. The 970FX includes two on-chip buffers for caching recently used segment table entries and page address translations: the segment lookaside buffer (SLB) and the translation lookaside buffer (TLB), respectively. The SLB is a 64-entry, fully associative cache. The TLB is a 1024-entry, four-way set-associative cache with parity protection. It also supports large pages (see Section 3.3.3.4).

3.3.3.2. Address Translation

Figure 36 depicts address translation in the 970FX MMU, including the roles of the SLB and the TLB. The 970FX MMU uses 64-bit or 32-bit effective addresses, 65-bit virtual addresses, and 42-bit physical addresses. The presence of the DART introduces another address flavor, the I/O address, which is an address in a 32-bit address space that maps to a larger physical address space.

Figure 36. Address translation in the 970FX MMU

Technically, a computer architecture has three (and perhaps more) types of memory addresses: the processor-visible physical address, the software-visible virtual address, and the bus address, which is visible to an I/O device. In most cases (especially on 32-bit hardware), the physical and bus addresses are identical and therefore not differentiated.

The 65-bit extended address space is divided into pages. Each page is mapped to a physical page. A 970FX page table can be as large as 2³¹ bytes (2GB), containing up to 2²⁴ (16 million) page table entry groups (PTEGs), where each PTEG is 128 bytes.

As Figure 36 shows, during address translation, the MMU converts program-visible effective addresses to real addresses in physical memory. It uses a part of the effective address (the effective segment ID) to locate an entry in the segment table. It first checks the SLB to see if it contains the desired STE. If there is an SLB miss, the MMU searches for the STE in the memory-resident segment table. If the STE is still not found, a memory access fault occurs. If the STE is found, a new SLB entry is allocated for it. The STE represents a segment descriptor, which is used to generate the 65-bit virtual address. The virtual address has a 37-bit virtual segment ID (VSID). Note that the page index and the byte offset in the virtual address are the same as in the effective address. The concatenation of the VSID and the page index forms the virtual page number (VPN), which is used for looking up in the TLB. If there is a TLB miss, the memory-resident page table is looked up to retrieve a page table entry (PTE), which contains a real page number (RPN). The RPN, along with the byte offset carried over from the effective address, forms the physical address.

The 970FX allows setting up the TLB to be direct-mapped by setting a particular bit of a hardware-implementation-dependent register.

3.3.3.3. Caching the Caches: ERATs

Information from the SLB and the TLB may be cached in two effective-to-real address translation caches (ERATs)one for instructions (I-ERAT) and another for data (D-ERAT). Both ERATs are 128-entry, two-way set-associative caches. Each ERAT entry contains effective-to-real address translation information for a 4KB block of storage. Both ERATs contain invalid information upon power-on. As shown in Figure 36, the ERATs represent a shortcut path to the physical address when there is a match for the effective address in the ERATs.

3.3.3.4. Large Pages

Large pages are meant for use by high-performance computing (HPC) applications. The typical page size of 4KB could be detrimental to memory performance in certain circumstances. If an application's locality of reference is too wide, 4KB pages may not capture the locality effectively enough. If too many TLB misses occur, the consequent TLB entry allocations and the associated delays would be undesirable. Since a large page represents a much larger memory range, the number of TLB hits should increase, as the TLB would now cache translations for larger virtual memory ranges.

It is an interesting problem for the operating system to make large pages available to applications. Linux provides large-page support through a pseudo file system (hugetlbfs) that is backed by large pages. The superuser must explicitly configure some number of large pages in the system by preallocating physically contiguous memory. Thereafter, the hugetlbfs instance can be mounted on a directory, which is required if applications intend to use the mmap() system call to access large pages. An alternative is to use shared memory callsshmat() and shmget(). Files may be created, deleted, mmap()'ed, and munmap()'ed on hugetlbfs. It does not support reads or writes, however. AIX also requires separate, dedicated physical memory for large-page use. An AIX application can use large pages either via shared memory, as on Linux, or by requesting that the application's data and heap segments be backed by large pages.

Note that whereas the 970FX TLB supports large pages, the ERATs do not; large pages require multiple entriescorresponding to each referenced 4KB block of a large pagein the ERATs. Cache-inhibited accesses to addresses in large pages are not permitted.

3.3.3.5. No Support for Block Address Translation Mechanism

The 970FX does not support the Block Address Translation (BAT) mechanism that is supported in earlier PowerPC processors such as the G4. BAT is a software-controlled array used for mapping largeoften much larger than a pagevirtual address ranges into contiguous areas of physical memory. The entire map will have the same attributes, including access protection. Thus, the BAT mechanism is meant to reduce address translation overhead for large, contiguous regions of special-purpose virtual address spaces. Since BAT does not use pages, such memory cannot be paged normally. A good example of a scenario where BAT is useful is that of a region of framebuffer memory, which could be memory-mapped effectively via BAT. Software can select block sizes ranging from 128KB to 256MB.

On PowerPC processors that implement BAT, there are four BAT registers each for data (DBATs) and instructions (IBATs). A BAT register is actually a pair of upper and lower registers, which are accessible from supervisor mode. The eight pairs are named DBAT0U-DBAT3U, DBAT0L-DBAT3L, IBAT0U-IBAT3U, and IBAT0L-IBAT3L. The contents of a BAT register include a block effective page index (BEPI), a block length (BL), and a block real page number (BRPN). During BAT translation, a certain number of high-order bits of the EAas specified by BLare matched against each BAT register. If there is a match, the BRPN value is used to yield the RA from the EA. Note that BAT translation is used over page table translation for storage locations that have mappings in both a BAT register and the page table.

3.3.4. Miscellaneous Internal Buffers and Queues

The 970FX contains several miscellaneous buffers and queues internal to the processor, most of which are not visible to software. Examples include the following:

A 4-entry (128 bytes per entry) Instruction Prefetch Queue logically above the L1 I-cache

Fetch buffers in the Instruction Fetch Unit and the Instruction Decode Unit

An 8-entry Load Miss Queue (LMQ) that tracks loads that missed the L1 cache and are waiting to receive data from the processor's storage subsystem

A 32-entry Store Queue (STQ)^[28] for holding stores that can be written to cache or memory later
^[28] The STQ supports forwarding.

A 32-entry Load Reorder Queue (LRQ) in the Load/Store Unit (LSU) that holds physical addresses for tracking the order of loads and watching for hazards

A 32-entry Store Reorder Queue (SRQ) in the LSU that holds physical addresses and tracks all active stores

A 32-entry Store Data Queue (SDQ) in the LSU that holds a double word of data

A 12-entry Prefetch Filter Queue (PFQ) for detecting data streams for prefetching

An 8-entry (64 bytes per entry) fully associative Store Queue for the L2 cache controller

3.3.5. Prefetching

Cache miss rates can be reduced through a technique called prefetchingthat is, fetching information before the processor requests it. The 970FX prefetches instructions and data to hide memory latency. It also supports software-initiated prefetching of up to eight data streams called hardware streams, four of which can optionally be vector streams. A stream is defined as a sequence of loads that reference more than one contiguous cache line.

The prefetch engine is a functionality of the Load/Store Unit. It can detect sequential access patterns in ascending or descending order by monitoring loads and recording cache line addresses when there are cache misses. The 970FX does not prefetch store misses.

Let us look at an example of the prefetch engine's operation. Assuming no prefetch streams are active, the prefetch engine will act when there is an L1 D-cache miss. Suppose the miss was for a cache line with address A; then the engine will create an entry in the Prefetch Filter Queue (PFQ)^[29] with the address of either the next or the previous cache linethat is, either A + 1 or A 1. It guesses the direction (up or down) based on whether the memory access was located in the top 25% of the cache line (guesses down) or the bottom 75% of the cache line (guesses up). If there is another L1 D-cache miss, the engine will compare the line address with the entries in the PFQ. If the access is indeed sequential, the line address now being compared must be either A + 1 or A 1. Alternatively, the engine could have incorrectly guessed the direction, in which case it would create another filter entry for the opposite direction. If the guessed direction was correct (say, up), the engine deems it a sequential access and allocates a stream entry in the Prefetch Request Queue (PRQ)^[30] using the next available stream identifier. Moreover, the engine will initiate prefetching for cache line A + 2 to L1 and cache line A + 3 to L2. If A + 2 is read, the engine will cause A + 3 to be fetched to L1 from L2, and A + 4, A + 5, and A + 6 to be fetched to L2. If further sequential demand-reads occur (for A + 3 next), this pattern will continue until all streams are assigned. The PFQ is updated using an LRU algorithm.

^[29] The PFQ is a 12-entry queue for detecting data streams for prefetching.

^[30] The PRQ is a queue of eight streams that will be prefetched.

The 970FX allows software to manipulate the prefetch mechanism. This is useful if the programmer knows data access patterns ahead of time. A version of the data-cache-block-touch (dcbt) instruction, which is one of the storage control instructions, can be used by a program to provide hints that it intends to read from a specified address or data stream in the near future. Consequently, the processor would initiate a data stream prefetch from a particular address.

Note that if you attempt to access unmapped or protected memory via software-initiated prefetching, no page faults will occur. Moreover, these instructions are not guaranteed to succeed and can fail silently for a variety of reasons. In the case of success, no result is returned in any registeronly the cache block is fetched. In the case of failure, no cache block is fetched, and again, no result is returned in any register. In particular, failure does not affect program correctness; it simply means that the program will not benefit from prefetching.

Prefetching continues until a page boundary is reached, at which point the stream will have to be reinitialized. This is so because the prefetch engine does not know about the effective-to-real address mapping and can prefetch only within a real page. This is an example of a situation in which large pageswith page boundaries that are 16MB apartwill fare better than 4KB pages.

On a Mac OS X system with AltiVec hardware, you can use the vec_dst() AltiVec function to initiate data read of a line into cache, as shown in the pseudocode in Figure 37.

Figure 37. Data prefetching in AltiVec

while (/* data processing loop */) { /* prefetch */ vec_dst(address + prefetch_lead, control, stream_id); /* do some processing */ /* advance address pointer */ } /* stop the stream */ vec_dss(stream_id);

The address argument to vec_dst() is a pointer to a byte that lies within the first cache line to be fetched; the control argument is a word whose bits specify the block size, the block count, and the distance between the blocks; and the stream_id specifies the stream to use.

3.3.6. Registers

The 970FX has two privilege modes of operation: a user mode (problem state) and a supervisor mode (privileged state). The former is used by user-space applications, whereas the latter is used by the Mac OS X kernel. When the processor is first initialized, it comes up in supervisor mode, after which it can be switched to user mode via the Machine State Register (MSR).

The set of architected registers can be divided into three levels (or models) in the PowerPC architecture:

User Instruction Set Architecture (UISA)

Virtual Environment Architecture (VEA)

Operating Environment Architecture (OEA)

The UISA and VEA registers can be accessed by software through either user-level or supervisor-level privileges, although there are VEA registers that cannot be written to by user-level instructions. OEA registers can be accessed only by supervisor-level instructions.

3.3.6.1. UISA and VEA Registers

Figure 38 shows the UISA and VEA registers of the 970FX. Their purpose is summarized in Table 36. Note that whereas the general-purpose registers are all 64-bit wide, the set of supervisor-level registers contains both 32-bit and 64-bit registers.

Figure 38. PowerPC UISA and VEA registers

Table 36. UISA and VEA Registers
Name	Width	Count	Notes
General-Purpose Registers (GPRs)	64-bit	32	GPRs are used as source or destination registers for fixed-point operationse.g., by fixed-point load/store instructions. You also use GPRs while accessing special-purpose registers (SPRs). Note that GPR0 is not hardwired to the value 0, as is the case on several RISC architectures.
Floating-Point Registers (FPRs)	64-bit	32	FPRs are used as source or destination registers for floating-point instructions. You also use FPRs to access the Floating-Point Status and Control Register (FPSCR). An FPR can hold integer, single-precision floating-point, or double-precision floating-point values.
Vector Registers (VRs)	128-bit	32	VRs are used as vector source or destination registers for vector instructions.
Integer Exception Register (XER)	32-bit	1	The XER is used to indicate carry conditions and overflows for integer operations. It is also used to specify the number of bytes to be transferred by a load-string-word-indexed (`lswx`) or store-string-word-indexed (`stswx`) instruction.
Floating-Point Status and Control Register (FPSCR)	32-bit	1	The FPSCR is used to record floating-point exceptions and the result type of a floating-point operation. It is also used to toggle the reporting of floating-point exceptions and to control the floating-point rounding mode.
Vector Status and Control Register (VSCR)	32-bit	1	Only two bits of the VSCR are defined: the saturate (SAT) bit and the non-Java mode (NJ) bit. The SAT bit indicates that a vector saturating-type instruction generated a saturated result. The NJ bit, if cleared, enables a Java-IEEE-C9X-compliant mode for vector floating-point operations that handles denormalized values in accordance with these standards. When the NJ bit is set, a potentially faster mode is selected, in which the value 0 is used in place of denormalized values in source or result vectors.
Condition Register (CR)	32-bit	1	The CR is conceptually divided into eight 4-bit fields (CR0CR7). These fields store results of certain fixed-point and floating-point operations. Some branch instructions can test individual CR bits.
Vector Save/Restore Register (VRSAVE)	32-bit	1	The VRSAVE is used by software while saving and restoring VRs across context-switching events. Each bit of the VRSAVE corresponds to a VR and specifies whether that VR is in use or not.
Link Register (LR)	64-bit	1	The LR can be used to return from a subroutineit holds the return address after a branch instruction if the link (LK) bit in that branch instruction's encoding is 1. It is also used to hold the target address for the branch-conditional-to-Link-Register (`bclrx`) instruction. Some instructions can automatically load the LR to the instruction following the branch.
Count Register (CTR)	64-bit	1	The CTR can be used to hold a loop count that is decremented during execution of branch instructions. The branch-conditional-to-Count-Register (`bcctrx`) instruction branches to the target address held in this register.
Timebase Registers (TBL, TBU)	32-bit	2	The Timebase (TB) Register, which is the concatenation of the 32-bit TBU and TBL registers, contains a periodically incrementing 64-bit unsigned integer.

Processor registers are used with all normal instructions that access memory. In fact, there are no computational instructions in the PowerPC architecture that modify storage. For a computational instruction to use a storage operand, it must first load the operand into a register. Similarly, if a computational instruction writes a value to a storage operand, the value must go to the target location via a register. The PowerPC architecture supports the following addressing modes for such instructions.

rA and rB represent register contents. The notation (rA | 0) means the contents of register rA unless rA is GPR0, in which case (rA | 0) is taken to be the value 0.

The UISA-level performance-monitoring registers provide user-level read access to the 970FX's performance-monitoring facility. They can be written only by a supervisor-level program such as the kernel or a kernel extension.

Apple's Computer Hardware Understanding Development (CHUD) is a suite of programs (the "CHUD Tools") for measuring and optimizing performance on Mac OS X. The software in the CHUD Tools package makes use of the processor's performance-monitoring counters.

The Timebase Register

The Timebase (TB) provides a long-period counter driven by an implementation-dependent frequency. The TB is a 64-bit register containing an unsigned 64-bit integer that is incremented periodically. Each increment adds 1 to bit 63 (the lowest-order bit) of the TB. The maximum value that the TB can hold is 2⁶⁴ 1, after which it resets to zero without generating any exception. The TB can either be incremented at a frequency that is a function of the processor clock frequency, or it can be driven by the rising edge of the signal on the TB enable (TBEN) input pin.^[31] In the former case, the 970FX increments the TB once every eight full frequency processor clocks. It is the operating system's responsibility to initialize the TB. The TB can be readbut not written tofrom user space. The program shown in Figure 39 retrieves and prints the TB.

^[31] In this case, the TB frequency may change at any time.

Figure 39. Retrieving and displaying the Timebase Register

// timebase.c #include <stdio.h> #include <stdlib.h> #include <sys/types.h> u_int64_t mftb64(void); void mftb32(u_int32_t *, u_int32_t *); int main(void) { u_int64_t tb64; u_int32_t tb32u, tb32l; tb64 = mftb64(); mftb32(&tb32u, &tb32l); printf("%llx %x%08x\n", tb64, tb32l, tb32u); exit(0); } // Requires a 64-bit processor // The TBR can be read in a single instruction (TBU || TBL) u_int64_t mftb64(void) { u_int64_t tb64; __asm("mftb %0\n\t" : "=r" (tb64) : ); return tb64; } // 32-bit or 64-bit void mftb32(u_int32_t *u, u_int32_t *l) { u_int32_t tmp; __asm( "loop: \n\t" "mftbu %0 \n\t" "mftb %1 \n\t" "mftbu %2 \n\t" "cmpw %2,%0 \n\t" "bne loop \n\t" : "=r"(*u), "=r"(*l), "=r"(tmp) : ); } $ gcc -Wall -o timebase timebase.c $ ./timebase; ./timebase; ./timebase; ./timebase; ./timebase b6d10de300000001 b6d10de4000002d3 b6d4db7100000001 b6d4db72000002d3 b6d795f700000001 b6d795f8000002d3 b6da5a3000000001 b6da5a31000002d3 b6dd538c00000001 b6dd538d000002d3

Note in Figure 39 that we use inline assembly rather than create a separate assembly source file. The GNU assembler inline syntax is based on the template shown in Figure 310.

Figure 310. Code template for inline assembly in the GNU assembler

__asm__ volatile( "assembly statement 1\n" "assembly statement 2\n" ... "assembly statement N\n" : outputs, if any : inputs, if any : clobbered registers, if any );

We will come across other examples of inline assembly in this book.

Viewing Register Contents: The Mac OS X Way

The contents of the TBR, along with those of several configuration registers, memory management registers, performance-monitoring registers, and miscellaneous registers can be viewed using the Reggie SE graphical application (Reggie SE.app), which is part of the CHUD Tools package. Reggie SE can also display physical memory contents and details of PCI devices.

3.3.6.2. OEA Registers

The OEA registers are shown in Figure 311. Examples of their use include the following.

The bit-fields of the Machine State Register (MSR) are used to define the processor's state. For example, MSR bits are used to specify the processor's computation mode (32-bit or 64-bit), to enable or disable power management, to determine whether the processor is in privileged (supervisor) or nonprivileged (user) mode, to enable single-step tracing, and to enable or disable address translation. The MSR can be explicitly accessed via the move-to-MSR (mtmsr), move-to-MSR-double (mtmsrd), and move-from-MSR (mfmsr) instructions. It is also modified by the system-call (sc) and return-from-interrupt-double (rfid) instructions.

The Hardware-Implementation-Dependent (HID) registers allow very fine-grained control of the processor's features. Bit-fields in the various HID registers can be used to enable, disable, or alter the behavior of processor features such as branch prediction mode, data prefetching, instruction cache, and instruction prefetch mode and also to specify which data cache replacement algorithm to use (LRU or first-in first-out [FIFO]), whether the Timebase is externally clocked, and whether large pages are disabled.

The Storage Description Register (SDR1) is used to hold the page table base address.

Figure 311. PowerPC OEA registers

3.3.7. Rename Registers

The 970FX implements a substantial number of rename registers, which are used to handle register-name dependencies. Instructions can depend on one another from the point of view of control, data, or name. Consider two instructions, say, I1 and I2, in a program, where I2 comes after I1.

I1 ... Ix ... I2

In a data dependency, I2 either uses a result produced by I1, or I2 has a data dependency on an instruction Ix, which in turn has a data dependency on I1. In both cases, a value is effectively transmitted from I1 to I2.

In a name dependency, I1 and I2 use the same logical resource or name, such as a register or a memory location. In particular, if I2 writes to the same register that is either read from or written to by I1, then I2 would have to wait for I1 to execute before it can execute. These are known as write-after-read (WAR) and write-after-write (WAW) hazards.

I1 reads (or writes) <REGISTER X> ... I2 writes <REGISTER X>

In this case, the dependency is not "real" in that I2 does not need I1's result. One solution to handle register-name dependencies is to rename the conflicting register used in the instructions so that they become independent. Such renaming could be done in software (statically, by the compiler) or in hardware (dynamically, by logic in the processor). The 970FX uses pools of physical rename registers that are assigned to instructions during the mapping stage in the processor pipeline and released when they are no longer needed. In other words, the processor internally renames architected registers used by instructions to physical registers. This makes sense only when the number of physical registers is (substantially) larger than the number of architected registers. For example, the PowerPC architecture has 32 GPRs, but the 970FX implementation has a pool of 80 physical GPRs, from which the 32 architected GPRs are assigned. Let us consider a specific example, say, of a WAW hazard, where renaming is helpful.

; before renaming r20 r21 + r22 ; r20 is written to ... r20 r23 + r24 ; r20 is written to... WAW hazard here r25 r20 r21 + r22 ; r20 is written to ... r64 r23 + 424 ; r20 is renamed to r64... no WAW hazard now r25 Renaming is also beneficial to speculative execution, since the processor can use the extra physical registers to reduce the amount of architected register state it must save to recover from incorrectly speculated execution.

Table 37 lists the available renamed registers in the 970FX. The table also mentions emulation registers, which are available to cracked and microcoded instructions, which, as we will see in Section 3.3.9.1, are processes by which complex instructions are broken down into simpler instructions.

Table 37. Rename Register Resources
Resource	Architected (Logical Resource)	Emulation (Logical Resource)	Rename Pool (Physical Resource)
GPRs	32x64-bit	4x64-bit	80x64-bit.
VRSAVE	1x32-bit		Shared with the GPR rename pool.
FPRs	32x64-bit	1x64-bit	80x64-bit.
FPSCR	1x32-bit		One rename per active instruction group using a 20-entry buffer.
LR	1x64-bit		16x64-bit.
CTR	1x64-bit		LR and CTR share the same rename pool.
CR	8x4-bit	1x4-bit	32x4-bit.
XER	1x32-bit		24x2-bit. Only two bitsthe overflow bit OV and the carry bit CAare renamed from a pool of 24 2-bit registers.
VRs	32x128-bit		80x128-bit.
VSCR	1x32-bit		20x1-bit. Of the VSCR's two defined bits, only the SAT bit is renamed from a pool of 20 1-bit registers.

3.3.8. Instruction Set

All PowerPC instructions are 32 bits wide regardless of whether the processor is in 32-bit or 64-bit computation mode. All instructions are word aligned, which means that the two lowest-order bits of an instruction address are irrelevant from the processor's standpoint. There are several instruction formats, but bits 0 through 5 of an instruction word always specify the major opcode. PowerPC instructions typically have three operands: two source operands and one result. One of the source operands may be a constant or a register, but the other operands are usually registers.

We can broadly divide the instruction set implemented by the 970FX into the following instruction categories: fixed-point, floating-point, vector, control flow, and everything else.

3.3.8.1. Fixed-Point Instructions

Operands of fixed-point instructions can be bytes (8-bit), half words (16-bit), words (32-bit), or double words (64-bit). This category includes the following instruction types:

Fixed-point load and store instructions for moving values between the GPRs and storage

Fixed-point load-multiple-word (lmw) and store-multiple-word (stmw), which can be used for restoring or saving up to 32 GPRs in a single instruction

Fixed-point load-string-word-immediate (lswi), load-string-word-indexed (lswx), store-string-word-immediate (stswi), and store-string-word-indexed (stswx), which can be used to fetch and store fixed- and variable-length strings, with arbitrary alignments

Fixed-point arithmetic instructions, such as add, divide, multiply, negate, and subtract

Fixed-point compare instructions, such as compare-algebraic, compare-algebraic-immediate, compare-algebraic-logical, and compare-algebraic-logical-immediate

Fixed-point logical instructions, such as and, and-with-complement, equivalent, or, or-with-complement, nor, xor, sign-extend, and count-leading-zeros (cntlzw and variants)

Fixed-point rotate and shift instructions, such as rotate, rotate-and-mask, shift-left, and shift-right

Fixed-point move-to-system-register (mtspr), move-from-system-register (mfspr), move-to-MSR (mtmsr), and move-from-MSR (mfmsr), which allow GPRs to be used to access system registers

Most load/store instructions can optionally update the base register with the effective address of the data operated on by the instruction.

3.3.8.2. Floating-Point Instructions

Floating-point operands can be single-precision (32-bit) or double-precision (64-bit) floating-point quantities. However, floating-point data is always stored in the FPRs in double-precision format. Loading a single-precision value from storage converts it to double precision, and storing a single-precision value to storage actually rounds the FPR-resident double-precision value to single precision. The 970FX complies with the IEEE 754 standard^[32] for floating-point arithmetic. This instruction category includes the following types:

^[32] The IEEE 754 standard governs binary floating-point arithmetic. The standard's primary architect was William Velvel Kahan, who received the Turing Award in 1989 for his fundamental contributions to numerical analysis.

Floating-point load and store instructions for moving values between the FPRs and storage

Floating-point comparison instructions

Floating-point arithmetic instructions, such as add, divide, multiply, multiply-add, multiply-subtract, negative-multiply-add, negative-multiply-subtract, negate, square-root, and subtract

Instructions for manipulating the FPSCR, such as move-to-FPSCR, move-from-FPSCR, set-FPSCR-bit, clear-FPSCR-bit, and copy-FPSCR-field-to-CR

PowerPC optional floating-point instructions, namely: floating-square-root (fsqrt), floating-square-root-single (fsqrts), floating-reciprocal-estimate-single (fres), floating-reciprocal-square-root-estimate (frsqrte), and floating-point-select (fsel)

The precision of floating-point-estimate instructions (fres and frsqrte) is less on the 970FX than on the G4. Although the 970FX is at least as accurate as the IEEE 754 standard requires, the G4 is more accurate than required. Figure 312 shows a program that can be executed on a G4 and a G5 to illustrate this difference.

Figure 312. Precision of the floating-point-estimate instruction on the G4 and the G5

// frsqrte.c #include <stdio.h> #include <stdlib.h> double frsqrte(double n) { double s; asm( "frsqrte %0, %1" : "=f" (s) /* out */ : "f" (n) /* in */ ); return s; } int main(int argc, char **argv) { printf("%8.8f\n", frsqrte(strtod(argv[1], NULL))); return 0; } $ machine ppc7450 $ gcc -Wall -o frsqrte frsqrte.c $ ./frsqrte 0.5 1.39062500 $ machine ppc970 $ gcc -Wall -o frsqrte frsqrte.c $ ./frsqrte 0.5 1.37500000

3.3.8.3. Vector Instructions

Vector instructions execute in the 128-bit VMX execution unit. We will look at some of the VMX details in Section 3.3.10. The 970FX VMX implementation contains 162 vector instructions in various categories.

3.3.8.4. Control-Flow Instructions

A program's control flow is sequentialthat is, its instructions logically execute in the order they appearuntil a control-flow change occurs either explicitly (because of an instruction that modifies the control flow of a program) or as a side effect of another event. The following are examples of control-flow changes:

An explicit branch instruction, after which execution continues at the target address specified by the branch

An exception, which could represent an error, a signal external to the processor core, or an unusual condition that sets a status bit but may or may not cause an interrupt^[33]
^[33] When machine state changes in response to an exception, an interrupt is said to have occurred.

A trap, which is an interrupt caused by a trap instruction

A system call, which is a form of software-only interrupt caused by the system-call (sc) instruction

Each of these events could have handlerspieces of code that handle them. For example, a trap handler may be executed when the conditions specified in the trap instruction are satisfied. When a user-space program executes an sc instruction with a valid system call identifier, a function in the operating system kernel is invoked to provide the service corresponding to that system call. Similarly, control flow also changes when the program is returning from such handlers. For example, after a system call finishes in the kernel, execution continues in user spacein a different piece of code.

The 970FX supports absolute and relative branching. A branch could be conditional or unconditional. A conditional branch can be based on any of the bits in the CR being 1 or 0. We earlier came across the special-purpose registers LR and CTR. LR can hold the return address on a procedure call. A leaf procedureone that does not call another proceduredoes not need to save LR and therefore can return faster. CTR is used for loops with a fixed iteration limit. It can be used to branch based on its contentsthe loop counterbeing zero or nonzero, while decrementing the counter automatically. LR and CTR are also used to hold target addresses of conditional branches for use with the bclr and bcctr instructions, respectively.

Besides performing aggressive dynamic branch prediction, the 970FX allows hints to be provided along with many types of branch instructions to improve branch prediction accuracy.

3.3.8.5. Miscellaneous Instructions

The 970FX includes various other types of instructions, many of which are used by the operating system for low-level manipulation of the processor. Examples include the following types:

Instructions for processor management, including direct manipulation of some SPRs

Instructions for controlling caches, such as for touching, zeroing, and flushing a cache; requesting a store; and requesting a prefetch stream to be initiatedfor example: instruction-cache-block-invalidate (icbi), data-cache-block-touch (dcbt), data-cache-block-touch-for-store (dcbtst), data-cache-block-set-to-zero (dcbz), data-cache-block-store (dcbst), and data-cache-block-flush (dcbf)

Instructions for loading and storing conditionally, such as load-word-and-reserve-indexed (lwarx), load-double-word-and-reserve-indexed (ldarx), store-word-conditional-indexed (stwcx.), and store-double-word-conditional-indexed (stdcx.)

The lwarx (or ldarx) instruction performs a load and sets a reservation bit internal to the processor. This bit is hidden from the programming model. The corresponding store instructionstwcx. (or stdcx.)performs a conditional store if the reservation bit is set and clears the reservation bit.

Instructions for memory synchronization,^[34] such as enforce-in-order-execution-of-i/o (eieio), synchronize (sync), and special forms of sync (lwsync and ptesync)
^[34] During memory synchronization, bit 2 of the CRthe EQ bitis set to record the successful completion of a store operation.

Instructions for manipulating SLB and TLB entries, such as slb-invalidate-all (slbia), slb-invalidate-entry (slbie), tlb-invalidate-entry (tlbie), and tlb-synchronize (tlbsync)

3.3.9. The 970FX Core

The 970FX core is depicted in Figure 313. We have come across several of the core's major components earlier in this chapter, such as the L1 caches, the ERATs, the TLB, the SLB, register files, and register-renaming resources.

Figure 313. The core of the 970FX

The 970FX core is designed to achieve a high degree of instruction parallelism. Some of its noteworthy features include the following.

It has a highly superscalar 64-bit design, with support for the 32-bit operating-system-bridge^[35] facility.
^[35] The "bridge" refers to a set of optional features defined to simplify the migration of 32-bit operating systems to 64-bit implementations.

It performs dynamic "cracking" of certain instructions into two or more simpler instructions.

It performs highly speculative execution of instructions along with aggressive branch prediction and dynamic instruction scheduling.

It has twelve logically separate functional units and ten execution pipelines.

It has two Fixed-Point Units (FXU1 and FXU2). Both units are capable of basic arithmetic, logical, shifting, and multiplicative operations on integers. However, only FXU1 is capable of executing divide instructions, whereas only FXU2 can be used in operations involving special purpose registers.

It has two Floating-Point Units (FPU1 and FPU2). Both units are capable of performing the full supported set of floating-point operations.

It has two Load/Store Units (LSU1 and LSU2).

It has a Condition Register Unit (CRU) that executes CR logical instructions.

It has a Branch Execution Unit (BRU) that computes branch address and branch direction. The latter is compared with the predicted direction. If the prediction was incorrect, the BRU redirects instruction fetching.

It has a Vector Processing Unit (VPU) with two subunits: a Vector Arithmetic and Logical Unit (VALU) and a Vector Permute Unit (VPERM). The VALU has three subunits of its own: a Vector Simple-Integer^[36] Unit (VX), a Vector Complex-Integer Unit (VC), and a Vector Floating-Point Unit (VF).
^[36] Simple integers (non-floating-point) are also referred to as fixed-point. The "X" in "VX" indicates "fixed."

It can perform 64-bit integer or floating-point operations in one clock cycle.

It has deeply pipelined execution units, with pipeline depths of up to 25 stages.

It has reordering issue queues that allow for out-of-order execution.

Up to 8 instructions can be fetched in each cycle from the L1 instruction cache.

Up to 8 instructions can be issued in each cycle.

Up to 5 instructions can complete in each cycle.

Up to 215 instructions can be in flightthat is, in various stages of execution (partially executed)at any time.

The processor uses a large number of its resources such as reorder queues, rename register pools, and other logic to track in-flight instructions and their dependencies.

3.3.9.1. Instruction Pipeline

In this section, we will discuss how the 970FX processes instructions. The overall instruction pipeline is shown in Figure 314. Let us look at the important stages of this pipeline.

Figure 314. The 970FX instruction pipeline

IFAR, ICA^[37]

^[37] Instruction Cache Access.

Based on the address in the Instruction Fetch Address Register (IFAR), the instruction-fetch-logic fetches eight instructions every cycle from the L1 I-cache into a 32-entry instruction buffer. The eight-instruction block, so fetched, is 32-byte aligned. Besides performing IFAR-based demand fetching, the 970FX prefetches cache lines into a 4x128-byte Instruction Prefetch Queue. If a demand fetch results in an I-cache miss, the 970FX checks whether the instructions are in the prefetch queue. If the instructions are found, they are inserted into the pipeline as if no I-cache miss had occurred. The cache line's critical sector (eight words) is written into the I-cache.

D0

There is logic to partially decode (predecode) instructions after they leave the L2 cache and before they enter the I-cache or the prefetch queue. This process adds five extra bits to each instruction to yield a 37-bit instruction. An instruction's predecode bits mark it as illegal, microcoded, conditional or unconditional branch, and so on. In particular, the bits also specify how the instruction is to be grouped for dispatching.

D1, D2, D3

The 970FX splits complex instructions into two or more internal operations, or iops. The iops are more RISC-like than the instructions they are part of. Instructions that are broken into exactly two iops are called cracked instructions, whereas those that are broken into three or more iops are called microcoded instructions because the processor emulates them using microcode.

An instruction may not be atomic because the atomicity of cracked or microcoded instructions is at the iop level. Moreover, it is the iops, and not programmer-visible instructions, that are executed out-of-order. This approach allows the processor more flexibility in parallelizing execution. Note that AltiVec instructions are neither cracked nor microcoded.

Fetched instructions go to a 32-instruction fetch buffer. Every cycle, up to five instructions are taken from this buffer and sent through a decode pipeline that is either inline (consisting of three stages, namely, D1, D2, and D3), or template-based if the instruction needs to be microcoded. The template-based decode pipeline generates up to four iops per cycle that emulate the original instruction. In any case, the decode pipeline leads to the formation of an instruction dispatch group.

Given the out-of-order execution of instructions, the processor needs to keep track of the program order of all instructions in various stages of execution. Rather than tracking individual instructions, the 970FX tracks instructions in dispatch groups. The 970FX forms such groups containing one to five iops, each occupying an instruction slot (0 through 4) in the group. Dispatch group formation^[38] is subject to a long list of rules and conditions such as the following.

^[38] The instruction grouping performed by the 970FX has similarities to a VLIW processor.

The iops in a group must be in program order, with the oldest instruction being in slot 0.

A group may contain up to four nonbranch instructions and optionally a branch instruction. When a branch is encountered, it is the last instruction in the current group, and a new group is started.

Slot 4 can contain only branch instructions. In fact, no-op (no-operation) instructions may have to be inserted in the other slots to force a branch instruction to fall in slot 4.

An instruction that is a branch target is always at the start of a group.

A cracked instruction takes two slots in a group.

A microcoded instruction takes an entire group by itself.

An instruction that modifies an SPR with no associated rename register terminates a group.

No more than two instructions that modify the CR may be in a group.

XFER

The iops wait for resources to become free in the XFER stage.

GD, DSP, WRT, GCT, MAP

After group formation, the execution pipeline divides into multiple pipelines for the various execution units. Every cycle, one group of instructions can be sent (or dispatched) to the issue queues. Note that instructions in a group remain together from dispatch to completion.

As a group is dispatched, several operations occur before the instructions actually execute. Internal group instruction dependencies are determined (GD). Various internal resources are assigned, such as issue queue slots, rename registers and mappers, and entries in the load/store reorder queues. In particular, each iop in the group that returns a result must be assigned a register to hold the result. Rename registers are allocated in the dispatch phase before the instructions enter the issue queues (DSP, MAP).

To track the groups themselves, the 970FX uses a global completion table (GCT) that stores up to 20 entries in program orderthat is, up to 20 dispatch groups can be in flight concurrently. Since each group can have up to 5 iops, as many as 100 iops can be tracked in this manner. The WRT stage represents the writes to the GCT.

ISS, RF

After all the resources that are required to execute the instructions are available, the instructions are sent (ISS) to appropriate issue queues. Once their operands appear, the instructions start to execute. Each slot in a group feeds separate issue queues for various execution units. For example, the FXU/LSU and the FPU draw their instructions from slots { 0, 3 } and { 1, 2 }, respectively, of an instruction group. If one pair goes to the FXU/LSU, the other pair goes to the FPU. The CRU draws its instructions from the CR logical issue queue that is fed from instruction slots 0 and 1. As we saw earlier, slot 4 of an instruction group is dedicated to branch instructions. AltiVec instructions can be issued to the VALU and the VPERM issue queues from any slot except slot 4. Table 38 shows the 970FX issue queue sizeseach execution unit listed has one issue queue.

Table 38. Sizes of the Various 970FX Issue Queues
Execution Unit	Queue Size (Instructions)
LSU0/FXU0^[a]	18
LSU1/FXU1^[b]	18
FPU0	10
FPU1	10
BRU	12
CRU	10
VALU	20
VPERM	16

^[a] LSU0 and FXU0 share an 18-entry issue queue.

^[b] LSU1 and FXU1 share an 18-entry issue queue.

The FXU/LSU and FPU issue queues have odd and even halves that are hardwired to receive instructions only from certain slots of a dispatch group, as shown in Figure 315.

Figure 315. The FPU and FXU/LSU issue queues in the 970FX

As long as an issue queue contains instructions that have all their data dependencies resolved, an instruction moves every cycle from the queue into the appropriate execution unit. However, there are likely to be instructions whose operands are not ready; such instructions block in the queue. Although the 970FX will attempt to execute the oldest instruction first, it will reorder instructions within a queue's context to avoid stalling. Ready-to-execute instructions access their source operands by reading the corresponding register file (RF), after which they enter the execution unit pipelines. Up to ten operations can be issued in a cycleone to each of the ten execution pipelines. Note that different execution units may have varying numbers of pipeline stages.

We have seen that instructions both issue and execute out of order. However, if an instruction has finished execution, it does not mean that the program will "know" about it. After all, from the program's standpoint, instructions must execute in program order. The 970FX differentiates between an instruction finishing execution and an instruction completing. An instruction may finish execution (speculatively, say), but unless it completes, its effect is not visible to the program. All pipelines terminate in a common stage: the group completion stage (CP). When groups complete, many of their resources are released, such as load reorder queue entries, mappers, and global completion table entries. One dispatch group may be "retired" per cycle.

When a branch instruction completes, the resultant target address is compared with a predicted address. Depending on whether the prediction is correct or incorrect, either all instructions in the pipeline that were fetched after the branch in question are flushed, or the processor waits for all remaining instructions in the branch's group to complete.

Accounting for 215 In-Flight Instructions

We can account for the theoretical maximum of 215 in-flight instructions by looking at Figure 34specifically, the areas marked 1 through 6.

The Instruction Fetch Unit has a fetch/overflow buffer that can hold 16 instructions.

The instruction fetch buffer in the decode/dispatch unit can hold 32 instructions.

Every cycle, up to 5 instructions are taken from the instruction fetch buffer and sent through a three-stage instruction decode pipeline. Therefore, up to 15 instructions can be in this pipeline.

There are four dispatch buffers, each holding a dispatch group of up to five operations. Therefore, up to 20 instructions can be held in these buffers.

The global completion table can track up to 20 dispatch groups after they have been dispatched, corresponding to up to 100 instructions in the 970FX core.

The store queue can hold up to 32 stores.

Thus, the theoretical maximum number of in-flight instructions can be calculated as the sum 16 + 32 + 15 + 20 + 100 + 32, which is 215.

3.3.9.2. Branch Prediction

Branch prediction is a mechanism wherein the processor attempts to keep the pipeline full, and therefore improve overall performance, by fetching instructions in the hope that they will be executed. In this context, a branch is a decision point for the processor: It must predict the outcome of the branchwhether it will be taken or notand accordingly prefetch instructions. As shown in Figure 315, the 970FX scans fetched instructions for branches. It looks for up to two branches per cycle and uses multistrategy branch prediction logic to predict their target addresses, directions, or both. Consequently, up to 2 branches are predicted per cycle, and up to 16 predicted branches can be in flight.

All conditional branches are predicted, based on whether the 970FX fetches instructions beyond a branch and speculatively executes them. Once the branch instruction itself executes in the BRU, its actual outcome is compared with its predicted outcome. If the prediction was incorrect, there is a severe penalty: Any instructions that may have speculatively executed are discarded, and instructions in the correct control-flow path are fetched.

The 970FX's dynamic branch prediction hardware includes three branch history tables (BHTs), a link stack, and a count cache. Each BHT has 16K 1-bit entries.

The 970FX's hardware branch prediction can be overridden by software.

The first BHT is the local predictor table. Its 16K entries are indexed by branch instruction addresses. Each 1-bit entry indicates whether the branch should be taken or not. This scheme is "local" because each branch is tracked in isolation.

The second BHT is the global predictor table. It is used by a prediction scheme that takes into account the execution path taken to reach the branch. An 11-bit vectorthe global history vectorrepresents the execution path. The bits of this vector represent the previous 11 instruction groups fetched. A particular bit is 1 if the next group was fetched sequentially and is 0 otherwise. A given branch's entry in the global predictor table is at a location calculated by performing an XOR operation between the global history vector and the branch instruction address.

The third BHT is the selector table. It tracks which of the two prediction schemes is to be favored for a given branch. The BHTs are kept up to date with the actual outcomes of executed branch instructions.

The link stack and the count cache are used by the 970FX to predict branch target addresses of branch-conditional-to-link-register (bclr, bclrl) and branch-conditional-to-count-register (bcctr, bcctrl) instructions, respectively.

So far, we have looked at dynamic branch prediction. The 970FX also supports static prediction wherein the programmer can use certain bits in a conditional branch operand to statically override dynamic prediction. Specifically, two bits called the "a" and "t" bits are used to provide hints regarding the branch's direction, as shown in Table 39.

Table 39. Static Branch Prediction Hints
"a" Bit	"t" Bit	Hint
0	0	Dynamic branch prediction is used.
0	1	Dynamic branch prediction is used.
1	0	Dynamic branch prediction is disabled; static prediction is "not taken"; specified by a "-" suffix to a branch conditional mnemonic.
1	1	Dynamic branch prediction is disabled; static prediction is "taken"; specified by a "+" suffix to a branch conditional mnemonic.

3.3.9.3. Summary

Let us summarize the instruction parallelism achieved by the 970FX. In every cycle of the 970FX, the following events occur.

Up to eight instructions are fetched.

Up to two branches are predicted.

Up to five iops (one group) are dispatched.

Up to five iops are renamed.

Up to ten iops are issued from the issue queues.

Up to five iops are completed.

3.3.10. AltiVec

The 970FX includes a dedicated vector-processing unit and implements the VMX instruction set, which is an AltiVec^[39] interchangeable extension to the PowerPC architecture. AltiVec provides a SIMD-style 128-bit^[40] vector-processing unit.

^[39] AltiVec was first introduced in Motorola's e600 PowerPC corethe G4.

^[40] All AltiVec execution units and data paths are 128 bits wide.

3.3.10.1. Vector Computing

SIMD stands for single-instruction, multiple-data. It refers to a set of operations that can efficiently handle large quantities of data in parallel. SIMD operations do not necessarily require more or wider registers, although more is better. SIMD essentially better uses registers and data paths. For example, a non-SIMD computation would typically use a hardware register for each data element, even if the register could hold multiple such elements. In contrast, SIMD would use a register to hold multiple data elementsas many as would fitand would perform the same operation on all elements through a single instruction. Thus, any operation that can be parallelized in this manner stands to benefit from SIMD. In AltiVec's case, a vector instruction can perform the same operation on all constituents of a vector. Note that AltiVec instructions work on fixed-length vectors.

SIMD-based optimization does not come for free. A problem must lend itself well to vectorization, and the programmer must usually perform extra work. Some compilerssuch as IBM's XL suite of compilers and GCC 4.0 or abovealso support auto-vectorization, an optimization that auto-generates vector instructions based on the compiler's analysis of the source code.^[41] Auto-vectorization may or may not work well depending on the nature and structure of the code.

^[41] For example, the compiler may attempt to detect patterns of code that are known to be well suited for vectorization.

Several processor architectures have similar extensions. Table 310 lists some well-known examples.

Table 310. Examples of Processor Multimedia-Extensions
Processor Family	Manufacturers	Multimedia Extension Sets
Alpha	Hewlett-Packard (Digital Equipment Corporation)	MVI
AMD	Advanced Micro Devices (AMD)	3DNow!
MIPS	Silicon Graphics Incorporated (SGI)	MDMX, MIPS-3D
PA-RISC	Hewlett-Packard	MAX, MAX2
PowerPC	IBM, Motorola	VMX/AltiVec
SPARC V9	Sun Microsystems	VIS
x86	Intel, AMD, Cyrix	MMX, SSE, SSE2, SSE3

AltiVec can greatly improve the performance of data movement, benefiting applications that do processing of vectors, matrices, arrays, signals, and so on. As we saw in Chapter 2, Apple provides portable APIsthrough the Accelerate framework (Accelerate.framework)for performing vector-optimized operations.^[42] Accelerate is an umbrella framework that contains the vecLib and vImage^[43] subframeworks. vecLib is targeted for performing numerical and scientific computingit provides functionality such as BLAS, LAPACK, digital signal processing, dot products, linear algebra, and matrix operations. vImage provides vector-optimized APIs for working with image data. For example, it provides functions for alpha compositing, convolutions, format conversion, geometric transformations, histograms operations, and morphological operations.

^[42] The Accelerate framework automatically uses the best available code that it implements, depending on the hardware it is running on. For example, it will use vectorized code for AltiVec if AltiVec is available. On the x86 platform, it will use MMX, SSE, SSE2, and SSE3 if these features are available.

^[43] vImage is also available as a stand-alone framework.

Although a vector instruction performs work that would typically require many times more nonvector instructions, vector instructions are not simply instructions that deal with "many scalars" or "more memory" at a time. The fact that a vector's members are related is critical, and so is the fact that the same operation is performed on all members. Vector operations certainly play better with memory accessesthey lead to amortization. The semantic difference between performing a vector operation and a sequence of scalar operations on the same data set is that you are implicitly providing more information to the processor about your intentions. Vector operationsby their naturealleviate both data and control hazards.

AltiVec has wide-ranging applications since areas such as high-fidelity audio, video, videoconferencing, graphics, medical imaging, handwriting analysis, data encryption, speech recognition, image processing, and communications all use algorithms that can benefit from vector processing.

Figure 316 shows a trivial AltiVec C program.

Figure 316. A trivial AltiVec program

// altivec.c #include <stdio.h> #include <stdlib.h> int main(void) { // "vector" is an AltiVec keyword vector float v1, v2, v3; v1 = (vector float)(1.0, 2.0, 3.0, 4.0); v2 = (vector float)(2.0, 3.0, 4.0, 5.0); // vector_add() is a compiler built-in function v3 = vector_add(v1, v2); // "%vf" is a vector-formatting string for printf() printf("%vf\n", v3); exit(0); } $ gcc -Wall -faltivec -o altivec altivec.c $ ./altivec 3.000000 5.000000 7.000000 9.000000

As also shown in Figure 316, the -faltivec option to GCC enables AltiVec language extensions.

3.3.10.2. The 970FX AltiVec Implementation

The 970FX AltiVec implementation consists of the following components:

A vector register file (VRF) consisting of 32 128-bit architected vector registers (VR0VR31)

48 128-bit rename registers for allocation in the dispatch phase

A 32-bit Vector Status and Control Register (VSCR)

A 32-bit Vector Save/Restore Register (VRSAVE)

A Vector Permute Unit (VPERM) that benefits the implementation of operations such as arbitrary byte-wise data organization, table lookups, and packing/unpacking of data

A Vector Arithmetic and Logical Unit (VALU) that contains three parallel subunits: the Vector Simple-Integer Unit (VX), the Vector Complex-Integer Unit (VC), and the Vector Floating-Point Unit (VF)

The CR is also modified as a result of certain vector instructions.

The VALU and the VPERM are both dispatchable units that receive predecoded instructions via the issue queues.

The 32-bit VRSAVE serves a special purpose: Each of its bits indicates whether the corresponding vector register is in use or not. The processor maintains this register so that it does not have to save and restore every vector register every time there is an exception or a context switch. Frequently saving or restoring 32 128-bit registers, which together constitute 512 bytes, would be severely detrimental to cache performance, as other, perhaps more critical data would need to be evicted from cache.

Let us extend our example program from Figure 316 to examine the value in the VRSAVE. Figure 317 shows the extended program.

Figure 317. Displaying the contents of the VRSAVE

// vrsave.c #include <stdio.h> #include <stdlib.h> #include <sys/types.h> void prbits(u_int32_t); u_int32_t read_vrsave(void); // Print the bits of a 32-bit number void prbits32(u_int32_t u) { u_int32_t i = 32; for (; i--; putchar(u & 1 << i ? '1' : '0')); printf("\n"); } // Retrieve the contents of the VRSAVE u_int32_t read_vrsave(void) { u_int32_t v; __asm("mfspr %0,VRsave\n\t" : "=r"(v) : ); return v; } int main() { vector float v1, v2, v3; v1 = (vector float)(1.0, 2.0, 3.0, 4.0); v2 = (vector float)(2.0, 3.0, 4.0, 5.0); v3 = vec_add(v1, v2); prbits32(read_vrsave()); exit(0); } $ gcc -Wall -faltivec -o vrsave vrsave.c $ ./vrsave 11000000000000000000000000000000

We see in Figure 317 that two high-order bits of the VRSAVE are set and the rest are cleared. This means the program uses two VRs: VR0 and VR1. You can verify this by looking at the assembly listing for the program.

The VPERM execution unit can do merge, permute, and splat operations on vectors. Having a separate permute unit allows data-reorganization instructions to proceed in parallel with vector arithmetic and logical instructions. The VPERM and VALU both maintain their own copies of the VRF that are synchronized on the half cycle. Thus, each receives its operands from its own VRF. Note that vector loads, stores, and data stream instructions are handled in the usual LSU pipes. Although no AltiVec instructions are cracked or microcoded, vector store instructions logically break down into two components: a vector part and an LSU part. In the group formation stage, a vector store is a single entity occupying one slot. However, once the instruction is issued, it occupies two issue queue slots: one in the vector store unit and another in the LSU. Address generation takes place in the LSU. There is a slot for moving the data out of the VRF in the vector unit. This is not any different from scalar (integer and floating-point) stores, in whose case address generation still takes place in the LSU, and the respective execution unitinteger or floating-pointis used for accessing the GPR file (GPRF) or the FPR file (FPRF).

AltiVec instructions were designed to be pipelined easily. The 970FX can dispatch up to four vector instructions every cycleregardless of typeto the issue queues. Any vector instruction can be dispatched from any slot of the dispatch group except the dedicated branch slot 4.

It is usually very inefficient to pass data between the scalar units and the vector unit because data transfer between register files is not direct but goes through the caches.

3.3.10.3. AltiVec Instructions

AltiVec adds 162 vector instructions to the PowerPC architecture. Like all other PowerPC instructions, AltiVec instructions have 32-bit-wide encodings. To use AltiVec, no context switching is required. There is no special AltiVec operating modeAltiVec instructions can be used along with regular PowerPC instructions in a program. AltiVec also does not interfere with floating-point registers.

AltiVec instructions should be used at the UISA and VEA levels of the PowerPC architecture but not at the OEA level (the kernel). The same holds for floating-point arithmetic. Nevertheless, it is possible to use AltiVec and floating-point in the Mac OS X kernel beginning with a revision of Mac OS X 10.3. However, doing so would be at the cost of performance overhead in the kernel, since using AltiVec or floating-point will lead to a larger number of exceptions and register save/restore operations. Moreover, AltiVec data stream instructions cannot be used in the kernel. High-speed video scrolling on the system console is an example of the Floating-Point Unit being used by the kernelthe scrolling routines use floating-point registers for fast copying. The audio subsystem also uses floating-point in the kernel.

The following points are noteworthy regarding AltiVec vectors.

A vector is 128 bits wide.

A vector can be comprised of one of the following: 16 bytes, 8 half words, 4 words (integers), or 4 single-precision floating-point numbers.

The largest vector element size is hardware-limited to 32 bits; the largest adder in the VALU is 32 bits wide. Moreover, the largest multiplier array is 24 bits wide, which is good enough for only a single-precision floating-point mantissa.^[44]
^[44] The IEEE 754 standard defines the 32 bits of a single-precision floating-point number to consist of a sign (1 bit), an exponent (8 bits), and a mantissa (23 bits).

A given vector's members can be all unsigned or all signed quantities.

The VALU behaves as multiple ALUs based on the vector element size.

Instructions in the AltiVec instruction set can be broadly classified into the following categories:

Vector load and store instructions

Instructions for reading from or writing to the VSCR

Data stream manipulation instructions, such as data-stream-touch (dst), data-stream-stop (dss), and data-stream-stop-all (dssall)

Vector fixed-point arithmetic and comparison instructions

Vector logical, rotate, and shift instructions

Vector pack, unpack, merge, splat, and permute instructions

Vector floating-point instructions

Vector single-element loads are implemented as lvx, with undefined fields not zeroed explicitly. Care should be taken while dealing with such cases as this could lead to denormals^[45] in floating-point calculations.

^[45] Denormal numbersalso called subnormal numbersare numbers that are so small they cannot be represented with full precision.

3.3.11. Power Management

The 970FX supports power management features such as the following.

It can dynamically stop the clocks of some of its constituents when they are idle.

It can be programmatically put into predefined power-saving modes such as doze, nap, and deep nap.

It includes PowerTune, a processor-level power management technology that supports scaling of processor and bus clock frequencies and voltage.

3.3.11.1. PowerTune

PowerTune allows clock frequencies to be dynamically controlled and even synchronized across multiple processors. PowerTune frequency scaling occurs in the processor core, the busses, the bridge, and the memory controller. Allowed frequencies range from fthe full nominal frequencyto f/2, f/4, and f/64. The latter corresponds to the deep nap power-saving mode. If an application does not require the processor's maximum available performance, frequency and voltage can be changed system-widewithout stopping the core execution units and without disabling interrupts or bus snooping. All processor logic, except the bus clocks, remains active. Moreover, the frequency change is very rapid. Since power has a quadratic dependency on voltage, reducing voltage has a desirable effect on power dissipation. Consequently, the 970FX has much lower typical power consumption than the 970, which did not have PowerTune.

3.3.11.2. Power Mac G5 Thermal and Power Management

In the Power Mac G5, Apple combines the power management capabilities of the 970FX/970MP with a network of fans and sensors to contain heat generation, power consumption, and noise levels. Examples of hardware sensors include those for fan speed, temperature, current, and voltage. The system is divided into discrete cooling zones with independently controlled fans. Some Power Mac G5 models additionally contain a liquid cooling system that circulates a thermally conductive fluid to transfer heat away from the processors into a radiant grille. As air passes over the grille's cooling fins, the fluid's temperature decreases.^[46]

^[46] Similar to how an automobile radiator works.

The Liquid in Liquid Cooling

The heat transfer fluid used in the liquid cooling system consists of mostly water mixed with antifreeze. A deionized form of water called DI water is used. The low concentration of ions in such water prevents mineral deposits and electric arcing, which may occur because the circulating coolant can cause static charge to build up.

Operating system support is required to make the Power Mac G5's thermal management work properly. Mac OS X regularly monitors various temperatures and power consumption. It also communicates with the fan control unit (FCU). If the FCU does not receive feedback from the operating system, it will spin the fans at maximum speed.

A liquid-cooled dual-processor 2.5GHz Power Mac has the following fans:

CPU A PUMP

CPU A INTAKE

CPU A EXHAUST

CPU B PUMP

CPU B INTAKE

CPU B EXHAUST

BACKSIDE

DRIVE BAY

SLOT

Additionally, the Power Mac has sensors for current, voltage, and temperature, as listed in Table 311.

Table 311. Power Mac G5 Sensors: An Example
Sensor Type	Sensor Location/Name
Ammeter	CPU A AD7417^[a] AD2
Ammeter	CPU A AD7417 AD4
Ammeter	CPU B AD7417 AD2
Ammeter	CPU B AD7417 AD4
Switch	Power Button
Thermometer	BACKSIDE
Thermometer	U3 HEATSINK
Thermometer	DRIVE BAY
Thermometer	CPU A AD7417 AMB
Thermometer	CPU A AD7417 AD1
Thermometer	CPU B AD7417 AMB
Thermometer	CPU B AD7417 AD1
Thermometer	MLB INLET AMB
Voltmeter	CPU A AD7417 AD3
Voltmeter	CPU B AD7417 AD3

^[a] The AD7417 is a type of analog-to-digital converter with an on-chip temperature sensor.

We will see in Chapter 10 how to programmatically retrieve the values of various sensors from the kernel.

3.3.12. 64-bit Architecture

We saw earlier that the PowerPC architecture was designed with explicit support for 64- and 32-bit computing. PowerPC is, in fact, a 64-bit architecture with a 32-bit subset. A particular PowerPC implementation may choose to implement only the 32-bit subset, as is the case with the G3 and G4 processor families used by Apple. The 970FX implements both the 64-bit and 32-bit forms^[47]dynamic computation modes^[48]of the PowerPC architecture. The modes are dynamic in that you can switch between the two dynamically by setting or clearing bit 0 of the MSR.

^[47] A 64-bit PowerPC implementation must implement the 32-bit subset.

^[48] The computation mode encompasses addressing mode.

3.3.12.1. 64-bit Features

The key aspects of the 970FX's 64-bit mode are as follows:

64-bit registers:^[49] the GPRs, CTR, LR, and XER
^[49] Several registers are defined to be 32-bit in the 64-bit PowerPC architecture, such as CR, FPSCR, VRSAVE, and VSCR.

64-bit addressing, including 64-bit pointers, which allow one program's address space to be larger than 4GB

32-bit and 64-bit programs, which can execute side by side

64-bit integer and logical operations, with fewer instructions required to load and store 64-bit quantities^[50]
^[50] One way to use 64-bit integers on a 32-bit processor is to have the programming language maintain 64-bit integers as two 32-bit integers. Doing so would consume more registers and would require more load/store instructions.

Fixed instruction size32 bitsin both 32- and 64-bit modes

64-bit-only instructions such as load-word-algebraic (lwa), load-word-algebraic-indexed (lwax), and "double-word" versions of several instructions

Although a Mac OS X process must be 64-bit itself to be able to directly access more than 4GB of virtual memory, having support in the processor for more than 4GB of physical memory benefits both 64-bit and 32-bit applications. After all, physical memory backs virtual memory. Recall that the 970FX can track a large amount of physical memory42 bits worth, or 4TB. Therefore, as long as there is enough RAM, much greater amounts of it can be kept "alive" than is possible with only 32 bits of physical addressing. This is beneficial to 32-bit applications because the operating system can now keep more working sets in RAM, reducing the number of page-outseven though a single 32-bit application will still "see" only a 4GB address space.

3.3.12.2. The 970FX as a 32-bit Processor

Just as the 64-bit PowerPC is not an extension of the 32-bit PowerPC, the latter is not a performance-limited version of the formerthere is no penalty for executing in 32-bit-only mode on the 970FX. There are, however, some differences. Important aspects of running the 970FX in 32-bit mode include the following.

The sizes of the floating-point and AltiVec registers are the same across 32-bit and 64-bit implementations. For example, an FPR is 64 bits wide and a VR is 128 bits wide on both the G4 and the G5.

The 970FX uses the same resourcesregisters, execution units, data paths, caches, and bussesin 64- and 32-bit modes.

Fixed-point logical, rotate, and shift instructions behave the same in both modes.

Fixed-point arithmetic instructions (except the negate instruction) actually produce the same result in 64- and 32-bit modes. However, the carry (CA) and overflow (OV) fields of the XER register are set in a 32-bit-compatible way in 32-bit mode.

Load/store instructions ignore the upper 32 bits of an effective address in 32-bit mode. Similarly, branch instructions deal with only the lower 32 bits of an effective address in 32-bit mode.

3.3.13. Softpatch Facility

The 970FX provides a facility called softpatch, which is a mechanism that allows software to work around bugs in the processor core and to otherwise debug the core. This is achieved either by replacing an instruction with a substitute microcoded instruction sequence or by making an instruction cause a trap to software through a softpatch exception.

The 970FX's Instruction Fetch Unit contains a seven-entry array with content-addressable memory (CAM). This array is called the Instruction Match CAM (IMC). Additionally, the 970FX's instruction decode unit contains a microcode softpatch table. The IMC array has eight rows. The first six IMC entries occupy one row each, whereas the seventh entry occupies two rows. Of the seven entries, the first six are used to match partially (17 bits) over an instruction's major opcode (bits 0 through 5) and extended opcode (bits 21 through 31). The seventh entry matches in its entirety: a 32-bit full instruction match. As instructions are fetched from storage, they are matched against the IMC entries by the Instruction Fetch Unit's matching facility. If matched, the instruction's processing can be altered based on other information in the matched entry. For example, the instruction can be replaced with microcode from the instruction decode unit's softpatch table.

The 970FX provides various other tracing and performance-monitoring facilities that are beyond the scope of this chapter.