Mac OS X Internals: A Systems Approach
3.3. The PowerPC 970FX
3.3.1. At a Glance
In this section, we will look at details of the PowerPC 970FX. Although several parts of the discussion could apply to other PowerPC processors, we will not attempt to identify such cases. Table 34 lists the important technical specifications of the 970FX.
[a] AS stands for Advanced Series. [b] VMX is interchangeable with AltiVec. Apple markets the PowerPC's vector functionality as Velocity Engine. [c] As of 2005. [d] The two fixed-point (integer) units of the 970FX are not symmetrical. Only one of them can perform division, and only one can be used for special-purpose register (SPR) operations. 3.3.2. Caches
A multilevel cache hierarchy is a common aspect of modern processors. A cache can be defined as a small chunk of very fast memory that stores recently used data, instructions, or both. Information is typically added and removed from a cache in aligned quanta called cache lines. The 970FX contains several caches and other special-purpose buffers to improve memory performance. Figure 34 shows a conceptual diagram of these caches and buffers. Figure 34. Caches and buffers in the 970FX
3.3.2.1. L1 and L2 Caches
The level 1 (L1) cache is closest to the processor. Memory-resident information must be loaded into this cache before the processor can use it, unless that portion of memory is marked noncacheable. For example, when a load instruction is being executed, the processor refers to the L1 cache to see if the data in question is already held by a currently resident cache line. If so, the data is simply loaded from the L1 cachean L1 cache hit. This operation takes only a few processor cycles as compared to a few hundred cycles for accessing main memory.[23] If there is an L1 miss, the processor checks the next level in the cache hierarchy: the level 2 (L2) cache. An L2 hit would cause the cache line containing the data to be loaded into the L1 cache and then into the appropriate register. The 970FX does not have level 3 (L3) caches, but if it did, similar steps would be repeated for the L3 cache. If none of the caches contains the requested data, the processor must access main memory. [23] Main memory refers to the system's installed and available dynamic memory (DRAM). As a cache line's worth of data is loaded into L1, a resident cache line must be flushed to make room for the new cache line. The 970FX uses a pseudo-least-recently-used (LRU) algorithm[24] to determine which cache line to evict. Unless instructed otherwise, the evicted cache line is sent to the L2 cache, which makes L2 a victim cache. Table 35 shows the important properties of the 970FX's caches. [24] The 970FX allows the data-cache replacement algorithm to be changed from LRU to FIFO through a bit in a hardware-dependent register.
[a] Section 3.3.6.1 discusses GPRs and FPRs. Section 3.3.10.2 discusses VPERM and VALU.
You can retrieve processor cache information using the sysctl command on Mac OS X as shown in Figure 35. Note that the hwprefs command is part of Apple's CHUD Tools package. Figure 35. Retrieving processor cache information using the sysctl command
3.3.2.2. Cache Properties
Let us look more closely at some of the cache-related terminology used in Table 35. Associativity
As we saw earlier, the granularity of operation for a cachethat is, the unit of memory transfers in and out of a cacheis a cache line (also called a block). The cache line size on the 970FX is 128 bytes for both the L1 and L2 caches. The associativity of a cache is used to determine where to place a cache line's worth of memory in the cache. If a cache is m-way set-associative, then the total space in the cache is conceptually divided into sets, with each set containing m cache lines. In a set-associative cache, a block of memory can be placed only in certain locations in the cache: It is first mapped to a set in the cache, after which it can be stored in any of the cache lines within that set. Typically, given a memory block with address B, the target set is calculated using the following modulo operation: target set = B MOD {number of sets in cache}
A direct-mapped cache is equivalent to a one-way set-associative cache. It has the same number of sets as cache lines. This means a memory block with address B can exist only in one cache line, which is calculated as the following: target cache line = B MOD {number of cache lines in cache}
Store Policy
A cache's store policy defines what happens when an instruction writes to memory. In a write-through design, such as the 970FX L1 D-cache, information is written to both the cache line and to the corresponding block in memory. There is no L1 D-cache allocation on write missesthe affected block is modified only in the lower level of the cache hierarchy and is not loaded into L1. In a write-back design, such as the 970FX L2 cache, information is written only to the cache linethe affected block is written to memory only when the cache line is replaced.
Memory pages that are contiguous in virtual memory will normally not be contiguous in physical memory. Similarly, given a set of virtual addresses, it is not possible to predict how they will fit in the cache. A related point is that if you take a block of contiguous virtual memory the same size as a cache, say, a 512KB block (the size of the entire L2 cache), there is little chance that it will fit in the L2 cache.
MERSI
Only the L2 cache is physically mapped, although all caches use physical address tags. Stores are always sent to the L2 cache in addition to the L1 cache, as the L2 cache is the data coherency point. Coherent memory systems aim to provide the same view of all devices accessing the memory. For example, it must be ensured that processors in a multiprocessor system access the correct datawhether the most up-to-date data resides in main memory or in another processor's cache. Maintaining such coherency in hardware introduces a protocol that requires the processor to "remember" the state of the sharing of cache lines.[25] The L2 cache implements the MERSI cache-coherency protocol, which has the following five states. [25] Cache-coherency protocols are primarily either directory-based or snooping-based.
RAS
The caches incorporate parity-based error detection and correction mechanisms. Parity bits are additional bits used along with normal information to detect and correct errors in the transmission of that information. In the simplest case, a single parity bit is used to detect an error. The basic idea in such parity checking is to add an extra bit to each unit of informationsay, to make the number of 1s in each unit either odd or even. Now, if a single error (actually, an odd number of errors) occurs during information transfer, the parity-protected information unit would be invalid. In the 970FX's L1 cache, parity errors are reported as cache misses and therefore are implicitly handled by refetching the cache line from the L2 cache. Besides parity, the L2 cache implements an error detection and correction scheme that can detect double errors and correct single errors by using a Hamming code.[26] When a single error is detected during an L2 fetch request, the bad data is corrected and actually written back to the L2 cache. Thereafter, the good data is refetched from the L2 cache. [26] A Hamming code is an error-correcting code. It is an algorithm in which a sequence of numbers can be expressed such that any errors that appear in certain numbers (say, on the receiving side after the sequence was transmitted by one party to another) can be detected, and corrected, subject to certain limits, based on the remaining numbers. 3.3.3. Memory Management Unit (MMU)
During virtual memory operation, software-visible memory addresses must be translated to real (or physical) addresses, both for instruction accesses and for data accesses generated by load/store instructions. The 970FX uses a two-step address translation mechanism[27] based on segments and pages. In the first step, a software-generated 64-bit effective address (EA) is translated to a 65-bit virtual address (VA) using the segment table, which lives in memory. Segment table entries (STEs) contain segment descriptors that define virtual addresses of segments. In the second step, the virtual address is translated to a 42-bit real address (RA) using the hashed page table, which also lives in memory. [27] The 970FX also supports a real addressing mode, in which physical translation can be effectively disabled. The 32-bit PowerPC architecture provides 16 segment registers through which the 4GB virtual address space can be divided into 16 segments of 256MB each. The 32-bit PowerPC implementations use these segment registers to generate VAs from EAs. The 970FX includes a transitional bridge facility that allows a 32-bit operating system to continue using the 32-bit PowerPC implementation's segment register manipulation instructions. Specifically, the 970FX allows software to associate segments 0 through 15 with any of the 237 available virtual segments. In this case, the first 16 entries of the segment lookaside buffer (SLB), which is discussed next, act as the 16 segment registers. 3.3.3.1. SLB and TLB
We saw that the segment table and the page table are memory-resident. It would be prohibitively expensive if the processor were to go to main memory not only for data fetching but also for address translation. Caching exploits the principle of locality of memory. If caching is effective, then address translations will also have the same locality as memory. The 970FX includes two on-chip buffers for caching recently used segment table entries and page address translations: the segment lookaside buffer (SLB) and the translation lookaside buffer (TLB), respectively. The SLB is a 64-entry, fully associative cache. The TLB is a 1024-entry, four-way set-associative cache with parity protection. It also supports large pages (see Section 3.3.3.4). 3.3.3.2. Address Translation
Figure 36 depicts address translation in the 970FX MMU, including the roles of the SLB and the TLB. The 970FX MMU uses 64-bit or 32-bit effective addresses, 65-bit virtual addresses, and 42-bit physical addresses. The presence of the DART introduces another address flavor, the I/O address, which is an address in a 32-bit address space that maps to a larger physical address space. Figure 36. Address translation in the 970FX MMU
Technically, a computer architecture has three (and perhaps more) types of memory addresses: the processor-visible physical address, the software-visible virtual address, and the bus address, which is visible to an I/O device. In most cases (especially on 32-bit hardware), the physical and bus addresses are identical and therefore not differentiated.
The 65-bit extended address space is divided into pages. Each page is mapped to a physical page. A 970FX page table can be as large as 231 bytes (2GB), containing up to 224 (16 million) page table entry groups (PTEGs), where each PTEG is 128 bytes. As Figure 36 shows, during address translation, the MMU converts program-visible effective addresses to real addresses in physical memory. It uses a part of the effective address (the effective segment ID) to locate an entry in the segment table. It first checks the SLB to see if it contains the desired STE. If there is an SLB miss, the MMU searches for the STE in the memory-resident segment table. If the STE is still not found, a memory access fault occurs. If the STE is found, a new SLB entry is allocated for it. The STE represents a segment descriptor, which is used to generate the 65-bit virtual address. The virtual address has a 37-bit virtual segment ID (VSID). Note that the page index and the byte offset in the virtual address are the same as in the effective address. The concatenation of the VSID and the page index forms the virtual page number (VPN), which is used for looking up in the TLB. If there is a TLB miss, the memory-resident page table is looked up to retrieve a page table entry (PTE), which contains a real page number (RPN). The RPN, along with the byte offset carried over from the effective address, forms the physical address.
The 970FX allows setting up the TLB to be direct-mapped by setting a particular bit of a hardware-implementation-dependent register.
3.3.3.3. Caching the Caches: ERATs
Information from the SLB and the TLB may be cached in two effective-to-real address translation caches (ERATs)one for instructions (I-ERAT) and another for data (D-ERAT). Both ERATs are 128-entry, two-way set-associative caches. Each ERAT entry contains effective-to-real address translation information for a 4KB block of storage. Both ERATs contain invalid information upon power-on. As shown in Figure 36, the ERATs represent a shortcut path to the physical address when there is a match for the effective address in the ERATs. 3.3.3.4. Large Pages
Large pages are meant for use by high-performance computing (HPC) applications. The typical page size of 4KB could be detrimental to memory performance in certain circumstances. If an application's locality of reference is too wide, 4KB pages may not capture the locality effectively enough. If too many TLB misses occur, the consequent TLB entry allocations and the associated delays would be undesirable. Since a large page represents a much larger memory range, the number of TLB hits should increase, as the TLB would now cache translations for larger virtual memory ranges. It is an interesting problem for the operating system to make large pages available to applications. Linux provides large-page support through a pseudo file system (hugetlbfs) that is backed by large pages. The superuser must explicitly configure some number of large pages in the system by preallocating physically contiguous memory. Thereafter, the hugetlbfs instance can be mounted on a directory, which is required if applications intend to use the mmap() system call to access large pages. An alternative is to use shared memory callsshmat() and shmget(). Files may be created, deleted, mmap()'ed, and munmap()'ed on hugetlbfs. It does not support reads or writes, however. AIX also requires separate, dedicated physical memory for large-page use. An AIX application can use large pages either via shared memory, as on Linux, or by requesting that the application's data and heap segments be backed by large pages. Note that whereas the 970FX TLB supports large pages, the ERATs do not; large pages require multiple entriescorresponding to each referenced 4KB block of a large pagein the ERATs. Cache-inhibited accesses to addresses in large pages are not permitted. 3.3.3.5. No Support for Block Address Translation Mechanism
The 970FX does not support the Block Address Translation (BAT) mechanism that is supported in earlier PowerPC processors such as the G4. BAT is a software-controlled array used for mapping largeoften much larger than a pagevirtual address ranges into contiguous areas of physical memory. The entire map will have the same attributes, including access protection. Thus, the BAT mechanism is meant to reduce address translation overhead for large, contiguous regions of special-purpose virtual address spaces. Since BAT does not use pages, such memory cannot be paged normally. A good example of a scenario where BAT is useful is that of a region of framebuffer memory, which could be memory-mapped effectively via BAT. Software can select block sizes ranging from 128KB to 256MB. On PowerPC processors that implement BAT, there are four BAT registers each for data (DBATs) and instructions (IBATs). A BAT register is actually a pair of upper and lower registers, which are accessible from supervisor mode. The eight pairs are named DBAT0U-DBAT3U, DBAT0L-DBAT3L, IBAT0U-IBAT3U, and IBAT0L-IBAT3L. The contents of a BAT register include a block effective page index (BEPI), a block length (BL), and a block real page number (BRPN). During BAT translation, a certain number of high-order bits of the EAas specified by BLare matched against each BAT register. If there is a match, the BRPN value is used to yield the RA from the EA. Note that BAT translation is used over page table translation for storage locations that have mappings in both a BAT register and the page table. 3.3.4. Miscellaneous Internal Buffers and Queues
The 970FX contains several miscellaneous buffers and queues internal to the processor, most of which are not visible to software. Examples include the following:
3.3.5. Prefetching
Cache miss rates can be reduced through a technique called prefetchingthat is, fetching information before the processor requests it. The 970FX prefetches instructions and data to hide memory latency. It also supports software-initiated prefetching of up to eight data streams called hardware streams, four of which can optionally be vector streams. A stream is defined as a sequence of loads that reference more than one contiguous cache line. The prefetch engine is a functionality of the Load/Store Unit. It can detect sequential access patterns in ascending or descending order by monitoring loads and recording cache line addresses when there are cache misses. The 970FX does not prefetch store misses. Let us look at an example of the prefetch engine's operation. Assuming no prefetch streams are active, the prefetch engine will act when there is an L1 D-cache miss. Suppose the miss was for a cache line with address A; then the engine will create an entry in the Prefetch Filter Queue (PFQ)[29] with the address of either the next or the previous cache linethat is, either A + 1 or A 1. It guesses the direction (up or down) based on whether the memory access was located in the top 25% of the cache line (guesses down) or the bottom 75% of the cache line (guesses up). If there is another L1 D-cache miss, the engine will compare the line address with the entries in the PFQ. If the access is indeed sequential, the line address now being compared must be either A + 1 or A 1. Alternatively, the engine could have incorrectly guessed the direction, in which case it would create another filter entry for the opposite direction. If the guessed direction was correct (say, up), the engine deems it a sequential access and allocates a stream entry in the Prefetch Request Queue (PRQ)[30] using the next available stream identifier. Moreover, the engine will initiate prefetching for cache line A + 2 to L1 and cache line A + 3 to L2. If A + 2 is read, the engine will cause A + 3 to be fetched to L1 from L2, and A + 4, A + 5, and A + 6 to be fetched to L2. If further sequential demand-reads occur (for A + 3 next), this pattern will continue until all streams are assigned. The PFQ is updated using an LRU algorithm. [29] The PFQ is a 12-entry queue for detecting data streams for prefetching. [30] The PRQ is a queue of eight streams that will be prefetched. The 970FX allows software to manipulate the prefetch mechanism. This is useful if the programmer knows data access patterns ahead of time. A version of the data-cache-block-touch (dcbt) instruction, which is one of the storage control instructions, can be used by a program to provide hints that it intends to read from a specified address or data stream in the near future. Consequently, the processor would initiate a data stream prefetch from a particular address. Note that if you attempt to access unmapped or protected memory via software-initiated prefetching, no page faults will occur. Moreover, these instructions are not guaranteed to succeed and can fail silently for a variety of reasons. In the case of success, no result is returned in any registeronly the cache block is fetched. In the case of failure, no cache block is fetched, and again, no result is returned in any register. In particular, failure does not affect program correctness; it simply means that the program will not benefit from prefetching. Prefetching continues until a page boundary is reached, at which point the stream will have to be reinitialized. This is so because the prefetch engine does not know about the effective-to-real address mapping and can prefetch only within a real page. This is an example of a situation in which large pageswith page boundaries that are 16MB apartwill fare better than 4KB pages. On a Mac OS X system with AltiVec hardware, you can use the vec_dst() AltiVec function to initiate data read of a line into cache, as shown in the pseudocode in Figure 37. Figure 37. Data prefetching in AltiVec
The address argument to vec_dst() is a pointer to a byte that lies within the first cache line to be fetched; the control argument is a word whose bits specify the block size, the block count, and the distance between the blocks; and the stream_id specifies the stream to use. 3.3.6. Registers
The 970FX has two privilege modes of operation: a user mode (problem state) and a supervisor mode (privileged state). The former is used by user-space applications, whereas the latter is used by the Mac OS X kernel. When the processor is first initialized, it comes up in supervisor mode, after which it can be switched to user mode via the Machine State Register (MSR). The set of architected registers can be divided into three levels (or models) in the PowerPC architecture:
The UISA and VEA registers can be accessed by software through either user-level or supervisor-level privileges, although there are VEA registers that cannot be written to by user-level instructions. OEA registers can be accessed only by supervisor-level instructions. 3.3.6.1. UISA and VEA Registers
Figure 38 shows the UISA and VEA registers of the 970FX. Their purpose is summarized in Table 36. Note that whereas the general-purpose registers are all 64-bit wide, the set of supervisor-level registers contains both 32-bit and 64-bit registers. Figure 38. PowerPC UISA and VEA registers
Processor registers are used with all normal instructions that access memory. In fact, there are no computational instructions in the PowerPC architecture that modify storage. For a computational instruction to use a storage operand, it must first load the operand into a register. Similarly, if a computational instruction writes a value to a storage operand, the value must go to the target location via a register. The PowerPC architecture supports the following addressing modes for such instructions.
rA and rB represent register contents. The notation (rA | 0) means the contents of register rA unless rA is GPR0, in which case (rA | 0) is taken to be the value 0. The UISA-level performance-monitoring registers provide user-level read access to the 970FX's performance-monitoring facility. They can be written only by a supervisor-level program such as the kernel or a kernel extension.
Apple's Computer Hardware Understanding Development (CHUD) is a suite of programs (the "CHUD Tools") for measuring and optimizing performance on Mac OS X. The software in the CHUD Tools package makes use of the processor's performance-monitoring counters.
The Timebase Register
The Timebase (TB) provides a long-period counter driven by an implementation-dependent frequency. The TB is a 64-bit register containing an unsigned 64-bit integer that is incremented periodically. Each increment adds 1 to bit 63 (the lowest-order bit) of the TB. The maximum value that the TB can hold is 264 1, after which it resets to zero without generating any exception. The TB can either be incremented at a frequency that is a function of the processor clock frequency, or it can be driven by the rising edge of the signal on the TB enable (TBEN) input pin.[31] In the former case, the 970FX increments the TB once every eight full frequency processor clocks. It is the operating system's responsibility to initialize the TB. The TB can be readbut not written tofrom user space. The program shown in Figure 39 retrieves and prints the TB. [31] In this case, the TB frequency may change at any time. Figure 39. Retrieving and displaying the Timebase Register
Note in Figure 39 that we use inline assembly rather than create a separate assembly source file. The GNU assembler inline syntax is based on the template shown in Figure 310. Figure 310. Code template for inline assembly in the GNU assembler
We will come across other examples of inline assembly in this book.
3.3.6.2. OEA Registers
The OEA registers are shown in Figure 311. Examples of their use include the following.
Figure 311. PowerPC OEA registers
3.3.7. Rename Registers
The 970FX implements a substantial number of rename registers, which are used to handle register-name dependencies. Instructions can depend on one another from the point of view of control, data, or name. Consider two instructions, say, I1 and I2, in a program, where I2 comes after I1. I1 ... Ix ... I2
In a data dependency, I2 either uses a result produced by I1, or I2 has a data dependency on an instruction Ix, which in turn has a data dependency on I1. In both cases, a value is effectively transmitted from I1 to I2. In a name dependency, I1 and I2 use the same logical resource or name, such as a register or a memory location. In particular, if I2 writes to the same register that is either read from or written to by I1, then I2 would have to wait for I1 to execute before it can execute. These are known as write-after-read (WAR) and write-after-write (WAW) hazards. I1 reads (or writes) <REGISTER X> ... I2 writes <REGISTER X> In this case, the dependency is not "real" in that I2 does not need I1's result. One solution to handle register-name dependencies is to rename the conflicting register used in the instructions so that they become independent. Such renaming could be done in software (statically, by the compiler) or in hardware (dynamically, by logic in the processor). The 970FX uses pools of physical rename registers that are assigned to instructions during the mapping stage in the processor pipeline and released when they are no longer needed. In other words, the processor internally renames architected registers used by instructions to physical registers. This makes sense only when the number of physical registers is (substantially) larger than the number of architected registers. For example, the PowerPC architecture has 32 GPRs, but the 970FX implementation has a pool of 80 physical GPRs, from which the 32 architected GPRs are assigned. Let us consider a specific example, say, of a WAW hazard, where renaming is helpful. ; before renaming r20 Table 37 lists the available renamed registers in the 970FX. The table also mentions emulation registers, which are available to cracked and microcoded instructions, which, as we will see in Section 3.3.9.1, are processes by which complex instructions are broken down into simpler instructions.
3.3.8. Instruction Set
All PowerPC instructions are 32 bits wide regardless of whether the processor is in 32-bit or 64-bit computation mode. All instructions are word aligned, which means that the two lowest-order bits of an instruction address are irrelevant from the processor's standpoint. There are several instruction formats, but bits 0 through 5 of an instruction word always specify the major opcode. PowerPC instructions typically have three operands: two source operands and one result. One of the source operands may be a constant or a register, but the other operands are usually registers. We can broadly divide the instruction set implemented by the 970FX into the following instruction categories: fixed-point, floating-point, vector, control flow, and everything else. 3.3.8.1. Fixed-Point Instructions
Operands of fixed-point instructions can be bytes (8-bit), half words (16-bit), words (32-bit), or double words (64-bit). This category includes the following instruction types:
Most load/store instructions can optionally update the base register with the effective address of the data operated on by the instruction. 3.3.8.2. Floating-Point Instructions
Floating-point operands can be single-precision (32-bit) or double-precision (64-bit) floating-point quantities. However, floating-point data is always stored in the FPRs in double-precision format. Loading a single-precision value from storage converts it to double precision, and storing a single-precision value to storage actually rounds the FPR-resident double-precision value to single precision. The 970FX complies with the IEEE 754 standard[32] for floating-point arithmetic. This instruction category includes the following types: [32] The IEEE 754 standard governs binary floating-point arithmetic. The standard's primary architect was William Velvel Kahan, who received the Turing Award in 1989 for his fundamental contributions to numerical analysis.
The precision of floating-point-estimate instructions (fres and frsqrte) is less on the 970FX than on the G4. Although the 970FX is at least as accurate as the IEEE 754 standard requires, the G4 is more accurate than required. Figure 312 shows a program that can be executed on a G4 and a G5 to illustrate this difference.
Figure 312. Precision of the floating-point-estimate instruction on the G4 and the G5
3.3.8.3. Vector Instructions
Vector instructions execute in the 128-bit VMX execution unit. We will look at some of the VMX details in Section 3.3.10. The 970FX VMX implementation contains 162 vector instructions in various categories. 3.3.8.4. Control-Flow Instructions
A program's control flow is sequentialthat is, its instructions logically execute in the order they appearuntil a control-flow change occurs either explicitly (because of an instruction that modifies the control flow of a program) or as a side effect of another event. The following are examples of control-flow changes:
Each of these events could have handlerspieces of code that handle them. For example, a trap handler may be executed when the conditions specified in the trap instruction are satisfied. When a user-space program executes an sc instruction with a valid system call identifier, a function in the operating system kernel is invoked to provide the service corresponding to that system call. Similarly, control flow also changes when the program is returning from such handlers. For example, after a system call finishes in the kernel, execution continues in user spacein a different piece of code. The 970FX supports absolute and relative branching. A branch could be conditional or unconditional. A conditional branch can be based on any of the bits in the CR being 1 or 0. We earlier came across the special-purpose registers LR and CTR. LR can hold the return address on a procedure call. A leaf procedureone that does not call another proceduredoes not need to save LR and therefore can return faster. CTR is used for loops with a fixed iteration limit. It can be used to branch based on its contentsthe loop counterbeing zero or nonzero, while decrementing the counter automatically. LR and CTR are also used to hold target addresses of conditional branches for use with the bclr and bcctr instructions, respectively.
Besides performing aggressive dynamic branch prediction, the 970FX allows hints to be provided along with many types of branch instructions to improve branch prediction accuracy. 3.3.8.5. Miscellaneous Instructions
The 970FX includes various other types of instructions, many of which are used by the operating system for low-level manipulation of the processor. Examples include the following types:
3.3.9. The 970FX Core
The 970FX core is depicted in Figure 313. We have come across several of the core's major components earlier in this chapter, such as the L1 caches, the ERATs, the TLB, the SLB, register files, and register-renaming resources. Figure 313. The core of the 970FX
The 970FX core is designed to achieve a high degree of instruction parallelism. Some of its noteworthy features include the following.
The processor uses a large number of its resources such as reorder queues, rename register pools, and other logic to track in-flight instructions and their dependencies. 3.3.9.1. Instruction Pipeline
In this section, we will discuss how the 970FX processes instructions. The overall instruction pipeline is shown in Figure 314. Let us look at the important stages of this pipeline. Figure 314. The 970FX instruction pipeline
IFAR, ICA[37][37] Instruction Cache Access. Based on the address in the Instruction Fetch Address Register (IFAR), the instruction-fetch-logic fetches eight instructions every cycle from the L1 I-cache into a 32-entry instruction buffer. The eight-instruction block, so fetched, is 32-byte aligned. Besides performing IFAR-based demand fetching, the 970FX prefetches cache lines into a 4x128-byte Instruction Prefetch Queue. If a demand fetch results in an I-cache miss, the 970FX checks whether the instructions are in the prefetch queue. If the instructions are found, they are inserted into the pipeline as if no I-cache miss had occurred. The cache line's critical sector (eight words) is written into the I-cache. D0
There is logic to partially decode (predecode) instructions after they leave the L2 cache and before they enter the I-cache or the prefetch queue. This process adds five extra bits to each instruction to yield a 37-bit instruction. An instruction's predecode bits mark it as illegal, microcoded, conditional or unconditional branch, and so on. In particular, the bits also specify how the instruction is to be grouped for dispatching. D1, D2, D3
The 970FX splits complex instructions into two or more internal operations, or iops. The iops are more RISC-like than the instructions they are part of. Instructions that are broken into exactly two iops are called cracked instructions, whereas those that are broken into three or more iops are called microcoded instructions because the processor emulates them using microcode.
An instruction may not be atomic because the atomicity of cracked or microcoded instructions is at the iop level. Moreover, it is the iops, and not programmer-visible instructions, that are executed out-of-order. This approach allows the processor more flexibility in parallelizing execution. Note that AltiVec instructions are neither cracked nor microcoded.
Fetched instructions go to a 32-instruction fetch buffer. Every cycle, up to five instructions are taken from this buffer and sent through a decode pipeline that is either inline (consisting of three stages, namely, D1, D2, and D3), or template-based if the instruction needs to be microcoded. The template-based decode pipeline generates up to four iops per cycle that emulate the original instruction. In any case, the decode pipeline leads to the formation of an instruction dispatch group. Given the out-of-order execution of instructions, the processor needs to keep track of the program order of all instructions in various stages of execution. Rather than tracking individual instructions, the 970FX tracks instructions in dispatch groups. The 970FX forms such groups containing one to five iops, each occupying an instruction slot (0 through 4) in the group. Dispatch group formation[38] is subject to a long list of rules and conditions such as the following. [38] The instruction grouping performed by the 970FX has similarities to a VLIW processor.
XFER
The iops wait for resources to become free in the XFER stage. GD, DSP, WRT, GCT, MAP
After group formation, the execution pipeline divides into multiple pipelines for the various execution units. Every cycle, one group of instructions can be sent (or dispatched) to the issue queues. Note that instructions in a group remain together from dispatch to completion. As a group is dispatched, several operations occur before the instructions actually execute. Internal group instruction dependencies are determined (GD). Various internal resources are assigned, such as issue queue slots, rename registers and mappers, and entries in the load/store reorder queues. In particular, each iop in the group that returns a result must be assigned a register to hold the result. Rename registers are allocated in the dispatch phase before the instructions enter the issue queues (DSP, MAP). To track the groups themselves, the 970FX uses a global completion table (GCT) that stores up to 20 entries in program orderthat is, up to 20 dispatch groups can be in flight concurrently. Since each group can have up to 5 iops, as many as 100 iops can be tracked in this manner. The WRT stage represents the writes to the GCT. ISS, RF
After all the resources that are required to execute the instructions are available, the instructions are sent (ISS) to appropriate issue queues. Once their operands appear, the instructions start to execute. Each slot in a group feeds separate issue queues for various execution units. For example, the FXU/LSU and the FPU draw their instructions from slots { 0, 3 } and { 1, 2 }, respectively, of an instruction group. If one pair goes to the FXU/LSU, the other pair goes to the FPU. The CRU draws its instructions from the CR logical issue queue that is fed from instruction slots 0 and 1. As we saw earlier, slot 4 of an instruction group is dedicated to branch instructions. AltiVec instructions can be issued to the VALU and the VPERM issue queues from any slot except slot 4. Table 38 shows the 970FX issue queue sizeseach execution unit listed has one issue queue.
[a] LSU0 and FXU0 share an 18-entry issue queue. [b] LSU1 and FXU1 share an 18-entry issue queue. The FXU/LSU and FPU issue queues have odd and even halves that are hardwired to receive instructions only from certain slots of a dispatch group, as shown in Figure 315. Figure 315. The FPU and FXU/LSU issue queues in the 970FX
As long as an issue queue contains instructions that have all their data dependencies resolved, an instruction moves every cycle from the queue into the appropriate execution unit. However, there are likely to be instructions whose operands are not ready; such instructions block in the queue. Although the 970FX will attempt to execute the oldest instruction first, it will reorder instructions within a queue's context to avoid stalling. Ready-to-execute instructions access their source operands by reading the corresponding register file (RF), after which they enter the execution unit pipelines. Up to ten operations can be issued in a cycleone to each of the ten execution pipelines. Note that different execution units may have varying numbers of pipeline stages. We have seen that instructions both issue and execute out of order. However, if an instruction has finished execution, it does not mean that the program will "know" about it. After all, from the program's standpoint, instructions must execute in program order. The 970FX differentiates between an instruction finishing execution and an instruction completing. An instruction may finish execution (speculatively, say), but unless it completes, its effect is not visible to the program. All pipelines terminate in a common stage: the group completion stage (CP). When groups complete, many of their resources are released, such as load reorder queue entries, mappers, and global completion table entries. One dispatch group may be "retired" per cycle.
When a branch instruction completes, the resultant target address is compared with a predicted address. Depending on whether the prediction is correct or incorrect, either all instructions in the pipeline that were fetched after the branch in question are flushed, or the processor waits for all remaining instructions in the branch's group to complete.
Accounting for 215 In-Flight Instructions
We can account for the theoretical maximum of 215 in-flight instructions by looking at Figure 34specifically, the areas marked 1 through 6.
Thus, the theoretical maximum number of in-flight instructions can be calculated as the sum 16 + 32 + 15 + 20 + 100 + 32, which is 215. 3.3.9.2. Branch Prediction
Branch prediction is a mechanism wherein the processor attempts to keep the pipeline full, and therefore improve overall performance, by fetching instructions in the hope that they will be executed. In this context, a branch is a decision point for the processor: It must predict the outcome of the branchwhether it will be taken or notand accordingly prefetch instructions. As shown in Figure 315, the 970FX scans fetched instructions for branches. It looks for up to two branches per cycle and uses multistrategy branch prediction logic to predict their target addresses, directions, or both. Consequently, up to 2 branches are predicted per cycle, and up to 16 predicted branches can be in flight. All conditional branches are predicted, based on whether the 970FX fetches instructions beyond a branch and speculatively executes them. Once the branch instruction itself executes in the BRU, its actual outcome is compared with its predicted outcome. If the prediction was incorrect, there is a severe penalty: Any instructions that may have speculatively executed are discarded, and instructions in the correct control-flow path are fetched. The 970FX's dynamic branch prediction hardware includes three branch history tables (BHTs), a link stack, and a count cache. Each BHT has 16K 1-bit entries.
The 970FX's hardware branch prediction can be overridden by software.
The first BHT is the local predictor table. Its 16K entries are indexed by branch instruction addresses. Each 1-bit entry indicates whether the branch should be taken or not. This scheme is "local" because each branch is tracked in isolation. The second BHT is the global predictor table. It is used by a prediction scheme that takes into account the execution path taken to reach the branch. An 11-bit vectorthe global history vectorrepresents the execution path. The bits of this vector represent the previous 11 instruction groups fetched. A particular bit is 1 if the next group was fetched sequentially and is 0 otherwise. A given branch's entry in the global predictor table is at a location calculated by performing an XOR operation between the global history vector and the branch instruction address. The third BHT is the selector table. It tracks which of the two prediction schemes is to be favored for a given branch. The BHTs are kept up to date with the actual outcomes of executed branch instructions. The link stack and the count cache are used by the 970FX to predict branch target addresses of branch-conditional-to-link-register (bclr, bclrl) and branch-conditional-to-count-register (bcctr, bcctrl) instructions, respectively. So far, we have looked at dynamic branch prediction. The 970FX also supports static prediction wherein the programmer can use certain bits in a conditional branch operand to statically override dynamic prediction. Specifically, two bits called the "a" and "t" bits are used to provide hints regarding the branch's direction, as shown in Table 39.
3.3.9.3. Summary
Let us summarize the instruction parallelism achieved by the 970FX. In every cycle of the 970FX, the following events occur.
3.3.10. AltiVec
The 970FX includes a dedicated vector-processing unit and implements the VMX instruction set, which is an AltiVec[39] interchangeable extension to the PowerPC architecture. AltiVec provides a SIMD-style 128-bit[40] vector-processing unit. [39] AltiVec was first introduced in Motorola's e600 PowerPC corethe G4. [40] All AltiVec execution units and data paths are 128 bits wide. 3.3.10.1. Vector Computing
SIMD stands for single-instruction, multiple-data. It refers to a set of operations that can efficiently handle large quantities of data in parallel. SIMD operations do not necessarily require more or wider registers, although more is better. SIMD essentially better uses registers and data paths. For example, a non-SIMD computation would typically use a hardware register for each data element, even if the register could hold multiple such elements. In contrast, SIMD would use a register to hold multiple data elementsas many as would fitand would perform the same operation on all elements through a single instruction. Thus, any operation that can be parallelized in this manner stands to benefit from SIMD. In AltiVec's case, a vector instruction can perform the same operation on all constituents of a vector. Note that AltiVec instructions work on fixed-length vectors.
SIMD-based optimization does not come for free. A problem must lend itself well to vectorization, and the programmer must usually perform extra work. Some compilerssuch as IBM's XL suite of compilers and GCC 4.0 or abovealso support auto-vectorization, an optimization that auto-generates vector instructions based on the compiler's analysis of the source code.[41] Auto-vectorization may or may not work well depending on the nature and structure of the code. [41] For example, the compiler may attempt to detect patterns of code that are known to be well suited for vectorization. Several processor architectures have similar extensions. Table 310 lists some well-known examples.
AltiVec can greatly improve the performance of data movement, benefiting applications that do processing of vectors, matrices, arrays, signals, and so on. As we saw in Chapter 2, Apple provides portable APIsthrough the Accelerate framework (Accelerate.framework)for performing vector-optimized operations.[42] Accelerate is an umbrella framework that contains the vecLib and vImage[43] subframeworks. vecLib is targeted for performing numerical and scientific computingit provides functionality such as BLAS, LAPACK, digital signal processing, dot products, linear algebra, and matrix operations. vImage provides vector-optimized APIs for working with image data. For example, it provides functions for alpha compositing, convolutions, format conversion, geometric transformations, histograms operations, and morphological operations. [42] The Accelerate framework automatically uses the best available code that it implements, depending on the hardware it is running on. For example, it will use vectorized code for AltiVec if AltiVec is available. On the x86 platform, it will use MMX, SSE, SSE2, and SSE3 if these features are available. [43] vImage is also available as a stand-alone framework.
Although a vector instruction performs work that would typically require many times more nonvector instructions, vector instructions are not simply instructions that deal with "many scalars" or "more memory" at a time. The fact that a vector's members are related is critical, and so is the fact that the same operation is performed on all members. Vector operations certainly play better with memory accessesthey lead to amortization. The semantic difference between performing a vector operation and a sequence of scalar operations on the same data set is that you are implicitly providing more information to the processor about your intentions. Vector operationsby their naturealleviate both data and control hazards.
AltiVec has wide-ranging applications since areas such as high-fidelity audio, video, videoconferencing, graphics, medical imaging, handwriting analysis, data encryption, speech recognition, image processing, and communications all use algorithms that can benefit from vector processing. Figure 316 shows a trivial AltiVec C program. Figure 316. A trivial AltiVec program
As also shown in Figure 316, the -faltivec option to GCC enables AltiVec language extensions. 3.3.10.2. The 970FX AltiVec Implementation
The 970FX AltiVec implementation consists of the following components:
The CR is also modified as a result of certain vector instructions.
The VALU and the VPERM are both dispatchable units that receive predecoded instructions via the issue queues.
The 32-bit VRSAVE serves a special purpose: Each of its bits indicates whether the corresponding vector register is in use or not. The processor maintains this register so that it does not have to save and restore every vector register every time there is an exception or a context switch. Frequently saving or restoring 32 128-bit registers, which together constitute 512 bytes, would be severely detrimental to cache performance, as other, perhaps more critical data would need to be evicted from cache. Let us extend our example program from Figure 316 to examine the value in the VRSAVE. Figure 317 shows the extended program. Figure 317. Displaying the contents of the VRSAVE
We see in Figure 317 that two high-order bits of the VRSAVE are set and the rest are cleared. This means the program uses two VRs: VR0 and VR1. You can verify this by looking at the assembly listing for the program. The VPERM execution unit can do merge, permute, and splat operations on vectors. Having a separate permute unit allows data-reorganization instructions to proceed in parallel with vector arithmetic and logical instructions. The VPERM and VALU both maintain their own copies of the VRF that are synchronized on the half cycle. Thus, each receives its operands from its own VRF. Note that vector loads, stores, and data stream instructions are handled in the usual LSU pipes. Although no AltiVec instructions are cracked or microcoded, vector store instructions logically break down into two components: a vector part and an LSU part. In the group formation stage, a vector store is a single entity occupying one slot. However, once the instruction is issued, it occupies two issue queue slots: one in the vector store unit and another in the LSU. Address generation takes place in the LSU. There is a slot for moving the data out of the VRF in the vector unit. This is not any different from scalar (integer and floating-point) stores, in whose case address generation still takes place in the LSU, and the respective execution unitinteger or floating-pointis used for accessing the GPR file (GPRF) or the FPR file (FPRF). AltiVec instructions were designed to be pipelined easily. The 970FX can dispatch up to four vector instructions every cycleregardless of typeto the issue queues. Any vector instruction can be dispatched from any slot of the dispatch group except the dedicated branch slot 4.
It is usually very inefficient to pass data between the scalar units and the vector unit because data transfer between register files is not direct but goes through the caches. 3.3.10.3. AltiVec Instructions
AltiVec adds 162 vector instructions to the PowerPC architecture. Like all other PowerPC instructions, AltiVec instructions have 32-bit-wide encodings. To use AltiVec, no context switching is required. There is no special AltiVec operating modeAltiVec instructions can be used along with regular PowerPC instructions in a program. AltiVec also does not interfere with floating-point registers.
AltiVec instructions should be used at the UISA and VEA levels of the PowerPC architecture but not at the OEA level (the kernel). The same holds for floating-point arithmetic. Nevertheless, it is possible to use AltiVec and floating-point in the Mac OS X kernel beginning with a revision of Mac OS X 10.3. However, doing so would be at the cost of performance overhead in the kernel, since using AltiVec or floating-point will lead to a larger number of exceptions and register save/restore operations. Moreover, AltiVec data stream instructions cannot be used in the kernel. High-speed video scrolling on the system console is an example of the Floating-Point Unit being used by the kernelthe scrolling routines use floating-point registers for fast copying. The audio subsystem also uses floating-point in the kernel. The following points are noteworthy regarding AltiVec vectors.
Instructions in the AltiVec instruction set can be broadly classified into the following categories:
Vector single-element loads are implemented as lvx, with undefined fields not zeroed explicitly. Care should be taken while dealing with such cases as this could lead to denormals[45] in floating-point calculations. [45] Denormal numbersalso called subnormal numbersare numbers that are so small they cannot be represented with full precision. 3.3.11. Power Management
The 970FX supports power management features such as the following.
3.3.11.1. PowerTune
PowerTune allows clock frequencies to be dynamically controlled and even synchronized across multiple processors. PowerTune frequency scaling occurs in the processor core, the busses, the bridge, and the memory controller. Allowed frequencies range from fthe full nominal frequencyto f/2, f/4, and f/64. The latter corresponds to the deep nap power-saving mode. If an application does not require the processor's maximum available performance, frequency and voltage can be changed system-widewithout stopping the core execution units and without disabling interrupts or bus snooping. All processor logic, except the bus clocks, remains active. Moreover, the frequency change is very rapid. Since power has a quadratic dependency on voltage, reducing voltage has a desirable effect on power dissipation. Consequently, the 970FX has much lower typical power consumption than the 970, which did not have PowerTune. 3.3.11.2. Power Mac G5 Thermal and Power Management
In the Power Mac G5, Apple combines the power management capabilities of the 970FX/970MP with a network of fans and sensors to contain heat generation, power consumption, and noise levels. Examples of hardware sensors include those for fan speed, temperature, current, and voltage. The system is divided into discrete cooling zones with independently controlled fans. Some Power Mac G5 models additionally contain a liquid cooling system that circulates a thermally conductive fluid to transfer heat away from the processors into a radiant grille. As air passes over the grille's cooling fins, the fluid's temperature decreases.[46] [46] Similar to how an automobile radiator works.
Operating system support is required to make the Power Mac G5's thermal management work properly. Mac OS X regularly monitors various temperatures and power consumption. It also communicates with the fan control unit (FCU). If the FCU does not receive feedback from the operating system, it will spin the fans at maximum speed. A liquid-cooled dual-processor 2.5GHz Power Mac has the following fans:
Additionally, the Power Mac has sensors for current, voltage, and temperature, as listed in Table 311.
[a] The AD7417 is a type of analog-to-digital converter with an on-chip temperature sensor. We will see in Chapter 10 how to programmatically retrieve the values of various sensors from the kernel. 3.3.12. 64-bit Architecture
We saw earlier that the PowerPC architecture was designed with explicit support for 64- and 32-bit computing. PowerPC is, in fact, a 64-bit architecture with a 32-bit subset. A particular PowerPC implementation may choose to implement only the 32-bit subset, as is the case with the G3 and G4 processor families used by Apple. The 970FX implements both the 64-bit and 32-bit forms[47]dynamic computation modes[48]of the PowerPC architecture. The modes are dynamic in that you can switch between the two dynamically by setting or clearing bit 0 of the MSR. [47] A 64-bit PowerPC implementation must implement the 32-bit subset. [48] The computation mode encompasses addressing mode. 3.3.12.1. 64-bit Features
The key aspects of the 970FX's 64-bit mode are as follows:
Although a Mac OS X process must be 64-bit itself to be able to directly access more than 4GB of virtual memory, having support in the processor for more than 4GB of physical memory benefits both 64-bit and 32-bit applications. After all, physical memory backs virtual memory. Recall that the 970FX can track a large amount of physical memory42 bits worth, or 4TB. Therefore, as long as there is enough RAM, much greater amounts of it can be kept "alive" than is possible with only 32 bits of physical addressing. This is beneficial to 32-bit applications because the operating system can now keep more working sets in RAM, reducing the number of page-outseven though a single 32-bit application will still "see" only a 4GB address space. 3.3.12.2. The 970FX as a 32-bit Processor
Just as the 64-bit PowerPC is not an extension of the 32-bit PowerPC, the latter is not a performance-limited version of the formerthere is no penalty for executing in 32-bit-only mode on the 970FX. There are, however, some differences. Important aspects of running the 970FX in 32-bit mode include the following.
3.3.13. Softpatch Facility
The 970FX provides a facility called softpatch, which is a mechanism that allows software to work around bugs in the processor core and to otherwise debug the core. This is achieved either by replacing an instruction with a substitute microcoded instruction sequence or by making an instruction cause a trap to software through a softpatch exception. The 970FX's Instruction Fetch Unit contains a seven-entry array with content-addressable memory (CAM). This array is called the Instruction Match CAM (IMC). Additionally, the 970FX's instruction decode unit contains a microcode softpatch table. The IMC array has eight rows. The first six IMC entries occupy one row each, whereas the seventh entry occupies two rows. Of the seven entries, the first six are used to match partially (17 bits) over an instruction's major opcode (bits 0 through 5) and extended opcode (bits 21 through 31). The seventh entry matches in its entirety: a 32-bit full instruction match. As instructions are fetched from storage, they are matched against the IMC entries by the Instruction Fetch Unit's matching facility. If matched, the instruction's processing can be altered based on other information in the matched entry. For example, the instruction can be replaced with microcode from the instruction decode unit's softpatch table. The 970FX provides various other tracing and performance-monitoring facilities that are beyond the scope of this chapter. |
Категории