ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

2017-07-07 02:10:07

The Itanium architecture, like many others, uses separate opcodes for operations involving general (integer) or floating-point registers. An ISA may provide instructions that copy data between storage locations: register memory, register register, or memory memory. For floating-point data, RISC and Itanium architectures do not support direct memory memory copying; instead, they provide register memory (load and store), as well as register register (move), transfers.

In this chapter, we take up floating-point instructions in the same order used for integer instructions, concentrating on their intended effect. Details of the conversions of floating-point data between memory and register formats have been discussed by Triebel and documented in "Application Architecture" (Volume 1 of the Intel Itanium Architecture Software Developer's Manual).

8.3.1 Floating-Point Store Instructions

There are three forms of floating-point store instructions for Itanium architecture: normal, integer, and spill, each having the following syntax:

stffsz.sthint [r3]=f2 // mem[r3] <- f2 stffsz.sthint [r3]=f2,imm9 // mem[r3] <- f2 // r3 <- r3 + sext(imm9) stf8.sthint [r3]=f2 // mem[r3] <- f2<63:0> stf8.sthint [r3]=f2,imm9 // mem[r3] <- f2<63:0> // r3 <- r3 + sext(imm9) stf.spill.sthint [r3]=f2 // mem[r3] <- f2 stf.spill.sthint [r3]=f2,imm9 // mem[r3] <- f2 // r3 <- r3 + sext(imm9)

where fsz is the size of the information unit to which the quantity in register f2 is to be converted and then copied by the normal form to memory at the address specified in register r3. The available values for fsz are s for 4-byte single-precision, d for 8-byte double-precision, and e for 10-byte double extended precision data. The simplest store instruction uses register direct addressing for the source register operand and register indirect addressing for the destination operand.

Store operations can be susceptible to numerous exceptions, chief among them being an attempt to store unaligned data. To store 4 or 8 bytes, we must assure that the lowest 2 bits (for 4 bytes) or 3 bits (for 8 bytes) in the address expressed in register r3 are 0.

There are two values for sthint (the store hint completer): none at all and nta. None at all corresponds to an ordinary store operation; the processor receives the hint that the program associates temporal locality in cache with the value stored. On the other hand, nta provides the hint that the program considers the value stored to have nontemporal locality at all levels of cache and memory hierarchy. The use of nta may thus avoid knocking out of the caches other data that will be reused.

The Itanium floating-point store instructions provide for postmodification of the value in pointer register r3 by a signed adjustment of 256 to +255. Many architectures provide an addressing mode known as autoincrement, where the pointer register is postincremented only by the byte size of data copied. Some architectures also offer an autodecrement addressing mode, usually with a predecrement, unlike the postdecrement discussed here.

The stf8 form (i.e., integer form) stores the significand bits <63:0> from register f2 into the quad word memory location specified in register r3. If the significand corresponds to a quad word integer, the value could subsequently be brought from memory into a general register using an ld8 instruction (Section 4.5.3).

The spill form always copies register f2 into a 16-byte region in memory, which should be aligned on a 16-byte addressing boundary. The spill form can be used to save register contents when, for example, an operating system switches context from one process to another.

8.3.2 Floating-Point Load Instructions

There are three forms of floating-point load instructions for Itanium architecture: normal, integer, and fill, each having the following syntax:

ldffsz.fldtype.ldhint f1=[r3] // f1 <- mem[r3] ldffsz.fldtype.ldhint f1=[r3],r2 // f1 <- mem[r3] // r3 <- r3 + r2 ldffsz.fldtype.ldhint f1=[r3],imm9 // f1 <- mem[r3] // r3 <- r3 + sext(imm9) ldf8.fldtype.ldhint f1=[r3] // f1<63:0> <- mem[r3] ldf8.fldtype.ldhint f1=[r3],r2 // f1<63:0> <- mem[r3] // r3 <- r3 + r2 ldf8.fldtype.ldhint f1=[r3],imm9 // f1<63:0> <- mem[r3] // r3 <- r3 + sext(imm9) ldf.fill.ldhint f1=[r3] // f1 <- mem[r3] ldf.fill.ldhint f1=[r3],r2 // f1 <- mem[r3] // r3 <- r3 + r2 ldf.fill.ldhint f1=[r3],imm9 // f1 <- mem[r3] // r3 <- r3 + sext(imm9)

where fsz is the size of the information unit at the address specified in register r3 from which a value is converted and placed into register f1 by the normal form. The available values for fsz are s for 4-byte single-precision, d for 8-byte double-precision, and e for 10-byte double extended precision data. The simplest load instruction uses register indirect addressing for the source register operand and register direct addressing for the destination operand.

Load operations can be susceptible to numerous exceptions, chief among them being an attempt to load unaligned data. In order to load 8-byte (double-precision) data, for example, we must assure that the lowest 3 bits in the address expressed in register r3 are 0.

There are several values for fldtype (the load type completer). None at all corresponds to an ordinary load operation. Other types correspond to a check load, speculative load, or advanced load. We shall consider the advantages and potential drawbacks of speculative and advanced load instructions in a later chapter.

There are three values for ldhint (the load hint completer): none at all, nt1, and nta. None at all corresponds to an ordinary load operation; the processor receives the hint that the program associates temporal locality in the L1 cache with the value loaded, although early Itanium implementations do not use the L1 cache for floating-point data. At the other extreme, nta provides the hint that the program considers the value loaded to have nontemporal locality at all levels of cache and memory hierarchy, while nt1 provides the hint that the program considers the value loaded to have intermediate temporal locality. The use of nta may thus avoid knocking out of the caches other data that will be reused.

The Itanium floating-point load instructions provide for postmodification of the value in pointer register r3 by a signed adjustment ranging from 256 to +255 or by a full 64-bit signed amount in register r2. Many architectures provide an addressing mode known as autoincrement, where the pointer register is postincremented only by the byte size of data copied. Some architectures also offer an autodecrement addressing mode, usually with a predecrement, unlike the postdecrement discussed here.

The ldf8 form (i.e., integer form) loads 8 bytes from a quad word memory location specified in register r3 into the significand bits <63:0> of register f1. At the same time, the sign bit <81> is set to zero and the biased exponent field of bits <80:64> is set to the value 0x1003E (2⁶³) appropriate for interpretation of the significand as a 64-bit integer.

The fill form always accesses 16 bytes and converts the information in the appropriate 11 bytes into the register f1. The fill form can be used to restore register contents and is paired with stf.spill when an operating system switches context from one process to another.

8.3.3 Floating-Point Load Pair Instruction

There are three forms of floating-point load pair instructions for the Itanium architecture: single (s), double (d), and integer (8), each having the following syntax:

ldfps.fldtype.ldhint f1,f2=[r3] // f1 <- mem[r3] // f2 <- mem[r3+4] ldfps.fldtype.ldhint f1,f2=[r3],8 // f1 <- mem[r3] // f2 <- mem[r3+4] // r3 <- r3 + 8 ldfpd.fldtype.ldhint f1,f2=[r3] // f1 <- mem[r3] // f2 <- mem[r3+8] ldfpd.fldtype.ldhint f1,f2=[r3],16 // f1 <- mem[r3] // f2 <- mem[r3+8] // r3 <- r3 + 16 ldfp8.fldtype.ldhint f1,f2=[r3] // f1<63:0> <- mem[r3] // f2<63:0> <- mem[r3+8] ldfp8.fldtype.ldhint f1,f2=[r3],16 // f1<63:0> <- mem[r3] // f2<63:0> <- mem[r3+8] // r3 <- r3 + 16

For each form, data from two successive information units at the address specified by register r3 are converted and brought into two floating-point registers, f1 and f2. The sizes are 4 bytes (single form) or 8 bytes (double and integer forms).

One destination register must be odd-numbered and the other even-numbered, but they do not have to be consecutively numbered. Refer to Intel's "Instruction Set Reference" for the further restrictions that apply when either destination is a rotating register.

A load pair instruction can be optionally accompanied by a postmodification of the value in the pointer register r3 by an amount equivalent to the aggregate storage size of the pair of data elements, thus facilitating sequential access through a list of paired values.

These load pair instructions accept the same values for fldtype (the load type completer) and ldhint (the load hint completer) as the standard floating-point load instructions (Section 8.3.2). Moreover, these instructions are subject to similar guidelines for alignment of the data on suitable addressing boundaries.

The purpose of the Itanium floating-point load pair instructions is to help offset the adverse performance effect of transfers from memory, which are slow when compared to the execution time of floating-point computations.

8.3.4 Floating-Point Pseudoinstructions for Register Register Copying

We observed that a value could be copied from one general register to another as a special case of certain Itanium integer instructions. For example, adding a source value to zero results in the source value being copied to the destination register, as does a bitwise Boolean OR instruction. Thus the Itanium ISA can have a mov pseudo-op without needing distinct hardware support for move operations. We discussed several other special cases in Chapter 4.

Here we first introduce pseudo-ops involving the movement of data among floating-point registers, and then take up the actual Itanium instruction that serves as the origin for these especially useful cases. The pseudoinstructions for copying floating-point data within the CPU are:

mov f1=f3 // f1 <- f3 fabs f1=f3 // f1 <- abs(f3) fneg f1=f3 // f1 <- - f3 fnegabs f1=f3 // f1 <- -abs(f3)

where f1 and f3 may be any of the floating-point registers. In these and other instructions for copying data, the source location still holds the original value, while the destination is overwritten to hold a copy of the value (mov) or a related value (fabs, fneg, fnegabs).

Special permanent values in Fr₀ and Fr₁

The Itanium architecture specifies that Fr₀ and Fr₁ will contain, respectively, read-only values of +0.0 and +1.0 at all times.

Clearing a floating-point register or setting it to ±1.0

It is self-evident that a mov (or fabs) operation with Fr₁ as the source results in a destination value of +1.0. Similarly, an fneg (or fnegabs) operation with a source of Fr₁ would place the value 1.0 in the destination register. Finally, a mov (or fabs) operation with a source of Fr₀ would place the value +0.0 in the destination register, effectively clearing that register, while an fneg (or fnegabs) operation with a source of Fr₀ would place the value 0.0 in the destination register.

8.3.5 Floating-Point Merge Instruction

Floating-point numbers are stored and manipulated as sign and magnitude quantities. The Itanium architecture provides several forms of the merge instruction for selective manipulation of the sign bit with or without the 17-bit biased exponent field of the register representation of floating-point quantities:

fmerge.s f1=f2,f3 // f1 <- sign(f2) with rest(f3) fmerge.ns f1=f2,f3 // f1 <- -sign(f2) with rest(f3) fmerge.se f1=f2,f3 // f1 <- sign&exp(f2) with rest(f3)

where f1,f2, and f3 may be any of the floating-point registers. In general, these instructions combine elements from the floating-point representations of two numbers in registers f2 and f3 in order to compose a new floating-point number for the destination.

Various choices of Fr₀ or register f3 as the source register f2 result in the special pseudo-op cases (Section 8.3.4). Other special cases that set register f1 to ±0.0 or ±1.0 derive from specifying Fr₀ and/or Fr₁ for f2 and f3 appropriately.