AMD Llano A-Series: Architecture Analysis - Fetch and Decode Unit

Indice articoli

Fetch and Decode Unit

The fetch and decode unit performs the translation of the ancient and complex x86-64 instruction set into more manageable macro-ops. Each macro-op is able to describe an arithmetic/logic, integer or floating point, and simultaneously a memory operation, whether read, write, and read-modify-write (atomic operation, useful for implementing semaphores in operating system kernels).

Llano is not very different from its predecessor. It incorporates two separate decoders, one to decode and split simple instructions (up to 2 macro-op), so-called DirectPath, and one for more complex instructions, so-called VectorPath (3 or more macro-op).

When the instruction window (which includes 32 bytes divided into two 16-byte parts) is read from the L1 instruction cache, the bytes are examined to determine if the instructions are of the type DirectPath or VectorPath (this is one of the information that is part the pre-decode and is stored in the L1). The output of these decoders maintains the program order for the instructions.

As we will see later, Llano is an out-of-order architecture, which can execute instructions out of order for faster execution.

These decoders can produce up to three macro-ops per cycle, which can come exclusively from one of the two types of decoders, in each cycle. The output of these decoders are combined and passed to the following unit, the Instruction Control Unit (ICU).

Since the decoding of a VectorPath instruction produces at least 3 macro-op and you can only send to the ICU the macro-ops of one unit at a time, decoding of the VectorPath instructions stops DirectPath instructions decoding. DirectPath instructions that can be decoded in a cycle depends on the complexity of them.

The DirectPath decoder can decode any combination of x86 instruction of DirectPath type that results in a sequence of 2 or 3 macro-op, considering that the decoding is done in program order. So 3 instructions are possible if there are 3 consecutive instructions that give one macro-op, so-called DirectPath Single.2 instructions are possible if one is DirectPath Single and the other is DirectPath Double (so are called the simple instructions that generate two macro-ops). It's possible to decode a single instruction if there are two consecutive DirectPath Double instructions, which can not be decoded together as this would produce 4 macro-ops, beyond the limit of 3 of the architecture.

Another limitation to the amount of decoded instructions is the fact that in a given cycle can be accessed a single block of 16 bytes, of the 32, at a time and then can be decoded only the statements contained in that block. Since there are some instructions that can comprise up to 15 bytes, it is possible that a in block of 16 bytes are not present a sufficient number of instructions to commit all the decoders.

 

Branch Prediction

The branch prediction in Llano works the same way as the previous generation. A new jump is predicted as not taken until it is actually taken once. Next the jump is predicted taken as long as this prediction is not actually wrong. After these two wrong predictions, the CPU starts to use the Branch Prediction Table (BPT).

The fetch logic accesses the L1 instruction cache and BPT in parallel and the information in the BPT is used to predict the direction of the jump. When the instructions are moved to the L2 cache, the pre-decode information and the jump selectors (which indicate in what condition is the jump, namely never seen, once taken, taken and then not taken) are copied and stored with them instead of the ECC code.

The branch prediction technique is based on a combination of a branch target buffer (BTB) of 2048 elements and a global history counter bimodal (GHBC) of 16384 elements with 2 bits containing a saturation counter used to predict whether a conditional branch should be predicted as taken. This counter contains how many times in the last 4 runs the jump has been taken and therefore the jump is predicted as taken if it has been taken at least 2 times recently.

The GHBC is addressed with a combination of an unspecified number of conditional jump results and the address of the last jump. This is a standard prediction technique that provides the table address with a hash of the jump address combined with the outcome of the last n jumps.

The branch prediction also includes a return address stack (RAS) of 24 items to predict the destinations of procedure calls and procedure returns. Finally, there is a table of 512 elements to predict indirect jumps, even with multiple destinations.

 

Sideband Stack Optimizer

This unit keeps track of the stack-pointer register. So that can be run in parallel several instructions that require this as the input register (CALL, RET, PUSH, POP, indexing by the stack pointer, calculations that have the stack pointer as a source register).

Instructions that can not be executed in parallel are those that target stack-register, the ones address indexed (in which calculations are made on the register) and the VectorPath instructions that use in some way that register (because of the difficulty to keep track of the register in VectorPath instructions).

 

Instruction Control Unit

The ICU is the control center of the processor. It controls the centralized register for the executing instructions reordering and the integer and floating point schedulers.

It's responsible for the dispatch (ie the forwarding to the appropriate scheduler) of macro-ops, the retire (ie the determination and validation of the result) of macro-ops, the dependency resolution of registers and flags, by the renaming (a technique to run in parallel unrelated instruction accessing the same register), the management of execution resources, interrupts, exceptions (during the retire of macro-ops) and the management of the incorrect prediction of jumps, which includes emptying the various queues and the cancellation of the ongoing operations.

The ICU takes up to 3 macro-ops per cycle, previously produced by the decoder and places them in a centralized reorder buffer, consisting of 3 lines of 28 macro-ops, which represent an increase over the 24 of the previous Stars architecture. This increase can give rise to more margin to the upstream decoder, because if the downstream instructions are not executed for lack of data from memory, for example, this queue fills up quickly. Increasing it decreases the stall time of the decoder, that must stop if the queue is full.

This buffer allows you to keep track of a maximum of 84 macro-ops, both integer and floating point. The ICU can submit simultaneously macro-ops to the various integer or floating point scheduler, which will make the final decoding and execution of the macro-ops.

Upon completion of the execution, the ICU performs the instruction retire, in program order, and handles any exceptions, including the erroneous prediction of a jump.

It's worth noting that the various macro-ops can be executed out of order and in parallel, both within the same unit (integer or floating point) and when executed on different units. The dispacth and retire do occur, however, in program order.

Corsair