Pagina 12 di 14

GPU

The Llano chip GPU component, code-named "Sumo", incorporates many of the characteristics of the AMD "Redwood" GPUs family, which is implemented in the Radeon 5500 and Radeon 5600.

By comparing the two images, we see that not much has changed between "Sumo" and "Redwood" in the computing heart. We have the SIMD engines, 5 in total, each consisting of 16-thread processors and 4 texture units.

Each thread processor consists of a unit with a VLIW 5 architecture, capable of running up to 5 operations per clock cycle, giving a total of 400 calculations and of 20 texture operations per clock cycle. In addition there are two render back-ends, each capable of 4 raster operations (ROP) per cycle.

In particular, we can note in figure the structure of the VLIW 5 unit of AMD graphics architectures. The special functions unit can perform one floating point (FP) multiplication and addition (MAD) per cycle, or an execution step of a special function (such as transcendental functions, exponential, etc. ) per cycle.

The remaining 4 execution units can perform 4 32 bits FP MAD, or any combination of addition and multiplication of 2 64-bit FP, one 64-bit FP MAD or any combination of 4 24-bit integer addition and multiplication.

014_Sumo

From the comparison with "Redwood", one can see that while the latter has a RAM controller for 128-bit DDR3 or GDDR5 memory, second generation UVD and 4 display controller, even capable of a certain level of Eyefinity technology support, "Sumo" has a direct connection to the 128-bit DDR3 RAM controller on the North Bridge, third-generation UVD and 2 controllers dedicated to the display. If you give up to have the PCI Express x16, you can configure other video outputs, to allow the implementation of the Eyefinity technology.

The integration of on-die GPU and direct connection to the North Bridge has helped the GPU, in part offsetting the fact that you have a DDR3 controller shared with the CPU.

The GPU is indeed connected to the rest of the system by well 3 BUS.

The first link is the "Garlic", which is a direct connection, without coherency check, low latency, 128-bit, between the GPU and RAM controller, arbitrated by the sole RAM controller front-end, which ensures high-priority and low latency. The frequency at which such a link operates would presumably be that of the Northbridge, which should be around 2 GHz, since AMD says it has a bandwidth of more than 30GB/s to the RAM controller, while 2 billion transfers per second is slightly over. It's an improvement over the 64-bit bus for the Ontario/Zacate GPU.

The second link is the "Onion", which interfaces the GPU with the coherent system, composed by the memory, the CPUs and their caches, which is ruled by the MOESI protocol. This direct and low latency connection is a significant step forward. With an external GPU, especially with an IGP, you had to go through the bus connection between the graphics card (or the North Bridge in the case of IGP) and CPU, to wait for its integrated North Bridge controlled the cache and eventually RAM and send the results back. Now the two checks can be done independently and in parallel (through both Onion and Garlic interfaces) and with a much lower latency.

The third connection is the one direct between the GPU and the GIO unit, for access to the PCI Express controller, for an eventual CrossFire with discrete card and video outputs management. The same fate is not up to the cores of the CPU that, to access the GIO, shall pass through the North Bridge and suffer the latencies and waiting.

Another great advantage of an integrated GPU on the same CPU chip is related to the exchange of data in memory. With conventional graphics cards, to render a scene, several data structures must be used: grids of vertices (mesh), color and lighting information and texture mapping for the final color.

These information are created from the 3D software into memory and are passed to the graphics card driver. Because the processes are working with virtual memory, to access them, the driver may even need to load it from disk. In any case they should be transferred to the internal memory of the video card via the PCI Express bus.

Similarly happens for GPGPU calculations, where, in addition, the data must travel even the reverse the route, to load into main memory the calculation results.

All this with an APU is no longer necessary. The Llano APU implements 2 techniques to speed up these operations.

The first is called zero copy, and allows you to not copy the data anymore, but simply to say where they are to the GPU and they may be used directly. This is possible because the GPU can directly access the RAM. But the virtual memory moves data in and out of memory and re allocate it at will. Llano does not yet implement a sophisticated virtual memory manager to help you get an unified memory space. The next architecture will implement it and make the GPU a real vector coprocessor, similarly to first FPU appeared years ago.

To avoid problems, the second technology is employed: Pin in Place. This is to block data structures in a specific area of memory, so the GPU can find and access them safely and they are not to be taken out of memory, on hard disk, extending access times.

In reality, the CPU uses the classic virtual memory technique, called demand paging, managed by the OS, while the GPU paging is handled by the video driver, of course, in concert with the operating system to implement the above two technologies. Pin in Place does just that: the graphics driver blocks in memory the pages for its needs, through the normal mechanisms offered by the operating system, and then map those pages in the GPU virtual memory so that it can access it.

The GPU handles the memory with a more relaxed logic compared to the x86 CPU, allowing a more heavy reordering of the accesses, leading to better performance. The GPU must access the memory with the stricter CPU rules only when accessing Pinned memory, to avoid interfering with the CPU and the CPU instead relies on the driver to synchronize with the GPU.

Now for the details on the interchange of information between the CPU and GPU.

The GPU accesses memory in interleaved mode, ie blocks of consecutive data are alternated on RAM channels, to optimize the access bandwidth, while the CPU typically accesses memory in a non-interlaced mode or anyway in a mode optimized for latency.

The accesses to the RAM by the CPU follows the classical approach, coherent with the MOESI protocol for the normal RAM. The memory reserved for the GPU is viewed as non-cacheable memory with write combining enabled. Writes of the CPU to the memory follows the classic protocol for such accesses and data pass through "Onion" bus to the GPU, which then writes them to memory, resulting in higher speed than that obtained by the PCIExpress bus (about 8GB/s). Reads in the GPU memory by the CPU is rather slow both because the memory is uncacheable, and then through the bus "Onion" it will have to force a write of all modified GPU data in RAM, and both because you can only have one of these transactions in place at any time.

The GPU accesses the RAM reserved for the GPU in non coherent mode, and therefore it is not affected by all the overhead that consistency brings with it, and in interleaved mode on the memory channels. All this made through the "Garlic" BUS, which has high priority and has a direct connection to the memory controller. The driver orders the GPU to read the memory not reserved for it, when external data must be taken, such as meshes, textures, or data for the GPGPU. It orders writes to that memory when you need to pass the results of a GPGPU calculation back to the CPU, or to copy the frame buffer to RAM for various processing. Memory not reserved to the GPU is managed both in uncached mode, or with the MOESI coherence protocol, implemented by the CPU cores. If the memory is not cacheable, then the cache should not be controlled, so no need to use the "Onion" bus to access the coherent domain. You can use the "Garlic" bus, but the memory must be locked (pinned), because the GPU does not yet implement in hardware a virtual memory mode compatible with the x86 cores, moreover typically it is not accessed in interleaved mode, so it is a bit slower. But not as much as the cached memory access, which requires it to pass for the "Onion" bus, be queued in the IFQ queue, and requires the check of CPU caches. Here too, the memory must be locked, since the GPU works with physical addresses.

When virtual memory will be implemented in future generations of GPU in a manner identical to that of the x86 CPU, then you can talk about unified memory space and such software tricks will no longer be necessary. It would even be possible to page out to disk graphics memory, if they implement the exchange of page fault signals from the GPU to the CPU.

These technologies let, in exceptional cases, have a greater speed than that of a discrete video card. But the sharing of the RAM controller, however, makes the GPU a bit slower than its discrete equivalent.

The GPU, unlike the one implemented in Sandy Bridge, fully supports DirectX 11 and angle independent anisotropic filtering.

Other features supported are OpenGL 4.1, OpenCL 1.1 and MSAA, SSAA and MLAA anti-aliasing techniques.

In the picture you can see a block diagram of the various generations of UVD. The third generation, implemented in Llano, can hardware accelerate the MPEG-4 Part 2 (which also includes the DivX and Xvid), MPEG-2 codec and Multi-View Codec (MVC) used to accelerate Blu-ray 3D, which can be viewed through the HDMI 1.4a port which Llano has.

The decoding is all done in the UVD3 block, so the rest of the GPU can be placed in advanced energy-saving (power gating) and save a lot of energy.

Dual Graphics

Last chapter for the GPU is Llano Dual Graphics technology. It allows you to pair, like CrossFire technology, video cards to increase performance. But unlike the latter, it can balance the load between the GPU, taking into account the differences in power.

The mechanism is not perfect, since it is greatly affected by the quality and maturity of the drivers and also works well with DirectX 10 and 11. With DirectX 9, the performance are equal to those of the slower card.

As we saw earlier, CrossFire/DualGraphics and Eyefinity technology are mutually exclusive unless you settle for a smaller number of display interfaces: the x16 PCI Express controller can be divided into two x8 controllers, one of which can be used for CrossFire and one for additional displays. Of course to get the maximum number of video interfaces is necessary to give up to CrossFire support, so as to obtain a full CrossFire support, with x16 interface or two x8 interfaces, to perform a triple CrossFire is necessary to give up to Eyefinity support.

In the event that the performance difference between the discrete card or cards combined with the GPU will be high, the performance increase over discrete card usage only is fairly modest.

In compensation, however, AMD has implemented the drivers so you can route OpenCL 1.1 calls to the integrated GPU without impairing the discrete GPUs, if they are working on a heavy 3D load.

AMD Llano A-Series: Architecture Analysis - GPU

Indice articoli

GPU

Dual Graphics

Corsair

Ultime Recensioni

EPOMAKER Split 65 - Recensione

Kiwi Ears Ardor - Recensione

Days Gone Remastered: La Recensione

EPOMAKER HE68 - Recensione

Asustor LOCKERSTOR 4 Gen3 AS6804T e DRIVESTOR 4 AS1104T - Recensione

EPOMAKER Carbon60 - Recensione

EPOMAKER Click - Recensione

EPOMAKER TH99 - Recensione

Ultime Notizie

La Season 4 Reloaded di Black Ops 6 è in arrivo il 2 Luglio

Razer annuncia la BlackWidow V4 75% Barebones

Immergiti in Dune: Awakening con Razer Sensa HD Haptics

I NAS ASUSTOR sono eco-friendly

Razer presenta la Razer Phantom Collection