HSA and Kaveri
The idea is that current processes are too heavily CPU biased and don’t properly utilize the compute capabilities of the graphics cores lying dormant on the die. Current processes are inefficient at jumping back and forth between CPU and GPU cores, often requiring heavy coding changes and time-consuming efforts to call back to the CPU from the GPU. The Kaveri APUs are optimized to balance out this problem. AMD admits that the effort to better utilize both processing units didn’t come without compromise. CPU frequencies at high TDPs took a hit, but AMD says that those losses are offset by the up to 20% IPC boost with the new Steamroller cores.
The Kaveri APU introduces a new way of looking at the CPU and the GPU cores. All cores on the die are now called Compute Cores. A compute core is a programmable hardware block that can independently run processes in its own context and virtual memory space. In the A10-7850K, we see four, multi-threaded Steamroller CPU cores and eight GCN-based GPU cores that combine to make twelve Compute Cores. Kaveri and HSA use two technologies to make this combining possible; hUMA (heterogeneous Uniform Memory Access) and hQ (heterogeneous Queuing).
In Kaveri, hUMA means that the GPU and the CPU share virtual memory and both processing units have uniform visibility into the entire memory space. This way, the CPU and GPU can share data without having to repackage it and send it off to the GPU memory. This change should reasonable enhance the usability of the GPU cores even when using current programming languages, since the GPU can access the same memory as the CPU. On the other side, hQ makes the GPU an equal partner to the CPU. In the past, the GPU has always had its processes routed to the CPU for approval first, like a micro-managing administrator. The GPU, being the good little worker, couldn’t take on any new tasks or dispatch tasks on its own. With hQ, the GPU can now interact directly with applications and even send tasks to the queue where they can be dispatched to the CPU or the GPU. This new process eliminates a big bottleneck and lowers latency in processing. Beyond making applications run faster and more efficiently, AMD also says that hQ leads to huge power savings. Additionally, since this is all done in the architecture, programmers no longer have to write specifically for the GPU cores, as they can now act just like CPU cores and create and dispatch tasks.
Because these HSA features (hUMA and hQ) fill feature gaps expected in the OpenCL 2.0 standard, AMD is calling Kaveri the first OpenCL 2.0 capable chip and has laid out scenarios showing just how HSA can aid in certain cases.
- Data pointers in Binary Tree Searches: Traditionally, the Binary Tree would have to be flattened and written to the GPU, then saved in the GPU’s memory, then the results can be written to the CPU. With Kaveri, the Binary Tree can be accessed in place by the GPU and the search results can be written directly to the CPU. With this, code complexity can be greatly reduced.
- Platform atomics in Binary Tree Updates: Currently, the CPU and the GPU cannot be used simultaneously and, again, the tree must be flattened and written to the GPU memory for use. With Kaveri, both the CPU and the GPU can access the tree in place and work simultaneously.
- Large data sets: Historically, since the information accessed by the GPU must be written to GPU memory, only part of the data set could be accessed. If the desired information is found in a lower level, the GPU memory must be cleared and request that the lower levels be written into GPU memory for access. Because this costs a lot, the GPU isn’t often used in large data sets. With Kaveri, the data sets can be accessed in place and the higher performance of the GPU can be utilized.
- CPU Callbacks: Legacy programming methods must run multiple kernels to check for potential callbacks because the GPU cannot call directly to the CPU. With Kaveri, these kernels are eliminated by allowing the GPU to call directly to the CPU.
AMD touts HSA as the method of choice to increase performance in applications across the spectrum that have a lot of parallel workloads. These application areas include Natural UI and Gestures, Biometrics, Augmented Reality, AV Content Management, Content Everywhere, and Beyond HD.