Consequently, major silicon vendors now spend their transistor budget on symmetric multi-core processors for the mass market. This evolutionary path might suffice for two, four and perhaps eight cores, as users might run that many programs simultaneously. However, in the foreseeable future we will most likely get hundreds of cores. This is a major issue: If silicon vendors and application developers cannot give better performance to users with new hardware, the whole hardware and software market will go from selling new products to simply maintaining existing product lines.
Today’s multi-core CPUs spend most of their transistors on logic and cache, with a lot of power spent on non-computational units. Processor design has always been a rapidly evolving research field. Many of these same devices now offer ‘context-aware’ subsystems that allow the system to initiate highly advanced, task-enhancing decisions without prompting the user. For example, temperature, chemical, infrared and pressure sensors can evaluate safety risks and track a user’s health in dangerous environments. Precision image sensors and ambient light sensors can boost image resolution and display readability automatically as environmental conditions change.
These new capabilities significantly impact system design. To optimise decision-making, these devices must collect, transfer and analyse data as quickly as possible. The faster the system responds, the more accurately it can adapt to rapidly changing conditions. Furthermore, since ‘context-aware’ systems must be ‘always on’ to track changes in the environment, these new capabilities significantly drain system power.
To address this problem, a growing number of developers are adopting mobile heterogeneous computing (MHC) architectures. One of the primary reasons why system designers are moving to MHC is the ability it gives developers to move repetitive computation tasks to the most efficient processing resources so as to lower power consumption. For example, one key distinction between GPUs, CPUs and FPGAs is how they process data. GPUs and CPUs typically operate in a serial fashion, performing one calculation after another.
If designers want to reduce system latency to respond to sensor inputs in real time, they must accelerate the system clock and, in the process, increase power consumption. FPGAs, on the other hand, enable a system to perform calculations in parallel, which, in turn, reduces power consumption, particularly in compute-intensive repetitive applications.
The fact that FPGAs have thus far been used sparingly for these tasks can be attributed to a common misconception—many designers think of FPGAs as relatively large devices. However, this is not necessarily the case. Low-density FPGAs offer a number of other advantages in the current generation of intelligent systems.
The rapid proliferation of sensors and displays in today’s mobile devices presents new challenges from an input/output (I/O) interface perspective. Designers must integrate sensors and displays with a growing diversity of interfaces, including legacy systems using proprietary or custom solutions.
In many cases, designers can use low-density FPGAs or programmable application-specific standard products built around an FPGA fabric to aggregate data from multiple sensors onto a single bus or bridge between multiple disparate interfaces. With reprogrammable I/Os, these FPGAs are capable of supporting a wide variety of bridging, buffering and display applications. The recent emergence and rapid adoption of low-cost Mobile Industry Processor Interfaces (MIPI) such as CSI-2 and DSI has helped to simplify this task. By utilising the latest advances in I/O from the mobile computing market and MIPI together with the inherent advantages of low-density programmable logic in MHC architectures, designers can optimise their systems’ ability to collect, transfer and analyse this key resource.
CUDA architecture enables general-purpose computing on the GPU and retains traditional DirectX/OpenGL graphics. The dominance of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors. Processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels.
GPUs and FPGAs are becoming popular in PC-based heterogeneous systems for speeding up compute-intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customised concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application’s fine and coarse grained parallelism by using special application programming interfaces (APIs).
OpenMP (Open Multi-Processing) API supports multi-platform shared memory multiprocessing programming in C, C++ and Fortran. It consists of a set of compiler directives, library routines and environment variables that influence run-time behaviour.
OpenACC (for open accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. The programmer can annotate C, C++ and Fortran source code to identify areas that should be accelerated using compiler directives and additional functions.
The four steps to accelerate the code include: Identify parallelism, express parallelism, express data locality and optimise.