Heterogeneous computing refers to systems which use more than one kind of processor or cores to maximise performance or energy efficiency. The processors incorporate specialised processing capabilities to handle particular tasks.
Heterogeneous architectures break away from the traditional evolutionary processor design path, posing new challenges and opportunities in high-performance computing.
Heterogeneous computing will continue to add many cores and hardware features such as transactional memory, random number generators, scatter/gather and wider single-instruction multiple-data/advanced vector extensions, ensuring synergies with big data, mobile and graphics markets. It has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores such as custom logic, field-programmable gate arrays (FPGAs) or general-purpose graphics processing units (GPUs). Instead of using just a single CPU or GPU, heterogeneous architectures add an application-specific integrated circuit (ASIC) or FPGA to perform highly dedicated processing tasks. Heterogeneous architectures achieve performance gain by parallelism instead of clock frequency.
There are four levels of parallelism in hardware: single-instruction single-data (SISD), single-instruction multiple-data (SIMD), multiple-instruction single-data (MISD) and multiple-instruction multiple-data (MIMD).
In SISD, there are tens to hundreds of data paths, which all run by one instruction stream. General-purpose host CPU executes the main application, with data transfer and calls to the SIMD processor for compute-intensive kernels. SIMD has dominated high-performance computing (HPC) since the time of Cray-1 supercomputer.
SIMD performance depends on hiding random-access memory latency, which may be hundreds of cycles, by accessing data in big chunks at a very high memory bandwidth. Data-parallel feed-forward applications are common in HPC. Large register files are provided in data paths to hold large regular data structures such as vectors.
It is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. There is a sequence of data transmitted to the set of processors. Each processor executes a different instruction sequence.
MIMD class of parallel architecture is the most familiar and possibly most basic form of parallel processor. MIMD architecture consists of a collection of ‘n’ independent, tightly-coupled processors, each with a memory that may be common to all processors, and/or local and not directly accessible by the other processors.
Two subdivisions of MIMD are single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD). Single-chip cell broadband engine architecture consists of a traditional CPU core and eight SIMD accelerator cores. It is a very flexible architecture, where each core can run separate programs in MPMD fashion and communicate through a fast on-chip bus. Its main design criteria has been to maximise performance whilst consuming minimal power. It defines a new processor structure based upon the 64-bit Power Architecture technology, but with unique features directed toward distributed processing and media-rich applications.
The cell broadband engine architecture defines a single-chip multiprocessor consisting of one or more power processor elements (PPEs) and multiple high-performance SIMD synergistic processor elements (SPEs). For example, a GPU with 30 highly multi-threaded SIMD accelerator cores in combination with a standard multicore CPU.
The GPU has a vastly superior bandwidth and computational performance, and is optimised for rerunning SPMD programs with little or no synchronisation. It is designed for high-performance graphics, where data throughput is the key. FPGA scan also incorporates regular CPU cores on-chip, making it a heterogeneous chip by itself. FPGAs can be viewed as user-defined ASICs that are reconfigurable. These offer fully deterministic performance and are designed for high throughput, for example, in telecommunication applications.
Accelerator cores are designed to maximise performance, given a fixed power or transistor budget. These use fewer transistors and run at lower frequencies than traditional CPUs.
Algorithms such as finite-state machines and other intrinsically serial algorithms are most suitable for single-core CPUs running at high frequencies. Parallel algorithms such as Monte Carlo simulations, on the other hand, benefit greatly from many accelerator cores running at a lower frequency. Most applications consist of a mixture of such serial and parallel tasks, and ultimately perform best on heterogeneous architectures.
Mobile heterogeneous computing
Mobile systems are more intelligent than ever. As users demand more functionality, designers are continually adding to a growing list of embedded sensors. Image sensors support functions such as gesture and facial recognition, eye tracking, and proximity, depth and movement perception. Health sensors monitor the user’s EKG, EEG, EMG and temperature. Audio sensors add voice recognition, phrase detection and location-sensing services.