Optimising peak power guided by profile results
The technique can also be used to guide application-based optimisation by analysing the processor’s behaviour during the cycles of peak power consumption. Three different optimisations can then be applied as appropriate:
1. Replace a complex instruction that induces a lot of activity in one cycle with a sequence of simpler instructions, thus spreading out the activity over several cycles
2. Delay the activation of one or more modules, previously activated in a peak cycle, until a later cycle
3. Avoid the multiplier being active simultaneously with the processor core by inserting a no-operation instruction (NOP) into the pipeline during the cycle in which the multiplier is active
Taking combined figures across the benchmarks, these techniques reduced peak power by up to 10 per cent (5 per cent on average) with up to 34 per cent (18 per cent on average) reduction in peak power dynamic range.
High-level synthesis (HLS) tools such as Cadence Stratus and Mentor Catapult help to generate hardware for a given application. However, unlike bespoke processor design, HLS involves additional development cost since a new high-level specification of application behaviour needs to be defined and the high-level specification itself needs to be verified. Besides, while HLS tools can transform many C programs into efficient ASICs, there are well-known limitations that further increase development costs.
Cadence Stratus HLS delivers up to ten times better productivity than traditional RTL design. It lets you quickly design and verify high-quality RTL implementations from abstract System C, C or C++ models. Using the platform, you can reduce the intellectual property (IP) development cycle from months to weeks.
You can easily create abstract models using Stratus HLS’ integrated design environment (IDE) and synthesise optimised hardware from those models. You can then retarget these models to new technology platforms and reuse them more easily than you could with traditional hand-coded RTL. You can actively make tradeoffs between power, area and performance from within the HLS environment. Users have reported productivity as high as two million verified gates/designer/year compared to 200,000 with the traditional RTL flow.
The Catapult HLS platform empowers designers to use industry-standard ANSI C++ and SystemC to describe functional intent and move up to a more productive abstraction level. From these high-level descriptions, Catapult generates production-quality RTL. By speeding time to RTL and by automating the generation of bug-free RTL, it significantly reduces the time to verified RTL.
The Catapult platform pairs synthesis with the power of formal C property checking to find bugs early at the C, C++ or SystemC level and to comprehensively verify source code before synthesis. Its advanced power optimisations automatically provide significant reductions in dynamic power consumption. The highly-interactive Catapult workflow provides full visibility and control of the synthesis process, enabling designers to rapidly converge upon the best implementation for power, performance and area.
Dynamic memory allocation, pointer ambiguity, memory parallelism extraction and efficient schedule creation for arbitrary C programs are all challenges for HLS. In fact, most commercial tools limit the use of pointers and dynamic memory allocation, requiring additional hardware-aware design development to create a working ASIC from a C program.
In contrast, a bespoke processor tool flow automatically creates a bespoke processor from the original, already verified gate-level netlist and application binary without further design work. Also, unlike HLS, it can generate a design that supports multiple applications on the same hardware (including full-fledged OS) and can support in-field updates. Thus a bespoke processor design flow can decrease design and verification effort and allow increased programmability compared with HLS tool flows.
Statically-specialised cores such as conservation cores, QsCores and GreenDroid automatically develop hardware implementations that are connected to a general-purpose processor at the data cache and target hotspots within an application code. GreenDroid, a prototype mobile application processor chip leverages ‘dark silicon’ to dramatically reduce energy consumption in smartphones. It provides many specialised processors targeting key portions of Google’s Android Smartphone platform. GreenDroid reduces energy consumption for these codes by integrating conservation cores (c-cores). Such cores increase energy efficiency at the expense of increasing the total area of a design, and thus may not be a good fit for area-constrained applications.
Reconfigurable architectures such as DySER can also increase energy efficiency by mapping frequently-executed code segments onto tightly-coupled reconfigurable execution units. However, increased energy efficiency comes with an increase in area and power for the additional reconfigurable units.
Extensible processors, such as Xtensa, allow designers to specify configurations including structure sizing, optional modules and custom application-specific functional units. Such extensible processors are limited in the extent to which these can reduce area and power, since they are applied primarily at the module level. Furthermore, the process is not fully automated and requires additional hardware design effort.
Compared with extensible application-specific processors, bespoke processors can reduce power further as these can remove gates within modules and require less manual design effort. Chip generators can be used to generate chip families from the ground up for a particular application domain by allowing domain expert hardware designers to encode domain-specific knowledge into tools that design application-specific chips within the same domain.
Like HLS, this approach still requires a domain expert to design the overarching hardware in an HLS-like manner and then specify functions that allow arbitrary elaboration of the hardware design (for example, encoding optimisation functions for determining lower-level parameters such as cache associativity). Therefore chip generators require a change in the design process, while tailoring bespoke processors to applications can be completely automated from a program binary and processor netlist.
Simulate-and-eliminate attempts to create a design tailored to an application by simulating the target application with a user-provided set of inputs on multiple base designs. Logic and interconnect components that are not used by the application are removed.
Simulate-and-eliminate differs from bespoke processors in three fundamental ways—level of automation, scope of elimination and correctness guarantees.
First, simulate-and-eliminate requires significant user input to guide the selection of core parameters, selection of bit widths and definition of optimisations. Bespoke processors require no user intervention.
Second, simulate-and-eliminate only considers high-level, manually-identified components when determining what is used by a processor, and consequently does not achieve as large of area and power reductions as fine-grained bespoke processor tailoring.
Third, simulate-and-eliminate relies on user-specified inputs to determine components that are never used by an application. This means it cannot guarantee safe optimisation for applications where inputs affect control flow. Additionally, simulate-and-eliminate cannot determine whether an unsafe elimination is performed. Bespoke processor tailoring guarantees correctness by considering all possible application inputs when determining which gates to remove.