Read Part 1
Over 20 billion connected devices expected in next five years will require a special class of processors that will have ultra-low-power and space requirements
More complex processors contain more performance-enhancing features such as large caches, prediction or speculation mechanisms, and out-of-order execution that introduce non-determinism into the instruction stream. Co-analysis is capable of handling this added non-determinism at the expense of analysis tool runtime. For example, by injecting an X as the result of a tag check, both the cache hit and miss paths are explored in the memory hierarchy. Similarly, since co-analysis already explores taken and not-taken paths for input-dependent branches, it can be adapted to handle branch prediction.
In an out-of-order processor, instruction ordering is based on the dependence pattern between instructions. While instructions may execute in different orders depending on the state of pipelines and schedulers, a processor that starts from a known reset state and executes the same piece of code will transition through the same sequence of states each time. Thus, modifying input-independent control flow graph (CFG) exploration to perform input-independent exploration of the data flow graph (DFG) may allow analysis to extend to out-of-order execution.
For complex applications, CFG complexity increases. This may not be an issue for simple in-order processors, since the maximum length of instruction sequences (CFG paths) that must be considered is limited based on the number of instructions that can be resident in the processor pipeline at once. However, for complex applications running on complex processors, heuristic techniques may have to be used to improve scalability.
In a multi-programmed setting (including systems that support dynamic linking), we take the union of toggle activities of all applications (caller, callee and the relevant OS code in case of dynamic linking) to get a conservative profile of unusable gates. Similarly, for self-modifying code, the set of usable gates for the processor is chosen as the union of usable gate sets for all code versions. In case of fine-grained execution, any state that is not maintained as part of a thread’s context is assumed to have a value of X when symbolic execution is performed for an instruction belonging to the thread. This leads to a conservative coverage of usable gates for the thread, irrespective of the behaviour of the other threads.
Bespoke processor design
Bespoke processor design is a novel approach to reducing processor area and power consumption without any degradation in performance. In this approach, a processor is tailored for an application such that it consists of only the gates required by the application for any possible execution with any possible inputs. A bespoke processor still runs the unmodified application binary without any performance degradation. Symbolic gate-level simulation-based methodology takes the original microprocessor IP and application binary as input to produce a design that is functionally-equivalent to the original processor from the perspective of the target application while consisting of the minimum number of gates needed for execution.
A large class of emerging applications is characterised by severe area and power constraints. For example, wearables and implantables are extremely area- and power-constrained. Several IoT applications such as stick-on electronic labels, RFIDs and sensors are also extremely area- and power-constrained. Area constraints are expected to be severe also for printed plastic and organic applications.
Cost concerns drive many of these applications to use general-purpose microprocessors and microcontrollers instead of much more area- and power-efficient ASICs, since, among other benefits, development cost of microprocessor IP cores can be amortised by the IP core licensor over a large number of chip makers and licensees. In fact, ultra-low-area- and power-constrained microprocessors and microcontrollers powering these applications are already the most widely used type of processing hardware in terms of production and usage, in spite of their well-known inefficiency compared to ASIC and FPGA-based solutions.
Given this mismatch between the extreme area and power constraints of emerging applications and the relative inefficiency of general-purpose microprocessors and microcontrollers compared to their ASIC counterparts, there exists a considerable opportunity to make microprocessor-based solutions for these applications much more area- and power-efficient.
The bespoke processor design methodology relies on gate-level symbolic simulation to identify gates in the microprocessor IP that cannot be toggled by the application, irrespective of the application inputs, and automatically eliminates them from the design to produce a significantly smaller and lower power design with the same performance. In many cases, reduction in the number of gates also introduces timing slack, which can be exploited to improve performance or further reduce power and area. Since the original design is pruned at the granularity of gates, the resulting methodology is much more effective than any approach that relies on coarse-grained application-specific customisation.
This methodology can be used either by IP licensors or IP licensees to produce bespoke designs for the application of interest. Its simple extensions can be used to generate bespoke processors that can support multiple applications or different degrees of in-field software programmability, debuggability and updates.