FPGAs in Data Centres: Opportunities and Challenges (Part 2 of 2)

V.P. Sampath is a senior member of IEEE and a member of Institution of Engineers India. He is currently working as technical architect at AdeptChips, Bengaluru. He is a regular contributor to national newspapers, IEEE-MAS section, and has published international papers on VLSI and networks -- Dr V.N. Ramakrishnan is an associate professor in Department of Micro & Nanoelectronics, VIT University, Vellore


The Accelerator-6D PCIe form-factor accelerator board (Source: Achronix)

Fig. 20: The Accelerator-6D PCIe form-factor accelerator board (Source: Achronix)

Companies are collaborating on the specification for the new Cache Coherent Interconnect for Accelerators (CCIX). A single interconnect technology specification will ensure that processors using different instruction-set architectures (ISAs) can coherently share data with accelerators and enable efficient heterogeneous computing, significantly improving compute efficiency for servers running data centre workloads. CCIX will allow these components to access and process data irrespective of where it resides, without the need for complex programming environments. This will enable both off-load and bump-in-the-wire inline application acceleration while building on existing server ecosystems and form factors, thereby lowering software barriers and improving the total cost-of-ownership of accelerated systems. CCIX will enable a new breed of high-density, flexible platforms for accelerating compute, storage and networking applications.

SuperVessel platform

Fig. 21: SuperVessel platform

Intel has presented the future of Xeon processors with integrated FPGAs in a single-package socket compatible to the standard Xeon E5 processors. FPGAs provide users with a programmable, high-performance coherent acceleration capability to turbo-charge their critical algorithms. Algorithms can be changed as new workloads emerge and compute demands fluctuate. The new hardware combination could deliver over 20 times performance gains compared to more traditional ASIC-based solutions.

SuperVessel roadmap

Fig. 22: SuperVessel roadmap

Reprogrammable FPGAs play a central role, enabling a more flexible infrastructure with software-defined allocation and prioritisation of virtualised computing, networking and storage resources. In effect, each FPGA acts as a node controller, for different algorithms to be accelerated across different FPGAs as shared resources in a data centre.

FPGAs allow users to tailor compute power to specific workloads or applications. Xilinx has added support for the proposed interface to its 16nm UltraScale+ product plans, combining its programmable logic with High Bandwidth Memory (HBM) and new accelerator interconnect technology for heterogeneous computing. Built on TSMC’s chip-on-wafer-on-substrate process, Xilinx HBM-enabled FPGAs will improve acceleration capabilities by offering ten times higher memory bandwidth relative to discrete memory channels. HBM technology enables multi-terabit memory bandwidth integrated in a package for the lowest latency.

SuperVessel architecture

Fig. 23: SuperVessel architecture

Vineyard employs widely used data analytics frameworks such as Spark that make it easier for cloud users to access accelerators.

IBM’s SuperVessel cloud platform, built on top of Power/OpenPower architecture technologies, provides remote access to all the ecosystem developers. Built on OpenStack, the cloud enables:

1. Latest infrastructure-as-services, including PowerKVM, containers and docker services, with big endian and little endian options

2. Big data service through collaboration with IBM big data technology for Hadoop 1.0 and open source technology for Hadoop 2.0 (Spark service)

3. Internet-of-Things application platform service which has successfully incubated several projects in healthcare, smart city, etc

4. Accelerator-as-service (FPGA virtualisation) with the novel marketplace, through collaboration with Altera

In SuperVessel platform, FPGAs and GPUs provide acceleration services, while OpenStack manages the whole cloud.

OpenCL flow

Fig. 24: OpenCL flow


OpenCL flow allows code written in the increasingly popular parallel-friendly GPU-inspired C dialect to be targeted to Altera FPGAs. Xilinx, on the other hand, has been showing a broader play. For hardware engineers, Vivado is a comprehensive suite of design, analysis and implementation tools.

SDAccel development environment supports OpenCL, C and C++ for software engineers using FPGAs. It enables up to 25 times better performance/watt for data centre application acceleration leveraging FPGAs. High-level synthesis (HLS) provides maximum benefit from the FPGA’s hardware programmability.

Fig. 25: SDAccel environment

Altera’s OpenCL implementation creates what one might think of as a GPU on an FPGA—with processing elements analogous to GPU cores that execute software kernels in parallel. This solution is easy to program, as it mimics many elements of a GPU, but doesn’t achieve the same performance and quality-of-results as Xilinx’s HLS-based approach. HLS has been repeatedly shown to deliver hardware architectures on par with what skilled experts can create with hand-coded, manually-optimised RTL.

HLS can quickly generate the RTL code, which synthesises down into your chosen architecture. But encapsulating all that power into something that appears to software engineers as a ‘compiler’ is a substantial task.

HLS supports several languages (C, C++ and OpenCL), all at the same time, with the same implementation back-end. There are certainly vast application areas that use C and C++ rather than OpenCL. In order to create a ‘software-like’ development and runtime environment, several pieces of the normal FPGA flow had to be improved.

First, for testing and debugging code, the normal HLS, logic synthesis, place-and-route, timing-closure and device programming cycle would be far too clunky for the rapid-iteration development software engineers are accustomed to. To get around this, Xilinx created a software environment that would be comfortable to any Eclipse-using programmer, and then set up their HLS tool to spit out software-executable cycle-accurate versions of the HLS architecture. This allows the ‘accelerated’ algorithm to execute completely in software for debug and testing purposes, without the long logic synthesis and place-and-route runs required for the final hardware-based implementation.


Please enter your comment!
Please enter your name here