Wednesday, March 22, 2023

FPGAs in Data Centres: Opportunities and Challenges (Part 1)

V.P. Sampath is a senior member of IEEE and a member of Institution of Engineers India. He is currently working as technical architect at AdeptChips, Bengaluru. He is a regular contributor to national newspapers, IEEE-MAS section, and has published international papers on VLSI and networks Dr V.N. Ramakrishnan is an associate professor in Department of Micro & Nanoelectronics, VIT University, Vellore

- Advertisement -

SMT4 vs SMT8 core

Fig. 6: SMT4 vs SMT8 core

Heterogeneous computing approaches for cloud data centres

In heterogeneous systems, different kinds of processors—x86 CPUs and FPGAs—cooperate on a computing task. Heterogeneous architectures are already widely used in the mobile world, where ARM CPUs, GPUs, encryption engines and digital signal processing (DSP) cores routinely work together, often on a single die.

- Advertisement -

There are two approaches to establish heterogeneous computing systems in cloud data centres.

The first approach is based on heterogeneous super nodes that tightly couple compute resources to multi-core CPUs and their coherent memory system via high-bandwidth, low-latency interconnects like CAPI or NVLink. CAPI is encapsulated by the PCIe and appears to use the same bus architecture with higher bandwidth capabilities and lower overall latencies. NVLink is a high-bandwidth, energy-efficient interconnect that enables ultra-fast communication between the CPU and GPU, and between GPUs. The technology allows data sharing at speeds five to twelve times faster than the traditional PCIe Gen 3 interconnect, resulting in dramatic speed-ups in application performance and creating a new breed of high-density, flexible servers for accelerated computing.

CAPI FPGA accelerator (based on a standard PCIe accelerator)

Fig. 7: CAPI FPGA accelerator (based on a standard PCIe accelerator)

The second approach is based on the disaggregation of data centre resources where the individual compute, memory and storage resources are connected via the network fabric and can be individually optimised and scaled in line with the cloud paradigm.

CAPI on Power9

CAPI on Power9 system provides a high-performance solution for implementation of client-specific, computation-heavy algorithms on an FPGA. It can replace either application programs running on a core or custom acceleration implementations attached via input/output (I/O). Applications for CAPI include Monte Carlo algorithms, key value stores and financial and medical algorithms. CAPI removes the overhead and complexity of the I/O subsystem, allowing an accelerator to operate as part of an application.

A specific algorithm for acceleration is contained in a unit on the FPGA called the accelerator function unit (AFU or accelerator). AFU provides applications with a higher computational unit density for customised functions to improve the performance of the application and offload the host processor. Use of AFU allows cost-effective processing over a wide range of applications. CAPI can also be used as a base for flash memory expansion, as is the case for IBM Data Engine for NoSQL Power Systems Edition.

Coherent accelerator processor interface

Fig. 8: Coherent accelerator processor interface

The Coherent accelerator processor proxy (CAPP) unit maintains a directory of all cache lines held by the off-chip accelerator, allowing it to act as the proxy that maintains architectural coherence for the accelerator across its virtual memory space. A power service layer (PSL) resides on the FPGA alongside the acceleration engine. It works in concert with the CAPP unit across a PCIe connection. The PSL provides a straightforward command to the client-written accelerator, which grants access to coherent memory.

CAPI technology connections

Fig. 9: CAPI technology connections

The CAPP and PSL handle all virtual-to-physical memory translations, simplifying the programming model and freeing the accelerator to do the number crunching directly on the data it receives. In addition, the PSL contains a 256kB resident cache on behalf of the accelerator. Based on the needs of the algorithm, the accelerator can direct the use of the cache via the type of memory accesses (read/write) as cacheable or non-cacheable. While the application executes on the host processor, the CAPI model offloads computation-heavy functions to the accelerator.

Issues with PCI accelerators

Fig. 10: Issues with PCI accelerators

The accelerator is a full peer to the application. It uses an unmodified virtual address with full access to the application’s address space. It uses the processor’s page tables directly with page faults handled by system software. For a CAPI solution, the application might set up the data for the AFU. The application sends the AFU a process element to initiate the data-intensive execution.

Benefits of coherent accelerators

Fig. 11: Benefits of coherent accelerators

The process element contains a work element descriptor (WED) provided by the application. The WED can contain the full description of the job to be performed or a pointer to other main memory structures in the application’s memory space. The application can be the master or the slave to the accelerator—whichever mechanism is demanded by the algorithm. Alternatively, the accelerator can receive incoming packets from the network, work on the data, and then inform the application that the processed data is ready for consumption.

Typical PCIe model flow

Fig. 12: Typical PCIe model flow

The acceleration platform can be integrated into cloud-based services. In this case, the application on the core runs as an Application as a Service (AaaS) for other applications that require services of the accelerator. The AaaS model maintains work queues for the accelerator on behalf of each of the requesting threads and performs any maintenance function needed to inform the accelerator of pending work. These connections are made through operating-system kernel extensions and library functions created specifically for CAPI.



What's New @

Truly Innovative Tech

MOst Popular Videos

Electronics Components

Tech Contests