Cont. from : FPGAs in Data Centres Part 1
Catapult cloud field-programmable gate array (FPGA) architecture accelerates both cloud services (such as search engine) and the Azure cloud platform; Azure is an open, flexible, enterprise-grade cloud computing platform. The reconfigurable Catapult fabric is embedded into each half-rack of 48 servers in the form of a small board with a medium-sized FPGA and local dynamic random-access memory (DRAM) attached to each server. FPGAs used in Catapult servers are central to delivering better Bing results. These can quickly score, filter, rank and measure the relevancy of text and image queries on Bing.
Catapult v2 design is more flexible in circumventing traditional data centre structures for machine learning and expands the role of FPGAs as accelerators. It expands the availability of FPGAs, allowing them to be hooked up to a larger number of computing resources. FPGAs are connected to DRAM, central processing unit (CPU) and network switches. These can accelerate local applications, or act as a processing resource in large-scale, deep-learning models. Much like with Bing, FPGAs can be involved in scoring results and training deep-learning models.
Fig. 16: Deep learning
Catapult v2 design can be used for cloud-based image recognition, natural language processing and other tasks typically associated with machine learning. It could also provide a blueprint for using FPGAs in machine learning installations. Many machine learning models are driven by graphical processing units (GPUs), but the role of FPGAs is less clear. FPGAs can quickly deliver deep-learning results, but consume too much power if not programmed correctly. These can be reprogrammed to execute specific tasks, but that also makes them one-dimensional. In comparison, GPUs are more flexible and capable of handling several calculations.
Catapult server specification
Project Catapult employs an elastic architecture that links FPGAs together in a 6×8 ring network that provides 20Gbps of peak bidirectional bandwidth at sub-microsecond latency. It allows FPGAs to share data directly without having to go back through the host servers. It can reprogram the Catapult fabric to adapt to new ranking algorithms without the time and cost of developing custom ASICs or the disruption of having to pull servers from production to install new ASICs—a primary consideration when operating cloud-scale data centres.
Fig. 17: Drivers for deep learning
Microsoft is putting FPGAs on PCI Express networking cards in every new server deployed in its data centres. FPGAs handle compression, encryption, packet inspection and other rapidly changing tasks for data centre networks that in six years have jumped from Gbps to 50Gbps data rates. In comparison, GPUs are typically more power consuming and more difficult to program.
Power9 processors use Nvidia’s NVLink to connect to chips like Pascal as well as IBM’s own CAPI to link to FPGAs from Xilinx. Separately, ARM will use its Coherent Hub Interface to link to Xilinx FPGAs and other accelerators. The x86 giant is already showing packages that contain Xeon server and Altera FPGA die. It aims to put both the chips on a single die as early as next year. Both CPUs and FPGAs require a lot of power. Putting the two in one socket will limit performance to the thermal envelope of the single chip.
Fig. 18: Altera vs PCI
Microsoft is using Altera FPGAs both in its networking cards and as accelerators for its Bing search. China server vendor Inspur is trying to ride the GPU train with a server packing 16 graphics processors. The server uses a message passing interface the company created for the Caffe deep learning framework developed at U.C. Berkeley for use with GPU accelerators. FPGAs accelerate the ranking of search terms—a new partition of the overall search job.
A special embedded core makes the FPGA more programmable. China’s Baidu has developed an FPGA board accelerating at least ten low-level tasks. The PCIe 2.0 card uses a Xilinx K7 480t with 4GB memory.
Fig. 19: Catapult uses a network of directly linked FPGAs
Speedster 22i FPGAs built on Intel’s advanced 22nm process technology implement all of interface functions as hardened IP. This results in lower power consumption, reduced programmable logic fabric, faster clock rates, and simpler design since the interface IP is already timing-closed.
Achronix Accelerator-6D board is said to offer the highest memory bandwidth for an FPGA-based PCIe form factor board. This PCIe add-in card is well suited for high-speed data centre acceleration applications. The Accelerator-6D packs a Speedster22i HD1000 FPGA with 700,000 lookup tables, which connects to six independent memory controllers to allow up to 192GB of memory and 690Gbps of total memory bandwidth. Accelerator-6D board comes with a power supply, one-year licence for Achronix ACE design tools, and multiple system-level reference designs for Ethernet, DDR3 and PCIe operation.