Open computer vision (OpenCV) is an image processing software that is widely used and heavily documented. With the latest version having participation from big players like Intel and AMD, let us find out about the open computer language (OpenCL) acceleration layer added to it.
Dr. Harris Gasparakis, OpenCV manager, Computer Vision, AMD speaks to Priya Ravindran from EFY about graphical processing and AMD’s contribution to OpenCV. Take a look.
Q. To set a base for this interview, how is graphical processing unit (GPU) computing different from central processing unit (CPU) computing when not handling graphics?
A. GPU computing is built with parallelism in mind. CPUs have multiple cores and support hyper-threading allowing it to have more than one executing thread per CPU (two threads per CPU core is a typical number). So, a retail CPU may have two to four cores, each core running one or two threads. To contrast this with GPUs, nominally, an integrated GPU will have eight cores, and a discrete GPU will have 40 cores. Clearly, you have massively more concurrent threads on a GPU than a CPU! To compensate for that, CPUs support vectorised instructions, also known as integer intrinsics, such as streaming SIMD extensions (SSE) and advanced vector extensions (AVX). However, taking advantage of the vector capabilities is a non-trivial task, and it is far from automatic. In fact, the modern trend, embodied by GPUs, is to employ scalar architectures, support multiple concurrent threads and let the compiler and the runtime do their magic!
Q. How in your opinion could computing evolve from here?
A. It is worth mentioning here that we are in the dawn of an era that transcends CPU and GPU computing, by combining both: I am talking about heterogeneous computing, where CPUs and GPUs are not only integrated on the same die, but efficiently synchronise access and efficiently operate on the same data from main memory, without redundant data copies. OpenCL 2.0 provides the application peripheral interface (API) that enables this.
Q.What are the main points to keep in mind while creating an OpenCL implementation?
A. There are two main elements in creating an OpenCL implementation or port of any library. First, you need to make sure that your data makes it to your processing cores, e.g., in a discrete GPU, you need to explicitly copy the data to on-board memory, from the main memory. That kind of approach will also work on an integrated device, but will not be efficient. The other element in porting a library to OpenCL is, you need to port the processing logic to OpenCL kernel syntax.
Q.What are the main differences between OpenCV-without-OpenCL and OpenCV-with-OpenCL? What new capabilities does it offer?
A. It was not about new functionality, rather, it was about accelerating the most common existing functionality to take advantage of GPUs. For copying data to on-board memory on an integrated GPU, you would want to use “zero copy”,i.e. you want the GPU to use the same data as the CPU. This was possible using OpenCL 1.2, but it is significantly easier with OpenCL 2.0. In OpenCV, our goal was to create an implementation that would work on all OpenCL capable devices, such as integrated or discreet GPUs. However, as part of OpenCV’s extensive automated testing, we make sure that the OpenCL functionality works correctly on a wide gamut of devices. Another thing is for the porting of processing logic. As far as embedded programming goes, OpenCL is a high-level, C-like language (and stay tuned for C++ goodness in the future), so that is not very difficult in principle. In practice, we wanted to have a high-performance implementation, so we fine tuned most of the OpenCL kernels for performance.
Q. Why did you choose to go with the transparent acceleration layer? How difficult was it to integrate this into OpenCV?
A. We introduced an OpenCL module in OpenCV 2.4. While we kept the names of most functions the same, taking advantage of OpenCL acceleration back then required ‘porting’ the code to use data structures and functions of the OpenCL module (under the ‘ocl’ namespace). While this was not difficult, it was a step I felt that we could do without! The goal of the transparent acceleration layer (T-API) was to enable people to write their code only once: If an OpenCL capable device is available, it will be used; otherwise, the fall back is CPU execution, which can also include accelerators like integrated performance primtives (IPP) or intrinsics like AVX/SSE. Detection of OpenCL devices happens at runtime, dynamically. Integration inside OpenCV was a significant effort. We sponsored the maintainers of OpenCV, they became very very excited with the vision, and carried it forward.