Cloud native EDA tools & pre-optimized hardware platforms
For Artificial Intelligence (AI) applications – like pedestrian detection for an autonomous vehicle or image quality enhancement for a digital still camera – trained neural networks have surpassed programmed digital signal processors (DSPs) in performance, efficiency and algorithmic flexibility. That doesn’t mean that DSPs aren’t needed in AI processing. In fact, just the opposite. Neural network accelerators paired with vector DSPs are a great combination for AI subsystems for a range of applications.
It's important to consider neural network processing techniques – like convolutional neural networks (CNNs) or transformers – separately from the hardware necessary to run these models. There are lots of options for implementing those models. Any processor that can perform multiplications and move large amounts of data around can eventually perform these computation-heavy models. With good quantization techniques, the trained 32-bit floating-point outputs of trained neural networks can be run on 8-bit integer controllers or processors with little to no accuracy degradation.
That means a CNN inference can be processed on a CPU, a GPU, a DSP or even a lowly microcontroller, and can attain the same accuracy. Choosing a processor for AI inference processing matters much more if real-time performance (in frames per second where a frame is an uncompressed image) is important. In other words, if you have a limited time to process the image frame before the next one is available. There are many real-time applications where high levels of performance are crucial – imagine a car barreling down on a pedestrian at 70 MPH and trying to make a quick decision on whether to apply the brakes or not. Multiple cameras, high resolution, minimum latency all drive the need for maximum computational efficiency to make a life-or-death decision.
Figure 1: AI applications span a wide range of performance requirements from a few GOPS to thousands of TOPS.
As shorthand for computational performance capabilities to process AI models, Giga Operations Per Second (GOPs) or Tera Operations Per Second (TOPS), where 1 TOPS equals 1000 GOPS, are often used. TOPS is, at best, a first order metric which provides a relative ideal performance capability of a processor. For AI processing, TOPS is usually compared for INT8 calculations (20 INT16 TOPS would require much more memory and data movement than 20 INT8 TOPS). You can see in Figure 1 that there is a wide range of performance requirements for different AI applications.
TOPS is calculated by taking the number operations that can be done in one-cycle (with a multiply-accumulate counting as two operations) and multiplies it by the maximum frequency. This is a good first estimate on performance since the bulk of the computations are driven by the need for matrix multiplies, which require multiply-accumulate operations. A CPU with DSP extensions that can perform one MAC per clock cycle and runs at 2 GHz, would have a TOPS of 2 GHz x 2 operations (multiply, accumulate) x 1 MAC/cycle = 4 GOPS or 0.004 TOPS. Table 1 shows the relative performance you can expect. Neural processing units (NPUs) are clearly the best choice for the highest computational results.
Processor Type |
Number of MAC/cycle |
Fmax |
Ideal TOPS |
CPU with DSP extensions |
1 |
2 GHz |
2 GOPS |
Vector DSP |
512 |
1.2 GHz |
1.2 TOPS |
NPU (low end) |
4,096 |
1.3 GHz |
10.6 TOPS |
NPU (high end) |
96,304 |
1.3 GHz |
255.6 TOPS |
Table 1 – Approximate performance ranges by processor type
Although not on the list, GPUs can also provide high levels of performance, but at a much higher power and area cost that are hard for real-time applications to absorb. In fact, each processor type in the table requires different levels of power and area to achieve the desired TOPS. For real-time applications, power (sometimes better expressed as thermal impact) and area (directly related to cost and manufacturability) are almost equally as important as performance. NPUs have been designed and optimized to be the most performance, power and area efficient processors to execute neural network algorithms.
Not every AI application needs the highest levels of neural network performance that an NPU provides. Referring to Figure 1, if you have a microcontroller or CPU with DSP extensions or a vector DSP in your system already, it might be perfectly fine to perform lower TOPS neural network tasks in between other typical processing tasks. If you need a mix of some AI performance – say less than 1 TOPS – as well as the non-AI processing capabilities of a CPU or DSP, they would be good choices for your application. Above 1 TOPS, the AI performance efficiency, power efficiency and area efficiency of an NPU is hard to argue for real-time applications.
An NPU’s best-in-industry efficiency comes from the significant number of multiplies it can do each cycle plus some dedicated hardware for other neural network operations like activation functions. The challenge for an NPU is to have as much hardware acceleration as possible to maximize neural network efficiency while maintaining some amount of programmability. Certainly, a fully hardwire neural network ASIC would be more efficient than a programmable NPU. However, with the unrelenting pace of academic research into AI producing new techniques and models and activation functions, maintaining some programmability is critical given the multi-year cycle to produce an AI SoC.
An NPU is a dedicated neural processor engine, while a CPU or DSP can perform AI as an additional task. Since an NPU only performs AI, it is recommended for AI requirements over 1 TOPS to combine a vector DSP with an NPU. This provides the maximum performance with additional programmability. For example, in an autonomous vehicle, while the NPU is looking for pedestrians or identifying street signs or using neural networks for part of the radar or LiDAR processing, a vector DSP in the system is needed for additional filtering, radar or LiDAR processing as well as pre- and post-processing for the NPU.
Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.
Figure 2: Different combinations of vector DSP and neural network performance.
In Figure 2, you can see the range of possibilities for combining NPUs and vector DSPs in AI applications. In all three cases, a high-resolution image frame is sitting in DDR memory waiting to be processed before the next frame arrives. In the first configuration, on the left, a vector DSP by itself can be used for DSP processing and some amount of AI processing – consider this the <1 TOPS use case: big DSP, small AI. A specific example of this use case is a vector DSP performing sensorless field-oriented control (FOC) for a permanent magnet synchronous motor (PMSM). DSP-based motor control is extended with AI processing, which performs position monitoring that is fed back into the control loop. The sampling rate and computational complexity of the AI model allow it to fit with the AI capabilities of a vector DSP.
In the second configuration, in the middle, the AI SoC requires both a large amount of vector DSP performance and AI performance:big AI, big DSP. While the vector DSP is processing DSP-heavy tasks, it needs to be supplemented with the neural network acceleration provided by an NPU for the AI-intensive tasks. One example use case is a digital still camera. The vector DSP can perform vision processing and pre- and post-processing support of the NPU, while the NPU is dedicated to CNN or transformer processing (object detection, semantic segmentation, super resolution, etc.) on high-resolution images. These use cases require vector DSP and NPU solutions that are closely integrated and can scale to fit the performance targets.
In the third configuration, on the right, all the processing is focused on neural networks and the DSP processing is only needed for NPU support: small DSP, big AI. The NPU is needed to process neural networks which often can be completely executed within the NPU. Standard neural network models or graphs like ResNet-50, MobileNet v2, etc., can run entirely on 草榴社区’ ARC? NPU. There are some more complicated neural network models with floating point operations requiring support from the vector DSP. For example, the ROI pooling and ROI alignment for Mask-RCNN or the non-integer scale factors use by Deeplab v3. Even if the AI SoC doesn’t need any additional DSP processing, there is still a benefit to include some amount of vector DSP performance to support the NPU for future proofing.
While there are multiple options for vector DSPs and NPUs on the market, for the second and third configuration types, it is best to choose your AI solution to include closely integrated processors. Some neural network accelerators embed a vector DSP into their neural network solution limiting its use for external programming. 草榴社区’ ARC EV7x vision processors are heterogeneous processors closely coupling vector DSPs with option neural network engines. To increase customer flexibility and programmability, the ARC EV7x family is evolving into the ARC VPX family of vector DSPs and the ARC NPX family of NPUs. The VPX and NPX are a closely coupled solution for AI. Figure 3 shows the high-level block diagrams of these two processors and how they are interconnected.
Figure 3: The closely coupled combination of the 草榴社区 ARC VPX5 and ARC NPX6
The ARC VPX DSP IP excels at parallel DSP processing based on very long instruction word (VLIW)/single instruction-multiple data (SIMD) architecture and are optimized for the power, performance and area (PPA) requirements of embedded workloads. The VPX family can be configured for floating point and multiple integer formats including INT8 operations for AI inference. A range of performance is available as the VPX family operates on 128-bit (VPX2, VPX2FS), 256-bit (VPX3, VPX3FS) and 512-bit (VPX5, VPX5FS) vector words and can scale from one to four cores. This provides from 16 INT8 to 512 INT8 MACs per cycle (using dual-MAC configuration on a four core VPX5).
The ARC NPX NPU IP is dedicated to NN processing and also optimized for PPA requirements of real-time applications. The family scales from a 4,096 MAC per cycle version to a 96k MAC per cycle version which can then be scaled to multiple instances. The NXP6 family can scale from 1 to 1000s of TOPS of AI performance on a single SoC. It has also been optimized for the latest neural network models of CNN and for the emerging transformers class of models.
As seen in Figure 3, the VPX and NPX families are closely integrated. ARCsync is additional RTL that provides interrupt control between the processors. Data passes through an external NOC or AXI bus that usually is readily available in SoC systems. While the two processors can perform completely independently, the VPX5 has the ability to reach into the NPX6’s L2 memory as needed.
The close integration of the VPX5 and NPX6 is also supported by one common software development tool chain, ARC MetaWare MX, that supports any combination of NXP and VPX. SoC architects can choose the right combination of DSP performance and AI performance using these scalable processor families to maximize performance and minimize area overhead. For AI heavy workloads, the rule of thumb for the big AI, small DSP configuration is to have one VPX5 for every 8k or 16k MACs for NPX (depending on models and workload). For an NPX6-64K configuration, at least four cores of the VPX5 (at a minimum) are recommended.
It's true that neural network processing has replaced DSP processing for specific tasks like object detection of a pedestrian, but the SIMD capabilities of a vector DSP combine with its DSP and AI support capabilities make it a valuable part of an AI system. As the demand for AI processing continues to grow in embedded applications, the combination of an NPU for AI processing and a vector DSP for NPU support and DSP processing is the best recommendation for a flexible design to help future-proof an AI SoC for rapidly evolving AI.
In-depth technical articles, white papers, videos, webinars, product announcements and more.
This article discusses the main security standard to secure Ethernet traffic: Media Access Control Security (MACsec).
Learn More →This article describes the new PAM-3 modulation for USB4 v2 and some of its implementation challenges.
Learn More →In-depth technical articles, white papers, videos, webinars, product announcements and more.
Explore all articles →