Cloud native EDA tools & pre-optimized hardware platforms
Graham Wilson, Sr. Product Marketing Manager, ARC Processors, 草榴社区
In the early 2000s, digital signal processors (DSP) were simple in architecture and limited in performance, but complex in programming. DSPs made use of fused instructions and utilized single instruction multiple data (SIMD) data computation. This computation was for multiply-accumulate (MAC) operations, at single or dual MAC computation. These DSPs were simple assembly-level programmed cores with a limited instruction set architecture (ISA), and were generally used in a variety of signal processing and voice processing applications.
DSPs further evolved in terms of performance for use in 3G cellular baseband modem applications. A typical 3G modem system would have a single DSP optimized for dual/quad SIMD MAC performance with basic DSP filter instructions like Fast Fourier Transform (FFT) and Infinite Impulse Response (IIR). The very long instruction word (VLIW) architecture was introduced to increase performance by enabling execution of multiple, parallel operations in tight, closed loops. At this point, C-compiler technology was advancing, but for optimized performance, programming was still at the assembly level for DSP algorithms.
DSP libraries became available for the DSP, allowing for quicker software development and more optimized operation. For 3G wireless communication algorithms, complex data type (I and Q) computation support is required. Therefore, the data types were added to the architecture and libraries, and with the compiler, could be mapped to the SIMD MAC computation units of the DSPs.
However, at the time, about 50% of wireless communication applications’ computation was digital signal processing while the other half included control and non-vector data computation (scalar) tasks. Control and scalar tasks were not well suited to running on a SIMD VLIW DSP core because control code has branches and exceptions that caused many long pipeline stalls. As a result, scalar computation units and branch prediction functions were introduced into the DSP.
The DSPs were well-suited for voice and audio applications, as SIMD MAC performance can deliver the computation performance needed for audio offload engines used in smartphones and tablets. This opened up a new market for DSPs.
Wireless (cellular and WiFi) mobile communication standards drove the need for greater computational complexity, and 3.9G modems, with multiple antenna, multiple input and multiple output (MIMO), channel aggregation, and estimation algorithms, saw the need for more software programmability to support greater functionality on a mobile baseband chip. In parallel, compiler technology has also evolved to recognize DSP vector data types and match to loop strides, and performed basic auto vectorization inside inner loops.
As performance requirements increased, DSPs’ SIMD width increased to 16, 32, and even 64 MACs/cycle for cellular mobile and infrastructure applications. In addition, a more customized instruction set architecture (ISA) that included more DSP filter acceleration instructions, matrix computation acceleration, and addressing modes further accelerated operations for better performance.
A dual load store architecture with wider load and/or store units was used to support higher throughput of complex vector data in the MAC units. Coupled with this, register files were updated with larger numbers of dedicated vector data register files to balance the architecture to the computation throughput and minimize internal register pressure.
The VLIW architecture also expanded to support higher computation throughput. For example, to perform at optimal performance, the FFT function requires Load, Load, Execution, and Store, hence 4-issue VLIW for DSP vector operations.
These DSPs integrated vector and scalar engines, which can execute in parallel. Some architectures have VLIW schemes of split DSP vector operations and scalar operations, resulting in large instruction word lengths. Other DSPs merge the VLIW vector and scalar operations, resulting in smaller instruction word length but more instruction decoding complexity. On these DSPs, the control and DSP functions are pre-mapped and scheduled by the compiler into VLIW instructions.
As the computational complexity increased with LTE category number, there was a shift to using multiple cores in modem systems to address varying processing requirements within the system. The processors had different architectures and performance capabilities depending upon the functional component of the system in which they were used. For example, the front-end of the LTE modem requires complex vector data computation on DSP functions. Separate from that is the need for soft-bit domain processing that consists of scalar 16-bit data best suited for a different type of architecture/ISA, and better served by smaller or task-specific processors.
With 4G (LTE-Advanced) modems, the computation complexity increased about 10x compared to 3G modems. To support this, DSPs are further optimized and have further acceleration of wireless communications algorithms with customized instructions, either as part of the base ISA or as extendable options. The addition of floating point support used for infrastructure MIMO computation algorithms was another key advancement in DSP processor technology.
As the computation workload and complexity of algorithms continued to increase, a ‘one processor fits all’ approach was no longer viable, and there was a move to heterogeneous systems. Designers cannot fit a big core running at GHz into a system-on-chip (SoC) that is subject to tight power budgets, so more DSPs were being used for main data path computation with highly task-specific programmable cores being utilized for the most data-intensive tasks, such as front-end FFT, turbo codec, and channel estimation (Figure 1).
Figure 1: Processor flexibility for balancing performance requirements against power budgets
In contrast to the higher end of 4G wireless communication modems, massive IoT and always-on activation applications push the requirements on DSPs in a completely different direction. Low-cost, low-power requirements, and the ability to implement multiple DSPs in one DSP core are driving single-core implementations. These new types of DSPs have an architecture and ISA that bring together audio, wireless communications, control, and image/motion detection algorithms. This ‘one core fits all’ approach offers flexibility to select/deselect computation and architectural options to implement the smallest size and lowest power possible.
The next step is 5G, which is estimated to show a further 10x to 20x computation complexity compared to 4G modems. 5G modems have much more complexity with the stability factor to cover a wider range of applications, bandwidths, and latency. The MIMO configurations are much higher, with some infrastructure applications using massive MIMO that can only be implemented with high-precision floating-point computation.
To address these requirements, more task-specific programmable cores will be used to offload the main computation intensive tasks, eliminating the need for the main wide SIMD VLIW DSPs to run these algorithms. Also, DSPs are used purely for DSP computation. The control and sequencing operations are offloaded from the wide SIMD VLIW DSPs to an optimized controller DSP core. This core is new in terms of being a controller core (CPU) targeted toward high GHz maximum clock frequency, very efficient task switching, and multithreaded operation. There are also DSP extensions supporting single-, dual- and quad-MAC level DSP computation for noncritical DSP functions that are not optimal to run on the main wide SIMD VLIW DSP, but can be run on the controller/DSP core. A superscalar architecture is better suited than VLIW for a processor that targets controller algorithms as the code is control, with branching, interrupts, and exceptions.
Regarding the control and low-level DSP functions, these cores operate as sequencers for hardware accelerator blocks, which is another function offloaded from the main DSPs. The controller cores require direct connectivity to hardware blocks, support of service requests, data movement, and synchronizing operations between hardware blocks.
Figure 2: The Evolution of DSPs and CPUs for cellular user equipment modems
草榴社区 is a leader in IP including processors and has been working with 4G/5G modem SoC developers for many years, and fully understands the issues designers face and the need for high-performance DSPs and controller cores. This was translated into the recent release of the HS45D which uses closely coupled instruction and data memories, and HS47D cores (uses instruction and data cache memory) that are high-performance controller cores with DSP extensions, for these specific needs.
The HS4xD (HS45D and HS47D) cores are based on an advanced 10-stage pipeline with dual-issue superscalar instruction execution. This pipeline and superscalar architecture implements late arithmetic logic unit (ALU) execution. Depending upon the context of the core resources and conditional instructions, the late ALU enables more cycles for conditions to resolve before instruction commit. There is also early resolution of miss-predicted branches. This greatly reduces pipeline stalls and improves control operation performance.
Figure 3: ARC HS4xD processor block diagram
The advanced pipeline architecture offers a true two cycle instruction and data memory access (implemented as two pipeline stages dedicated to access CCMs and caches). This gives SoC developers more options in closely coupled memory technology and enables the processor to be clocked at twice the frequency of the memories, which eases design and reduces implementation bottlenecks.
The DSP extensions include 150 DSP instructions for fixed-point, complex data type as well as floating point (single precision and double precision) computation native to the core. There are also instructions and addressing modes to accelerate most DSP filter functions and algorithms that are used for communications, RADAR, and home audio applications. The cores can perform sustained dual MAC (16-bit x 16-bit) with quad MAC (16-bit x 16-bit) for key digital filter functions. The DSP functions benefit from the parallel instruction execution of the superscalar architecture, combined with the advanced load/store unit of the core, high-performance sustained DSP computation can be achieved, comparable to DSP only cores.
Because of all these architectural and ISA functions, the HS45D (non-cached version) can achieve a typical clock frequency of 2.5GHz (16nFF) and give more than enough performance overhead for extra computation requirement growth. The core also delivers an industry-leading 5.2 CoreMarks/MHz benchmark number as well as 3.0 Dhrystone MIPS/MHz.
The HS4xD core is C programmable and fully compatibility with other ARC DSP solutions. It includes a performance-optimized DSP library that allows developers to quickly get the required performance from the core on critical algorithms.
Like all ARC processors, the HS4xD the cores are configurable in terms of architecture, memory subsystem, and ISA options. There are many configuration options allowing developers to customize the core to meet performance requirements, as well as keep to minimum size and power consumption. With these cores, ARC APEX technology is available. APEX allows instruction-set extensions, register banks, custom registers, and custom interfaces to be added by users to the core. This allows a developer to add custom instructions, registers to accelerate key algorithms, giving very high performance where needed.
The HS4xD offers one of the highest number of options of connection schemes within an SoC. On top of the modular bus interface (e.g., AMBA, etc.), there is a separate peripheral bus to the HS4xD. This peripheral bus has a dedicated region of the memory map and zero latency, SoC developers own peripherals can be connected to this peripheral bus. This allows performance critical peripherals to be isolated from bus sharing latency delays.
Using APEX register extensions, hardware blocks can be connected to APEX registers, which can be accessed directly. These registers can be of any width definition and will fit with the hardware block operation. This allows user hardware blocks to be directly connected to the core and controlled with core instructions. Coupled with ?DMA engine available with the HS4xD cores, data movement to and from the hardware blocks can be controlled by the HS4xD core.
DSPs have changed significantly over the last decade and a half to be able to meet the needs of the wireless communication standards. And now, DSPs have found their place in the baseband modem of mobile handset devices. But the story of higher performance and bigger cores does not work in modern mobile devices in which battery life and power consumption are critical. Hence the more task-specific, heterogeneous cores as well as DSP/controller cores have evolved. Thus, a supplier of DSP IP should offer SoC developers a much bigger range of DSPs, accelerators, customized cores, and DSP/controller cores. 草榴社区’ ARC HS4xD processors feature a dual-issue, 32-bit RISC + DSP architecture for embedded applications where high performance and high clock speed plus signal processing are required. The HS4xD cores offer the flexibility, control, signal processing, and power consumption that are needed to address modern DSP challenges.
For more information: