ASIP eUpdate - April 2018 | DesignWare IP

Go Back

�� Cloud

Cloud native EDA tools & pre-optimized hardware platforms

Request a Free Trial →

Innovate Faster with �� Multi-Die Solution

Accelerating success from early architecture to manufacturing.

Download eBook →

Explore Silicon Design, Verification & Manufacturing

�� is a leading provider of electronic design automation solutions and services.

Simpleware Software

Virtual Prototyping

�� Cloud

Unlimited access to EDA software licenses on-demand

Request a Free Trial →

Explore Silicon IP

�� is a leading provider of high-quality, silicon-proven semiconductor IP solutions for SoC designs.

�� IP Portfolio

Download Brochure →

�� IP Technical Bulletin

Read Latest Issue →

Explore Systems Verification and Validation

�� is a leading provider of hardware-assisted verification and virtualization solutions.

System Test Generation

Company Overview

Success Stories

Explore our success stories.

Learn More →

SNUG 2025

�� User Group Conference

Learn More →

ASIP eUpdate, April 2018

ASIP Designer

�� solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can��t find suitable processor IP, or when hardware implementations require more flexibility.

This bi-annual newsletter provides you with easy access to ASIP related resources. This issue includes the following topics:

Technology Feature: Domain-specific Processors
What��s New: 2018.03 Release Update
Additional Resources: White papers & Customer Stories

Technology Feature: Domain-specific Processors

ASIP Designer? comes with a rich set of example processor models provided in source code, which serve both as a modeling reference as well as a starting point for customer designs.

In previous issues of the eUpdate Newsletter we covered the subject of data-level parallelism (June 2016) and instruction-level parallelism (Jan 2017), including the corresponding example models. In the October 2017 issue we looked at example processor models that demonstrate how to do a fast context switch. In this issue, we will conclude the series on example models, looking at models that highlight how a processor can be tuned towards very specific algorithmic requirements, featuring all the concepts covered in the previous editions.

Featuring: Tgauss

Tgauss is an example processor that uses the Gaussian Filtering application to illustrate processor concepts suited for certain image processing kernels.

Applying Gaussian Filtering to images results in two-dimensional matrix operations. The specifics of the algorithm allow for a separation into two 1-D filters, one performing the horizontal phase while the other handles the vertical phase. A performance-efficient implementation is directly linked to the organization of the register files as line buffers, and utilizing the ability of vector-based processing, i.e. applying SIMD concepts.

Figure 1: Tgauss data path

Some of the design decisions taken for Tgauss:

Vector data path to process six bytes (= two RGB pixels) in parallel:
- vmul, vmac
- vector register files A, B with cyclic buffer access
Separate vector memory for the input/output image
Separate vector memory for buffering lines between the horizontal and vertical filter phases
Split 32-bit register file into 16-bit Rh and Rl components for filter coefficient storage
3-level hardware loop
Memory interfaces to support delays in image memory response
Two pipeline execution stages (E1, E2) for higher frequency synthesis
Instruction level parallelism (ILP) tuned to the needs of the application: a load from line buffer memory can be executed in parallel with a vector data path operation

Tgauss comes in two versions: one performing horizontal and vertical phases in an iterative sequence, the second performing two horizontal phases followed by two vertical phases, reducing the line buffer memory load traffic by 2.

To illustrate the performance, the result for a 9x9 filter, RGB processing:

11.9 cycles/pixel = 920 kcycles/frame (320 x 240)
9 ms/frame @ 100 MHz
133 instructions

Smaller filters need less cycles, e.g. 3x3 separable ~ 6 cycles/pixel

The example indicates options for further speed-up:

Increase vector size
Reduce loop size by more ILP, e.g. parallel store to XM, LM

Featuring: Tcom8

Tcom8 is an example processor that illustrates processor concepts suited for certain communication kernels, especially those containing FFT and FIR operations.

Starting from a scalar processor model, the datapath architecture extension for Tcom8 adds SIMD vector operations on 8-component vectors, and has the following new storages:

4 vector word registers of 8 x 16 bit (V), split in 2 tuples (VA, VB)
4 vector accumulator registers of 8 x 40 bit (W), split in 2 tuples (WA, WB)

The vector memory is split in two parts, VMA and VMB, and allows simultaneous read and write from different parts, thus, (read VMA || write VMB) or (write VMA || read VMB).

A 16 x 16 bit vector multiplication consumes two cycles and allows for:

pure 16 x 16 mult
real complex 16 x 16 mult
imag complex 16 x 16 mult (takes the real part as additional input)
multiply accumulate 16 x 16 mult + 40 bit accumulator
dedicated division instruction (both scalar & vector)

In addition, specific instructions and parallel formats have been provided to efficiently map FFT and FIR applications (as explained below).

To illustrate the performance of the core (and of course of the automatically generated compiler), the model comes with example code for FFT and FIR:

For a 256 point FFT, the performance is about 570 cycles
The inner loops make optimal use of the hardware:
- In steady state, it is possible to perform 2 complex butterflies, and 2 complex mults per cycle �C illustrating the power of the architecture specialization for the FFT algorithm
The FFT scales at every stage. A dynamic scaling, depending on the data can be considered as a future extension
Specific instructions have been added for this application, in particular:
- radix 2 butterfly
- vector transpose
For a 32 tap FIR, the performance is about 6 cycles per sample
Specific instructions have been added for this application, in particular
- vector element select and broadcast
- vector concatenate and vector window select

Featuring: SHA 256

This model highlights the design of a programmable accelerator as an alternative to fixed-function RTL, using the SHA-256 cryptographic hash function as an example. SHA-2 comes in different variants (SHA-224, 256, 384, 512, 512/224, 512/256 ), making it a good candidate for a flexible (because programmable), yet dedicated crypto engine.

As with all models, the SHA 256 crypto processor comes in source code. In addition, the model library includes a slide deck that describes the design process, starting from of an existing 32-bit MCU, all the way to the final SHA 256 architecture.

Some of the architecture design steps taken:

Starting with an initial 32-bit MCU
Removing unneeded hardware elements, such a hardware multipliers
Introduction of a special-purpose functional unit that implements the most time-consuming computations, especially the Compression Step present in the SHA256 transform function. This step contains many simple operations where a sequential execution in software is inefficient, but an implementation in hardware is inexpensive
Adding zero-overhead loop capabilities, automatically selected by the compiler
Adding post-increment addressing mode
Reorganizing memory, separating program memory, data memory and a separate K-memory for the keys, which can be implemented as a 32-bit-wide ROM.
This allows for adding 3-way instruction-level parallelism (ILP) that performs two loads and one arithmetic operation in parallel. ILP can be encoded in 32-bit instructions, so there is limited impact on the required program memory

Figure 2: SHA256 transform function

The example illustrates the typical tradeoff analysis between performance (here measured as throughput in cycles/byte) and the required area. Such architectural exploration is at the heart of almost any ASIP design. Fundamental to this approach is the immediate availability of a cycle-accurate instruction-set simulator, an optimizing C/C++ compiler, and the ability to go to synthesis. ASIP Designer is the ideal tool for this task.

Figure 3: Architectural exploration of SHA256 processor

What��s New in ASIP Designer?

2018.03 Release Update

Since the October 2017 newsletter, ASIP Designer has again seen a number of enhancements and extensions. The following is an extract, sorted by categories (please refer to the official Release Notes for the comprehensive list).

Example Models

New Tgauss model demonstrating how to implement an image processing kernel (see also the Technology Feature section above)
All example models featuring the LLVM compiler frontend have been updated to LLVM 6.0

Processor Modeling

Defining general load/store intrinsic functions (where guarded loads/store functions are a special case), with improved LLVM support (no longer worst-case points-to information)

Simultaneous Hardware / Software Debugging

New support of �� Verdi Hardware-Software Debug solution, which enables embedded software-driven SoC verification by providing a synchronized multi-window view of the design��s behavior of both hardware and software. It combines an instruction-accurate fully synchronized view on RTL, C and assembly, providing a comprehensive SoC debug solution
Eclipse plugins have been upgraded to support Neon and Oxygen

Figure 4: Hardware/Software Debugging using Verdi

C/C++ Compiler

LLVM version updated from 5.0 to 6.0, making C++14 the default C++ version
New optimization load/store widening on processors supporting unaligned dual-word memory access, to increase the memory bandwidth of word accesses
Improved results for control benchmarks in the LLVM flow, resulting in up to 4.8 CoreMarks/MHz for selected example cores

RTL Generation and Synthesis Support, FPGA Prototyping

New GO RTL generation options to configure the generated HDL regarding multiplexer implementation, decoding structure, guarding of pipeline registers, and reducing Spyglass warnings
Support for a new PDG assert statement, including auto-insertion to check index boundaries on bit slices and vector elements

Labs, Tutorials and Documentation

New Eclipse IDE manual, included in the regular manuals directory
The BRIDGE linker manual has been reworked and includes the archiving functionality
The GO manual has a new chapter on Verdi Hardware-Software Debug

Additional Resources

Customer References

Cognitive Systems: Designing a Complex SIMD/VLIW DSP in Record Time

Cognitive Systems is a startup that developed an innovative security system, based on wireless signal analysis. The application required a chipset that covers a wide spectrum between 650MHz and 4GHz, supporting a large variety of wireless standards. Read why Cognitive Systems decided on an ASIP, and how they managed to design a complex SIMD/VLIW DSP in less than 12 months, with a small team.

White Papers

Software Development Kits (SDKs) for Proprietary Processors �C Why They Matter, What It Takes to Develop Them

In order to develop a proprietary processor that can stand the test of time, a highly functional SDK must be developed. The complexity, cost and duration of SDK development vary depending on the architecture of the processor and the skillset of the SDK developers. In this paper, we analyze the requirements for an SDK. We then introduce a tool-based methodology for SDK development based on �� ASIP Designer tool suite.
Rapid Architectural Exploration in Designing Application-Specific Processors

Architectural exploration is at the heart of any ASIP design approach. Designers need to rapidly explore the impact of different architectural choices on power consumption and performance, ideally using real-world application C-code as part of the design flow. This white paper explains the architectural tradeoffs that are available to an ASIP designer, how to trade off performance vs. area, and why an ASIP design can still maintain full C-programmability while being optimized for a certain application domain.
Designing ASIPs in Multicore SoCs

Modern SoCs integrate dozens of complex system functions, each requiring its own optimal balance of performance, flexibility, energy consumption, communication, and design time. The traditional model of a (configurable) general-purpose processor core with a number of fixed hardware accelerators no longer suffices. ASIPs can offer the best balance for each system function, and thus form the basis of new generations of multicore SoCs.