草榴社区

ASIP eUpdate, April 2018

草榴社区’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility.

ASIP Designer

草榴社区’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility.

This bi-annual newsletter provides you with easy access to ASIP related resources. This issue includes the following topics:  

 

Technology Feature: Domain-specific Processors

ASIP Designer? comes with a rich set of example processor models provided in source code, which serve both as a modeling reference as well as a starting point for customer designs.

In previous issues of the eUpdate Newsletter we covered the subject of data-level parallelism (June 2016) and instruction-level parallelism (Jan 2017), including the corresponding example models. In the October 2017 issue we looked at example processor models that demonstrate how to do a fast context switch. In this issue, we will conclude the series on example models, looking at models that highlight how a processor can be tuned towards very specific algorithmic requirements, featuring all the concepts covered in the previous editions.

 

Featuring: Tgauss

Tgauss is an example processor that uses the Gaussian Filtering application to illustrate processor concepts suited for certain image processing kernels.

Applying Gaussian Filtering to images results in two-dimensional matrix operations. The specifics of the algorithm allow for a separation into two 1-D filters, one performing the horizontal phase while the other handles the vertical phase. A performance-efficient implementation is directly linked to the organization of the register files as line buffers, and utilizing the ability of vector-based processing, i.e. applying SIMD concepts.

Figure 1: Tgauss data path  

Some of the design decisions taken for Tgauss: 

  • Vector data path to process six bytes (= two RGB pixels) in parallel:
    • vmul, vmac
    • vector register files A, B with cyclic buffer access
  • Separate vector memory for the input/output image
  • Separate vector memory for buffering lines between the horizontal and vertical filter phases
  • Split 32-bit register file into 16-bit Rh and Rl components for filter coefficient storage
  • 3-level hardware loop
  • Memory interfaces to support delays in image memory response
  • Two pipeline execution stages (E1, E2) for higher frequency synthesis 
  • Instruction level parallelism (ILP) tuned to the needs of the application: a load from line buffer memory can be executed in parallel with a vector data path operation  

Tgauss comes in two versions: one performing horizontal and vertical phases in an iterative sequence, the second performing two horizontal phases followed by two vertical phases, reducing the line buffer memory load traffic by 2.  

To illustrate the performance, the result for a 9x9 filter, RGB processing:

  • 11.9 cycles/pixel = 920 kcycles/frame (320 x 240)
  • 9 ms/frame @ 100 MHz 
  • 133 instructions 

Smaller filters need less cycles, e.g. 3x3 separable ~ 6 cycles/pixel

The example indicates options for further speed-up:

  • Increase vector size
  • Reduce loop size by more ILP, e.g. parallel store to XM, LM

 

Featuring: Tcom8

Tcom8 is an example processor that illustrates processor concepts suited for certain communication kernels, especially those containing FFT and FIR operations.

Starting from a scalar processor model, the datapath architecture extension for Tcom8 adds SIMD vector operations on 8-component vectors, and has the following new storages:

  • 4 vector word registers of 8 x 16 bit (V), split in 2 tuples (VA, VB)
  • 4 vector accumulator registers of 8 x 40 bit (W), split in 2 tuples (WA, WB)

The vector memory is split in two parts, VMA and VMB, and allows simultaneous read and write from different parts, thus, (read VMA || write VMB) or (write VMA || read VMB).

A 16 x 16 bit vector multiplication consumes two cycles and allows for:

  • pure 16 x 16 mult
  • real complex 16 x 16 mult
  • imag complex 16 x 16 mult (takes the real part as additional input)
  • multiply accumulate 16 x 16 mult + 40 bit accumulator
  • dedicated division instruction (both scalar & vector)

In addition, specific instructions and parallel formats have been provided to efficiently map FFT and FIR applications (as explained below). 

To illustrate the performance of the core (and of course of the automatically generated compiler), the model comes with example code for FFT and FIR: 

  • For a 256 point FFT, the performance is about 570 cycles
  • The inner loops make optimal use of the hardware:
    • In steady state, it is possible to perform 2 complex butterflies, and 2 complex mults per cycle – illustrating the power of the architecture specialization for the FFT algorithm 
  • The FFT scales at every stage. A dynamic scaling, depending on the data can be considered as a future extension
  • Specific instructions have been added for this application, in particular:
    • radix 2 butterfly
    • vector transpose 
  • For a 32 tap FIR, the performance is about 6 cycles per sample
  • Specific instructions have been added for this application, in particular
    • vector element select and broadcast
    • vector concatenate and vector window select

 

Featuring: SHA 256

This model highlights the design of a programmable accelerator as an alternative to fixed-function RTL, using the SHA-256 cryptographic hash function as an example. SHA-2 comes in different variants (SHA-224, 256, 384, 512, 512/224, 512/256 ), making it a good candidate for a flexible (because programmable), yet dedicated crypto engine.

As with all models, the SHA 256 crypto processor comes in source code. In addition, the model library includes a slide deck that describes the design process, starting  from of an existing 32-bit MCU, all the way to the final SHA 256 architecture.

Some of the architecture design steps taken:

  1. Starting with an initial 32-bit MCU
  2. Removing unneeded hardware elements, such a hardware multipliers
  3. Introduction of a special-purpose functional unit that implements the most time-consuming computations, especially the Compression Step present in the SHA256 transform function. This step contains many simple operations where a sequential execution in software is inefficient, but an implementation in hardware is inexpensive
  4. Adding zero-overhead loop capabilities, automatically selected by the compiler
  5. Adding post-increment addressing mode
  6. Reorganizing memory, separating program memory, data memory and a separate K-memory for the keys, which can be implemented as a 32-bit-wide ROM.
    This allows for adding 3-way instruction-level parallelism (ILP) that performs two loads and one arithmetic operation in parallel. ILP can be encoded in 32-bit instructions, so there is limited impact on the required program memory 

 

Figure 2: SHA256 transform function

The example illustrates the typical tradeoff analysis between performance (here measured as throughput in cycles/byte) and the required area. Such architectural exploration is at the heart of almost any ASIP design. Fundamental to this approach is the immediate availability of a cycle-accurate instruction-set simulator, an optimizing C/C++ compiler, and the ability to go to synthesis. ASIP Designer is the ideal tool for this task.

Figure 3: Architectural exploration of SHA256 processor

What’s New in ASIP Designer?

2018.03 Release Update 

Since the October 2017 newsletter, ASIP Designer has again seen a number of enhancements and extensions. The following is an extract, sorted by categories (please refer to the official Release Notes for the comprehensive list).

Example Models 

  • New Tgauss model demonstrating how to implement an image processing kernel (see also the Technology Feature section above)
  • All example models featuring the LLVM compiler frontend have been updated to LLVM 6.0  

Processor Modeling 

  • Defining general load/store intrinsic functions (where guarded loads/store functions are a special case), with improved LLVM support (no longer worst-case points-to information) 

Simultaneous Hardware / Software Debugging

  • New support of 草榴社区’ Verdi Hardware-Software Debug solution, which enables embedded software-driven SoC verification by providing a synchronized multi-window view of the design’s behavior of both hardware and software. It combines an instruction-accurate fully synchronized view on RTL, C and assembly, providing a comprehensive SoC debug solution
  • Eclipse plugins have been upgraded to support Neon and Oxygen 

Figure 4: Hardware/Software Debugging using Verdi

C/C++ Compiler 

  • LLVM version updated from 5.0 to 6.0, making C++14 the default C++ version
  • New optimization load/store widening on processors supporting unaligned dual-word memory access, to increase the memory bandwidth of word accesses
  • Improved results for control benchmarks in the LLVM flow, resulting in up to 4.8 CoreMarks/MHz for selected example cores 

RTL Generation and Synthesis Support, FPGA Prototyping 

  • New GO RTL generation options to configure the generated HDL regarding multiplexer implementation, decoding structure, guarding of pipeline registers, and reducing Spyglass warnings
  • Support for a new PDG assert statement, including auto-insertion to check index boundaries on bit slices and vector elements 

Labs, Tutorials and Documentation 

  • New Eclipse IDE manual, included in the regular manuals directory
  • The BRIDGE linker manual has been reworked and includes the archiving functionality
  • The GO manual has a new chapter on Verdi Hardware-Software Debug 

Additional Resources

Customer References 

Cognitive Systems is a startup that developed an innovative security system, based on wireless signal analysis. The application required a chipset that covers a wide spectrum between 650MHz and 4GHz, supporting a large variety of wireless standards. Read why Cognitive Systems decided on an ASIP, and how they managed to design a complex SIMD/VLIW DSP in less than 12 months, with a small team.

 

White Papers

  • Software Development Kits (SDKs) for Proprietary Processors – Why They Matter, What It Takes to Develop Them

    In order to develop a proprietary processor that can stand the test of time, a highly functional SDK must be developed. The complexity, cost and duration of SDK development vary depending on the architecture of the processor and the skillset of the SDK developers. In this paper, we analyze the requirements for an SDK. We then introduce a tool-based methodology for SDK development based on 草榴社区’ ASIP Designer tool suite.

  • Rapid Architectural Exploration in Designing Application-Specific Processors

    Architectural exploration is at the heart of any ASIP design approach. Designers need to rapidly explore the impact of different architectural choices on power consumption and performance, ideally using real-world application C-code as part of the design flow. This white paper explains the architectural tradeoffs that are available to an ASIP designer, how to trade off performance vs. area, and why an ASIP design can still maintain full C-programmability while being optimized for a certain application domain.

  • Designing ASIPs in Multicore SoCs

    Modern SoCs integrate dozens of complex system functions, each requiring its own optimal balance of performance, flexibility, energy consumption, communication, and design time. The traditional model of a (configurable) general-purpose processor core with a number of fixed hardware accelerators no longer suffices. ASIPs can offer the best balance for each system function, and thus form the basis of new generations of multicore SoCs.