草榴社区

ASIP eUpdate, October 2023

<p>草榴社区’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility.</p><p>This bi-annual newsletter provides you with easy access to ASIP-related resources.</p>

ASIP Designer

草榴社区’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility.

This bi-annual newsletter provides you with easy access to ASIP-related resources. This issue includes the following topics:


Technology Feature: Example Processor Models

Designers can choose from an extensive library of example processor models provided as nML source code. In combination with ASIP Designer?, these models can be used as a starting point for architectural exploration and customer-specific production designs. 

A new Example Processor Models web page is now available which provides a concise overview of the example processor models available with ASIP Designer and their features.

Tsec: An ASIP for Post-Quantum Cryptography

In this section we elaborate on a new example processor model that is introduced with the 2023.06 release of ASIP Designer. It is called “Tsec” and implements an accelerator for post-quantum cryptography.

Kyber, the first standardized key encryption mechanism designed to withstand attacks with future powerful quantum computers, is computationally very demanding due to extensive use of hashing, for example. The Tsec example is an ASIP optimized for accelerating Kyber. It evolved from a RISC-V base model to which custom application-specific instructions were added as well as architectural specializations that go beyond simple RISC-V extension mechanisms, such as adding heterogeneous storage.

The underlying base model is Trv32p5x, a previously existing example processor model with a RISC-V scalar instruction set (RV32IM) and 5 pipeline stages, enhanced with DSP-type extensions including:

- A zero-overhead looping mechanism that allows to efficiently implement loops that iterate over arrays 

- Load and store instructions with a post-modify addressing mode, that allow to make pointer updates without instruction overhead

- 2-way instruction-level parallelism to support the simultaneous execution of a compute operation and a memory access

Using the rich profiling capabilities of ASIP Designer, an open-source software implementation of the Kyber algorithm was simulated and profiled on the baseline model. Two main computational kernels were identified as the dominating bottlenecks: modular finite-field operations such as “Montgomery reduction” and “Barrett reduction”, and a hashing mechanism called “Keccak state permutation”.

The Montgomery and Barrett reduction functions could be accelerated by fusing them into single instructions. These fused instructions operate just like a custom scalar ALU instruction on the central register file X. 

Figure 1: Trv family of processor models

Figure 1: Arithmetic unit for Montgomery reduction

Figure 1 depicts a custom hardware resource as needed for a single fused instruction performing Montgomery reduction. A resource for Barrett reduction looks very similar, so both were merged and shared between the instructions. Furthermore, multiple instances of the Barret reduction block, along with adders and finite-field multipliers, were combined into a larger butterfly-alike hardware block as depicted in Figure 2, which is triggered by even more specialized single instructions.

CS1201096868-ASIP-eUpdate-images-for-September-2023-newsletter

Figure 2: Custom butterfly unit with Barret reduction logic and finite-field multipliers

The debugger snapshot in Figure 3 shows how the specialized butterfly instructions are utilized by the compiler in the innermost loop of the number-theoretic transform (NTT) function.

CS1201096868-ASIP-eUpdate-images-for-September-2023-newsletter

Figure 3: Software-pipelined NTT function

The innermost loop is implemented as a hardware loop (zlp). The loop body consisting of six instructions is software-pipelined, consisting of butterfly instructions, finite-field multiplications and additions, with memory accesses scheduled in parallel. 

For the Keccak permutation function, the situation is a bit more complicated. The bit-level logic operations of the hashing mechanism can still be fused into one big logic cloud. The interface of the function, however, takes an entire array of 25 64-bit state variables as an argument, which results in extensive load/store traffic on the general-purpose register. The general-purpose register file of the baseline processor (32 x 32-bit) is just not big enough to capture 25 64-bit values simultaneously, and additionally, it would be too expensive to add the number of parallel ports required by the Keccak operation.

Instead, we created a dedicated register file “S” with 25 fields of 64 bits, and with dedicated 64-bit load/store access to the data memory. In addition, each register field has a direct port to the Keccak logic, which can thus access all 25 fields in parallel, as depicted in Figure 4.

CS1201096868-ASIP-eUpdate-images-for-September-2023-newsletter

Figure 4: Keccak Unit with dedicated register file

The debugger snapshot in Figure 5 shows how the compiler schedules a single-cycle instruction triggering the Keccak logic, embedded in a single-instruction hardware loop, which is surrounded by memory load/store instructions to the special S register file.

Figure 2: Instruction formats supported by Trv<x> processor models (visualization by ASIP Designer's nMLView tool)

Figure 5: Single-cycle Keccak instruction scheduled in a single-instruction hardware loop

Figure 6 is a screenshot of the nML viewer, a utility to graphically inspect the hierarchy of the instruction set. It shows how the custom Keccak instruction and the special finite-field instructions (grouped under “kyber_instrs”) are integrated both in the single-issue 32-bit instruction format as well as in the parallel dual-issue 64-bit instruction format.

Figure 3: Tmoby ASIP architecture, with RISC-V scalar data-path (far left) and vector data-path extensions

Figure 6: Graphical view of the Tsec instruction set (partially expanded)

The new Tsec example model illustrates how ASIP Designer can be used to extend a RISC-V baseline architecture for higher performance.  The specialization for the Keccak state permutation and the reduction functions result in an 8.3x speed-up of the Kyber algorithm compared to the original RISC-V baseline implementation with DSP extensions, at a moderate gate-count increase by a factor 1.8x. 

What’s New: ASIP Designer U-2023.06 Release

Since the last edition of this newsletter, we have launched a new feature release of ASIP Designer in June 2023, providing various enhancements and extensions. The following is an extract, sorted by categories (customers can refer to the official Release Notes for a comprehensive list).

Click on each tab for additional information about that new feature

Example Processor Models

In the 2023.06 release the following updates were made to the library of example processor models: 

  • A new example model “Tsec” has been added, demonstrating an accelerator for the Kyber post-quantum cryptography algorithm. More details can be found in the Technology Feature section in this eUpdate.
  • A new educational example model “Matmul” has been added, included in a workshop that demonstrates the successive extension of a RISC-V baseline model with vector SIMD instructions to accelerate matrix multiplication.
  • The “Trv” model family (RISC-V processors) has been updated:
    • CSR instructions and interrupt support have been added to all 32-bit variants.
    • All floating-point variants now implement the Zfinx ISA, which means that the floating-point instructions access the general-purpose register file X. There is no more separate floating-point register file F.
  • All models of the “Tvec” model family (generic SIMD processors) have been unified and cleaned up.

Processor Modeling

  • Enhanced hierarchical instantiation of I/O modules inside other I/O modules and I/O interfaces.
  • Generalized bit selection in nML image attributes, for more convenient modeling of split encodings.

C/C++ Compiler

ASIP Designer comes with a unique and patented compiler solution, with the compiler automatically retargeting itself to the processor architecture. This eliminates any need for compiler backend customization by the user. Release 2023.06 offers the following enhancements: 

  • Support for multiple software stacks in the LLVM-based front-end.
  • Under the hood, a new “node-based” list scheduling algorithm has been introduced in the compiler. It assigns operations to time steps in a more balanced way than the original list scheduling algorithm. The new algorithm better exploits the following features of advanced ASIP architectures: negative dependency lengths (exposed pipeline), delay slots, and combinations of ASAP and ALAP preferences.
  • The LLVM-based front-end has been updated to the more recent LLVM version 16.0.

 

ChessDE GUI, Instruction-Set Simulation and Debugging

  • The language server support, which has been introduced to the ChessDE editor in Release 2022.12, has been further enhanced and extended with additional functionality.
  • Support for parallel compilation of batch projects. Different subsidiary projects of a batch project can now be compiled in parallel.
  • Integration of pretty-printing in the GDB debug flow.

RTL Generation, Verification, and Synthesis Support

  • Basic support for partitioned synthesis.
  • Support for unit testing of PDG primitive functions, using annotations in the PDG code.

Additional Resources

Events and Webinars

White Papers & Articles

Customer References