草榴社区

Under the Hood of ASIP Designer - Application-Specific Processor Design Made Possible by Tool Automation

Gert Goossens, Sr. Director of Engineering, 草榴社区

Introduction

The continuous evolution towards smarter connected systems triggers a need for heterogeneous System-on-Chip (SoC) architectures.  Microprocessor cores are complemented with application-specific instruction-set processors (ASIPs) that can execute high workloads with low power consumption. Unlike traditional hardwired accelerators, ASIPs remain software-programmable, which provides for flexibility to make late functional changes.  Embedded software development can start before the ASIP architecture is frozen.

Efficient ASIP architectures exploit instruction-level and data parallelism, instruction pipelining, and specialization of functional units and storage elements, in ways that are tailored to the application domain. Semiconductor and system companies often design their own differentiating ASIPs, tuned to their specific requirements. This is made possible by tool automation.

草榴社区’ ASIP Designer is the most advanced software tool-suite for the design and programming of ASIPs. In this article we look under the hood of ASIP Designer, and provide some insight into how advanced modeling and tool automation technology enable a unique methodology for architectural exploration, concurrent hardware and software development, and optimization for best power, performance and area (PPA).  

Modeling and Optimizing ASIP Architectures with Retargetable Tools

ASIP Designer automatically supports any processor architecture modeled by the user in the nML language. nML captures the instruction-set architecture (ISA) and the microarchitecture of the processor. nML can be viewed as a hardware description language augmented with dedicated constructs to describe processor features concisely. The language eases modeling of a processor architecture and enables quick iterations to tune the architecture to an application domain.

Figure 1: ASIP Designer tool flow

ISA Design with the Compiler-in-the-Loop

The nML model serves as an input to ASIP Designer’s retargetable software development kit (SDK), which will automatically target the user-defined processor architecture (step 1 in Figure 1). The SDK consists of an optimizing C and C++ compiler that exploits the ISA defined in nML, a cycle-accurate instruction-set simulator (ISS), a graphical debugger and profiler.

The instantaneous availability of an SDK with an efficient C/C++ compiler is the cornerstone of ASIP Designer’s Compiler-in-the-Loop design methodology.  Designers will start with a baseline nML model defined by themselves or picked from a rich set of example models that come with the tool. They then compile selected C or C++ benchmark application programs that are representative for the target application domain and run the compiled code in the ISS. The ISS generates profiling information, identifying critical code segments where most cycles are spent and hot spots in the utilization of processor resources.  From the profile reports, designers can identify opportunities for tuning the ISA. These changes can be applied quickly in the nML model (step 2 in Figure 1), and subsequently validated in a next iteration cycle with the retargetable SDK. Multiple design iterations can be made, gradually extending the set of benchmark applications. Significant tuning of the ISA is possible just within days.  

ISA and Microarchitecture Optimization with Synthesis-in-the-Loop

At any moment, ASIP Designer’s RTL generator can convert the nML model into a synthesizable RTL model in the languages Verilog or VHDL (step 3 in Figure 1). The RTL generator implements advanced optimizations for good timing performance and low power dissipation. The instantaneous availability of an RTL model is the cornerstone of ASIP Designer’s Synthesis-in-the-Loop methodology. By synthesizing the RTL model, designers can quickly and accurately evaluate the PPA of the design for the target process technology and clock frequency. This may reveal opportunities to optimize the ASIP’s ISA or microarchitecture. For example, an area reduction may be achieved by reducing the number of registers or register ports, or timing closure may require adding extra stages to the instruction pipeline to break the critical path. With every change, the SDK and its compiler are automatically updated as well. Therefore, the SDK and the RTL model always stay in sync, and nML constitutes the golden reference model of the design.

ASIP Designer offers full interoperability with 草榴社区’ Design Compiler? (DC), Fusion Compiler? (FC) and RTL Architect?. RTL Architect is of interest during design iterations, since it can predict PPA quickly and accurately, and it supports the parallel exploration of configuration options of ASIP Designer’s RTL generator. Design Compiler or Fusion Compiler are used for the actual physical implementation of the ASIP.

Under the Hood: Processor Modeling

Figure 2 shows a simple ASIP architecture, composed of its datapath, program control unit (PCU) and peripheral interfaces. The code box to the left shows a snippet of the ASIP’s nML model. 

Figure 2: Example processor architecture, with snippets of its nML and PDG descriptions

An nML description has two parts:

  • The resource definition.  Resource types include memories, registers, register files, pipeline registers, transitories (i.e. wires connecting other resources) and functional units. All resources are named. Resources other than functional units have a user-definable datatype and in most cases a dimension.

  • The instruction-set grammar. This part defines a hierarchical decomposition of the instruction-set into instruction classes.  At each level of the hierarchy, an operation rule (with keyword opn) composes an instruction class from lower-level rules. One type of composition is made by or-rules (e.g. my_core() in Figure 2), specifying that instructions exclusively belong to one of the lower-level classes. Another type of composition is made by and-rules, which introduce instruction-level parallelism (e.g. arith_inst()). Or-rules and and-rules can be mixed throughout the instruction-set hierarchy.

Rules representing the lowest classes of the instruction-set hierarchy can have up to three different attributes:

  • The action attribute (keyword action) defines the concurrent register-transfer behavior of all instructions in that class. Register-transfer statements can call primitive functions, which are implemented as combinational logic in functional units (e.g. add(), and() and or() in instruction class alu_inst, all implemented in the alu functional unit). Register transfer operations are grouped into stages of the processor’s instruction pipeline (keyword stage).

  • The syntax attribute (keyword syntax) defines the instructions’ assembly language.

  • The image attribute (keyword image) defines the instructions’ binary encoding.

The behavior of each primitive function is defined in bit-accurate C code. A subset of the C language with support of C99 fixed-width integer datatypes is used, referred to as the PDG language. PDG is also used to describe the behavior of the ASIP’s program control unit, and of its peripheral interfaces. 

Under the Hood: Retargetable Compilation

ASIP Designer’s C/C++ compiler offers full and automatic retargetability to any processor architecture defined in the nML language.

3 specialized functional units, supporting complex fixed-point arithmetic

  • VU0: butterfly operations
  • VU1: complex multiplications
  • VU2: special radix computation

2 memories

Special-purpose heterogeneous register and interconnect structure

Data-level parallelism: vectors of 8 lanes, each representing a complex number

Instruction-level parallelism: very-long instruction word (VLIW) with 5 parallel issue slots

4-stage instruction pipeline

Figure 3: PrimeCore ASIP architecture

Optimized code generation for specialized ASIP architectures

To appreciate the compiler’s efficiency, consider the so-called PrimeCore ASIP shown in Figure 3. This is one of the example ASIPs that are made available as nML code with the ASIP Designer tools.  PrimeCore’s architecture was optimized for the efficient execution of FFT and DFT algorithms in wireless baseband applications.

Figure 4: Compiler-generated machine code on PrimeCore

Figure 4 shows a machine code snippet implementing part of a DFT algorithm, as generated by ASIP Designer’s compiler on PrimeCore. The compiler was able to almost fill each of the 5 issue slots, throughout the complete inner loop of the algorithm.

Anatomy of the Retargetable Compiler

Efficient compiler performance with full architectural retargetability requires a special compiler technology. This is what ASIP Designer offers. The internal architecture of the retargetable compiler is shown in Figure 5.

Figure 5: Internal architecture of ASIP Designer's retargetable compiler

The compiler’s application language front-end supports the programming languages C, C++ and OpenCL C. It translates source code into an intermediate representation in the form of a control and data flow graph (CDFG), in which nodes represent operations and edges represent dependencies. Users have a choice between ASIP Designer’s original front-end with powerful optimizations for digital signal processing applications, and a version of the Clang front-end from the open-source compiler project LLVM that was extended to support the broad architectural scope of ASIP architectures.

The compiler’s nML front-end translates the nML processor description into an intermediate representation in the form of an instruction-set graph (ISG). Nodes in the ISG represent primitive functions alternated with storage elements (e.g. memories, registers), and edges represent connectivity. The ISG is closer to a hardware representation than machine models of traditional compilers. It represents all processor resources, data types, connectivity, the instruction encoding, instruction-level parallelism and the instruction pipeline. Note that the ISG representation is also used by ASIP Designer’s RTL generator, where it is a basis for generating an RTL netlist of the processor. In the context of the compiler however, the ISG can be viewed as a superposition of all data-flow patterns that are legal on the processor.

In multiple steps referred to as compilation phases (depicted as blue boxes in the lower-right of Figure 5), the compiler maps the CDFG representation onto the ISG representation. Compilation phases are fired by a code generation engine that can cope with phase coupling, i.e. mutual dependencies between phases are accounted for to generate efficient code.

Every compilation phase implements advanced optimization algorithms, which directly operate on the ISG representation that was derived from the nML model. Unlike traditional compilers there is no need to develop architecture-specific optimization phases. Architecture designers and software engineers can retarget the compiler autonomously, merely by modifying the nML model.

A few key optimization algorithms are introduced next. 

Register Allocation for Heterogeneous Storage and Interconnect

Register allocation is a compilation phase in which the compiler decides which storage locations (registers and memories) to use for the intermediate variables of the application program. To increase performance, many ASIPs have a heterogeneous architecture with dedicated registers or register files that are locally connected to inputs or outputs of functional units. The ISA design may also constrain the register choices to reduce the opcode space required by instructions, e.g., an instruction may require that its destination register always equals one of its source registers.

ASIP Designer’s compiler uses a graph-based data-routing algorithm to solve the register allocation problem for such heterogeneous architectures. The algorithm applies path-search techniques in the ISG model of the processor to determine alternative ways to route intermediate variables from the functional units where they are produced to the ones where they are consumed. Variables are allocated to the storage locations on those paths. These path-search techniques are embedded in a branch-and-bound search method to select efficient solutions. 

Aggressive Scheduling for Parallel Architectures

In the scheduling phase, the compiler decides on which time step each individual operation will be executed. If the processor supports instruction-level parallelism, the scheduler is expected to fill the parallel issue slots as densely as possible (see Figure 4 above). The scheduler must respect all data dependencies (i.e., variables can only be consumed after they have been produced) as well as anti-dependencies (i.e. variables cannot be overwritten before they have been consumed).

Many traditional compilers do not expose the detailed instruction pipeline to the scheduler. This may not result in the most compact schedules. ASIP Designer supports an aggressive scheduling mode, in which the instruction pipeline becomes fully exposed. The scheduler knows in exactly which pipeline stage each variable is produced or consumed, enabling instructions to be scheduled more in parallel.

Table 1 compares the cycle count obtained with standard and aggressive scheduling of FFT algorithms of different sizes on the PrimeCore ASIP with its 4-stage instruction pipeline, showing an average cycle count gain of 22%. Higher gains can be obtained on ASIPs with deeper pipelines.

Table 1: Cycle count comparison for standard and aggressive scheduling of FFT algorithms on the PrimeCore ASIP

 

FFT Size

Cycle Count Standard

Cycle Count Aggressive

Gain

8

5

5

0%

16

16

14

12%

32

25

20

20%

64

42

29

31%

128

75

52

31%

256

248

180

27%

512

446

328

26%

1024

878

654

25%

2048

1700

1288

24%

4096

4412

3585

19%

Average:

22%

While executing aggressively scheduled code, the processor cannot accept interrupts. The designer can, however, restrict the application of aggressive scheduling to selected sections in the application source code. ASIP Designer will automatically masks interrupts during the execution of such sections. 

Software Pipelining for Parallel Architectures

Software pipelining is a code transformation in the scheduling phase, for applications that contain for-loops. The compiler will try to move operations from one loop iteration to a subsequent one, where they can fill issue slots that would otherwise remain unused.

Figure 6 shows a machine code fragment generated by ASIP Designer’s compiler for the so-called MMSE ASIP (named after the minimum mean-square error algorithm). MMSE, which is shipped as an example model with the tools, was optimized for the efficient execution of channel equalization functions in wireless communication. It offers instruction-level parallelism with four issue slots. The shown function has two nested for-loops, as indicated by the angular arrows. The compiler has applied software pipelining to the inner for-loop, resulting in a compact schedule of two cycles per loop iteration. The compiler inserted prolog and epilog code before and after the loop, to ensure correct initialization and termination of the software pipeline.

Figure 6: Software-pipelined schedule of an inner for-loop on the MMSE ASIP

Many compilers only apply software pipelining to the innermost loop of each loop nest in the application code. ASIP Designer has an option to also apply software pipelining at higher levels of loop nests. This is often beneficial because, as illustrated by the triangular shapes in Figure 6, the issue-slot utilization of a loop’s prolog and epilog is typically quite complementary, so that moving epilog code to the beginning of the next iteration of the higher-level loop will result in a higher issue-slot utilization in the higher-level loop body and thus a further cycle-count reduction. This is at the expense of larger code size, because the compiler must now insert extensive prolog and epilog code for the higher-level loop as well.

Conclusion

ASIPs play an important role in the design of SoCs for smart applications. They are used as accelerators to offload microprocessor cores. Unlike traditional hardwired accelerators, ASIPs are software-programmable, resulting in pre- and post-silicon flexibility.

In this article we described how tool automation brings the design of ASIPs within easy reach of chip designers. 草榴社区’ ASIP Designer tool suite addresses all aspects of the design cycle, based on its Compiler-in-the-Loop and Synthesis-in-the-Loop methodologies. We provided insight in some of the advanced technologies under the hood of ASIP Designer, thereby focusing on processor modeling and retargetable compilation. Relying on 草榴社区’ expertise in ASIP tools, semiconductor and system companies can focus on designing their own differentiating ASIPs, tuned to their product requirements.

For more information on ASIP Designer, visit