草榴社区

ASIP eUpdate, November 2019

草榴社区’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility

ASIP Designer

草榴社区’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility

This bi-annual newsletter provides you with easy access to ASIP related resources. This issue includes the following topics:

Technology Feature: Variable Length Instruction Encoding

Instruction level parallelism (ILP) - the parallel dispatching of a few distinct operations - is a proven architectural feature through which the performance of processor architectures can be increased.  It enables the simultaneous activation of distinct hardware resources such as computational units and load store units.  Compared to superscalar architectures, static ILP moves the complexity of composing parallel instructions from the hardware to the compiler, resulting in much simpler dispatch logic. ILP requires that all of the parallel operations, or ILP slots, are encoded in a single instruction. This may lead to wide instructions composed of multiple slots.

The following code was generated by the compiler for the Tvliw example processor, which is part of the ASIP Designer? release package.  Tvliw supports two arithmetic (D) slots and two load/store (M) slots.   The instructions highlighted in green are a loop body.  For these instructions the four slots are used in the most efficient way.  Note that in the code before and after the loop, not all slots can be filled with useful operations.   In case the processor supports only instructions of the maximal width, then a NOP must be encoded for these unused slots.  This is a waste of program memory.


We can reduce the code size of the program by adding shorter instruction formats to the ISA.  On Tvliw, the variable length mechanism is implemented by adding a continue bit to each VLIW slot.  When this bit is set, the next slot will still be part of the current instruction, while a zero bit indicates this is the final slot of the current instruction.  A four-slot instruction DDMM is therefore encoded as follows:

Whereas a two-slot instruction DM is encoded as


Due to the short formats, the above code snippet can be compressed into 22 slots instead of the original 32 slots.

On Tvliw, there is one maximal instruction (DDMM) but there are 7 shorter formats (DDM, DMM, DD, DM, MM, D and M).  While it is possible to capture the complete ISA (the maximal width instruction and the shorter formats) in the main nML model, the presence of many variable length instruction formats has a few disadvantages

-        It makes the internal processor model complex and slows down the compiler

-        It results in a large instruction decoder

The 2019.09 release of ASIP Designer introduces the concept of nML format rules:  in the main nML model, only the full VLIW format is modeled. The shorter formats are modeled as a secondary part of the ISA, in so-called nML format rules. This technique significantly reduces the complexity of the internal processor model. Besides this, it results in faster compiler execution.  This is due to the fact that the compaction or “NOP compression” is done as a separate pass at the end of the compilation process.  Only this pass needs to be aware of the short formats.

Another advantage is that the instruction decoder is simplified as it now only needs to decode the DDMM format. When short format instructions are fetched from the program memory, they go through a pre-decoding step where they are converted to the DDMM format.  For example, a standalone l d r6, [p1+=m0] is converted into: 


The pre-decoder function is generated automatically by ASIP Designer.   

A known concern related to VLIW compaction is that of unaligned jump targets. ASIP Designer addresses this concern through the concept of elongation. This is how it works: It is typical that the processor fetches larger instruction packets from program memory, at aligned addresses. In the diagram below on the left, instruction C is the target instruction of a jump.  It is a three-slot instruction that happens to be spread over two fetch packets (at addresses 104 and 108).  Before C can be issued, two fetches are required, and this results in a stall cycle.  The compiler extends the previous instruction longer by encoding a redundant NOP slot, so that the jump target is located at an aligned location, or at least comprised within one fetch packet.  


What's New in ASIP Designer?

2019.09 Release Update

In September 2019, we launched the latest release of ASIP Designer, providing a number of enhancements and extensions. The following is an extract, sorted by categories (please refer to the official Release Notes for the comprehensive list).

Example Models

  • MMSE example model: minimum mean square error equalization (MMSE) is another example of a domain-specific accelerator. Among other things, it illustrates the efficient computation of triangular matrices.
  • We have added two more examples that show the integration of existing RTL code into the ASIP Designer design flow. For illustration purposes we used the integration of floating-point units
    • FLX2: IEEE-compliant floating-point operation, using a Verilog implementation from the DesignWare library. Due to the IEEE-compliance, the simulator would use the standard C++ built-in floating-point operations
    • FLX3: non-IEEE compliant floating-point format (e.g. 24-bit format). Again, it uses a Verilog implementation from the DesignWare library. As the format is no longer IEEE-compliant, for the simulator the example illustrates the use of Verilator to come to a bit-accurate simulation model.
    • Note: another way to model floating-point operations is to describe its behavior in PDG. In this case, both the RTL as well as the simulation model are generated from the single PDG description. This implementation is available as the FLX1 example model.

Processor Modeling

  • ASIP Designer supports a new concept of format rules in nML to introduce variable length instruction formats in a VLIW instruction set. The nML front end generates both the corresponding instruction truncation rules, used by the compiler and assembler for NOP compression, and a PDG predecode function, to be included in the PCU, to expand the truncated instruction stream back to the wide format.
  • We added the option to enable NMLVIEW in an ASIP Programmer? processor package exported from ASIP Designer. This enables you to browse through the instruction formats and the assembly syntax in ASIP Programmer mode .

C/C++ Compiler

  • ASIP Designer’s unique compiler-in-the-loop? methodology enables architectural exploration by profiling the actual application code using the C-compiler and the cycle-accurate instruction set simulator. Further improving the compiler-in-the-loop approach, ASIP Designer now provides a graphical scheduling report for easy identification why a specific schedule was picked, including a report for the inner modulo scheduled loops on the critical cycles and resources including register usage
  • Global pointers can now also be allocated to memory, to support multiple data memories and multiple libraries using position independent code. This allows for linking a library independent from the main program, which can then still be loaded at any place in the memory
  • The compiler’s LLVM-based front end now supports synthetic struct members, including restrict pointers
  • The LLVM-based front-end including the C++ library, and all example models featuring the LLVM-based frontend have been updated to the most recent LLVM version 9.0

Simulation and Debugging

  • Further speedup for on-chip debugging data block load/store operations. By memorizing the last accessed register and memory, this can lead to accelerations of up to 50% for these operations.

RTL Generation, Verification, and Synthesis Support

  • Better support for Spyglass:  Using the option “spyglass_scripts”, the generated Makefile will include support to run the Spyglass tool, and it generates a project file and an ASIP Designer specific waiver file. It uses the default rule set for “rtl_handoff” with the default goal “lint_rtl”, but this can be easily adapted, as described in the RTL generation manual. Example models that originally resulted in Spyglass warnings or errors have been updated accordingly. 

Additional Resources

ASIP Designer Online Training

Online training for ASIP Designer has seen strong adoption by new users. Additional recordings have been added. Register for access to the training modules, which provide a deep dive into the concepts, languages, and files that are used to capture a processor design.

 

ASIP University Day 2019

On September 25th, 草榴社区, in cooperation with Lund University organized the ASIP University Day 2019. Leading university teams presented results from their ongoing ASIP projects in domains such as 5G baseband and AI accelerators. 草榴社区 presented latest case studies and in-depth insight into the ASIP Designer technology. Teaching embedded processor design classes, and how to leverage ASIP Designer, was another subject of the day. Check here for the agenda, and access to the proceedings.  

 

Customer References

“To meet our customer-specific requirements, we are developing specialized processors and programmable accelerators that are fully optimized for performance, power, area, and code size, while offering the required flexibility,” said Thierry Brouste, Manager, Embedded Computing 草榴社区, STMicroelectronics. “Using ASIP Designer as our tool of choice gives us a significant competitive advantage, because it enables us to quickly develop complex and highly differentiated application-specific processors, while maximizing our design team’s efficiency through design automation and architecture exploration.”

RIKEN’s drug discovery molecular simulation platform team utilizes leading computational technologies using large-scale, high-speed supercomputers, specifically for molecular simulation technologies. These molecular simulators are used to identify drug behavior at the atomic level and help predict what structural formulas make for highly effective and selective drug candidates. Molecular dynamics (MD) simulations are computationally intensive and need petaflops of processing performance. RIKEN recognized that a general-purpose processor would not deliver the required performance, and so they decided to develop their own specialized custom processor using 草榴社区’ ASIP Designer tool, and integrated 17 instances of the processor in a custom multicore chip.

 

White Papers   

Over the past decade, the trend in SoC design has been to add more functionality into software. There are several reasons for this, including (a) software is easier and faster to fix and update, (b) evolving trends and not-yet fully specified standards require flexibility since the final functionality might not be known at the time the hardware design must be locked down, and (c) the desire to reuse SoCs for different products and derivatives, improving the return on investment (ROI) for a single design. Read the white paper to find out how ASIPs can contribute and what it takes to develop them.

In order to develop a proprietary processor that can stand the test of time, a highly functional SDK must be developed. The complexity, cost and duration of SDK development vary depending on the architecture of the processor and the skillset of the SDK developers. In this paper, we analyze the requirements for an SDK. We then introduce a tool-based methodology for SDK development based on 草榴社区’ ASIP Designer tool suite.

Architectural exploration is at the heart of any ASIP design approach. Designers need to rapidly explore the impact of different architectural choices on power consumption and performance, ideally using real-world application C-code as part of the design flow. This white paper explains the architectural tradeoffs that are available to an ASIP designer, how to trade off performance vs. area, and why an ASIP design can still maintain full C-programmability while being optimized for a certain application domain.

Modern SoCs integrate dozens of complex system functions, each requiring its own optimal balance of performance, flexibility, energy consumption, communication, and design time. The traditional model of a (configurable) general-purpose processor core with a number of fixed hardware accelerators no longer suffices. ASIPs can offer the best balance for each system function, and thus form the basis of new generations of multicore SoCs.

 

More