Cloud native EDA tools & pre-optimized hardware platforms
Michael Thompson, Sr. Product Marketing Manager, 草榴社区
There is a transformation occurring in high-end embedded applications as processing spreads from the cloud to the edge and the end points of the Internet. Performance requirements are increasing rapidly and changing the architecture of processors and how they are implemented in designs. This is behind the increased use of multicore processors to deliver higher performance. Most high-end processors today are available with support for dual- and quad-core configurations. There are a few that support up to eight CPU cores, but even this won’t deliver the performance needed for emerging applications in storage, automotive, networking, and 5G. Next-generation embedded applications need scalable support for larger CPU clusters and specialized hardware accelerators to deliver the required performance. Large multicore processors will require a new architectural approach to enable higher performance and not create additional implementation and timing closure problems for embedded designers.
It is no secret that advanced process nodes no longer deliver the higher clock speeds and lower power consumption that they once did. For many process generations logic speeds have continued to increase but memory access times have not (Figure 1). The speed-limiting paths in processors are almost always through memory. This is a situation that is not likely to change for future process nodes because of the very real limitations of semiconductor physics.
Figure 1: Embedded Memory Performance Gap
At the same time the maximum clock speeds in embedded applications have topped out in the 1GHz – 2GHz range (Figure 2). To be sure, there are a few exceptions that are clocked above 2 GHz but for most applications this isn’t possible. There are limits on power consumption and area both of which increase rapidly with clock speed. Most embedded designs are being clocked at less than 1 GHz and this won’t change as we move into the future. Increasing performance by increasing clock speed is not a realistic approach for most embedded designs.
The challenge here is that the performance requirements for embedded applications continue to increase. This is driven by competition, the addition of new features, and changes in requirements in application spaces. For example, the size of SSD drives is increasing rapidly with the demand for larger capacities and higher access speeds. Also, computational storage and artificial intelligence are being added to extend drive life and improve data access and performance. All of these are increasing the performance requirements for SSD controllers and the processors that are used to implement them.
Figure 2: Embedded Processor Speeds Top Out at 2 GHz
Many methods have been used to increase processor performance. Increasing the number of pipeline stages has been used for years to deal with the limitations in memory speeds. For example, the DesignWare? ARC? HS Processor 10-stage pipeline with two cycle memory access can be clocked at 1.8 GHz (worst case) in 16 FFC processes. There are limits to how fast embedded designs are being clocked so adding more stages to a processor’s pipeline will be of limited benefit. In the future this may change, but currently a 10-stage pipeline is optimal for embedded designs.
Superscalar implementations are a good tradeoff in terms of performance gain versus the increased area and power. Moving from a single-issue architecture to dual-issue can increase RISC performance by as much as 40% with limited increases in area and power. This is a good tradeoff for an embedded processor. Moving to tri-issue or quad-issue will further increase area and power but offer lower increases in performance. Performance at any cost is never the goal of an embedded processor.
Adding Out-of-Order (OoO) execution can increase performance for embedded applications without increasing clock speeds. Typically, a CPU that supports full OoO is overkill for embedded and a limited approach will give an optimal performance increase without blowing up the size of the processor. Limited OoO is commonly used on high-end embedded processors.
Caching is used to bring memory closer to the processor, thus increasing performance. Cache has single-cycle access for the processor and the performance improvement is the result of information being in cache when it is needed. Often used code and data is kept in Level 1 cache. Lesser used code and data are kept in slower access Level 2 cache or external memory and accessed as needed. For multicore processors maintaining coherency between the L1 data caches also improves performance. L1 caching and coherency are common in embedded processors while L2 cache (and Level 3) are used only for higher end applications.
Embedded designs are seeing increased use of multiple processors. A few years ago, a typical system-on-chip (SoC) had one to two processors. Today more than five processors is common even for low-end designs, and the number is increasing. To support this, processors for mid-range and high-end embedded applications offer multicore implementations. Processors supporting two, four, and eight CPU cores are available. Using Linux or another operating system enables programmers to get smooth operation across the CPU cores while balancing the execution to increase performance.
The usage of hardware accelerators is increasing in embedded designs. They offer high performance with minimum power and area while offloading the processor. The main drawback to hardware accelerators is that they are not programmable. Adding accelerators to work in conjunction with a processor can mitigate this. Unfortunately, existing processors have limited or no capabilities to support hardware accelerators. Some processors like the ARC Processors support custom instructions that enable the user to add hardware to the processor pipeline. While custom instructions are attractive, hardware accelerators offer additional benefits and, when used in conjunction with a processor, can offer a significant performance improvement.
There are challenges to increasing processor performance for embedded applications. Processors already have support for deeper pipelines, superscalar implementations and OoO help but can only go so far and caching is already prolific, as is coherency, so further gains there are unlikely. A path to higher performance that is already being pursued by embedded designers is to implement more CPU cores and hardware accelerators in designs.
Next-generation processors will add support for large multicore implementations and hardware acceleration (Figure 3). Processor vendors must do more than just add interfaces to their existing processors. Processors supporting four or eight CPU cores are already hitting maximum frequency limits and can have significant issues with timing closure. Adding more cores will just make this worse. Next-generation processors must start with a full re-architecting of the internal processor interconnect to facilitate timing closure, address speed limitations, and increase internal bandwidth. The bandwidth of the external interfaces must also be increased to support the movement of data into and out of the processor.
Figure 3: Next-Generation Embedded Processor Architecture
Quality of Service (QoS) has been implemented extensively in Network-On-Chip (NOC) but has seen limited implementation in multicore processors. This will change in next-generation processors, giving programmers the ability to manage the internal bandwidth to each CPU core and accelerator to maximize performance. This is application dependent and while QoS will not be needed for every design, it will be essential in many others to insure predictable performance.
Large multicore processors have advantages over smaller multicore processors. Implementing a processor with 12 CPU cores as opposed to three processor clusters with four CPU cores will reduce latency between the CPU cores and enable direct support for snooping across the cores. Another advantage is better software scaling. A 12-CPU core processor offers programmers greater flexibility in how software is partitioned, and the number of cores used to address a task can be dynamically allocated depending on the performance needed. With multiple processor clusters it is more difficult to get this level of software performance control, because of the lack of uniform access between the CPU cores.
Large multicore processors will also gain advantages from having tight coupling to hardware accelerators. Moving the hardware accelerator interfaces inside of the processors instead of connecting to them over an SoC bus will reduce latencies and traffic on SoC buses while increasing data sharing and system performance. This can also increase the efficiency of programmable control over the accelerators if shared user registers are implemented.
草榴社区’ next-generation DesignWare ARC HS5x and ARC HS6x Processor IP take advantage of many of the methods described previously to increase processor performance. They are built with a high-speed 10-stage, dual-issue pipeline that offers increased utilization of functional units with a limited increase in power and area. The ARC 64-bit HS6x processors feature a full 64-bit pipeline and register file and supports 64-bit virtual and 52-bit physical address spaces to enable direct addressing of current and future large memories, as well as 128-bit loads and stores for efficient data movement (Figure 4).
Figure 4: DesignWare ARC HS5x/HS6x Processor IP Block Diagram
Multicore versions of both the 32-bit ARC HS5x and 64-bit HS6x processors include an advanced high-bandwidth internal processor interconnect that has been designed to ease timing closure with asynchronous clocking and up to 800 GB/s internal aggregate bandwidth. The multicore versions of the new ARC HS processors include the advanced interconnect fabric that links up to 12 CPU cores and supports interfaces for up to 16 user hardware accelerators. To aid timing closure each core can reside in its own power domain and have an asynchronous clock relationship with the other cores. Like all DesignWare ARC processors, the HS5x and HS6x processors are highly configurable and implement ARC Processor EXtension (APEX) technology that enables the support of custom instructions to meet the unique performance, power, and area requirements of embedded applications.
To accelerate software development, the ARC HS5x and HS6x processors are supported by the ARC MetaWare Development Toolkit that generates highly efficient code. Open-source tool support for the processors includes the Zephyr real-time operating system, an optimized Linux kernel, the GNU Compiler Collection (GCC), GNU Debugger (GDB), and the associated GNU programming utilities (binutils).
The performance requirements for embedded applications will continue to increase. The processors used in these applications must also increase in performance. This is challenging because area and power will be limited, and the easy processor performance gains have already been made. Advanced process nodes no longer deliver the gains that they once did, and embedded processor speeds are being limited. Superscalar and OoO capabilities are common in high-end processors, and 64-bit, though necessary, offers limited increase in performance. A new generation of multicore processor with support for more than eight CPU cores and the internal connection of hardware accelerators are needed. New processors like DesignWare ARC HS5x and HS6x Processor IP will deliver scalable performance and capabilities while enabling designers to manage the power and area requirements of their embedded applications. Built on an advanced architecture and implemented with high-speed internal interconnect these processors address the performance needs of today’s high-end embedded applications while having plenty left over for future designs.