Cloud native EDA tools & pre-optimized hardware platforms
By Angela Raucher, Product Line Manager, ARC EM Processors
Many IoT applications will operate from small batteries or even harvested energy for at least a portion of the time, and thus have a very strict energy consumption budget. System-on-Chip (SoC) designers targeting the IoT market have unique challenges in delivering the growing set of features required by the market and maintaining the low power demanded by the application.
Often the system architect requests an upgrade to an application processor level of performance for executing advanced system functions while maintaining the power profile of an 8-bit microcontroller-based system. Think of it as delivering the brain of a smartphone for the power of a wind-up toy. The ability to configure a processor to achieve these seemingly conflicting goals is critical. This article describes techniques and options to reduce system power through processor selection and configuration.
IoT devices are defined by their ability to take in or “sense” real world signals, perform operations on the associated data and communicate results over a network, whether it is the internet or local network. Most general-purpose RISC processors can process the signals successfully, but dedicated DSPs can perform these tasks with better power efficiency and lower latency. On the other hand, RISC processors are well suited for transferring data and setting up communication channels. Using separate independent processors is an option but adds cost and board space to the system as well as multiple development and debug environments and tools. This complexity and cost can be reduced using a single processor core with both functions.
Key features such as voice triggering, voice control, speech playback, and inertial sensor processing, which are needed in always-on and low-power environments, leverage DSP instructions to perform tasks such as filtering, Fast Fourier Transform (FFT), and interpolation while still meeting energy goals.
The DesignWare? ARC? EMxD family of processors meets these challenges by adding a DSP engine with ARCv2DSP instruction set architecture (ISA) to ARC configurable processor cores, enabling RISC and digital signal processing within a single unified architecture (Figure 1). They offer low power consumption and can perform speech detection for voice control in less than 1 ?W.
The ARC EM DSP processors are highly configurable so that each instance can be tailored to achieve the optimum balance of DSP and RISC performance for the target application as well as power- and area-efficiency. For example, the ARC EM5D and EM7D are well suited for applications requiring around 50% DSP processing and the EM9D and EM11D, with support for XY memory, are ideal for more DSP intensive applications. ARC Processor EXtension (APEX) technology also offers designers the ability to create user-defined instructions, enabling the integration of custom hardware accelerators that improve application-specific performance while reducing energy consumption and the amount of memory required.
Figure 1: ARC EMxD Block Diagram
Code used to implement a typical DSP MAC operation in a RISC + DSP processor consists of loading data from memory, followed by performing a MAC operation on the operands. A maximum throughput of 1/3 MAC-operations per cycle can be achieved in this architecture, as the instruction sequence consists of two data moves through load instructions followed by the MAC operation, as shown in Figure 2.
Figure 2: DSP MAC Operation in a RISC + DSP Architecture
DSP applications needing higher throughput can be supported by the addition of an XY memory system. An XY memory-based system typically consists of multiple memory banks and automated address generation units (AGUs) with pointers and update registers. The AGUs are built into the instruction pipeline, and allow a single instruction to execute three data moves, a MAC operation and three address pointer updates. Multiple address pointer update modes can be supported. In this way, using an XY memory-based system architecture, an effective throughput of one MAC-operation per cycle can be achieved for a significant performance boost (Figure 3). An XY memory system also reduces code size, as there is no need for separate load and increment instructions.
Figure 3: DSP MAC Operation in a RISC + DSP Architecture with XY Memory
Aside from increased throughput and code size reduction, an often overlooked advantage is lower energy consumption. As shown in Figure 4, for DSP functions the energy efficiency can improve significantly with the use of XY memory (EM9D), as fewer clock cycles are needed for the same functions, especially when they are tailored to a RISC + DSP architecture that allows concurrent accesses for both RISC and DSP.
Figure 4: Comparison of Energy with and without XY Memory as DSP Needs Increase
The increasing demand for performance and processing capabilities in IoT applications is driving a trend to shift from 8-bit microcontroller tightly coupled embedded systems towards 32-bit processor bus-based embedded systems. This shift negatively impacts power and area of the system, which violates other key requirements of IoT products to be smaller and cheaper as they achieve mass adoption. Tightly coupled extensions to 32-bit embedded processor systems can be leveraged to achieve all of these system goals simultaneously by removing the less efficient bus infrastructure. The processor can access memories and peripheral registers directly, reducing latency and required clock frequency, which reduces the amount of energy required to perform the same function.
An example of such a reduction is shown in Figure 5, which compares a bus-based processor subsystem to a tightly coupled system processing sensor data. The processor core accesses the auxiliary registers in one cycle instead of a minimum of four cycles for the peripheral registers in a bus-based system.
Figure 5: Energy Savings for Processing Sensor Data in a Tightly Coupled System
Another option to reduce power in a processor system is to use direct memory access (DMA), which enables the peripherals to move data without involvement of the CPU. To ensure an area-efficient system, DMA has to be highly optimized for the processor and application. Combining DMA with multibank memory saves even more energy as the internal DMA moves data in and out of XY memory without impacting the processor pipeline.
草榴社区’ ?DMA option for the ARC EM family of processors is designed with IoT applications in mind, and includes only the features needed for this type of embedded system. The ?DMA controller enables lower power operation by offering the option to put the EM core to sleep while the ?DMA moves data around the chip from peripheral to memory or memory to memory, only waking up the core when it’s needed. Multiple sleep modes allow customization for the lowest possible power.
As mentioned, requirements for IoT applications continue to expand, and one of key importance is security. Yet security algorithms add complexity to a system that is already tight on power and area budget. Processors that can accelerate security algorithms by reducing the clock cycles to achieve the same functionality will save power. This is true of any common or frequently used functions that are needed by the system; the more often they are used, the more power can be saved by executing them more efficiently.
The ARC EM processor family uses ARC Processor EXtension (APEX) technology to allow SoC designers to simplify and automate the process of designing and verifying extensions for common functions like cryptographic software algorithms, or customer-specific code so that these frequently used algorithms take less time, memory, and energy to execute.
Figure 6: Energy and Cycle Count Reduction Running Sensor Application Software with APEX Acceleration
When designing a chip for IoT applications, designers often have concerns about trading off energy consumption for performance to meet ever-evolving feature requirements. Designers can make architectural choices that deliver the performance required without sacrificing energy efficiency. Flexibility and configurability are key factors when selecting processor architecture, along with the ability to scale to meet the needs of evolving applications.
The ARC EM family of processors offers scalability and options that can future-proof product roadmaps with the flexibility needed to find an optimum performance to power ratio. With the ability to customize your processor with APEX technology you will also be able to differentiate your product in the competitive IoT market.