Cloud native EDA tools & pre-optimized hardware platforms
By: Ken Brock, Product Marketing Manager, 草榴社区
A large part of the chip area of deep learning accelerators is dedicated to AI-specific computational functions and large memories. Designers integrating deep learning technologies face performance, power, and area challenges when building ASICs and ASSPs for data centers and artificial intelligence (AI) inference engines at the edge. These challenges can be addressed at the foundation IP and system-on-chip (SoC) levels using practical solutions for designing energy-efficient SoCs. Integrating on-chip foundation IP - memories and logic - that are optimized for AI and deep learning functions can dramatically reduce power and area while meeting the application’s performance requirements.
Deep learning accelerators use trained deep learning frameworks, such as TensorFlow or Caffe, to as the basis for inference. The top left of Figure 1 shows an initial neural network model that consists of several attributes that the training engine will evaluate using large amounts of training data. For example, early deep learning studies at Google and Stanford used the abundant availability of cat faces in web images to develop many of today’s most common algorithms and deep learning practices. Once the neural network models have been trained with large amounts of image data to form data/label pairs, they can enable inference engines to quickly infer targets and take actions accordingly, in this case, infer a cat face while scanning new images as shown on the right.
Figure 1: Neural network models in deep learning framework
Convolutional neural networks (CNNs) have proven very effective in areas such as image recognition and classification as inference engines. CNNs are successful in identifying faces, objects, and traffic signs as well as powering vision in robots and self-driving cars. Deep neural networks (DNNs) use a series of layers and features to create very accurate and computationally efficient results. These layers of Convolution, Non Linearity, Normalization, Pooling and Classification (Fully Connected Layer), as illustrated in Figure 2, can be implemented in software or using dedicated hardware consisting of logic arithmetic functions with intermediate results often stored in local memories.
Figure 2: Modern deep convolutional neural networks
Much of the early work in training and inference was accomplished with general purpose CPUs running programs in high level language due to ease of implementation and adequate performance given the limited data sets. GPUs proved themselves more efficient in performing the functions in neural networks in parallel for better throughput as training data sets became larger and larger. With the availability of “big data,” dedicated deep learning processors show the most promise in performing these large computational tasks the fastest. Their speeds are often measured in teraflops (one million million (1012)) floating-point operations per second with high efficiency. Today’s most advanced FinFET processes provide excellent platforms for building deep learning accelerators and inference engines that are optimized for the specific application.
Due to the limits of today’s processors based on Von Neuman architectures, there is a large and growing gap between off-chip memory performance and the speed of processors following Moore’s Law (Figure 3). This memory access bottleneck is exacerbated by the large amounts of sparse data and increasing silicon performance.
Figure 3: CPU/GPU performance has been outpacing memory performance, and deep learning processor performance shows an even greater gap
Many recent hardware platforms, including general-purpose CPUs and GPUs, as well as dedicated CNN processors and custom DNN servers, offer special features that target CNN processing. CNN inference has also been demonstrated on embedded SoCs, and 草榴社区’s Embedded Vision (EV) Processor can perform huge numbers of multiplier-accumulator (MAC) computations per second with low power and small area, supported with the MetaWare EV development toolkit for software development.
Art forms in designing deep learning accelerators include minimizing the memory bottleneck through processor architectures that optimize numerical precision while preserving sufficient accuracy, efficient use of on-chip memory, and employing energy-efficient logic for fixed point and floating point computations. Figure 4 shows a typical architecture of a dedicated deep learning accelerator that includes a hierarchy of off-chip and on-chip memory.
Figure 4: Example of dedicated deep learning accelerator memory architecture
The memory hierarchy is connected to an array of arithmetic logic units (ALUs), each containing small scratchpad memory and control circuits. The array of ALUs perform the dedicated functions and layers of DNNs to efficiently process large amounts of data in training and in inference functions. The flexibility of the connections that can be made in this array between memories and processors enable it to be able to handle a small number of dedicated tasks or a wide variety of learning tasks with the same hardware.
Reducing numerical precision of scalar data in memories is key to minimizing energy in deep learning because most deep learning algorithms that apply weights to scalar data are quite tolerant of small round-off errors. For example, an 8-bit fixed-point adder consumes 3.3X less energy and 3.8X less area than a 32-bit fixed point adder and 30X less energy and 116X less area than a 32-bit float adder. Reducing the precision also reduces the memory area and energy needed to store this typically sparse matrix data and reduces the memory bandwidth issues that can quickly become the system performance bottleneck. These integer and floating point MACs use a variety of multiplier types such as Booth, Wallace tree, and a wide variety of adders-- carry-look-ahead, carry-save and carry-select. The 草榴社区 DesignWare? Library contains RTL for a wide variety of adders, multipliers, and dot-product generators with full IEEE-752 compatibility for single and double precision.
Deep learning requires large on- or off-chip memories to efficiently perform sparse matrix multiplication with a special emphasis on minimum total power. On-chip SRAM and register file memories require much less energy than off-chip memory accesses to DRAM. Today’s on-chip memory features, such as light sleep, deep sleep, shutdown, low voltage operation with read/write assist, and dual rail take advantage of the variety of bit cells and periphery options available to optimize performance, power, and area. Memory and register file configurations such as single port, dual port, and multi-port are frequently used in deep learning processors. The special configuration of a multi-port two read/one write, seen in Figure 5, are especially useful when feeding ALUs with filter weight and mapping information while the partial sum is coming from local registers.
Figure 5: Multi-port memory for deep learning
Content-addressable memory (CAM) is now common in networking applications but may become very useful in deep learning applications if power consumption can be minimized. For example, ternary content-addressable memory (TCAM) is a specialized type of high-speed memory that searches its entire content in a single clock cycle that could be a lower power solution for deep learning SoC. The term “ternary” refers to the memory's ability to store and query data using three different inputs: 0, 1, and X (the “don't care” logic bit acts as a wildcard during searches).
Near-memory processing and in-memory processing may take advantage of non-volatile memory (NVM) technologies such as spin torque transfer RAM (STT RAM), phase-change memory (PCM), and resistive RAM (RRAM). These emerging technologies have been produced in discrete devices for years by mainstream semiconductor manufacturers and could be used to perform multiplication using an analog method of weighted bits, resistive ladders and capacitors for accumulation. These technologies, if they can be combined with CMOS logic, could provide a magnitude of energy efficiency to deep learning as could quantum computing, both years away. Today, energy efficient embedded SRAMs and register files are available for deep learning accelerators.
Area-efficient and power-efficient logic is required for massively parallel, matrix multiplication circuits that often use data in tensor formats. Tensors are geometric objects that describe linear relations between geometric vectors, scalars, and other tensors. Elementary examples of such relations include the dot product, the cross product, and linear maps. These computation engines can include both integer and floating point operations. Critical circuits include 8-bit integer with 16-bit floating point MACs and support conversions between integer and floating point formats in today’s accelerators.
To build higher level functions, logic libraries must contain a rich variety of fast and efficient half and full adders, XORs, compressors, Booth encoders, and flop and latch families, including multi-bit, for partial products and other storage. Large training applications require a strategy of energy minimization with low voltage arithmetic logic operations, as power is proportional to CFV2. Enabling large training applications with minimum energy requires low voltage design techniques and special low voltage characterization as circuits approach near-threshold operation and variability becomes a major concern. To minimize the power consumed in the clock tree, asynchronous logic may provide the next breakthrough in power optimization of deep learning accelerators. The power efficiency of MACs, which are repeated millions of times on a deep learning processor, is measured by a figure of merit of computation throughput per unit of energy, such as at GMAC/s/W or Giga multiply per second per Watt. The choice of optimal numerical precision and energy-efficient arithmetic logic cells can be critical in minimizing power. Figure 6 shows the logic in a typical floating point MAC.
Figure 6: Multiply-accumulate MAC
Designing a deep learning accelerator is more than having a great algorithm, innovative architecture, efficient design, and great foundation IP. Deep learning SoCs are designed to process a large amount of data brought on and off chips which can completely change the overall energy metrics per operation. 草榴社区 provides a broad portfolio of high speed interface controllers and PHYs including HBM, PCI Express, USB, MIPI, and SerDes to transfer data at high speeds, with minimal power and area. 草榴社区 also provides a comprehensive EDA platform to design and validate ASICs and ASSP for deep learning.
Optimized Foundation IP is critical for building deep learning accelerators because of large extent of chip area dedicated to AI-specific computational functions and large memories. 草榴社区 provides comprehensive Foundation IP consisting of energy-efficient logic libraries and memory compilers in all of the advanced FinFET processes targeted for deep learning accelerators. These accelerators are being designed into the most power-efficient ASICs and ASSPs in data centers and for AI inference engines in mobile devices.