Cloud native EDA tools & pre-optimized hardware platforms
Gordon Cooper, Product Marketing Manager, ARC Processors
It’s been ten years since AlexNet, a deep learning convolutional neural network (CNN) model running on GPUs, displaced more traditional vision processing algorithms to win the ImageNet Large Scale Visual Recognition Competition (ILSVRC). AlexNet, and its successors, provided significant improvements in object classification accuracy at the cost of intense computational complexity and large datasets. In other words, it took a lot of data movement and a lot of multiplies and accumulations to achieve this record-breaking accuracy. In the academic community, the rush was on to improve neural network optimization techniques to increase accuracy and performance while minimize power and area efficiency for practical applications.
Neural networks have found extensive uses in real-world applications like pedestrian detection for self-driving cars, speech recognition for smart personal assistants, and facial recognition for cellphone and laptop access control (Figure 1). The strength of neural networks is their ability to identify patterns within data sets – in some cases surpassing human ability – even when the data is noisy or incomplete.
Figure 1: Examples of embedded neural network applications
Implementations on GPUs, like AlexNet’s ILSVRC submission, provide a great starting point for training of the model and early prototyping. For high-volume, cost sensitive applications where performance and power are critical, designers have turned to neural processing units (NPUs) with programmable but optimized hardware acceleration for neural networks. The challenge for an NPU is to be optimized accelerate the math intensive neural networks, area efficient and yet programmable enough to be future proof when a new neural network technique or algorithm is published.
The first neural network accelerators began appearing in 2014 about the time that VGG16, a neural network model that improved upon AlexNet, was a widely used CNN architecture for vision classification tasks. VGG16’s architecture is fairly simple. It uses 3x3 convolutions and used a straightforward activation function, ReLU (Figure 2). An activation function helps sort between useful and non-so-useful data by driving the output of a node toward a one or a zero (thereby activating or not activating the output). VGG16’s accuracy over AlexNet came from increased layers driving even greater complexity and data movement requirements.
Convolutions – the heart of neural network processing – have gotten more complicated in the last ten years. MobileNet, for example, introduced depthwise separable convolutions. VGG16’s 3x3 convolutions gave way to multiple alternatives especially in increased reliance on 1x1 convolutions. Activation functions got more varied and complicated. Although ReLU wasn’t the only activation function used in 2014 – sigmoid and tanh were often used with neural networks (RNNs) – it was the most popular for CNNs. Evolving research has introduced multiple new activation functions. An NPU must still efficiently support ReLU, but it must also support a dozen or more alternative configurations. Figure 2 shows some of the complexities that today’s neural network architectures must support and be optimized for.
Figure 2: The left image shows ReLU, a computationally efficient, non-saturating, non-linear activation function used in many early CNNs. The right image shows just some of the activation functions that need to be support by the latest NPUs.
The multiple advancements in CNN architectures over the last eight years have improved performance, efficiency, accuracy, and bandwidth but at the cost of additional hardware complexity. The hardware designed to maximize AlexNet, VGG16 or other early ImageNet winning models, would be inadequate today to run the latest neural network models (ie. Yolo v5, EfficientNet) efficiently or be able to support emerging deep learning models like Transformers and Recommender Networks.
The transformer neural network is a new type of deep learning architecture which originally gained traction with its ability to implement natural language processing (NLP). Like RNNs, transformers have been designed to handle sequential input data like audio or voice. Unlike RNNs, which process the data serially and therefore suffer from bandwidth limits in hardware, transformers allow for more parallelism which improve efficiency and accuracy and allows for training on larger data sets than was possible previously. In addition to NLP, transformers are now being applied to vision applications as well.
An NPU’s neural network acceleration must evolve to better support the latest neural network models. They must evolve from a CNN engine to a broader AI engine.
It is not just the complexity of the neural network models driving the need for NPU improvements. Real-world applications have a growing demand for greater and greater levels of neural network performance. Mobile phones have shown a 30x performance jump in AI processing in the last couple year. Autonomous vehicle demands for neural network processing have grown from 10s to 100s to 1000s of tera-operations/second (TOPS) of performance in the last several years thanks to increasing number of cameras used, higher image resolution and more complex algorithms. Where L3 autonomy might require 10s of TOPS, L4 autonomy is expected to need 100s of TOPS and L5 autonomy, 1000s of TOPS.
Figure 3: The levels of automotive autonomy where the automated system monitors the driving environment and the expected neural network performance required
The easiest way to improve performance for a neural network accelerator is to increase the numbers of multiply-accumulators (MACs) – the building block matrix-multiplications. However, while the computational units are growing exponentially, the memory bandwidths required to feed data into these large accelerators is not. There is a lot of pressure on the neural network designers to come up with ways to minimize bandwidth to utilize all those MACs in the system.
Designers of AI enabled SoCs need neural network IP that is keeping pace with the latest evolutionary advancements in neural network algorithms, can scale for the growing demand for higher and higher levels of neural network performance, and can be easily programmed with a mature set of development tools. For automotive and aeronautical use cases, it’s also important to meet the increased functional safety standards.
To keep pace with the evolving neural network advancements and the growing demand for higher performance, 草榴社区 has recently introduced the DesignWare? ARC? NPX6 (Fig 4) NPU IP. The NPX6 NPU IP addresses demands of real-time compute with ultra-low power consumption for deep learning applications. The NPX6 NPU IP is 草榴社区’ sixth generation neural network accelerator IP.
Figure 4: DesignWare ARC NPX6 NPU IP
There are multiple sizes of NPX6 NPU IP to choose from to meet specific application performance requirements. The NPX6 NPU’s scalable architecture is based on individual cores that can scale from 4K of MACs to 96K MACs. A single NPX6 Processor can deliver up to 250 TOPS at 1.3 GHz on 5nm processes in worst case conditions, or up to 440 TOPS by using new sparsity features, which can increase the performance and decrease energy demands of a neural network.
Each NPX6 core includes up to three computational units optimized for the latest neural networks. The convolution accelerator supports 4,096 MACs per clock cycle for matrix multiplications including convolutional operations. The tensor accelerator supports a broad range of tensor operators for CNNs, RNNs and newer networks like Transformers. The tensor accelerator also provides a programmable look-up table (LUT) that supports any current or future activation function, including ReLU, PReLU, ReLU6, tanh and sigmoid, MISH, SWISH, etc. The Tensor Floating-Point Unit (TFPU) offers optional 16-bit floating point (both FP16 and BF16 formats) support inside the neural processing hardware, maximizing layer performance and simplifying the transition from GPUs used for AI prototyping to high-volume power- and area-optimized SoCs.
This scalability of the computational blocks is supported by advanced bandwidth techniques and a memory hierarchy that supports L1 memory in each core and L2 memory between cores and external DRAM. The ability to scale up to 24 cores is provided by a high-performance, low-latency interconnect. The are many hardware and software features designed into the NPX family to help scale up the TOPS while keeping the external memory bandwidth in a manageable range. These include on the fly compression by the DMA, exploiting graph sparsity, advanced buffer management and multi-level tiling, and layer fusion to name a several.
To take advantage of all these integrated hardware features and to accelerate application software development for the NPX processor family, the new DesignWare ARC MetaWare MX Development Toolkit provides a comprehensive compilation environment with automatic neural network algorithm partitioning to maximize resource utilization. Together, the NPX IP and high-productivity programming tools optimize the performance, power, and area of high-performance SoCs for a broad range of embedded AI applications, including advanced driver assistance systems (ADAS), surveillance, digital TVs and cameras, and data center and edge server inference.