Cloud native EDA tools & pre-optimized hardware platforms
Prasad Saggurti, Director of Marketing for Foundation IP, 草榴社区
As artificial intelligence (AI) approaches human-brain levels of speed and accuracy, systems increasingly rely on centralized servers connecting applications from the edge to the cloud. The explosion in the number of devices connected to the Internet combined with an exponential increase in Internet traffic means that today’s systems have many cases where very fast searches are needed. Routers, a key component of networking equipment, need to receive and then decide on where to send a packet of data to perform Internet Protocol (IP) forwarding or IP routing. Today’s routers require very fast lookups among large amounts of data to enable fast data packet routing. Other applications requiring high speed searches include translation lookaside buffers (TLB) and fully associative cache controllers in CPUs, database engines, and neural networks.
While designers can choose among many options to execute these searches, the most effective method involves using content addressable memories (CAMs). CAMs compare search data against a table of stored data and return the address of the matching data1. A CAM search function operates much faster than its counterpart in software, and thus CAMs are replacing software in search intensive applications such as address lookup in Internet routers, data compression, and database acceleration2.
Readers may be familiar from their college days that a hash-table is the most efficient search implementation. CAMs are considered a hardware implementation of this construct. The linear search of a list is similar to walking through all locations of a memory and comparing against a key in a serial search. A CAM-based search is the equivalent of comparing against all contents in parallel and then returning the address of the successful compare. This is inherently much faster, although more complex to build.
Along with the performance benefits come the downside of larger area and higher power dissipation (Figure 1). Unlike static RAM (SRAM), which has simple storage cells, each individual memory bit in a fully parallel CAM must have its own associated comparison circuit to detect a match between the stored bit and the input bit. Additionally, when using CAMs, match outputs from each cell in the data word must be combined to yield a complete data word match signal. The additional circuitry increases the physical size of the CAM5. A large amount of the circuitry is active in a given cycle because all the entries are searched in parallel. Thus, a key challenge is to minimize the CAM power consumption which grows along with the size of the CAM configuration.
Figure 1. A CAM-based search compares against all contents in parallel, which is much faster but can increase complexity, area, and power consumption compared to SRAM
CAMs come in two major types: Binary CAMs (BCAMs) and Ternary CAMs (TCAMs). BCAMs are the simplest type of CAM in that they use only 1s and 0s in the stored word. TCAMs also allow a third matching state of X or “don’t care” for one or more of the bits in the search word. Where a BCAM has “10010” as a stored word, a TCAM may have “10XX0” as one of its stored words. The “don’t care” state allows the TCAM to flexibly match any one of four search words – "10000," "10010," "10100," or "10110." Adding a “don’t care” state is done by adding a mask bit for each memory cell and increases complexity even more. A priority encoder in the TCAM ensures that only the first matching entry is output. The flexible coding with X reduces the number of stored entries, thus further improving efficiency of the search.
Figure 2 shows three possible configurations of TCAMs in networking applications, where speed, or CAM access/sec (Msps), is of the essence. Each of the examples--MAC, Switching, and Packet Processing--are key components of networking hardware and require high accesses per second. While packet processing has the highest speed requirements, the MAC and switching blocks also require high-speed access rates. In addition, all three applications have high port bandwidth requirements because of the increasing amount of traffic and the larger number of devices on advanced networks.
Figure 2. TCAM configurations in networking applications
The amount of content in TCAMs is increasing and enterprise, networking, and routing system-on-chip (SoC) designs are moving to smaller geometries (Figure 3). In smaller geometries, TCAMs take on additional functions such as IP and MAC lookup and error corrections code (ECC).
An active control list (ACL) is a filter that controls network traffic in and out of a network. The packets are either permitted or denied access to specific ports or specific types of services using a TCAM. ACL rules for a typical network gateway can consist of just a few to tens of thousands of entries. For example, an ACL rule table of 2K depth and 288 width can handle both IPv4 and IPv6 standards. IPv6 is the next-generation Internet Protocol (IP) address standard intended to supplement and eventually replace IPv4 (the protocol most Internet services use today). Globally, by 2020, 34% of total Internet traffic will be IPv6-driven, with IPv6 traffic growing sixteen-fold from 2015 to 2020, with a CAGR of 74% as stated in Cisco’s VNI Global IP Traffic Forecast, 2015–2020.
Figure 3. Aggregate TCAM content in representative SoCs
Testing TCAMs is both complex and time consuming due to the unique mix of logic and memory. It is important for TCAM BIST algorithms to deliver coverage of all failure mechanisms and do so in an efficient manner. Conventional TCAM array BIST algorithms are of the order of O(xy) where x is the number of words and y is the number of bits in a word. In addition to the bitcells, sub-blocks of the Priority Encoder (the multi-match resolver and the match address encoder) also need to be tested. As the sizes of the TCAM macros and the number of macros on a chip increase, chip designers should consider redundancy to improve yield. ECC also needs to be considered for higher reliability applications.
TCAMs can offer significant performance advantages for AI and cloud applications. If you need to integrate TCAMs on your next chip design, 草榴社区 can help. Silicon-proven DesignWare? TCAMs are available at multiple process nodes – 90nm, 65nm, 40nm, 28nm, 16nm, 14nm and 7nm. To simplify testing of TCAMs, DesignWare STAR Memory System (SMS)-CAM and the SMS ECC Compiler deliver a complete and robust solution. Please visit DesignWare Ternary and Binary Content-Addressable Memory Compilers and DesignWare STAR Memory System? for more details on these products.
References
1. K. Pagiamtzis and A. Sheikholeslami, "A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme," in IEEE Journal of Solid-State Circuits, vol. 39, no. 9, pp. 1512-1519, Sept. 2004
2. T.-B. Pei, C. Zukowski, "Putting routing tables in silicon", IEEE Network Mag., vol. 6, pp. 42-50, Jan. 1992
3. L. Chisvin, R. J. Duckworth, "Content-addressable and associative memory: Alternatives to the ubiquitous RAM", IEEE Computer, vol. 22, pp. 51-64, July 1989
4. Mohan, Nitin & Fung, Wilson & Wright, Derek & Sachdev, Manoj, “Design techniques and test methodology for low-power TCAMs”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, 14. 573 - 586. 10.1109/TVLSI.2006.878206