草榴社区

An Introduction to CCIX

By: Richard Solomon, Technical Marketing Manager, 草榴社区

Cache Coherent Interconnect for Accelerators (CCIX) refers to a set of specifications being developed by a new industry standards body – the CCIX Consortium. The driving factors for CCIX are the need for faster interconnects than the current available technologies, and cache coherency allowing faster access to memory in a heterogeneous multi-processor system. For this reason, the consortium’s efforts focused on enabling hardware accelerators to use memory shared with multiple processors in a cache coherent manner. This article describes the CCIX standard and its key benefits for high-performance applications such as machine learning, network processing, storage off-load and in-memory database. 

What is Cache Coherency?

When multiple CPUs share a common memory space, they gain performance from communicating the cached and/or cacheable state of pieces of that memory. In this way, each CPU can safely work on a portion of a common data set without having to use (slow) software semaphores to control access. If CPU A has a piece of memory cached, it can ensure that CPU B does not modify that same memory space or use a stale copy of the data. CCIX extends this communication so agents other than CPUs can participate, which enables hardware accelerators to gain the same benefits. CCIX’s coherence protocol is also vendor-independent, so CPUs, GPUs, and other accelerators can all participate equally and without onerous licensing restrictions.

To better understand cache coherency, let’s examine a coherence protocol in common use for some time now known as MESI. The acronym MESI refers to the four possible states of each cache line in the system: Modified, Exclusive, Shared, or Invalid. Modified means a cache line is stored ONLY in the current cache, and is different from the data in main memory (“dirty” in cache parlance). Any other agent attempting to read from an address marked somewhere in the system as Modified will cause the cache (which has the modified data for the address) to write the data back to main memory before the read may proceed. An Exclusive cache line is also stored ONLY in the current cache, but it matches the data in main memory (“clean” in cache parlance). If the agent owning that cache line makes changes to it, the state will switch to Exclusive. A Shared cache line is also “clean” like an Exclusive one, but it may ALSO exist in other cache(s) in the system (where it would also be in the Shared state). Finally, an Invalid cache line is exactly what it sounds like – an unused or no longer valid cache line. Clearly the various caches in such a system must communicate several pieces of information with each other. They must support snooping or monitoring of bus transactions from other agents to determine when their cache state needs to change, and they must have some means of communicating state changes to other caches in the system.

The CCIX protocol specification defines a set of cache states and associated messages and mechanisms to accomplish this same general type of behavior. While full details are available only to CCIX Consortium members, this article will give a high-level overview of the protocol specification. 

Why CCIX for Cache Coherency?

One of the biggest advantages of the CCIX specification is that it builds on the PCI? Express specifications. CCIX’s coherence protocol can be carried across PCI Express links with little or no modification. As shown in Figure 1, an existing PCI Express controller implementation can be extended with logic to implement a CCIX transaction layer. The CCIX transaction layer is responsible for carrying the coherence messages, while the blocks – CCIX protocol layer and link layer – are responsible for implementing the coherence protocol itself and acting upon it. These blocks require tight integration with internal system-on-chip (SoC) logic for caching, and are likely to be very specific to the particular architecture in use on that SoC. SoC designers implementing CCIX in their next designs typically desire partitioning the CCIX protocol and link layers separately from the CCIX transaction layer to enable them to achieve tight integration with the internal SoC logic.  

Figure 1: CCIX specification utilizes the PCI Express protocol to implement a CCIX transaction layer

Figure 1: CCIX specification utilizes the PCI Express protocol to implement a CCIX transaction layer

Moving Beyond 16GT/s

As noted earlier, one of the biggest attractions of CCIX is its compatibility with PCI Express, and in fact CCIX’s cache coherency protocol can be carried over any PCI Express link running 8GT/s or faster. The highest data rate specified by PCI Express 4.0 is 16GT/s, which works out to around 64GB/s of total bidirectional bandwidth on a 16-lane link, but some members of the CCIX Consortium needed even more bandwidth. They determined that by raising the transfer rate to 25GT/s, a CCIX link could approach 100GB/s under the same conditions. This led to a CCIX feature known as Extended Speed Mode (ESM). Since PCI Express is owned by a different standards body, the CCIX Consortium chose a clever mechanism to allow compatibility between ESM-capable components and PCI Express components. Two CCIX components wishing to communicate with each other proceed through a normal PCI Express link initialization process (generally a hardware autonomous process) to the highest mutually supported PCI Express speed. From that point, software running on the host system can interrogate CCIX-specific configuration registers and determine if both components are ESM-capable, and if so, identify their highest supported speeds. That software then programs other CCIX-specific registers on both components to map PCI Express link speed(s) to CCIX ESM link speed(s). From that point forward, link negotiation would be for CCIX ESM speed(s), so by forcing a link retraining, the two components could now communicate as quickly as 25GT/s.

Conclusion

Designers looking for a cache-coherent interconnect with a relatively easy migration path from today’s dominant PCI Express interface should consider CCIX for their next high-performance SoC. Built on the silicon-proven DesignWare? IP for PCI Express 4.0, which is validated in over 1,500 designs and shipped in billions of units, 草榴社区’ complete CCIX IP solution enables cache coherency and speeds up to 25 Gbps. An active member of the CCIX Consortium and PCI-SIG, 草榴社区 will continue to innovate in CCIX, PCI Express and related technologies to ensure the ecosystem meets their high-performance connectivity requirements for data-intensive cloud computing applications. 

Additional Resources

Read News: