草榴社区

For many years, memory and storage have been clearly distinct things. Memory is a short-term place to hold data while a nearby CPU or accelerator processed that data. It is effectively working memory placed near the elements needing the data while providing rapid access with very low latency. The typical examples include SRAM/DRAM in its various forms of DDR, LPDDR and more recently, HBM. These memory devices, in addition to providing rapid access to data with low latency, also share the characteristics of being volatile devices that need to remain powered on to retain the data.

Storage, on the other hand, is long-term memory that is non-volatile and is retained even when devices are powered off, and their access times and latency are much greater than traditional memory devices. The typical examples include Hard Disk Drives (HDDs) and Solid State Drives (SSDs). Table 1 compares some of the characteristics of memory and storage.

 

Memory

Storage

Examples

DDR, LPDDR, HBM

SSD, HDD

Proximity to CPU

Near or embedded

Farther

Access time

Fast, low latency

Slower, higher latency

Permanence

Volatile, requires power

Persistent, no power required

Capacity

Limited by physical constraints

Not inherently limited

Data access size

Byte

Blocks: pages or sectors (kBytes)

Interface to CPU

Various JEDEC DDR standards

PCIe with NVMe, other (SATA, SAS)

 

Table 1: Characteristics of memory versus storage 

 

Years ago, before the advent of SSDs, the differences between memory and storage were simple and stark. Memory meant random access memory (RAM), and storage meant magnetic media (disk drives or magnetic tape). Because of the physical differences between the two, memory and storage have another difference not shown in Table 1. Memory can be read or written a byte at a time, while storage, largely due to its rotating disk structure, has a minimum storage unit of a sector (typically 512 bytes for HDDs). As SSDs replaced HDDs, due to nature of the SSD design, even though they were not rotating magnetic media, they were still not able to be read or written a byte at a time like RAM.

SSDs store data in a matrix of electrical cells organized into rows called pages where the data is stored. Pages are grouped together to form blocks. SSDs can only write to empty pages within a block. The net result is that SSDs read data a page at a time, and they can write at the page level only if surrounding cells are empty, otherwise an entire block must be erased before a page can be written. Reading or writing a page can translate to 16kB of data, so these devices are not good for cache-like applications where small amounts of data must be frequently accessed while working on a problem.

With new types of persistent memory like Intel’s Optane Technology and others offering non-volatility and reduced access times approaching DRAM, the line between memory and storage is beginning to blur, and this is opening up interesting possibilities.


PCIe vs CXL for Memory/Storage

PCI Express (PCIe) implementations are expandable and hierarchical with embedded switches or switch chips allowing one root port to interface with multiple endpoints, such as multiple storage devices (as well as other endpoints like Ethernet cards and display drivers). However, limitations of these implementations are seen in large systems with isolated memory pools that require heterogeneous computing where the processor and accelerator share the same data and memory space in a single 64-bit address space. The lack of a cache coherency mechanism  makes memory performance for these applications inefficient and latency less than acceptable when compared to alternative implementations using CXL. 

While PCIe typically transfers a large block of data through a direct memory access (DMA) mechanism using load-store semantics (load = read, store = write), CXL uses a dedicated CXL.mem protocol for short data exchanges and extremely low latency.

While the introduction of PCIe 6.0.1 at 64GT/s helps increase the bandwidth available for storage applications with minimal or no increase in latency, the lack of coherency still limits PCIe applications like traditional SSDs, which are block storage devices. For these storage applications, NVMe, which uses PCIe as the transport interface, has become the dominant SSD technology. Next-generation SSDs, with CXL interfaces instead of PCIe, are currently being developed.

Table 2 shows a summary of some of the important characteristics for storage applications for PCIe versus CXL. This article highlights CXL’s main characteristics for high-performance computing storage applications. 

 

Feature

PCI Express

CXL

Max bandwidth

32GT/s x16 for PCIe 5.0

64GT/s x16 for PCIe 6.0

32GT/s x16 for CXL 2.0

64GT/s x16 for CXL 3.0

Coherency

None

Supported; Host-managed

Latency

100’s of ns

10’s of ns

Cacheability

PCIe address space typically
NON-cacheable

CXL address space cacheable by definition

Switching

Chips and embedded

Embedded in CXL 2.0 and 3.0
Future chips may expected

Topologies

Host to Device, Switched

Host to device, switched, fabrics

Memory Access

DMA typically

Dedicated CXL.mem

Transfer sizes

PCIe optimized for larger data payloads; Traditional block storage (512B, 1KB, 2KB, 4KB); lower overhead for non-cached data

CXL optimized for 64B cacheline transfers; Fixed size offers low latency

Storage standards

PCIe-based Flash memory (NVMe)

Emerging SSD and DRAM with CXL interface, promising for many new types of memory/storage applications

Datapath

32b – 512b
1024b

Natively 512b
1024b

Implementation

PCIe only

CXL controller with support to PCIe

No Home agent needed in devices

Applications

Non coherent data movement applications, large DMA block transfers, traditional storage controllers; NVMe

Linear storage: Byte-addressable (vs block or sector) SSD successors; Computational memory

Ecosystem

Massive and well established up to PCIe 5.0

Limited adoption so far; CXL expected to accelerate through 2022 and 2023

 

Table 2: Characteristics of PCIe versus CXL for storage applications


Subscribe to the 草榴社区 IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Advantages of CXL for Emerging HPC Applications Memory Composability and Disaggregation

Memory pooling introduced by CXL 2.0 has been calculated to theoretically enable support for CXL attached memory of at least 1.28 petabyte (PB), and with multi-level switching and other features introduced in CXL 3.0, it could be even higher. This opens the door for new approaches to solving large computation problems, in which multiple hosts can work on massive problems while accessing the entire dataset simultaneously. As an example, with access to a petabyte of memory, whole new models can be created and coded to work on complex problems, like modeling climate change, with an assumption that the system can handle the entire problem all at once, rather than breaking the problem down into smaller pieces. 

Advanced fabric capabilities introduced in CXL 3.0 are a shift from previous generations and their traditional tree-based architectures. The new fabric supports up to 4,096 nodes, each able to communicate with one another via a port-based routing (PBR) mechanism. A node can encompass a CPU host, a CXL accelerator (memory included or not), a PCIe device, or a Global Fabric Attached Memory (GFAM) device. A GFAM device is a Type 3 device that effectively acts as a shared pool of memory whose I/O space is owned either by a single Host or a Fabric Manager. After configuration other Hosts and devices on the CXL Fabric can directly access the pooled memory of the GFAM device. The GFAM device allows an array of new possibilities in building systems made up of compute and memory elements that are arranged to satisfy the needs of specific workloads. For example, with access to a terabyte or a petabyte of memory, it’s possible to create whole new models to tackle complex challenges such as mapping the human genome. 

Table 3 shows some of the key feature that are driving the adoption of CXL for memory and storage applications.

 

Feature

When Introduced

Coherency and Low latency

Introduced in CXL 1.0/1.1

Switching

Introduced in CXL 2.0 as single level switching for CXL.mem

Expanded to multi-level switching for all protocols in CXL 3.0

Memory pooling & sharing

Pooling introduced in CXL 2.0 w/MLD support

Sharing added in CXL 3.0

Fabrics

Introduced in CXL 3.0

 

Table 3: Key CXL characteristics for storage applications 


Traditionally, there have been only a couple of options to attach additional memory to accelerators or other SoCs. The most common method is to add additional DDR memory channels supporting more standard DDR memory modules. The only other viable option was to integrate memory together with the SoC within the same package. With CXL it becomes possible to put memory onto something that is very much like the PCIe bus (CXL uses PCIe PHYs and electricals). This enables a system to support more memory modules using a card with a standard CXL interface without the need for additional DDR channels. Figure 1 shows an example of how this can vastly increase the memory accessible to the SoC, both in amount (GB) and it type (RAM or persistent memory). Using this approach, memory begins to look like a pooled resource and can be accessible to multiple hosts via switching, which was first introduced in CXL 2.0 and vastly expanded in CXL 3.0.

Figure 1: CXL enables media independence with a single interface such as DDR3/4/5, LPDDR 3/4/5, Persistent Memory/Storage

As can be seen in Figure 1, CXL can solve a problem that has blocked the development of expandable pools of memory accessible to multiple systems—it does away with proprietary interconnects, so that any CPU, GPU or tensor processing unit (TPU) that needs access to additional memory can be designed with an industry-standard CXL interface. CXL will eventually permit connection to a vast array of memory modules, including SSDs, DDR DRAM, and emerging persistent memories. The combination of CXL’s low latency, coherency and memory pooling and sharing capabilities make it a viable technology for allowing system architects to create large pools of both volatile and persistent memory that extend into multiple infrastructure pools, becoming a true shared resource.

Another advantage of the approach shown in Figure 1 is that the SoC pins devoted to CXL interfaces do not have to be dedicated to memory interfaces—they can be used to connect anything with a CXL interface, including additional CXL switches, GFAM devices, or chip-to-chip interconnects. 

At the 2022 Flash Memory Summit it was clear that CXL is emerging as the leading architecture for pooling and sharing connected memory devices targeting both DRAM and NAND flash devices. Many large SSD companies are either introducing or talking about plans for flash-based SSDs with a CXL interface, and others are discussing their memory controllers or other memory products that are featuring CXL as the high-speed interface to memory. See Figure 1.

CXL has now acquired the assets of Gen Z and OpenCAPI, further enhancing the scope and types of applications that CXL can handle. 

Figure 2: CXL enables fine grained memory allocation (pooling) and sharing among multiple Hosts


CXL for Memory Disaggregation and Composability

The advantages of CXL are many, but two in particular are worth highlighting: memory disaggregation and composability. Memory disaggregation refers to the capability of spreading memory around to various devices while still allowing sharing and coherency by multiple servers, such that memory is no longer aggregated and devoted to a single device or server. Composability refers to the capability to allocate the disaggregated memory to particular CPUs, TPUs, as needed, with a result that memory utilization can be increased substantially. This enhanced utilization offers a critical improvement over current systems, where actual memory utilization in real systems, as measure by Microsoft and highlighted in , MICROSOFT AZURE BLAZES THE DISAGGREGATED MEMORY TRAIL WITH ZNUMA, can be on the order of only 40% with most Virtual Machines (VMs) utilizing less than half of the memory allocated to them by their hypervisor.  

With CXL 2.0 and CXL 3.0 which include switching, a host can access memory from one or more devices that form a pool. It’s important to note that in this kind of pooled configuration, only the resources themselves and not the contents of the memory are shared among the hosts: each region of memory can only belong to a single coherency domain. Memory sharing, which has been added to the CXL 3.0 specification, actually allows individual memory regions within pooled resources to be shared between multiple hosts. Figure 2 shows and example illustrating memory pooling and memory sharing within a single system.

CXL can also enable computational memory, where attached memory devices can carry out some computation directly on the memory contents without the involvement of a Host or accelerator.  

The ultimate goal is 100% disaggregation and composability, in which all memory attached to a system can be utilized by any attached device and is all available as a pooled resource. 

To achieve this goal of 100% disaggregation and composability, a system needs to be able to discover and enumerate every device within the system, including servers, accelerators, memory expansion device, and other devices with shareable memory. This requires rack-level device discovery and identification of 100% of the devices (servers, memory pools, accelerators, and storage devices), whether already composed or as yet unassigned. This can only be accomplished using the PCIe and CXL capabilities, since fabrics like Ethernet and Infiniband can’t support fine grained discovery, disaggregation and composition. 

The approach to dynamically create flexible hardware configurations capable of meeting different workload requirements is often referred to as Composable Disaggregated Infrastructure (CDI), and it becomes possible using the low latency fabrics now enabled by CXL. This capability can effectively permit an entire rack to be configured and act like a server.


Summary

CXL is rapidly becoming the interface of choice for managing and sharing large amounts of memory coherently among multiple Devices and Hosts. It is enabling a true heterogeneous composable and disaggregated architecture supporting more than just memory. The CXL 3.0 spec expands on previous versions of CXL, doubling the per-lane bandwidth to 64 GT/s with no added latency, while adding multi-level switching, efficient peer-to-peer communications, and memory sharing.

To ease and accelerate adoption of the latest CXL protocol, 草榴社区 offers a complete CXL IP solution, encompassing controller with IDE Security Module, PHY, and verification IP to deliver secure, low-latency, high-bandwidth interconnection for AI, machine learning, and high-performance computing, including storage, applications. 

Built on silicon-proven 草榴社区 PCIe IP, our CXL IP solution lowers integration risks for device and host applications and helps designers achieve the benefits that CXL 3.0 brings to SoCs for data-intensive applications. As an early CXL contributor, 草榴社区 had early access to the latest specification, enabling our engineers to deliver a more mature solution. Already, 草榴社区 has delivered CXL 2.0 and 3.0 solutions with IDE support to several customers, including for next generation SSD and advanced memory applications with proven silicon in customer products and successful third-party interoperability demonstrated in hardware. 

草榴社区 IP Technical Bulletin

In-depth technical articles, white papers, videos, webinars, product announcements and more.

Continue Reading