Cloud native EDA tools & pre-optimized hardware platforms
With Frontier supercomputer topping the list of the world's fastest supercomputing systems in June 2022 and offering 1.1 ExaFlop Rmax at 21.1 MW power, exascale computing has entered its era. The latest HPC performance benchmarks have demonstrated that the throughput of HPC datacenters depends heavily on the networking fabric. In the Frontier cluster, close to 9 million CPU/GPU cores are connected by PCIe 4.0 physical layer through a 12.8 Tbps switch providing 100GbE over a 145 KM network. Next generation HPC data centers are expected to implement 200G/400G/800G networks with interconnect upgrades using PCIe 5.0/6.0 along with 56G/112G Ethernet PHYs.
This article summarizes various implementation options that 112G Ethernet PHY offers with respect to reach, fabric architecture, power, and types of channels.
With the deployment of HPC data centers for exascale computing, the interconnect fabric, ports and SerDes IP are now shifting to support much higher speeds. Figure 1 shows an iconographic representation of a two-rack network in an HPC datacenter with Top of Rack (ToR) switches connecting racks through an optical link. Within a rack, compute resources are connected through PCIe/CXL and a Data Processing Unit which is essentially a Network Interface Card with processing capabilities connects ToR switch to these cores through a Direct Attach Copper (DAC) or Active Copper Cables (ACC).
Figure 1: HPC as a network of computing resources
Table 1 summarizes the present and future interconnect implementation options for HPC data centers. The early deployments with four-lane (x4) or eight-lane (x8) form-factors, with 56G PHY led to 200G/400G ports. With a SerDes upgrade from 56G to 112G Ethernet PHY, the new rack unit design starts are expected to retain the choice of x4/x8 ports – doubling the port bandwidth to 400G/800G.
|
Operational |
Early Deployment |
Mainstream Deployment |
CPU-Accelerator Fabric |
PCIe 4.0 |
PCIe 5.0 /CXL 2.0 |
PCIe 6.0 /CXL 3.0 |
System Interconnect |
100GbE with Four lane 25G PHY |
|
|
ToR Switch |
12.8T |
25.6T |
51.2 T |
Table 1: Comparison of HPC Network Components and Interconnects
New 400G/800G optical module designs using the QSFP-DD form factor are targeting a challenging 14W power budget from the MSA ( multi-source agreement) standard. This creates the need for a power-optimized VSR-based electrical interface for optical DSP SoCs. The 112G-VSR specification defines a 15dB channel for chip-to-module interface, whereas LR PHY is specifies a 28dB channel with two connectors.
Compared to an LR specification, the lower channel loss target for VSR channels allows SerDes designers to offer better overall power efficiency with a dedicated architecture.
Every additional Serilization /Deserialization in the data path not only adds power for data transmission but also requires additional power for system cooling. This forces system designers to explore implementational options with and without retimers by deploying modules with 112G VSR/LR PHYs. Figure 2 illustrates a representational implementation where longer switch-to-port links are enabled by deploying a retimer in-line with the VSR optical module. Alternatively, although an LR PHY consumes more power than a VSR PHY, additional Digital Signal Processing equalization in 112G LR PHY can potentially avoid the requirement for a retimer.
Figure 2: Implementation choices with retimer and LR/VSR PHYs
Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.
Inter-ToR switch connectivity is invariably implemented through optical links whereas intra-rack links are implemented with pluggable modules and DAC, ACC. With 112G Ethernet PHY deployments, the industry is exploring multiple electrical interfaces to save overall SerDes + optical engine + retimer power. Table 2 summarizes emerging implementation options of next-generation electro-optical link.
Table 2: Next generation electro optical links in HPC data centers
CEI-112G-LR-PAM4 specifies a 112 Gb/s chip-to-chip PAM4 electrical interface for less than 28 dB loss at the Nyquist frequency, including two connectors. A 112G LR SerDes is expected to work with all these channels and provide the PHY level BER of 1e-4. The forward error correction (FEC) at the protocol layer is expected to improve the BER from 1e-4 to 1e-12 or 1e-15.
As system deployments progress, implementers are considering LR Max option of 112G SerDes so that they have more margins in their system designs. Table 3 shows the typical and max values of each component in a LR channel.
Table 3: Emerging requirements of 112G SerDes LR and LR Max
An orthogonal channel with 9” trace of both line card with Megatron material may be considered a typical implementation, however, the trace length, package loss and PCB material choice, changes loss, Insertion Loss Deviation (ILD ), and reflections of the channel. Figure 3 shows the loss profiles for various channels.
Figure 3: Varied loss profile of HPC LR channels
While there is currently no standard for LR Max, it is important to note that the industry’s need for additional margins is creating the need for an LR Max SerDes architecture. Innovative DSP techniques in receiver equalization such as MLSD (shown in figure 4) in the LR Max receiver provide attractive implementation options at the expense of marginal power and latency.
Figure 4: Adaptive DSP with MLSD for LR Max equalization
The networking infrastructure in HPC data centers is evolving to allow exascale computing, going from 100G to 200G/400 and 800G. New electro -optical interfaces such as co-packaged optics, near packaged optics and pluggable optics with linear interface provides multitude options to optimize power, latency and performance. 草榴社区 provides integrated 112G Ethernet PHY IP for extra short reach (XSR) and XSR+, Linear, very short reach (VSR), and UCIe PHY to implement the electrical interfaces. 草榴社区 112G Ethernet PHY for LR and LR Max channels addresses the need for additional margin in intra rack DAC/ACC links.
In-depth technical articles, white papers, videos, webinars, product announcements and more.
In-depth technical articles, white papers, videos, webinars, product announcements and more.
Explore all articles →