草榴社区

With Frontier supercomputer topping the list of the world's fastest supercomputing systems in June 2022 and offering 1.1 ExaFlop Rmax at 21.1 MW power, exascale computing has entered its era. The latest HPC performance benchmarks have demonstrated that the throughput of HPC datacenters depends heavily on the networking fabric. In the Frontier cluster, close to 9 million CPU/GPU cores are connected by PCIe 4.0 physical layer through a 12.8 Tbps switch providing 100GbE over a 145 KM network.  Next generation HPC data centers are expected to implement 200G/400G/800G networks with interconnect upgrades using PCIe 5.0/6.0 along with 56G/112G Ethernet PHYs.

This article summarizes various implementation options that 112G Ethernet PHY offers with respect to reach, fabric architecture, power, and types of channels.

Interconnect Fabric, Ports and SerDes

With the deployment of HPC data centers for exascale computing, the interconnect fabric, ports and SerDes IP are now shifting to support much higher speeds. Figure 1 shows an iconographic representation of a two-rack network in an HPC datacenter with Top of Rack (ToR) switches connecting racks through an optical link. Within a rack, compute resources are connected through PCIe/CXL and a Data Processing Unit which is essentially a Network Interface Card with processing capabilities connects ToR switch to these cores through a Direct Attach Copper (DAC) or Active Copper Cables (ACC). 

Figure 1: HPC as a network of computing resources

Table 1 summarizes the present and future interconnect implementation options for HPC data centers. The early  deployments with four-lane (x4) or eight-lane (x8) form-factors, with 56G PHY led to 200G/400G ports. With a SerDes upgrade from 56G to 112G Ethernet PHY, the new rack unit design starts are expected to retain the choice of x4/x8 ports – doubling the port bandwidth  to 400G/800G. 

 

Operational

Early Deployment

Mainstream Deployment

CPU-Accelerator Fabric

PCIe 4.0

PCIe 5.0 /CXL 2.0

PCIe 6.0 /CXL 3.0

System Interconnect

100GbE with Four lane 25G PHY

  • 200G with four-lane 56G PHY
  • 400G with eight-lane 56G PHY
  • 400G with four-lane 112G PHY

 

  • 400G with four-lane 112G PHY
  • 800G with eight-lane 112G PHY
  • 800G with four-lane 224G PHY

 

ToR Switch

12.8T

25.6T

51.2 T

 

Table 1: Comparison of HPC Network Components and Interconnects

Retimer, Very Short Reach (VSR), and Long Reach (LR) PHYs

New 400G/800G optical module designs using the QSFP-DD form factor are targeting a challenging 14W power budget from the MSA ( multi-source agreement) standard. This creates the need for a power-optimized VSR-based electrical interface for optical DSP SoCs.  The 112G-VSR specification defines a 15dB channel for chip-to-module interface, whereas LR PHY is specifies a 28dB channel with two connectors. 

Compared to an LR specification, the lower channel loss target for VSR channels allows SerDes designers to offer better overall power efficiency with a dedicated architecture. 

Every additional Serilization /Deserialization in the data path not only adds power for data transmission but also requires additional power for system cooling. This forces system designers to explore implementational options with and without retimers by deploying modules with 112G VSR/LR PHYs. Figure 2 illustrates a representational implementation where longer switch-to-port links are enabled by deploying a retimer in-line with the VSR optical module. Alternatively, although an LR PHY consumes more power than a VSR PHY, additional Digital Signal Processing equalization in 112G LR PHY can potentially avoid the requirement for a retimer. 

Figure 2: Implementation choices with retimer and LR/VSR PHYs

Subscribe to the 草榴社区 IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Evolution of Electro-Optical Interface

Inter-ToR switch connectivity is invariably implemented through optical links whereas intra-rack links are implemented with pluggable modules and DAC, ACC. With 112G Ethernet PHY deployments, the industry is exploring multiple electrical interfaces to save overall SerDes + optical engine + retimer power. Table 2 summarizes emerging implementation options of next-generation electro-optical link. 

Table 2: Next generation electro optical links in HPC data centers

  • Co-Packaged Optics: Improvements in silicon photonics and packaging technology have made implementation of co-packaged optics viable with 25G and 100G lambda optical dies. Serial and parallel interfaces: 112G-XSR specification from OIF for Serial and the latest die-to-die standard from UCIe offer efficient electrical interconnect options between electrical and optical dies. 
  • Near Packaged Optics: As the performance of optical components significantly depends on the operating temperature and some implementers are concerned about serviceability with CPO, implementers are considering an alternative implementation approach with NPO to ease integration challenges.
  • Pluggable optics with linear electrical interface: The quest for saving overall power has guided OIF to draft a 112G-Linear standard for C2M channels. In this implementation, DSP present in 112G-Linear PHY compensates for optical impairments.

LR and LR Max for Within Rack Connectivity

CEI-112G-LR-PAM4 specifies a 112 Gb/s chip-to-chip PAM4 electrical interface for less than 28 dB loss at the Nyquist frequency, including two connectors. A 112G LR SerDes is expected to work with all these channels and provide the PHY level BER of 1e-4. The forward error correction (FEC) at the protocol layer is expected to improve the BER from 1e-4 to 1e-12 or 1e-15. 

As system deployments progress, implementers are considering LR Max option of 112G SerDes so that they have more margins in their system designs. Table 3 shows the typical and max values of each component in a LR channel.

Table 3: Emerging requirements of 112G SerDes LR and LR Max

An orthogonal channel with 9” trace of both line card with Megatron material may be considered a typical implementation, however, the trace length, package loss and PCB material choice, changes loss, Insertion Loss Deviation  (ILD ), and reflections of the channel. Figure 3 shows the loss profiles for various channels.

Figure 3: Varied loss profile of HPC LR channels

While there is currently no standard for LR Max, it is important to note that the industry’s need for additional margins is creating the need for an LR Max SerDes architecture. Innovative DSP techniques in receiver equalization such as MLSD (shown in figure 4) in the LR Max receiver provide attractive implementation options at the expense of marginal power and latency.

Figure 4: Adaptive DSP with MLSD for LR Max equalization

Conclusion

The networking infrastructure in HPC data centers is evolving to allow exascale computing, going from 100G to 200G/400 and 800G. New electro -optical interfaces such as co-packaged optics, near packaged optics and pluggable optics with linear interface provides multitude options to optimize power, latency and performance. 草榴社区 provides integrated 112G Ethernet PHY IP for extra short reach (XSR) and XSR+, Linear, very short reach (VSR), and UCIe PHY to implement the electrical interfaces. 草榴社区 112G Ethernet PHY for LR and LR Max channels addresses the need for additional margin in intra rack DAC/ACC links.

草榴社区 IP Technical Bulletin

In-depth technical articles, white papers, videos, webinars, product announcements and more.

Continue Reading