草榴社区

IP and process innovations within the semiconductor industry have evolved of to address the compute needs for future applications and now multi-die systems are becoming pervasive. However, workload demands are impacting compute arrays, memories, and bandwidth of DDR, HBM, UCIe, and more as each generational hardware design plays catch up to take on future AI workloads.

What’s Truly Driving Multi-Die Systems

SoC performance continues to see generational upgrades and there has been much discussion on the importance of multi-core designs to drive performance upgrades as Denard Scaling has been argued to be ending (Figure 1).

CS1142348322

Figure 1: Generational upgrades in SoC performance

Subscribe to the 草榴社区 IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Regardless of this phenomenon, to satisfy performance needs, foundries continue to drive next-generation process nodes that enable higher frequencies and higher logic densities to incorporate more processing elements, all at reduced power footprints.  This continued innovation is shown in Figure 2.

CS1142348322

Figure 2: The recent IMEC potential process node roadmap

Multi-core architectures and processor element arrays, in addition to process node innovation, continue to aggressively solve generational performance gains. However, it’s not only multi-core innovations, nor process node generational upgrades that require a new multi-die system architecture.  The memory wall is clearly one of the key drivers for multi-die systems (Figure 3).  The chart shows that memory densities are progressing at 2x every 2 years vs a workload that is increasing by 240x every 2 years in terms of memory densities. For more information, see the blog post on Medium.com entitled .

CS1142348322

Figure 3: The amount of compute, measured in Peta FLOPs, needed to train SOTA models, for different CV, NLP, and Speech models, along with the different scaling of Transformer models (750x/2yrs); as well as the scaling of all of the models combined (15x/2yrs).

To satisfy performance needs in terms of memory bandwidths we’ve already seen a disruption of the current off-chip memory market with new technologies such as High Bandwidth Memories (HBM).  The industry has already seen HBM3 become mainstream for HPC markets.  This disruption will continue as HBM has a great roadmap for future performance gains.

Unfortunately, the memory wall described above focuses only on off-chip memories while it’s the on-chip memory systems that are playing a pivotal role in most SoC designs today and disruption in this industry could be inevitable in the near future.

The Current Focus of Design

AI and security workloads have increased dramatically, and this has driven many on-chip performance innovations beyond frequency increases.  Many of these innovations have focused on the processors required for these workloads. AI algorithms have driven design of massive multiply accumulate parallelism as well as creative nested loops to reduce cycles and increase work completed per cycle. These workloads however also require much larger memory densities to store the weights, coefficients, and training data.  This has driven larger capacity and higher bandwidth memories both on-chip and off-chip.  For off-chip memories, the indusry has adopted next-generation HBM, DDR, and LPDDR rapidly. However, it is the on-chip memory configurations that are providing vendor differentiations.  Examples of this include the AI Accelerator space where each vendor is integrating more memory density at higher bandwidths in terms of global SRAMs and caches.  Unique methods to optimize the memory configurations per processing element are also a critical piece of the innovation puzzle.  

Taking a step back from the SoC-specific performance gains, a critical industry problem is power usage of these AI systems within the cloud.  Figure 4 shows Google Data Center’s power usage.  Clearly the design activity to build an SoC with a more efficient CPU is critical and multiple SoC startups openly promote their AI processor efficiency to solve this problem.  But the overall system performance also includes the off-chip memory as seen in the DRAM power consuming 18% of the total.  Designing for lower power has seen an increase in HBM adoption because the pJ/bit performance is unmatched by other technologies.

CS1142348322

Figure 4: Google Data Center Power Usage

Jumping back to the performance puzzle, in discussing the broader SoC system, on-chip it’s been a race for AI accelerators to integrate more SRAM and cache in each generation, and in particular, more than their competitors.  For instance, market share leaders such as Nvidia have aggressively adopted the latest process technologies and integrated larger L2 cache and global SRAM densities each generation to accommodate a better performance for AI workloads.

Additional Challenges to Designing for AI Workloads

When designing for AI workloads there are several things to consider.  One discussed often is the memory bandwidth.  Many AI SoC vendors use memory bandwidths as their key messaging for performance.  However, memory bandwidth needs more context.  For instance, the number of cycles to access data from global memory could take 1.9x more cycles than from L2 cache and L2 cache takes almost 6x more cycles than from L1 Cache as described in a .

  • Global memory access (up to 80GB): ~380 cycles
  • L2 cache: ~200 cycles
  • L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
  • Fused multiplication and addition, a*b+c (FFMA): 4 cycles
  • Tensor Core matrix multiply: 1 cycle

Therefore, to improve performance for these workloads, it is critical to increase L1 and L2 cache in these systems versus previous generations.  Increasing the density of caches used next to processing elements is one of the most effective design improvements for AI workloads with massive processor parallelism.  

Another method for on-chip system memory optimizations relates to AI algorithm specific knowledge.  An example for this is to design the local memories to the size of the largest intermediate activation values found in these AI algorithms.  This removes any bottlenecks within the transfer of data on-chip.  This is a method that will be deployed more commonly at the edge as efficiency will be dictated by software and hardware co-design.  Unfortunately, this takes intimate knowledge of the end applications.  Again, modeling these systems can play a pivotal role in improving the hardware performance and 草榴社区 is in a very unique position in offering solutions to help developers.

What Will the Next-Generation of Innovation Focus Be?

We’ve discussed the adoption of off-chip memory interfaces such as DDR/LPDDR and HBM to improve memory bandwidth however these technologies are not keeping up with the ability to integrate processors for AI workloads within the die.  The off-chip memory trajectory gap is clearly increasing. Meta pointed to this trend in an OCP summit presentation in recent years (Figure 5).

CS1142348322

Figure 5: The increasing off-chip memory trajectory gap

Interface IP standards have seen a recent uptick in the advancement of next-generation standards to keep up with this performance gap.  For instance, a next-generation standard interface is typically released every four years and that has recently expedited to every two years.  The advent of both AI and security workloads have contributed to this faster adoption of next-generation technologies.

CS1142348322

Figure 6: Next-generation standards are being released more frequently to keep up with the performance gap

This processing to memory gap is focused not only on advancements in off-chip memories but also on-chip memories.  Taking a closer look at process node advancements we see continued innovations in three categories (Table 1).

  1. Maximum processor frequencies are improving each generation at leading foundries such as TSMC and their development of 16nm, 7nm, 5nm, 3nm and beyond, increasing performance
  2. Power consumption is being reduced each process node advancement in the tune of 25+%
  3. And logic reductions, indicating more processors per mm2, are advancing well over 40% as seen in the chart below.

 

TSMC 7nm vs 16nm

(source TSMC)

TSMC 5nm vs 7nm

(source TSMC)

TSMC 3nm vs 5nm

(source TSMC)

Samsung’s GAA 3nm vs TSMC 3nm FinFET

(source Samsung)

Performance Increase

30-40%

20%

15%

30%

Power Consumption Decrease

60-65%

40%

30%

50%

Logic Reduction

70%

45%

70%

45%

SRAM Reduction

64%

22%

0%

?

 

Table 1. Innovations resulting from process node advancements

However, innovations that are slowing are foundry provided density improvements of on-chip SRAMs/cache, etc.  The scale of reduction has potentially slowed.  The slow down has even indicated that migrating from a 5nm node to a 3nm node may see a minor or no reduction in SRAM density.  This poses a problem for future computing where AI workloads require more efficient and higher density memories per processing element.  草榴社区 has focused on this market challenge to ensure density improvements for memories at each node migration.

Beyond improving on-chip memories another method to improve this compute to memory ratio is to extend these on-chip memories within a distributed computing and memory system architecture.  To increase performance, future architectures must take advantage of multi-die systems based on the need to satisfy the correct processor and memory ratio on individual die and extend that performance to multiple die in a complex system.

The advent of the UCIe and XSR standards are filling the gap for a standard, reliable solution to connect multiple dies to scale performance.  The AI accelerator industry has almost universally adopted some die-to-die interconnect for these workloads.  UCIe is a standard parallel die-to-die interface that is quickly becoming a market-leading interface for today’s performance leading multi-die systems.  Most importantly, these standards for die-to-die connectivity have scaled embedded memories, that clearly outperform external memory accesses to specific processing elements. This is why companies race to embed the highest performance, most dense  memory arrays possible to satisfy the insatiable workloads of the future.

Performance is clearly driving next-generation monolithic SoC and multi-die systems architectures.  To scale performance unique and innovative AI processors have been developed.  Future process nodes enable denser processing arrays to improve performance.  However, memories must scale too.  The most effective memories are those closest to the processing elements.  To best scale these future SoCs, multi-die systems architectures will be adopted in addition to interface IP standards increasing bandwidths, process node advancements and innovative multi-core processor implementations.  Next generation interface IP such as HBM3 and UCIe will be adopted to scale bandwidths, but it is also imperative that innovative embedded memories are available to scale performance at each process node generation.

Conclusion

Multi-die systems are one of the hottest topics in the industry. However, the technical challenges related to memories to satisfy the current AI and security workloads are clearly driving next-generation SoC architecture innovations.  These architectures need higher performance, and more memory per processing element as technology process nodes advance.  If the memories are scaling at smaller rates than the processing elements but the workloads are demanding more memory per processing element, there must be technology disruptions.  One clear solution has been multi-die systems, leveraging more on-chip memories at higher bandwidths and improved densities.  These memory and IO innovations will see rapid adoption to meet the needs of future workloads and may open opportunities for future industry disruptions.

草榴社区 IP Technical Bulletin

In-depth technical articles, white papers, videos, webinars, product announcements and more.

Continue Reading