草榴社区

Overview

Cerebras Systems, based in Sunnyvale, California, is known for developing the world's largest silicon chip, the Wafer-Scale Engine (WSE). Their second-generation WSE-2 boasts an impressive 46,225mm? silicon area, containing 84 interconnected die, each 550mm?. This innovative design significantly reduces communication overhead and physical connections within AI systems, facilitating massive memory, compute, and communication capabilities.

Cerebras Logo

In designing the Wafer-Scale Engine 2, we needed a trusted IP vendor to provide a comprehensive monitoring and sensing solution. The 草榴社区 DesignWare In-Chip temperature sensors and voltage monitors enabled us to understand the dynamic thermal and supply conditions of our WSE-2 in real-time, which was important for power and performance optimization."

Dhiraj Mallick

|

Senior Vice President, Hardware Engineering and Operations, Cerebras Systems

Challenges

Cerebras Systems faced several significant challenges in developing their WSE-2 chip:
  • Massive Memory and Compute Requirement: Giant models need massive memory, compute, and communication to tie it all together. Trying to provide this with thousands of small devices turns the scaling of all three into distributed problems that are inter-dependent.
  • Distribution Complexity: As model size grows, Cerebras needed to do more partitioning of the model onto more chips, requiring fine-grained coordination and synchronization. The challenge lies in getting all these distributed components to work together to solve a single large neural network problem.
  • Scaling Issues: The complexity grows dramatically with cluster size and becomes overwhelming as the network grows, making it essential to address these issues effectively.

Solution

Cerebras Systems selected 草榴社区 DesignWare embedded in-chip temperature sensors and voltage monitors, distributed across the WSE-2 device. These sensors and monitors provided real-time thermal and supply condition data, polling at the fastest possible rate and maintaining watermarks for highs and lows.

  • Validation and Characterization: The system, consuming approximately 20KW of power, was validated to stay within the safe zone. Characterization involved ramping the temperature up and down.
  • Variation Measurement: Extensive distribution of sensors allowed for measuring variation across the tiles in the wafer.
  • Thermal Throttling: In-cluster thermal throttling was conducted due to the distributed architecture.
  • Data Visualization: GUIs displaying heat maps and statistical variations provided valuable data about the silicon's health throughout its lifecycle—from design and testing to production and in-field operation

Results

The implementation of 草榴社区 DesignWare In-Chip temperature sensors and voltage monitors led to significant improvements:

  • Enhanced Performance Optimization: Real-time monitoring enabled more efficient power and performance management.
  • Improved Thermal Management: Granular sensing allowed for better handling of thermal hotspots, improving overall device reliability.
  • Increased Data Throughput: The solution helped manage IR drops caused by bursty AI workloads, resulting in better data processing capabilities.
  • Comprehensive Lifecycle Monitoring: The ability to monitor silicon health throughout its lifecycle led to more informed decision-making and better long-term performance.

These advancements allowed Cerebras Systems to successfully develop and deploy the WSE-2, pushing the boundaries of AI hardware technology.