草榴社区

Configuring DDR Subsystems for Your Design

By Tim Kogel, Principal R&D Engineer, 草榴社区

 

DDR interface IP is used extensively in the semiconductor industry, in many different application areas. Every application has different constraints on the DDR interface. Even within the same application space, design teams creating SoCs that use the same CPUs and have similar functions often use the DDR interface differently.

DDR IP has evolved to be adaptable or configurable to different applications’ constraints. For example, designers using DDR IP like 草榴社区’s uMCTL2 memory controller have about 70 compile-time options to decide upon plus 15 further options per port, plus many more run-time options. Combined, most designs need over 100 options to be set correctly for an optimal DDR configuration.

Some key compile-time options that the designer needs to set are: memory type, memory data bus width, memory bus frequency, number of channels, number of AXI ports, width of each port, depth of read and write buffering throughout the design, depth of queues for the CAM-based scheduler, controller: PHY frequency ratio, ECC support, and Quality-of-Service (QoS) options. Additionally, there are many more run-time options especially around the QoS interface and critically in the logical to physical address mapping.

Searching for the Right Configuration

In the past, finding the right configuration was difficult and people tended to over-design their DDR interfaces. Designers used a number of methods to attempt to find the right configuration.

The Do-Over Method

Designers of the new SoC simply carry forward what had been done on the previous design, and if more memory bandwidth is needed, then the frequency of operation is adjusted upwards. Sometimes this requires a change in memory type (for example, from DDR3 to DDR4) and there may be new timing restrictions at different frequencies or memory types that are not accounted for with this method. This method often requires some over-design, where the DDR bus is over-specified to a higher frequency than is really necessary, resulting in higher cost. It is very difficult to make quantitative decisions on options like QoS with this method, relying on the skill of the designer and their familiarity with hundreds of pages of user manuals to find the optimal result.

The Spreadsheet Method

This technique uses a slightly more architectural method, based upon static spreadsheet formulas that capture assumptions of the types of traffic required for the different masters in the system and an assumption on a memory bandwidth utilization percentage. The data all goes into spreadsheet calculations and the quality of the result is dependent on the assumptions used, particularly the memory bandwidth utilization. Like the “do-over” method, the spreadsheet method relies on the designer’s familiarity with the user manuals to get a good result, and for this reason there is still some over-specification in most cases.

The FPGA Method

In this approach, users model their system in an FPGA or prototyping system. Designers using this technique can get an accurate representation of the traffic in the final SoC. However, the FPGA method doesn’t address design choices that were not selected in the design process. So the final system may be designed correctly for the choices that were made, but a better combination of options that were not selected might have been better. 

The Traffic Modeling Method

In the traffic modeling method, the designer creates a system-level model of the masters to represent the traffic of the application and the specific fixed configuration of the memory controller they have selected. This approach has good accuracy, but it takes a lot of time and effort to create the model, and still does not address the need to explore configuration options that were not selected. 

A Better DDR Subsystem Solution

In the hundreds of DDR designs that 草榴社区 has seen, there have been many examples of chips that were over-designed – and some that were under-designed – from using these techniques. Overdesign costs money – whether it’s additional power and area from having more buffering than necessary, or whether it’s a DDR interface that’s specified at a higher frequency than is really necessary and which requires faster and more expensive DDR devices than needed.

It became clear that something better was needed for the critical DDR interface of the SoC - something that is more accurate than spreadsheet analysis, and which allows the designer to both model their application traffic and explore the effects of multiple different combinations of configuration options in the memory controller. The result is 草榴社区 Platform Architect, which contains fast and accurate SystemC models of 草榴社区' uMCTL2, LPDDR5, and DDR5 memory controllers that are more than 10X faster than RTL, with the ability to launch hundreds of simulation runs with different combinations of features, and powerful analysis tools to select the feature combinations with the best results for your application.

Using Platform Architect, designers can apply models of the traffic at each interface of the DDR controller, kick off multiple simulation runs with varying combinations of parameters that can affect the performance, and analyze the results using the tools that are part of Platform Architect.

Hot-Bit Analysis

A simple but effective example of the use of Platform Architect is the technique of hot-bit analysis of traffic to determine the best logical to physical address mapping in the DDR system (Figure 1). DDR devices are composed of a number of banks (and in the case of DDR4 and DDR5, bank groups). Each bank is divided into a number of rows and each row has a number of columns. The organization of DDR means that accesses to different columns in the same row is fast, accesses to different banks are fast if you don’t access the same bank twice in quick succession, but accesses to different rows within the same bank have a long delay between accesses. Meanwhile, the logical addressing on the AXI or other bus is expected to be flat and linear; the application rarely has understanding of the architecture of the DDR devices on the bus. The goal of hot-bit analysis is to analyze the transactions on the AXI bus and determine the optimal address mapping from logical (AXI) to physical (DDR) addresses, resulting in optimum performance.

When using Platform Architect for hot-bit analysis, designers don’t need to vary many parameters initially. Designers can simply run Platform Architect on 草榴社区 traffic samples and use the analysis tools to indicate which bits have the most activity. Then the designers can use the programmable address mapping feature in uMCTL2, LPDDR5, and DDR5 memory controllers to map the most frequently changing bits to columns, banks, or bank groups to avoid the memory timing penalties associated with row-to-row switching.

Figure 1: The hot bit analysis screen

The hottest bits on the AXI bus (as shown in darker colors in the hot bit analysis view) can be mapped to DDR column or bank bits as needed to produce more page hits or more bank rotation for better performance. The Memory Data Channel Utilization view shows the profile of read and write traffic going over the data channel against a horizontal time axis, while the Memory Channel Utilization view shows the corresponding number of page misses (or activations) per bank on the command channel. Together with the Hot Bit analysis view, these views give insights into the performance of the system and a visual comparison among options to help improve the utilization.

Clock Frequency Optimization

As designers approach the end of the process, they can look at clock frequency optimization. The goal is to find the ideal clock frequency and DDR speed bin for the workload that is required so they don’t over-design the interface.

草榴社区 customers commonly ask for help with DDR designs that are heavily guardbanded (overdesigned) or that simply specify the fastest available DDR device for the interface. While this may be good for the SoC datasheet, it is costly for the end user as it requires higher speed DDR parts (which are generally more expensive than lower speed parts), additional design and signal integrity work on the PCB, and sometimes more power.

The core memory arrays of DDR parts are analog components surrounded by a digital interface. As a result, many memory core timing parameters are expressed in nanoseconds and must be converted to clock cycles by dividing by the clock period. Some memory core timing parameters also vary by the speed bin, or the maximum speed at which the manufacturer states the part will work.

The designer may choose to do some final sweeps with different DDR speed bins and frequency of operation to find the best speed bin for bus utilization and which meet the goals of total throughput and maximum latency (Figure 2).

Figure 2: Compares utilization and overall execution times for 4 different speed bins. The best choice is the lowest speed bin (i.e., 1,866M) at which the execution time constraint of 150 uSec is still met.

QoS Optimization

Once the logical to physical address mapping has been completed and a good mapping has been found, designers can turn their attention to optimizing the Quality of Service (QoS) for a particular set of traffic and performance goals.

草榴社区 DDR memory controllers have compile-time options to create separate resources for high-priority and low-priority traffic and the controller can have different queues for each kind of traffic to avoid low-priority traffic from blocking high-priority traffic. The QoS may also be affected by other compile-time options like the scheduler CAM depth.

uMCTL2, LPDDR5, and DDR5 controllers have run-time options in registers for tuning the QoS to different workloads. As shown for uMCTL2 in Figure 3, among the available QoS parameters that can be programmed into registers are the mapping of QoS inputs to queues, the mapping of QoS inputs into traffic classes, the starvation prevention registers, the read/write switching control, and the division of the CAM into High Priority and Variable/Low Priority areas.

Figure 3: uMCTL2 QoS architecture and mapping into traffic classes, and some of the available run-time parameters that affect performance.

For Quality of Service optimization, Platform Architect models the QoS system in uMCTL2 by modeling both the compile-time options and the run-time options for QoS, and using the QoS input on the traffic to drive the behavior of the model.

Many of the QoS parameters are inter-related, meaning that multi-variable analysis may be required to find the best operating point. For example, different QoS input values into traffic class mappings may necessitate changes to the size of the CAM or the division of the CAM. The best way to determine this is to run several multi-variable combinations, or sweeps, to determine the best combination of settings (Figure 4).

Figure 4: Platform Architect parameter editing screen. Multi-variable sweeps can be performed by defining a set of sweep parameters and specifying permutations over their respective values. 

When the sweeps are complete, designers can use available analysis tools like pivot charts in a spreadsheet to analyze the outputs from all the sweeps to find the combination of parameters that produces the desired performance (Figure 5).

Figure 5: Using a pivot chart to explore the performance across a range of results produced from multiple simulations performed in sweeps. The figure shows average read latencies for one of the ports in the design, across 256 data points representing variations of four variables that influence starvation levels of the read and write queues, while targeting a design constraint of 250 ns.

Figure 6: The pivot chart shows a reduced set of configurations that meets all the quality of service requirements. It also shows the final configuration that produces the highest utilization.

It may be desirable to iterate over QoS Optimization and Clock Frequency Optimization to find a final result.

Confirmatory Runs

Although Platform Architect usually trends in the same direction as the memory controllers RTL, Platform Architect uses accelerated models of uMCTL2, LPDDR5, and DDR5 that are not 100% cycle-accurate. 草榴社区 recommends running a final RTL simulation against the RTL to confirm the results found with Platform Architect.

Post-Silicon Analysis

Platform Architect may also be used post-silicon. If the final SoC is presented with a new workload that was not modeled in the pre-silicon architectural phase and it is not performing as expected, the designer may still use Platform Architect post-silicon to analyze where the traffic may be improved and they may be able to suggest new register settings for address mapping and QoS that better handle the new traffic.

Conclusion

草榴社区’ Platform Architect offers accelerated models of the DesignWare uMCTL2, LPDDR5, and DDR5 memory controllers with powerful analysis capabilities and a convenient method of generating multiple variable analysis sweep simulation runs. The capabilities of Platform Architect allow users to thoroughly and rapidly explore the DDR controller configuration and programming space to find combinations of settings that produce the target bandwidth, latency and QoS for the user’s DDR design.