Cloud native EDA tools & pre-optimized hardware platforms
By: Richard Solomon, Sr. Technical Marketing Manager, 草榴社区
With every generation, automotive electronics become increasingly complex. Infotainment systems now rival home theaters in complexity, and Advanced Driver Assistance Systems (ADAS) practically require the compute power of a data center. The architectures for many ADAS involving machine vision are similar to high performance cloud computing systems: an array of powerful processors connected by high bandwidth PCI Express? (PCIe) links. Due to the successful use of PCIe in cloud computing applications, it is no surprise that PCI Express is becoming prevalent in automotive electronic systems. It is also common to find PCIe WiFi chips, PCIe GPUs, and PCIe ASIC-to-ASIC connections in infotainment systems.
Automotive electronics must meet stringent reliability standards where safety is of the utmost importance — in powertrain and braking controls, ADAS, and other vehicle operation platforms. Even an automotive infotainment system is expected to perform flawlessly despite being powered on for an entire day of driving where the temperature, humidity, and vibration can vary drastically from moment to moment. While consumers may accept rebooting their smartphone weekly or replacing their phone every 2-3 years, they expect to do neither throughout the life of their automobile. Reliability is a key component of functional safety and is critical to achieving Automotive Safety Integrity Levels (ASIL) certification, which is required for most ADAS systems. System-on-chip (SoC) designers need to approach automotive reliability with even more concern than they would for a high-performance server operating in a traditional data center.
The PCI Express protocol includes a very robust link integrity scheme, but it has some reliability limitations which may not be immediately apparent. Every application packet includes a link-level cyclic redundancy check (LCRC) which is verified immediately upon receipt. An Acknowledged/Not-Acknowledged (ACK/NAK) mechanism handles seamless retransmission of erroneous packets, and includes timeouts to ensure broken links do not go unnoticed. Perhaps the most obvious limitation is that the LCRC can only protect the data actually presented to the PCI Express interface logic – it has no way to confirm that data is actually correct. More subtly, the retransmission of erroneous packets due to NAKs can hide signal integrity problems in the physical interconnect since the application software and even upper-layer hardware are less likely to be aware of the retransmissions. Whether due to a fundamental problem present at design/manufacturing time, or due to degradation over the product lifetime, all but the most severe PCI Express link errors will be largely invisible to software.
To address these shortcomings, SoC designers must first ensure that on-chip data is reliable so known bad data is never sent out on the link and that any bad data received on the link is never passed into the application logic. Secondly, SoC designers must make sure the link itself is reliable, remains available even when degraded, and alerts the application logic to any problems.
There are two sub-areas of on-chip data protection: “data at rest” and “data in-flight”. Protecting data at rest requires some mechanism to ensure data stored in a memory array doesn’t change while “resting” in that array. In the early days of on-chip SRAMs, failure rates and random error rates were high, so designers included protection mechanisms like parity and/or redundancy in attempts to guard against unintended data changes. As CMOS processes matured, these concerns lessened and designers in many markets chose to accept unprotected SRAMs to cut down on area overhead for protection against increasingly less likely error events. However, with the rapid shrinking of silicon geometries and the change from planar to FinFET transistors, concern over such “soft” or “random” errors appears to be growing again. Fortunately, the increased gate counts possible with modern silicon processes make more advanced techniques such as Error Correcting Code (ECC) feasible – and for automotive applications, arguably mandatory as they provide much stronger protection against data corruption.
While precise details vary by the ECC chosen, today’s SoC designer should be able to get full Single Error Correct, Double Error Detect (SECDED) protection at a cost of around 8 bits of additional storage for every 64 bits of data. The additional logic complexity is outweighed by the additional capability for a system to survive single-bit errors. It is particularly important for the automotive SoC designer to ensure that both correctable and uncorrectable errors are logged and reported to software. By logging both the failed data bit(s) and SRAM line address, application or diagnostic software will have the information necessary to identify potentially failing hardware from patterns of even soft errors over time. Data at rest is generally in transition from layer to layer in PCI Express designs, so the SoC designer will not find a benefit in rewriting any corrected data values back into their originating memory as once passed to the next layer, the original memory locations will be reused for a later packet.
Protecting data in-flight is the process of ensuring correct data is carried through the various non-storage data paths of the SoC. For designers using ECC on their memories, carrying the ECC code along with the data certainly accomplishes the desired protection but the additional ECC checks may not be desirable due to area or timing closure. Given that even cutting-edge FinFET flip-flops are considered to be fairly reliable, the industry practice of carrying simple parity is likely sufficient – even in automotive applications.
When uncorrectable errors are detected anywhere on the outbound path to the PCI Express link, SoC designers must implement some type of error recovery handshake with the application logic. Because packets are often pipelined, simply invalidating an outbound packet and notifying the application logic may not be able to prevent a subsequent packet from being transmitted. Worst case, that packet might indicate a higher-level protocol “successful completion” message related to the corrupted data. Even though the bad packet was never transmitted, the system memory (intended to be updated by the now invalidated packet) will not have valid data, and so receiving a “success” message would be catastrophic.
The PCI Express transport is inherently excellent at delivering correct data, so if the SoC designer can provide solid data protection up to the PCI Express controller, correct data transfer will be assured. The key area for improvement here is tracking reliability from the perspective of first-time error-free transfer. If every packet takes three attempts to deliver successfully, the link may be reliable in the sense of correct data delivery, but not in the sense of error-free transfers. Long experience with PCI Express has shown that poor quality channels are the number one contributor to poor link reliability. Unfortunately, the channel design is usually out of the hands of the SoC designer, and automotive environments are notoriously harsh – with wide temperature swings and high levels of vibration. The SoC designer can track channel quality through a series of event counters and logging facilities. Of course the internal data protection errors (both correctable and uncorrectable) should be tracked as previously noted. Figure 1 shows some of the key information which should be considered for tracking at the various layers of the PCI Express protocol. Some of this data may be best understood in the context of number of events per some unit of time, while others may make most sense as simple event logs.
Figure 1: Key events to track at the various layers of the PCI Express protocol
An important consideration here is non-volatile storage of the reliability data. At a minimum, the registers in question must survive an SoC reset – if the link goes down, a system reset may be needed to bring it back up again. Ideally, the data would be preserved in a non-volatile storage medium so it could be accessed after a loss of power and/or a very long time passes. Consider that the automobile in which the SoC resides might not be seen for system diagnostics for a year or more! It’s also useful to note that the same data on link quality which reflects reliability can be invaluable during initial laboratory bring-up of the system – so it is important to consider an access mechanism that is independent of a working PCI Express link. Examples might include a logic analyzer connection, a USB interface for debugging, a processor In-Circuit-Emulator connection, or other proprietary mechanisms for accessing the SoC without requiring a working PCI Express link.
Error injection is another capability to consider. Automotive certifications tend to require extensive system testing, and creating the full gamut of potential PCI Express error events can be very difficult. By designing-in the ability to generate those events – both inbound as if they’d been detected on the PCI Express link and outbound by actually causing them to occur on the PCI Express link – SoC designers can greatly facilitate such testing. Furthermore, controlled error injection substantially eases the process of software testing (both embedded firmware and system drivers) and so a comprehensive system of error injection provides a huge benefit overall.
Today's connected automobile contains compute platforms and architectures which closely resemble those used in data centers. Given PCI Express' successful use in such cloud and data center applications, it is no surprise to see the protocol used in automotive applications. Automotive electronic systems must meet certain standards of reliability and safety, and the PCI Express protocol can fulfill those requirements with a combination of external and internal data protection and reliability features. SoC designers should provide data protection both at rest and in-flight, using industry best-practices of SECDED ECC on all memories and at least byte-parity on datapaths. They should also design-in link reliability measuring hardware, with non-volatile storage as/where appropriate, to enable comprehensive system diagnosis and to ease initial link bring-up efforts. SoC designers should also implement the corresponding error injection capabilities to best provide for system-wide reliability testing and software development. 草榴社区' DesignWare? IP for PCI Express supports these features and enables designs for automotive reliability.