草榴社区

Speaking Up About Silent Data Corruption

These 2 panels explore the challenges posed by Silent Data Corruption (SDC) and the strategic interventions within the realm of Reliability, Availability, and Serviceability (RAS) for contemporary systems. Within these panel discussions we will define SDC, explore potential causes, including areas where SDC and RAS interact. Both RAS and SDC encounter challenges with technology scaling and integration, compute power, testing of advanced nodes, and detection and correction of permanent, degrading, intermittent, or transient errors. Listen to the insights with industry experts from 草榴社区, Google and Microsoft  covering multiple perspectives including end users, hyperscalers, OEMs, semiconductor suppliers, and EDA companies.

SDC_RAS_cropped_no-amr-for-AEM

Speakers (left to right): Rama Govindaraju, Jyotika Athavale, Robert S. Chappell

Part I of this panel defines and explores potential causes of SDC.

Part II of this panel discusses strategies to mitigate SDC.

Our Experts

Jyotika Athavale

Director, Engineering Architecture, 草榴社区

Jyotika is a Director, Engineering Architecture at 草榴社区, leading quality, reliability and safety research, pathfinding and architectures for data centers and automotive applications. Jyotika also serves as the 2024 President of the global IEEE Computer Society, overseeing overall IEEE-CS programs and operations.
For her leadership in international safety standardization, Jyotika was awarded the 2023 IEEE SA Standards Medallion. And for her leadership in service, she was awarded the IEEE Computer Society Golden Core Award in 2022. This year, she is a finalist for Electronics Weekly Woman of the Year awards in two categories.

Rama Govindaraju

Principal Engineer, Google

Rama Govindaraju is a Principal Engineer at Google leading the effort to ensure reliability of large scale Machine Learning Supercomputers. Prior to that Rama was Director of Engineering at Google where he led the Systems Infrastructure Architecture team. Prior to that Rama was a Distinguished Engineer at IBM responsible for leading the Software Architecture at IBM's Supercomputing Lab where he led the development of 5 generations of Supercomputers. Prior to that Rama received his MS and Phd in Computer Science from Rensselaer Polytechnic Institute in New York and BE in Computer Science from BIT Mesra, Ranchi, India.

Robert S. Chappell

Partner Hardware Architecture, Microsoft

Robert S. Chappell is a Partner at Microsoft in Redmond, WA.  Rob has a passion for "at-scale" computing and is responsible for improving the reliability and performance of the millions of server nodes underlying Azure's core cloud business.  Prior to joining Microsoft in 2019, Rob spent over 20 years architecting high-volume CPUs.  Rob earned a Ph.D. in Computer Architecture from the University of Michigan.

Additional Resources

Blog
Ensuring the Health and Reliability of Multi-Die Systems