# You Can't Fix ## What You Can't Measure Bill Gervasi, Principal Memory Solutions Architect Monolithic Power Systems bill.gervasi@monolithicpower.com ### The Memory and Storage Tier is Dominated by DRAM Various DRAM options fill all these tiers **LPDDR DRAM Direct** **HBM** **DDR DRAM Direct** **NUMA DRAM 1 Hop** **NUMA DRAM 2 Hops** **CXL DRAM Direct** **CXL DRAM 1 Hop** **CXL DRAM 2 Hops** **Hybrid DRAM + NAND** SSD **Network** ### Truism That Preventing an Error is Cheaper Than Fixing It This talk will focus on telemetry gathering for today's DRAM modules Host In-Band allows interrogating DRAMs directly – DRAM access must be halted Host Sideband allows interrogating module support logic – DRAM access is not interrupted ### Mux allows - CPUs and BMC to share sideband - Supporting multiples of 8 modules per bus segment ### **Anatomy of a Memory Module** # DDR ### Settings done in-band ### Reliable DRAM operation starts with effective calibration Each signal type is calibrated independently - Data & strobes - Address/commands - Clocks - Chip selects Settings in dozens of mode registers Input capture eye maximized by - Schmooing - Test patterns - On-die termination - Voltage margining ### When good memories go bad Each DRAM provides hints about internal status Internal temperature monitor Suggested refresh rate 1X up to 85° 2X from 85° to 95° On-Die Error Correction Code (ECC) is supported Reads correct single bit errors before sending to host Writes save ECC codes along with data Runtime error transparency Maximum number of errors Worst 3 rows error count Error check scrub (ECS) error repair Reads data, corrects errors, writes it back Total error counter Rows with errors counter # DDR ### **Dealing with errors** When an error occurs, the DRAM drives an ALERT signal to the host - CRC error - Excessive activation threshold exceeded Post-package repair of naughty rows - If a row is misbehaving, it can be swapped out - Combine mPPR with Memory Built-In-Self Test (MBIST) ### **Serial Presence Detect (SPD)** Powered via LDO independent of other module circuits Host communication via sideband bus 1 KB of non-volatile memory contains module parameters Operates as a Hub from the host to local bus devices Integrated thermal sensor with low, high, and critical settings ### Host interrupt capability for itself and all local bus devices E.G., PMIC sees high temperature - 1. Sends interrupt to the SPD - 2. SPD interrupts the Host - 3. Host interrogates PMIC ### Power Management ICs (PMICs) PMICs contribute to calibration and telemetry gathering Communication over sideband interface ### **Calibration:** Each voltage rail can be adjusted based on device and signaling limits\* These may compensate for corner conditions as power planes cover large distances ### **Telemetry:** Each voltage rail can be interrogated to measure voltage and current Total module wattage may be calculated on the fly Coupled with operational test patterns, power per operation type may be calculated ### **Warnings and Error reporting:** Each voltage rail can be configured with warnings at high and critical levels Error counts are kept in non-volatile memory Overvoltage and overcurrent treated separately \* Overclockers do this as standard procedure ### **Thermal Sensors** Communication over sideband interface Thermal sensors tie directly to the module ground plane Direct thermal path from the DRAMs Positioned at both ends of the module since airflow direction is 50/50 ### **Registering Clock Driver (RCD)** RCDs contribute to calibration and telemetry gathering Communication over in-band and sideband interfaces | RCD Event Handling | | | | | |---------------------------------|-----------|-------|----------|------| | Event | DERROR_IN | ALERT | IBI | NACK | | DRAM data CRC error | V | ٧ | | | | DRAM PRAC alert back-off | V | ٧ | | | | DRAM activation count violation | V | ٧ | | | | DRAM ALERT verification | V | ٧ | | | | RCD address parity violation | | ٧ | | | | RCD DCS training | | ٧ | | | | RCD DCA training | | ٧ | | | | RCD DFE training | | ٧ | | | | RCD DES training | | ٧ | | | | SidebandBus PEC error | | | ٧ | ٧ | | SidebandBus Parity error | | · | ٧ | ٧ | | Handling | INBAND | | SIDEBAND | | Interface between RCD and DRAMs calibrated Detects and reports DRAM errors Register errors On-chip error log reports what it saw on the address bus Each error or warning type has its own set of reactions and mitigations At the system level, a point of failure can have major impact... should the fan be faster for all modules in a rack to deal with one warning? Collecting data at the rack, cage, hall, and building level can improve: Dealing with errors and warnings Predicting failures before they occur This sounds like a good problem for Al Non-DRAM memory failures from the memory controller and memory channel cause the majority of errors Newer DRAM cell fabrication technologies have substantially higher failure rates, increasing by 1.8 over the previous generation Using lower density DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7% https://ieeexplore.ieee.org/document/7266869/ ### "Improving Memory Reliability at Data Centers" Al to create a model of predictive patterns by comparing thousands and thousands of memory error logs from the field, then compares this model with scans from an operator's data center to determine where problems may exist to support data center operation and workload continuity Predictive memory resilience technology can reduce uncorrectable error rates by nearly 50% https://www.intel.cn/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf # America's Cyber Defense Agency NATIONAL COORDINATOR FOR CRITICAL INFRASTRUCTURE SECURITY AND RESILIENCE CISA.gov Electronically available hardware bill of materials is under deployment for security Once the HBOM is downloaded, error information can be added to the models and tracked to specific components https://www.cisa.gov/resources-tools/resources/hardware-bill-materials-hbom-framework-supply-chain-risk-management "Problem" parts can indicate potential for future failures ### **Conclusions** Increasing demand for memory has exacerbated sensitivity to errors Memory subsystems provide a collection of reports **Errors detected** Warnings Using this data to mitigate an error is useful, but... ...Predicting the next error before it happens is essential # Thank you for your time Any questions? Bill Gervasi, Principal Memory Solutions Architect Monolithic Power Systems bill.gervasi@monolithicpower.com