



# An Advanced Flash Emulator for Designing Today's High-Capacity Controllers

# Th. Antonakopoulos, **N. Toulgaridis**, M. Varsamou, E. Bougioukou and T. Petropoulos

University of Patras Greece





- Why NAND Flash emulation at system level is needed?
- What are the challenges of emulation at system level?
- The architecture of NAND Flash Emulator
  - System Architecture
  - Memory Organization
  - Low-latency memory access
  - Experimental results
- Beyond the current NAND Flash Emulator





### Storage

Flash Memory Summit

- Storage devices
  Multiple memory ICs organized in channels
  - Multiple channels operating in parallel
  - Large memory capacity per channel (a few 10s or 100s GBs) Huge capacity at system level (xTBs)
  - High IO rates and fast response time, especially when a page is read
  - Complicated functions (i.e. wear leveling, workload balancing) in the storage controller
- Full system prototyping and testing before the actual memory chips are available, based only on their specifications.
- Evaluate under different loading conditions the performance of the implemented algorithms
- Reduce time-to-market for the storage device when new memory chips become available
   There is a need for an NAND Flash Channel Emulator that can <u>emulate the whole memory</u> <u>capacity</u> of a device and <u>respond in real-time</u> according to the NAND Flash specs



The main challenges of a system Flash emulator 🔘

### Flash Memory Summit

- Emulate the whole system capacity
  - Single board emulators have limited fast memory capacity (x10GBs)
  - Storage systems have multiple channels with multiple dies/channel and their total capacity ranges from x100GBs up to xTBs.

### Solution: Exploit the DRAM capacity of server motherboards

- Directly accessed by the host processor, indirectly accessed by devices attached to PCIe slots
- Access is affected by the used host processor (number of DRAM controllers, internal data paths)

### Respond in real-time according the Flash IC specs

- Data access time in NAND Flash: 30 to 50 usecs and the page transfer time depends on the NAND Flash interface supported and the Flash page size.
- · Multiple channels operating in parallel generate asynchronous access requests
- The latency introduced by the Operating System has to be avoided

### Solution: Use a fast PCIe-based FPGA board where the DUT is attached

- Custom logic has to be developed for direct access to the host's DRAM.
- Modular design for supporting different Flash interfaces

UNIVERSITY OF

**P**ÂT



The characteristics of the system level Flash emulator



Flash Memory Summit

- Uses a low-cost commercially available motherboard that support the maximum possible DRAM (xTB)
- Uses a PCIe card with a high-speed SoC FPGA that acts as the digital front-end to the DUT.
- Modular re-usable system design
- Split the design into two FPGAs boards, if needed.
  - FPGA boards are interconnected using a High-Speed Digital Link (HSDL), i.e. xSFP+.

### Advantages:

- Supports minimum latency, high capacity, various I/O NAND Flash interfaces and provides flexibility on the mechanical attachment of the DUT.
- Minimum additional development effort when new memory devices have to be supported.









# Memory Organization and Emulator Capabilities 🛞 PATRAS



# Memory Organization and Emulator Capabilities 🛞 PATRAS



| User Page<br>[kB] | Flash Page<br>[kB] | DRAM<br>[x1K] | Total Number of<br>emulated Pages | NAND Flash Capacity<br>[Gbits] | Total Number of Dies | Supported Channels –<br>Dies/channel |  |
|-------------------|--------------------|---------------|-----------------------------------|--------------------------------|----------------------|--------------------------------------|--|
| 4.0               | 4.22               | 5.0           | 107.3 M                           | 64                             | 64                   | 16 – 4                               |  |
| 8.0               | 8.44               | 9.0           | 59.6 M                            | 128                            | 32                   | 8 - 4                                |  |
| 16.0              | 17.25              | 18.0          | 29.8 M                            | 256                            | 16                   | 4 - 4                                |  |
| 16.0              | 18.16              | 19.0          | 28.2 M                            | 512                            | 8                    | 4 - 2 or 2 - 4                       |  |

512 GB emulated NAND Flash (DRAM: 576 up to 640 GB)

### Prototype of the NAND Flash Emulator



Flash Memory Summit



Supermicro X10SRi-F Xeon E5-2650 v4



HTG-K800 Xilinx Kintex UltraScale KU085

Emulates up to

- 2 NAND Flash channels
- 4 8 CE per channel
- 2 GB L1 cache
- Specs
  - ONFI 1.0 3.0
  - Toggle 1.0 2.0

| DIMM<br>capacity | Total<br>DRAM | DRAM for<br>NFE | Emulated NAND<br>Flash Memory |
|------------------|---------------|-----------------|-------------------------------|
| 64 GB            | 512 GB        | 496 GB          | 480 GB                        |
| 128 GB           | 1 TB          | 768 GB          | 656 GB                        |





64 or 128 GB DRAM SO-DIMMs



End



Log/tracing info

100 001

## **3CPU and Tracer Architecture**



Flash Memory Summit





| Page size<br>[B] | Read Time<br>[usecs] | Write Time<br>[usecs] | ONFI | Transfer Rate<br>[MBps] | Transfer Time<br>[usecs] | Transfer Time over PCle<br>(8 lanes Gen3, 128 bits DMA)<br>[usecs] | NAND Flash<br>Channels<br>supported |
|------------------|----------------------|-----------------------|------|-------------------------|--------------------------|--------------------------------------------------------------------|-------------------------------------|
| 8,640            | 50                   | 1,300                 | 2.0  | 166                     | 52.0                     | 2.6                                                                | 18                                  |
| 8,640            | 35                   | 300                   | 2.2  | 200                     | 43.2                     | 2.6                                                                | 12                                  |
| 18,592           | 50                   | 1,400                 | 3.0  | 166                     | 112.0                    | 5.1                                                                | 8                                   |
| 18,592           | 50                   | 1,400                 | 3.0  | 333                     | 55.8                     | 5.1                                                                | 8                                   |



## **3CPU Experimental Results**









Multiple-channels NAND Flash emulator characteristics:

- Supports a large number of NAND Flash channels
- Practically supports unlimited NAND Flash Capacity
- Responds according to the NAND Flash chips
- Supports ONFI and Toggle interfaces
- Due to its modular design can be re-used for emulating other Non-volatile Memory (NVM) technologies and/or other IO interfaces (i.e. eMMC)
- Design of new algorithms based on data analytics (i.e. minimize read latency by predicting future read/write commands)



http://www.loe.ee.upatras.gr/English/COMES-home.htm





# Back-up slides













Single DFE cannot support the number of channels of the storage device

32



# Prototype of the NAND Flash Emulator

64 or 128 GB

**DRAM SO-DIMMs** 



Flash Memory Summit



| DIMM<br>capacity | Total<br>DRAM | DRAM for<br>NFE | Emulated NAND<br>Flash Memory |
|------------------|---------------|-----------------|-------------------------------|
| 16 GB            | 128 GB        | 112 GB          | 96 GB                         |
| 32 GB            | 256 GB        | 250 GB          | 224 GB                        |
| 64 GB            | 512 GB        | 496 GB          | 480 GB                        |
| 128 GB           | 1 TB          | 768 GB          | 656 GB                        |



#### нтс-казо Xilinx Kintex UltraScale KU085



Emulates up to

- 2 NAND Flash channels
- 4 8 CE per channel
- 2 GB L1 cache
- Specs
  - ONFI 1.0 3.0
  - Toggle 1.0 2.0



UNIVERSITY OF

Ρ̈́Α

Latency [usecs]

### **Tracer Experimental Results**

Latency [usecs]

bytes

66

200

400

**MBps**