

### Use an Intelligent SSD to Accelerate Machine Learning

## Hung-Wei Tseng University of California, Riverside

Flash Memory Summit 2019 Santa Clara, CA







K. Hazelwood et al., "Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective," 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, 2018, pp. 620-629.

Flash Memory Summit 2019 Santa Clara, CA



### ML is still timing consuming

|                      | Resource                  | Training Frequency | Training Duration |  |
|----------------------|---------------------------|--------------------|-------------------|--|
| Facer                | GPUs + single socket CPUs | Every N Photos     | Seconds           |  |
| News Feed            | Dual Socket CPUs          | Daily              | Hours             |  |
| Lumos                | GPUs                      | Multi-monthly      | Hours             |  |
| Search               | Vertical Dependent        | Hourly             | Hours             |  |
| Language Translation | GPUs                      | Weekly             | Days              |  |
| Sigma                | Dual Socket CPUs          | Sub-Daily          | Hours             |  |
| Speech Recognition   | ech Recognition GPUs      |                    | Hours             |  |

Flash Memory Summit 2019 Santa Clara, CA K. Hazelwood et al., "Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective," 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, 2018, pp. 620-629.



### The ML data processing pipeline

| CPU         | Data<br>preparation<br>#1 | Data<br>preparation<br>#2 | Data<br>preparation<br>#3 | Data<br>preparation<br>#4 |                 |                 |
|-------------|---------------------------|---------------------------|---------------------------|---------------------------|-----------------|-----------------|
| TPU/<br>GPU |                           | Training #1               | Training #2               | Training #3               | Training #4     |                 |
| TPU/<br>GPU |                           |                           | Inference<br>#1           | Inference<br>#2           | Inference<br>#3 | Inference<br>#4 |

time





## The ML data processing pipeline — TPU





### Tasks in this new bottleneck

- Reading inputs
- Reduce precisions
- Shuffling data
- Create application objects



### Adjusting data resolutions in storage --Varifocal Storage

- Shuffling data in storage
- Conclusion



### We don't need really detailed inputs

Reduce the resolution by 25%







### **Approximate Computing**

### A large set of applications can tolerate inaccuracies

- Machine learning
- Data mining
- Video/Image processing
- Scientific computing

### Benefits of approximate computing

- Reduce the amount of computation
- Simplify hardware design
- Deliver higher throughputs
- Improve the area-efficiency











### We don't need really detailed inputs

Reduce the resolution by 25%

### We can save both computation overhead and bandwidth if the storage device can reduce the resolution!









### Varifocal Storage: dynamic multiresolution storage





### Varifocal Storage



1



### Speedup in "data preparation"







### Adjusting data resolutions in storage --Varifocal Storage

- Shuffling data in storage
- Conclusion



# Flash Memory Summit

### **Conventional NVMe Read**

- The command sends the starting address in the SSD and the length to read
- The command contains a list of memory locations to receive the reading data
  - These addresses are consecutive in virtual address presented to the application
  - These addresses may not be physically consecutive





### Shuffled NVMe Read





## Performance of shuffled NVMe read





### Conclusion

- Conventional research focus on single-point design, missing the opportunities for cross-layer, full stack solutions
- I/O stack is becoming the new bottleneck for accelerator-based architectures
- We need to carefully examine the bottleneck in modern applications they may not be computation-bound





#### UCRUSTICALIFORNIA UCRUSTICALIFORNIA

https://www.escalab.org/