

### A 3 TOPS/W RISC-V Parallel Cluster for Inference of Fine-Grain Mixed-Precision Quantized Neural Networks

<u>Alessandro Nadalini</u><sup>1</sup>, Georg Rutishauser <sup>2</sup>, Alessio Burrello <sup>1</sup>, Nazareno Bruschi <sup>1</sup>, Angelo Garofalo <sup>1</sup>, Luca Benini <sup>1,2</sup>, Francesco Conti <sup>1</sup>, Davide Rossi <sup>1</sup>

<sup>1</sup>University of Bologna, <sup>2</sup> ETH Zürich









### **Introduction and Motivation**

- Emerging application areas for Al-enabled IoT
  - Personalized Healthcare
  - Augmented Reality
  - Nano-Robotics
- Challenges
  - High computational demand from DNNs, other algorithms
  - Diverse computational patterns and requirements
- Opportunities

ETH zürich

- Bit-precision tolerance
- Accelerable workloads

ALMA MATER STUDIORUN Università di Bologna





### **Introduction and Motivation**

- Emerging application areas for Al-enabled IoT
  - Personalized Healthcare
  - Augmented Reality
  - Nano-Robotics
- Challenges
  - High computational demand from DNNs, other algorithms
  - Diverse computational patterns and requirements
- Opportunities

ETH zürich 🧯

- Bit-precision tolerance
- Accelerable workloads

ALMA MATER STUDIORUN Università di Bologna



### **Introduction and Motivation**

- Emerging application areas for Al-enabled IoT
  - Personalized Healthcare
  - Augmented Reality
  - Nano-Robotics
- Challenges
  - High computational demand from DNNs, other algorithms
  - Diverse computational patterns and requirements
- Opportunities

ETH zürich

- Bit-precision tolerance
- Accelerable workloads

ALMA MATER STUDIORUN Università di Bologna







- Cluster
  - 8 RISC-V processors
  - 128 kB of L1 TCDM memory
  - L1 I\$
  - DMA controller







- Fabric Controller (FC)
- Cluster
  - 8 RISC-V processors
  - 128 kB of L1 TCDM memory
  - L1 I\$
  - DMA controller







- Fabric Controller (FC)
- Cluster
  - 8 RISC-V processors
  - 128 kB of L1 TCDM memory
  - L1 I\$
  - DMA controller





- Fabric Controller (FC)
- Cluster
  - 8 RISC-V processors
  - 128 kB of L1 TCDM memory
  - L1 I\$
  - DMA controller







- Fabric Controller (FC)
- Cluster
  - 8 RISC-V processors
  - 128 kB of L1 TCDM memory
  - L1 I\$
  - DMA controller



#### **RISCY**

**ETH** zürich

- 32-bit in-order single-issue RISC-V processor
- 4 pipeline stages
- DSP extensions

ALMA MATER STUDIORUN Università di Bologna

SIMD arithmetic instructions down to 8-bit precision





#### **RISCY**

- 32-bit in-order single-issue RISC-V processor
- 4 pipeline stages
- DSP extensions
- SIMD arithmetic instructions down to 8-bit precision

#### MatMul pseudo-assembly code:



11

| <b>lp.setup</b> | 11, 12, end |
|-----------------|-------------|
| p.lw            | w1, 4(aw1!) |
| p.lw            | w2, 4(aw2!) |
| p.lw            | w3, 4(aw3!) |
| p.lw            | w4, 4(aw4!) |
| p.lw            | x1, 4(ax1!) |
| p.lw            | x2, 4(ax2!) |
| pv.sdotp.b      | s1, x1, w1  |
| pv.sdotp.b      | s2, x1, w2  |
| pv.sdotp.b      | s3, x1, w3  |
| pv.sdotp.b      | s4, x1, w4  |
| pv.sdotp.b      | s5, x2, w1  |
| pv.sdotp.b      | s6, x1, w2  |
| pv.sdotp.b      | s7, x1, w3  |
| pv.sdotp.b      | s8, x1, w4  |

end:



#### **RISCY**

- 32-bit in-order single-issue RISC-V processor
- 4 pipeline stages
- DSP extensions
- SIMD arithmetic instructions down to 8-bit precision

PERFORMANCE DEGRADATION DUE TO LOAD OPERATIONS WITHIN THE INNERMOST LOOP MatMul pseudo-assembly code:



12

|      | Ip.setup   | ,11 | 12, end |
|------|------------|-----|---------|
|      | p.lw       | w1, | 4(aw1!) |
|      | p.lw       | w2, | 4(aw2!) |
|      | p.lw       | w3, | 4(aw3!) |
|      | p.lw       | w4, | 4(aw4!) |
|      | p.lw       | x1, | 4(ax1!) |
|      | p.lw       | x2, | 4(ax2!) |
|      | pv.sdotp.b | s1, | x1, w1  |
|      | pv.sdotp.b | s2, | x1, w2  |
|      | pv.sdotp.b | s3, | x1, w3  |
|      | pv.sdotp.b | s4, | x1, w4  |
|      | pv.sdotp.b | s5, | x2, w1  |
|      | pv.sdotp.b | s6, | x1, w2  |
|      | pv.sdotp.b | s7, | x1, w3  |
| end: | pv.sdotp.b | s8, | x1, w4  |



### **XpulpNN**

|         | [pv.nnsdotusp.h | zero, | aw1,16 |
|---------|-----------------|-------|--------|
| титт    | pv.nnsdotusp.h  | zero, | aw2,18 |
|         | pv.nnsdotusp.h  | zero, | aw3,20 |
| ININ-KF | pv.nnsdotusp.h  | zero, | 22,aw4 |
|         | pv.nnsdotusp.h  | zero, | ax1,8  |

Single-cycle MAC + load instruction (Mac&Load) down to 2-bit width





### **XpulpNN**

|      | [pv.nnsdotusp.h | zero, aw1,16  |
|------|-----------------|---------------|
| τντ  | pv.nnsdotusp.h  | zero, aw2,18  |
|      | pv.nnsdotusp.h  | zero, aw3,20  |
|      | pv.nnsdotusp.h  | 22, zero, aw4 |
|      | pv.nnsdotusp.h  | zero, ax1,8   |
|      | lp.setup        | 11, 12, end   |
|      | pv.nnsdotup.h   | zero,ax2,9    |
|      | pv.nnsdotusp.b  | s1, aw2, 0    |
|      | pv.nnsdotusp.b  | s2, aw4, 2    |
|      | pv.nnsdotusp.b  | s3, aw3, 4    |
|      | pv.nnsdotusp.b  | s4, ax1, 14   |
|      | pv.nnsdotusp.b  | s5, aw2, 17   |
|      | pv.nnsdotusp.b  | s6, aw4, 19   |
|      | pv.nnsdotusp.b  | s7, aw3, 21   |
| end: | pv.nnsdotusp.b  | s8, aw1, 23   |

Single-cycle MAC + load instruction (Mac&Load) down to 2-bit width







### **XpulpNN**

|      | [pv.nnsdotusp.h | zero, aw1,16 |
|------|-----------------|--------------|
| τντ  | pv.nnsdotusp.h  | zero, aw2,18 |
|      | pv.nnsdotusp.h  | zero, aw3,20 |
|      | pv.nnsdotusp.h  | 22,2xero, aw |
|      | pv.nnsdotusp.h  | zero, ax1,8  |
|      | lp.setup        | 11, 12, end  |
|      | pv.nnsdotup.h   | zero,ax2,9   |
|      | pv.nnsdotusp.b  | s1, aw2, 0   |
|      | pv.nnsdotusp.b  | s2, aw4, 2   |
|      | pv.nnsdotusp.b  | s3, aw3, 4   |
|      | pv.nnsdotusp.b  | s4, ax1, 14  |
|      | pv.nnsdotusp.b  | s5, aw2, 17  |
|      | pv.nnsdotusp.b  | s6, aw4, 19  |
|      | pv.nnsdotusp.b  | s7, aw3, 21  |
| end: | pv.nnsdotusp.b  | s8, aw1, 23  |

Only one explicit load inside the innermost loop

# Single-cycle MAC + load instruction (Mac&Load) down to 2-bit width





**MPIC** 



HW sub-byte and mixed-precision sum of dot products (sdotp)



MPIC

- HW sub-byte and mixed-precision sum of dot products (sdotp)
- Virtual SIMD instructions
- Dynamic bit-scalable execution mode



MPIC

MPC CNT Operand C SIMD FMT SLICER AND ROUTER Operand A HW sub-byte and mixed-precision sum of dot products (sdotp) ++++ Virtual SIMD instructions DOTP-16 DOTP-8 DOTP-4 DOTP-2 Dynamic bit-scalable execution mode OUTPUT MUX Result No more need for packing operations! SCALAR INSTR MAC32 MULT/ALU DECODER SCALAR VIRTUAL SIMD INSTR x10, 4(x4!)p.lw SDOTP.v MULT/ALU p.lw x11, 4(x5!)SIMD FORMAT MIX8x4 CSR SIMD x5, x11, 4, 0 p.extract SDOTP.N p.extract x6, x11, 4, 4 p.extract <7, x11, 4, 8 x8, 11, 4, 12 p.extr pv.packlo.b x15, x5, x6pv.packhi.b x15,x7,x8 pv.sdotsp.b x20,x15,x10

Operand B

## Our proposal for energy-efficient inference of DNNs



#### Flex-V core

- Performance of XpulpNN extensions
- Flexibility of MPIC
- Mixed-precision Mac&Load instructions
- Optimized SW library targeting well-known mixed-precision QNNs

# Optimized SW Library

HW support for mixed-precision



## Our proposal for energy-efficient inference of DNNs



- Flex-V core
  - Performance of XpulpNN extensions
  - Flexibility of MPIC
  - Mixed-precision *Mac&Load* instructions
- Optimized SW library targeting well-known mixed-precision QNNs

Optimized SW Library

HW support for mixed-precision



























- Based on static information related to the kernel
- Needed invariant parameters are stored in CSRs
- Only one pointer for activations and one for weights
- Extensible to all 2D strided patterns, not only MatMuls!!



- Based on static information related to the kernel
- Needed invariant parameters are stored in CSRs
- Only one pointer for activations and one for weights
- Extensible to all 2D strided patterns, not only MatMuls!!

#### How does it work?

1. Read current address





27



- Based on static information related to the kernel
- Needed invariant parameters are stored in CSRs
- Only one pointer for activations and one for weights
- Extensible to all 2D strided patterns, not only MatMuls!!

#### How does it work?

- 1. Read current address
- 2. Check number of performed updates





- Based on static information related to the kernel
- Needed invariant parameters are stored in CSRs
- Only one pointer for activations and one for weights
- Extensible to all 2D strided patterns, not only MatMuls!!

#### How does it work?

- 1. Read current address
- 2. Check number of performed updates
- 3. Adds proper increment





- Based on static information related to the kernel
- Needed invariant parameters are stored in CSRs
- Only one pointer for activations and one for weights
- Extensible to all 2D strided patterns, not only MatMuls!!

#### How does it work?

- 1. Read current address
- 2. Check number of performed updates
- 3. Adds proper increment
- 4. New address stored back in related CSR





#### ETHZÜRICH

#### Mixed-precision + M&L + Automatic Address Generation

| csrwi sb | _legacy,   | 0        | ٦    |            |
|----------|------------|----------|------|------------|
| csrwi si | md_fmt,    | 8        |      |            |
| csrwi mi | x_skip,    | 16       |      | CSRs CONF. |
| csrw a_  | stride,    | A_STRID  | DE   | - OUTSIDE  |
| csrw w_  | stride,    | W_STRID  | DE   | THE KERNEL |
| csrw a_  | rollback,  | A_ROLLB  | 3    |            |
| csrw w_  | rollback,  | W_ROLLB  | 3 J  |            |
| csrw a_  | csr, A_BAS | SE_ADDR  | ٦    | CSRs CONF. |
| csrw w_  | csr, W_BAS | SE_ADDR  |      | THE KERNEL |
| pv.mlsdo | tsp.h zero | o, aw, 1 | L6 ] |            |
| pv.mlsdo | tsp.h zero | o, aw, 1 | 8    |            |
| pv.mlsdo | tsp.h zero | o, aw, 2 | 20   | INIT THE   |
| pv.mlsdo | tsp.h zero | o, aw, 2 | 22   | NIN-KF     |
| pv.mlsdo | tsp.h zero | o, ax, 8 | 3 ]  |            |
| lp.setup | 11, 12, 6  | end      | -    |            |
|          |            |          |      | (end):     |

pv.mlsdotsp.h zero, ax, 9 pv.mlsdotusp.b s1, aw, 0 pv.mlsdotusp.b s2, aw, 2 pv.mlsdotusp.b s3, aw, 4 pv.mlsdotusp.b s4, ax, 14 pv.mlsdotusp.b s13, aw, 1 pv.mlsdotusp.b s14, aw, 3 pv.mlsdotusp.b s15, aw, 5 pv.mlsdotusp.b s16, ax, 15 pv.mlsdotusp.b s1, aw, 0 pv.mlsdotusp.b s13, aw, 17 pv.mlsdotusp.b s14, aw, 19 pv.mlsdotusp.b s15, aw, 21 pv.mlsdotusp.b s16, aw, 23 Mixed-precision + M&L + Automatic Address Generation

 NO extraction/packing operation within the innermost loop of MatMuls

ALMA MATER STUDIORUM

- Masked load operations
- Extension of the MatMul unrolling factor

**ETH** zürich

| csrwi sb_legacy,              | 0         |            |
|-------------------------------|-----------|------------|
| csrwi simd_fmt,               | 8         |            |
| csrwi mix_skip,               | 16        | CSRs CONF. |
| csrw a_stride,                | A_STRIDE  | OUTSIDE    |
| csrw w_stride,                | W_STRIDE  | THE KERNEL |
| csrw a_rollback,              | A_ROLLB   |            |
| csrw w_rollback,              | W_ROLLB   |            |
| csrw a_csr, A_BAS             | SE_ADDR   | CSRs CONF. |
| csrw w_csr, W_BAS             | SE_ADDR   | THE KERNEL |
| <pre>pv.mlsdotsp.h zero</pre> | o, aw, 16 |            |
| <pre>pv.mlsdotsp.h zero</pre> | o, aw, 18 |            |
| <pre>pv.mlsdotsp.h zero</pre> | o, aw, 20 | INIT THE   |
| <pre>pv.mlsdotsp.h zero</pre> | o, aw, 22 |            |
| <pre>pv.mlsdotsp.h zero</pre> | o, ax, 8  |            |
| lp.setup 11, 12, e            | end       |            |
|                               |           | (end):     |

| <pre>pv.mlsdotsp.h ;</pre> | zero, | ax, | 9  |
|----------------------------|-------|-----|----|
| <pre>pv.mlsdotusp.b</pre>  | s1,   | aw, | 0  |
| <pre>pv.mlsdotusp.b</pre>  | s2,   | aw, | 2  |
| <pre>pv.mlsdotusp.b</pre>  | s3,   | aw, | 4  |
| <pre>pv.mlsdotusp.b</pre>  | s4,   | ax, | 14 |
|                            |       |     |    |
| <pre>pv.mlsdotusp.b</pre>  | s13,  | aw, | 1  |
| <pre>pv.mlsdotusp.b</pre>  | s14,  | aw, | 3  |
| <pre>pv.mlsdotusp.b</pre>  | s15,  | aw, | 5  |
| <pre>pv.mlsdotusp.b</pre>  | s16,  | ax, | 15 |
| <pre>pv.mlsdotusp.b</pre>  | s1,   | aw, | 0  |
| •••                        |       |     |    |
| <pre>pv.mlsdotusp.b</pre>  | s13,  | aw, | 17 |
| <pre>pv.mlsdotusp.b</pre>  | s14,  | aw, | 19 |
| <pre>pv.mlsdotusp.b</pre>  | s15,  | aw, | 21 |
| <pre>pv.mlsdotusp.b</pre>  | s16,  | aw, | 23 |

Mixed-precision + M&L + Automatic Address Generation

- NO extraction/packing operation within the innermost loop of MatMuls
- Masked load operations
- Extension of the MatMul unrolling factor

**ETH** zürich

 At the cost of simple writings to the CSRs outside the body of the loop

ALMA MATER STUDIORUN



pv.mlsdotsp.h zero, ax, 9 pv.mlsdotusp.b s1, aw, 0 pv.mlsdotusp.b s2, aw, 2 pv.mlsdotusp.b s3, aw, 4 pv.mlsdotusp.b s4, ax. 14 pv.mlsdotusp.b s13, aw, 1 pv.mlsdotusp.b s14, aw, 3 pv.mlsdotusp.b s15, aw, 5 pv.mlsdotusp.b s16, ax, 15 pv.mlsdotusp.b s1, aw, 0 pv.mlsdotusp.b s13, aw, 17 pv.mlsdotusp.b s14, aw, 19 pv.mlsdotusp.b s15, aw, 21 pv.mlsdotusp.b s16, aw, 23

**ETH** zürich

ALMA MATER STUDIORUM Università di Bologna







**ETH** zürich

ALMA MATER STUDIORUM Università di Bologna



**ETH** zürich

ALMA MATER STUDIORUM Università di Bologna



### **Results – Single kernels Energy Efficiency**

**ETH** zürich

ALMA MATER STUDIORUM Università di Bologna Physical implementation in GF-22nm technology





38

### **Results – Single kernels Energy Efficiency**





| Network                 | MobileNetV1<br>(8b) | MobileNetV1<br>(8b4b) | ResNet-20<br>(4b2b) |  |
|-------------------------|---------------------|-----------------------|---------------------|--|
| Top-1 Accuracy          | 69.3 %              | 66.0 %                | 90.2 % [1]          |  |
| Deg. W.r.t. 8b          | -                   | 3.3 %                 | 0.15 %              |  |
| Model size              | 1.9 MB              | 997 kB                | 142 kB              |  |
| Memory saved            | -                   | 47 %                  | 63 %                |  |
| Performance (MAC/cycle) |                     |                       |                     |  |
| STM32H7                 | 0.33                | 0.30                  | -                   |  |
| XpulpV2                 | 5.6                 | 3.2                   | 4.8                 |  |
| XpulpNN                 | 6.0                 | 2.7                   | 4.4                 |  |
| Flex-V                  | 6.0                 | 5.8                   | 11.2                |  |

ETH zürich (Environmenter Studiogum [1] Z. Dong et al., «HAWQ, Hessian Aware Quantization of Neural Networks With Mixed-Precision» 40



| Network        | MobileNetV1<br>(8b) | MobileNetV1<br>(8b4b) | ResNet-20<br>(4b2b) |
|----------------|---------------------|-----------------------|---------------------|
| Top-1 Accuracy | 69.3 %              | 66.0 %                | 90.2 % [1]          |
| Deg. W.r.t. 8b | -                   | 3.3 %                 | 0.15 %              |
| Model size     | 1.9 MB              | 997 kB                | 142 kB              |
| Memory saved   | -                   | 47 %                  | 63 %                |
|                | Performance         | e (MAC/cycle)         |                     |
| STM32H7        | 0.33                | 0.30                  | -                   |
| XpulpV2        | 5.6                 | 3.2                   | 4.8                 |
| XpulpNN        | 6.0                 | 2.7                   | 4.4                 |
| Flex-V         | 6.0                 | 5.8                   | 11.2                |

ETH zürich (in Manager Studiogum [1] Z. Dong et al., «HAWQ, Hessian Aware Quantization of Neural Networks With Mixed-Precision» 41



| Network        | MobileNetV1<br>(8b) | MobileNetV1<br>(8b4b) | ResNet-20<br>(4b2b) |
|----------------|---------------------|-----------------------|---------------------|
| Top-1 Accuracy | 69.3 %              | 66.0 %                | 90.2 % [1]          |
| Deg. W.r.t. 8b | _                   | 3.3 %                 | 0.15 %              |
| Model size     | 1.9 MB              | 997 kB                | 142 kB              |
| Memory saved   | _                   | 47 %                  | 63 %                |
|                | Performance         | e (MAC/cycle)         |                     |
| STM32H7        | 0.33 <b>~ 20</b>    | <b>X</b> 0.30         | -                   |
| XpulpV2        | 5.6                 | 3.2                   | 4.8                 |
| XpulpNN        | 6.0                 | 2.7                   | 4.4                 |
| Flex-V         | 6.0                 | 5.8                   | 11.2                |

ETH zürich (Environmenter Studiogum [1] Z. Dong et al., «HAWQ, Hessian Aware Quantization of Neural Networks With Mixed-Precision» 42



| Network        | MobileNetV1<br>(8b) | MobileNetV1<br>(8b4b) | ResNet-20<br>(4b2b) |
|----------------|---------------------|-----------------------|---------------------|
| Top-1 Accuracy | 69.3 %              | 66.0 %                | 90.2 % [1]          |
| Deg. W.r.t. 8b | _                   | 3.3 %                 | 0.15 %              |
| Model size     | 1.9 MB              | 997 kB                | 142 kB              |
| Memory saved   | _                   | 47 %                  | 63 %                |
|                | Performance         | (MAC/cycle)           |                     |
| STM32H7        | 0.33 <b>~ 20</b>    | <b>X</b> 0.30         | _                   |
| XpulpV2        | 5.6                 | 3.2                   | 4.8                 |
| XpulpNN        | 6.0                 | 2.7                   | 4.4                 |
| Flex-V         | 6.0                 | 5.8                   | 11.2                |

ETH zürich (Interster Studiogum [1] Z. Dong et al., «HAWQ, Hessian Aware Quantization of Neural Networks With Mixed-Precision» 43



| Network                 | MobileNetV1<br>(8b) | MobileNetV1<br>(8b4b) | ResNet-20<br>(4b2b) |
|-------------------------|---------------------|-----------------------|---------------------|
| Top-1 Accuracy          | 69.3 %              | 66.0 %                | 90.2 % [1]          |
| Deg. W.r.t. 8b          | _                   | 3.3 %                 | 0.15 %              |
| Model size              | 1.9 MB              | 997 kB                | 142 kB              |
| Memory saved            | _                   | 47 %                  | 63 %                |
| Performance (MAC/cycle) |                     |                       |                     |
| STM32H7                 | 0.33 ~ 20           | <b>x</b> 0.30         | -                   |
| XpulpV2                 | 5.6                 | 3.2                   | 4.8                 |
| XpulpNN                 | 6.0                 | 2.7                   | 4.4 <b>2.5</b> X    |
| Flex-V                  | 6.0                 | 5.8                   | 11.2                |

ETH zürich (Environmenter Studiogum [1] Z. Dong et al., «HAWQ, Hessian Aware Quantization of Neural Networks With Mixed-Precision» 44

### Conclusion

- s Point
- In this work we proposed a full stack to optimize the inference of fine-grain QNNs
- We designed new RISC-V ISA extensions:
  - Starting from XpulpV2 baseline
  - Mixed-precision *Mac&Load* instructions through SIMD virtual instructions
  - Automatic address generation
- Optimized key kernels for the inference of mixed-precision QNNs
- Outperformed the baseline core by 19x with 91.5 MAC/cycle
- Reached an energy efficiency of 3.3 TOPS/W, likely HW accelerators
- Benchmarked our ISA extensions on full networks
  - Obtaining a speedup against all reference architectures
  - Low Top-1 accuracy loss against a huge reduction of the memory footprint

# Thank you!



EHZÜRICH ALMA MATER STUDIORUM