

## ∧ Meta

## Next stop XR: towards on-sensor PULP computing for micropower eXtended Reality

Francesco Conti f.conti@unibo.it

**PULP Platform** Open Source Hardware, the way it should be!

@pulp platform pulp-platform.org



youtube.com/pulp\_platform

## "Smart" Glasses, today

Socially Acceptable form factor  $\rightarrow$  like regular glasses

Lightweight  $\rightarrow$  <50 gram

All-day battery → LiPo 167mAh @ 3.7V → 25mW for 24-hour operation



[https://www.techinsights.com/blog/ray-ban-stories-smart-glasses-cameras]





٠.





#### Meta Quest 2



Microsoft HoloLens 2



Dedicate form factor → cumbersome and uncomfortable in the long run

Heavyweight → ~500 gram

2/3-hours battery → Li-ion 3640mAh @ 3.85V → ~5W operation





ETHZÜRICH 🕮 ALMA MATER STUDIORUM

[https://en.wikichip.org/wiki/qualcomm/snapdragon\_800/865] [https://arstechnica.com/]



## How to "fold" XR functionality into smart glasses?

**ETH** zürich

ALMA MATER STUDIORUM



Many technological hurdles to address as well! E.g., see-through hi-res displays [0] 4

#### Save energy where it counts!



[E. Beigné, ISSCC Forum 4: Advancing Technologies for Extended Reality (XR) to Make the "Metaverse" Possible]



units



## The PULP value

#### Composability

- Vast library of silicon-proven IPs
- Ranging from microcontroller to HPC

#### Heterogeneity

- L0 acceleration: RISC-V extensions, SSR, ...
- L1 acceleration: HWPEs (neural engines, TPEs, optimization engines...)
- L2 acceleration: AXI autonomous units (& multi-cluster)

#### Efficiency

• Otherwise it would be the P\_\_\_\_ Platform!



High-speed on-chip interconnect (NoC, AXI, other..)





## A vision for PULP-based XR glasses

Distributed, on-sensor computing

- Collect raw data
- Process directly **on-sensor**
- Aggregate on larger computing platforms

#### Acceleration

- On-chip **NVM** for DNN weights
- L1 HW acceleration for DNNs
- LO acceleration for diverse processing

Meta  $\stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{B} \stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{B} \stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{B} \stackrel{\sim}{}_{B} \stackrel{\sim}{}_{A \ B} \stackrel{\sim}{}_{B} \stackrel{\sim}{}_{B}$ 



## Siracusa first steps

- A heterogeneous cluster template
  ARchiMEDES
- A novel accelerator NEureka





Siracusa



#### ARchiMEDES\*cluster template

(IXA)

Architectural Heterogeneity <-> Compute Diversity

DSP, Models, and other general parallel computations -> PULP cluster



#### A "classic" PULP cluster with 8 **RV32IMCFXpulpnn** cores

- private multi-precision FPUs
- hierarchical instruction cache (4 KiB + 512B per core)
- **Xpulpnn** extensions [5] for integer mixed-precision DSP + DNNs
- 256 KiB of Tightly-Coupled Data Memory (TCDM) divided in 16 word-interleaved SRAM banks
- The Logarithmic Interconnect we have known and loved since < 2013 ☺</li>



Almo matter studiogen\*\* Augmented Reality Architecture with Minimum Energy DNNs Embedded Specialization

#### **ARchiMEDES cluster template**

Architectural Heterogeneity <-> Compute Diversity

Quantized DNNs -> Tightly-Coupled Neural Engine NEureka (3<sup>rd</sup> gen after RBE [5], NE16 )





- 1 Core = receptive field of 1x1 px in output across 32 out-chans
- Output stationary, Input quasi-stationary
- Parametric number of Cores (NxM out-px)
- 8b activations, 2-8b weights



#### **ARchiMEDES cluster template**





#### Boost memory energy efficiency

A large power-optimized on-chip memory for network weights -> cluster-level **weight stationarity** 

4x 1MiB SRAM banks (64b-wide) 4x 1MiB NVM banks (64b-wide)

Paging support for transparent network reconfiguration with negligible increase in overall circuit area.



#### Siracusa SoC



YEARS OF PULP 12

## NEureka – DNN Accelerator Engine

$$\mathbf{y}(k_{out}) = quant\left(\sum_{i=0..Wbit}\sum_{k_{in}} 2^{i} \left(\mathbf{W}_{bin}(k_{out}, k_{in}) \otimes \mathbf{x}(k_{in})\right)\right)$$

**ETH** zürich

ALMA MATER STUDIORUN Università di Bologn



- Partially bit-serial dataflow for CONV3x3, PW1x1, DWCONV3x3
  - 3x3, 1x1 and 3x3 depthwise mode

## NEureka – DNN Accelerator Engine

$$\mathbf{y}(k_{out}) = quant\left(\sum_{i=0..Wbit}\sum_{k_{in}} 2^{i} \left(\mathbf{W}_{bin}(k_{out}, k_{in}) \otimes \mathbf{x}(k_{in})\right)\right)$$



- Partially bit-serial dataflow for CONV3x3, PW1x1, DWCONV3x3
  - 3x3, 1x1 and 3x3 depthwise mode
  - Activations 8b, Weights 2-8b
- Core receptive field of 1 output px across 32 output chans → more cores, larger output "tile"
- Stationarity
  - Output -> fully stationary in Accumulators
  - Input -> quasi-stationary in Input Buffers
  - Weights -> non-stationary (but stationary @ cluster level, thanks to WMEM!)
- **Dispatching network** maps input across Cores
- Accumulator 32x32-bit registers to store partial sums
- Quant Normalization, Quantization, ReLU



**ETH** zürich

## NEureka – DNN Accelerator Engine

$$\mathbf{y}(k_{out}) = quant\left(\sum_{i=0..Wbit}\sum_{k_{in}} 2^{i} \left(\mathbf{W}_{bin}(k_{out}, k_{in}) \otimes \mathbf{x}(k_{in})\right)\right)$$









#### NEureka scalability and performance



#### Weight-precision scaling

ETHZÜRICH (I) OLMA MATER STUDIORUM

[A. Prasad, L. Benini, F. Conti, "Specialization meets Flexibility: a Heterogeneous Architecture for High-Efficiency, High-Flexibility AR/VR Processing," DAC 2023 (to appear)]



## Siracusa SoC – prototype in TSMC 16nm

• 4mm x 4mm

**ETH** zürich

- A cornucopia of **memory** 
  - 4 MiB of WMEM-NVM
  - 4 MiB of WMEM-SRAM
  - 2 MiB of L2 SRAM
  - 256 KiB of L1 TCDM
- Largest NEureka configuration

ALMA MATER STUDIORUN

• 6x6 = 36 Cores

To appear at ESSCIRC'23 (M. Scherer et al.)

Siracusa: A Low-Power On-Sensor RISC-V SoC for Extended Reality Visual Processing in 16nm CMOS







#### **Siracusa** Performance and Efficiency



#### RISC-V cores (GP, DSP, ...)



Very near to out target

ALMA MATER STUDIORUM Università di Bologna

- hundreds of GOPS
- ~10 TOPS/W

**ETH** zürich

NEureka



18 YEARS

#### The Evolution of the Accelerator Species





ETHZÜRICH

19 YEARS 0

#### The real challenge: using it!







**ETH** zürich



#### Back to the vision!

Distributed, on-sensor computing

- Collect raw data
- Process directly **on-sensor**
- Aggregate on larger computing platforms

#### Acceleration

- On-chip **NVM** for DNN weights
- L1 HW acceleration for DNNs
- LO acceleration for diverse processing







#### Back to the vision!

Distributed, on-sensor computing

- Collect raw data
- Process directly **on-sensor**
- Aggregate on larger computing platforms



# 

## From focus on single node towards distributed network of on-sensor nodes

[J. Gomez et al., Distributed On-Sensor Compute System for AR/VR Devices: A Semi-Analytical Simulation Framework for Power Estimation]

ETHZÜRICH IN ALMA MATER STUDIORUM



## Distributed on-sensor computing simulation with GVSOC









#### Siracusa Team @ ETH / UNIBO



Manuel Eggimann Top-level design & verif. WMem subsystem Silicon measurements



Arpan Prasad NEureka design & verification System-level simulation



Moritz Scherer Silicon measurements Applications Alfio Di Mauro Interfaces integration Top-level verification





**Francesco Conti** NEureka architecture Siracusa architecture





#### References

- P D P
- [0] M. Abrash, "Creating the Future: Augmented Reality, the next Human-Machine Interface," in 2021 IEEE International Electron Devices Meeting (IEDM), Dec. 2021, p. 1.2.1-1.2.11. doi: 10.1109/IEDM19574.2021.9720526.
- [1] Y. Feng, N. Goulding-Hotta, A. Khan, H. Reyserhove, and Y. Zhu, "Real-Time Gaze Tracking with Event-Driven Eye Segmentation," in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Mar. 2022, pp. 399–408. doi: <u>10.1109/VR51125.2022.00059</u>.
- [2] S. Huang *et al.*, "A new head pose tracking method based on stereo visual SLAM," *Journal of Visual Communication and Image Representation*, vol. 82, p. 103402, Jan. 2022, doi: <u>10.1016/j.jvcir.2021.103402</u>.
- [3] F. Zhang *et al.*, "MediaPipe Hands: On-device Real-time Hand Tracking." arXiv, Jun. 17, 2020. doi: <u>10.48550/arXiv.2006.10214</u>.
- [4] A. Li, W. Liu, C. Zheng, and X. Li, "Embedding and Beamforming: All-Neural Causal Beamformer for Multichannel Speech Enhancement," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 6487–6491. doi:

10.1109/ICASSP43922.2022.9746432.



