

# Toward End-to-End Open Platforms for the Embodied AI Era

#### Luca Benini

benny@stanford.edu

Ibenini@iis.ee.ethz.ch

luca.benini@unibo.it

#### **PULP Platform**

Open Source Hardware, the way it should be!



# **Embodied AI:** Artificial Intelligence Everywhere











2010 - 2018

2019 - 2025

2025...



[SCR23]



# **Embodied AI:** Artificial Intelligence Everywhere

# P

#### **Smart Glasses**





#### Nano-Drone







On-car Computing P<sub>avg</sub> < 150 W

On-drone Computing P<sub>avg</sub> < 150 mW

On-glass Computing P<sub>avg</sub> < 1.50 mW







### **Embodied AI**: Efficiency Challenge







Model complexity 10× every ~2.5 years

Moore's Law 10x every 12 years!

[AMD HotChips24]

### Algorithm, Architecture, Design are key!







# Efficiency through Heterogeneity: Multi-Specialization



Brain-inspired: Multiple areas, different structure different function!



Wernicke's Area
Written and spoken
language understanding

#### FUNCTIONAL AREAS OF THE CEI









# Perceptive & Generative AI: A Fast-Evolving Model Zoo



[Z. Sun et al.] MobileBERT

Encoder Transformer





[P. Busia et al.] **EEGFormer** 

**Encoder Transformer** 



Need to design embedded AI SoC's that provide 1) Flexibility 2) Performance 3) Energy Efficiency





DINOv2: Learning Robust Visual Features
without Supervision
[M. Oquab et al.]

**Encoder Transformer** 





Auto-regressive Transformers? Maybe...

**v**Denoiser







# Kraken: 22FDX SoC, Multiple Heterogeneous Accelerators



The *Kraken*: an "Extreme Edge" Brain

- RISC-V Cluster 8 Compute cores +1 DMA core
- CUTIE

  Dense ternary-neural-network
  accelerator
- SNE
   Energy-proportional spikingneural-network accelerator



| Technology   | 22 nm FDSOI           |
|--------------|-----------------------|
| Chip Area    | 9 mm <sup>2</sup>     |
| SRAM SoC     | 1 MiB                 |
| SRAM Cluster | 128 KiB               |
| VDD range    | 0.55 V - <b>0.8 V</b> |
| Cluster Freq | ~370 MHz              |
| SNE Freq     | ~250 MHz              |
| CUTIE Freq   | ~140 MHz              |







### **CUTIE: Perception from Frame Sensors**





#### Output channel compute unit (OCU)

- Completely Unrolled Ternary Neural Inference Engine: K × K window, all input channels, cycle-by-cycle sliding
- One Output Compute Unit (OCU) computes one output activation per cycle!
- Zeros in weights and activations, spatial smoothness of activations reduce switching activity

### Aggressive quantization and full specialization





### Kraken's CUTIE Implementation





- Configuration in Kraken
  - 96 channels (Output compute units)
  - 3 × 3 kernels
  - 64 × 64 pixels feature maps (158 KiB)
  - 9 layers of weights (117 KiB)
- Lots of TMAC/cycle
  - 96 OCUs, 96 Input channels, 3 × 3 kernels:
  - $96 \times 96 \times 3 \times 3 = 82'944$  Ternary-MAC/cycle



# 1fJ/MAC (1POP/s/W) Ternary OPS







### SNE: Perception on Event Sensors



Event Sensors – DVS camera

Ultra-low latency

Energy- proportional interface



Spiking Neural Engine (SNE)





[Di Mauro et al. DATE22]

SNE works seamlessly with DVS (event-based) sensors







### Event consumption, and output spikes generation





A more complex dynamic than conventional DNNs neurons:

- Membrane Potential Accumulation/Activation 1× SynAcc = 1× 4b-ADD + 1× 8b-COMPARE
- Membrane Potential decay 1× SynDec = (1×8b-MUL) + (1×8b-MUL + 1×8b-ADD)







# General Purpose: Domain-Specialized RV32 Core (PE)





RISC-V° Instruction set: open and extensible by construction (great!)

#### 8-bit Convolution Vanilla a0,a0,1 **RISC-V** t1,t1,1 t3,t3,1 core t4,t4,1 a7,-1(a0) lbu a6,-1(t4) a5,-1(t3) lbu t5,-1(t1) lbu s1,a7,a6 mul mul a7,a7,a5 add s0,s0,s1 mul a6,a6,t5 t0,t0,a7 add a5,a5,t5 mul t2,t2,a6 t6,t6,a5 add

### Specialized for AI → Mixed precision SIMD (16-2bit)

```
Init NN-RF (outside of the loop)
pv.nnsdotup.h s0,ax1,9
pv.nnsdotsp.b s1, aw2, 0
pv.nnsdotsp.b s2, aw4, 2
pv.nnsdotsp.b s3, aw3, 4
pv.nnsdotsp.b s4, ax1, 14
```



15x less instructions than Vanilla 90%+ ALU Utilization

Specialization Cost: Power, Area:  $1.5 \times \uparrow$  Time  $15 \times \downarrow \rightarrow$  E = PT  $10 \times \downarrow$ 





s5,a0,1c000bc



### PULP Paradigm: A PE cluster accelerates a host system









### Heterogeneous, Multiscale Accelerated Computing



#### **Multiple Scales of acceleration**

#### Extensions to processor cores

- Explore new extensions
- Efficient implementations

#### **Shared-memory Accelerators**

- Domain specific
- Local memory

#### Multiple Decoupled Accelerators

- Communication
- Synchronization

High-speed on-chip interconnect (NoC, AXI, other..)



RISC-V is a key enabler  $\rightarrow$  max agility, enabling SW build-up, without vendor lock-in







# **Tightly-coupled Accelerators**











# HWPE: Reconfigurable Binary Engine



$$\mathbf{y}(k_{out}) = \mathbf{quant}\left(\sum_{i=0..M}\sum_{j=0..N}\sum_{k_{in}} 2^{i}2^{j}\left(\mathbf{W_{bin}}(k_{out},k_{in})\otimes\mathbf{x_{bin}}(k_{in})\right)\right)$$



Energy efficiency 10-20× (0.1pJ/OP) w.r.t. SW on cluster @same accuracy







### Specialization in perspective



Using 22FDX tech, NT@0.6V, High utilization, minimal IO & overhead

Energy-Efficient RV Core → 20pJ (8bit)



ISA-based 10-20x  $\rightarrow$ 1pJ (4bit)



**XPULP** 



Configurable DP 10-20x  $\rightarrow$  100fJ (4bit)



**RBE** 



Highly specialized DP  $100x \rightarrow 1fJ$  (ternary)



**CUTIE, SNN** 



# Marsellus: Al-loT Heterogeneous SoC









#### **Combine:**

- Heterogeneous architecture
- Quantization
- V<sub>DD</sub> scaling
- Adaptive Body Biasing

### **Prototype implemented in GF 22FDX**

→ flip-well LVT & SLVT cells, 2.43mm² for CLUSTER







### Vega: On-Chip NVMem for NN Weights

In cooperation with





GREENWAVES 22nm FDSOI **Technology Chip Area** 12mm<sup>2</sup> **SRAM** 1.7 MB 4000 **MRAM** 4 MB 0.5V - 0.8V **VDD** range 0V - 1.1V **VBB** range 32 kHz - 450 MHz Fr. Range 1.7 µW - 49.4 mW Pow. Range









end-to-end on-chip computation

3.5x less energy





### Not only academia: GAP9 with NE16





### Best-in-class in latency and energy efficiency in MLPerf Tiny 1.0!

| Submitter                     | Board Name                                              | SoC Name           | Processor(s) &<br>Number | Accelerator(s) & Number    | Software                                                                 |                             | Benchmark Results         |                                                                             |         |                                |         |                                                           |        |                                                                  |       | 1               |
|-------------------------------|---------------------------------------------------------|--------------------|--------------------------|----------------------------|--------------------------------------------------------------------------|-----------------------------|---------------------------|-----------------------------------------------------------------------------|---------|--------------------------------|---------|-----------------------------------------------------------|--------|------------------------------------------------------------------|-------|-----------------|
|                               |                                                         |                    |                          |                            |                                                                          |                             | Data Model Accuracy Units | Visual Wake Words Visual Wake Words Dataset MobileNetV1 (0.25x) 80% (top 1) |         | CIFAR-10 ResNet-V1 85% (top 1) |         | Keyword Spotting Google Speech Commands DSCNN 90% (top 1) |        | Anomaly Detection  ToyADMOS (ToyCar)  FC AutoEncoder  0.85 (AUC) |       |                 |
|                               |                                                         |                    |                          |                            |                                                                          |                             |                           |                                                                             |         |                                |         |                                                           |        |                                                                  |       |                 |
|                               |                                                         |                    |                          |                            |                                                                          |                             |                           |                                                                             |         |                                |         |                                                           |        | Latency in ms                                                    |       |                 |
| Greenwaves<br>Technologies    | GAP9 EVK                                                | GAP9               | RISC-V Core<br>(1+9)     | NE16 (1)                   | GreenWaves GAPFlow                                                       | GAP9 (370MHZ,<br>0.8Vcore)  |                           | 1.13                                                                        | 58.4    | 0.62                           | 40.4    | 0.48                                                      | 26.7   | 0.18                                                             | 7,29  |                 |
| Greenwaves<br>Technologies    | GAP9 EVK                                                |                    | RISC-V Core<br>(1+9)     | NE16 (1)                   | GreenWaves GAPFlow                                                       | GAP9 (240MHZ,<br>0.65Vcore) |                           | 1.73                                                                        |         |                                |         | 25-01                                                     |        | 757                                                              | 5.25  |                 |
| OctoML                        | NRF5340DK                                               | -                  | Arm® Cortex®-<br>M33     |                            | microTVM using CMSIS-NN backend                                          | 128MHz                      |                           | 232.0                                                                       |         | 316.1                          |         | 76.1                                                      |        | 6.27                                                             |       | Clee            |
| OctoML                        | NUCLEO-L4R5ZI                                           | IT6U               | Arm® Cortex®-<br>M4      |                            | microTVM using CMSIS-NN backend                                          | 120MHz, 1.8Vbat             |                           | 301.2                                                                       | 15531.4 | 389.5                          | 20236.3 | 99.8                                                      | 5230.3 | 8.60                                                             | 443.2 | Pione           |
| OctoML                        | NUCLEO-L4R5ZI                                           | IT6U               | Arm® Cortex®-<br>M4      |                            | microTVM using native codegen                                            | 120MHz, 1.8Vbat             |                           | 336.5                                                                       | 17131.6 | 389.2                          | 21342.3 | 144.0                                                     | 7950.5 | 11.7                                                             | 633.7 | featur<br>Al-en |
| Plumerai                      | B_U585I_IOT02A                                          |                    | Arm® Cortex®-<br>M33     |                            | Plumerai Inference Engine 2022.09                                        | 160MHz                      |                           | 107.0                                                                       |         | 107.1                          |         | 35.4                                                      |        | 4.90                                                             |       |                 |
| Plumerai                      | CY8CPROTO-062-<br>4343w                                 |                    | Arm® Cortex®-<br>M4      |                            | Plumerai Inference Engine 2022.09                                        | 150MHz                      |                           | 192.5                                                                       |         | 193.1                          |         | 61.4                                                      |        | 6.70                                                             |       |                 |
| Plumerai                      | DISCO-F746NG                                            | STM32F746          | Arm® Cortex®-<br>M7      |                            | Plumerai Inference Engine 2022.09                                        | 216MHz                      |                           | 57.0                                                                        |         | 64.8                           |         | 19.1                                                      |        | 2.30                                                             |       |                 |
| Plumerai                      | NUCLEO-L4R5ZI                                           | STM32L4R5Z<br>IT6U | Arm® Cortex®-<br>M4      |                            | Plumerai Inference Engine 2022.09                                        | 120MHz                      |                           | 208.6                                                                       |         | 173.2                          |         | 71.7                                                      |        | 5.60                                                             |       |                 |
| Silicon Labs                  | xG24-DK2601B                                            |                    | Arm® Cortex®-<br>M33     | Silicon Labs MVP(1)        | TensorFlowLite for Microcontrollers, CMSIS-NN,<br>Silicon Labs Gecko SDK |                             |                           | 111.6                                                                       | 1139.2  | 120.9                          | 1234.7  | 36.3                                                      | 401.9  | 5.43                                                             | 47.3  | €               |
| STMicroelectronics            | NUCLEO-H7A3ZI-<br>Q                                     | IT6Q               | Arm® Cortex®-<br>M7      |                            | X-CUBE-AI v7.3.0                                                         | 280MHz, 3.3Vbat             |                           | 50.7                                                                        | 7978.5  | 54.3                           | 8707.3  | 16.8                                                      | 2721.8 | 1.82                                                             |       | 2               |
| STMicroelectronics            | NUCLEO-L4R5ZI                                           | STM32L4R5Z<br>IT6U | Arm® Cortex®-<br>M4      |                            | X-CUBE-AI v7.3.0                                                         | 120MHz, 1.8Vbat             |                           | 230.5                                                                       | 10066.6 | 226.9                          | 10681.6 | 75.1                                                      | 3371.7 | 7.57                                                             | 323.0 |                 |
| STMicroelectronics            | NUCLEO-U575ZI-<br>Q                                     |                    | Arm® Cortex®-<br>M33     |                            | X-CUBE-AI v7.3.0                                                         | 160MHz, 1.8Vbat             |                           | 133.4                                                                       | 3364.5  | 139.7                          | 3642.0  | 44.2                                                      | 1138.5 | 4.84                                                             | 119.1 |                 |
| Syntiant                      | NDP9120-EVL                                             | NDP120             | M0 + HiFi                | Syntiant Core 2 (98MH      | Syntiant TDK                                                             | Syntiant Core 2 (98MHz,     | 1                         | 4.10                                                                        | 97.2    | 5.12                           | 139.4   | 1.48                                                      | 43.8   |                                                                  |       |                 |
| Syntiant                      | NDP9120-EVL                                             | NDP120             | M0 + HiFi                | Syntiant Core 2 (30MH      | Syntiant TDK                                                             | Syntiant Core 2 (30MHz,     | d                         | 12.7                                                                        | 71.7    | 16.0                           | 101.8   | 4.37                                                      | 31.5   |                                                                  |       | l               |
| Qualcomm<br>Innovation Center | Next Generation<br>Snapdragon<br>Mobile Platform<br>HDK |                    | Qualcomm Kryo<br>CPU(1)  | Qualcomm Sensing<br>Hub(1) | Qualcomm Al Stack                                                        |                             |                           |                                                                             |         |                                |         |                                                           |        | 0.098                                                            |       |                 |









# Al Innovation beyond "NVIDIA Gravity" is Challenging!



- It's the software → flexibility, fast evolution!
- Need an open standard to counter a monopoly



RISC-V: The Free and Open RISC Instruction Set Architecture







### **RISC-V** is Accelerating







EuroHPC 200+M€ for RV HPC (DARE FPA) Chips (KDT) 300+M€ for RV Automotive





India Ministry for Electronics & Information Technology launched Digital India RISC-V (DIR-V) program for commercial SHAKTI & VEGA silicon.



Industry Leaders Launch RISE to Accelerate the Development of Open Source Software for RISC-V

















Six chip giants to drive RISC-V application in automotive, enhance industry resilience

























### Fully Open-Source Deployment Flow!









### Open SW & HW Embodied AI Platform?











# Curtailing IP<sub>€</sub>: Open-Source Hardware







#### **Platforms**





- PULPino, PULPissimo
- Cheshire



- OpenPULP
- ControlPULP



- Hero, Carfield, Astral
- Occamy, Mempool

### IOT

#### **Accelerators and ISA extensions**

XpulpNN, XpulpTNN ITA (Transformers)

RBE, NEUREKA (QNNs) FFT (DSP) REDMULE (FP-Tensor)







# We make everything (we can) available openly



- All our development is on GitHub using a <u>permissive</u> license
  - HDL source code, testbenches, software development kit, virtual platform

### https://github.com/pulp-platform



Allows anyone to use, change, and make products without restrictions.









# Curtailing EDA<sub>€</sub>: Open-Source Implementation?









# End-to-end Open-Source Digital IC Design is Possible Today!









### Basilisk: Open RTL, Open EDA, Open PDK





### **Designed in IHP 130nm OpenPDK**

- 6.25mm x 5.50mm
- 60MHz
- 1.08 MGE logic, 60% density
- 24 SRAM macros (114 KiB)
- CVA6 based SoC
  - Runs and boots Linux
- **Active collaboration with**















### Basilisk SoC: Cheshire Platform

P

- Multi-million gate design
- 64-bit RISC-V Core
  - Complete Linux-capable SoC
  - Simple "Raspberry Pi"
- Rich Peripherals
  - Includes an open USB 1.1 host
- Open-source DRAM interface
  - Digital-only interface
- Silicon-proven
  - Multiple tapeouts with commercial EDA



github.com/pulp-platform/cheshire-ihp130-o





### Mlem is our 2<sup>nd</sup> end-to-end open SoC: Croc Platform



- Scalable ULP design
- 32-bit RISC-V Core
  - Complete Linux-capable SoC
  - Simple "Raspberry Pi"
- Rich Peripherals
- Ready for Acceleration
  - Digital-only interface
- Silicon-proven
  - Tapeouts with open & commercial EDA





github.com/pulp-platform/croc





### Open-source vs. Commercial EDA – Reality Check



- SV-to-Verilog chain @ <2min runtime
- Yosys synthesis:
  - → 1.1 MGE (1.6×) @ 77 MHz (2.3×)
  - → 1.4× less runtime, 2.4× less peak RAM
- OpenROAD P&R: tuning
  - → -12% die area, +10% core utilization

### Improvements June-October

- Yosys-slang replaces SV2V
  - 1.6× less runtime, 10× less peak RAM
  - -10% logic area (preliminary)

#### Logic Area (MGE) **Baseline** 1.75 (30ns, 1.8MGE) **Fastest** 1.50 (10ns, 1.4MGE). MUX 1.25 (27ns, 1.4MGE) today **LMS** (14ns. 1.1MGE) 1.00 (13ns, 1.1MGE) Commercial 0.75 0.50 0.25 better 10 15 25 20 Critical Path (ns) Recommendations and Roadmap for

### **Open EDA is maturing really fast!**







**Open-Source EDA** 

in Europe

### Does it Make Sense for a Foundry?





# $Cost=IP_{\epsilon}+EDA_{\epsilon}+SI_{\epsilon}$

- 1. Silicon cost remains as the bottom line
- 2. Openness Facilitates Ecosystem build-up
- 3. Eases life-cycle (training, audit, certification, support)
- 4. Great to boost return (€) on a mature node
- 5. Hybrid models are always possible

### But...

Need a mature node for **Energy-Efficient Digital** (**FDX22** ②)

### **Embodied AI is the Perfect Target Market for End-to-end Open Platforms!**











pulp-platform.org

# Thank You!

