

# Designing "Artificial Brains" for Next-Generation Autonomous Systems

Luca Benini Ibenini@iis.ee.ethz.ch, luca.benini@unibo.it

**PULP Platform** Open Source Hardware, the way it should be! @pulp\_platform pulp-platform.org youtube.com/pulp\_platform

#### Autonomous Systems: Roadmap Path Towards Full Autonomy Compute Networking Power Speed (TFLOPS) (Gbit/s) High-Speed, Reliable & Secure Nervous **Efficient** Level 4-5 100 100 System Self Driving High-Performance Brain **On-car Computing** 10 10 Level 2-3 $P_{MAX} < 1.5 \, kW$ Local Computing Decision "Behind" Every Sensor Assistant **Centralized** Computing Integrates Input From All 1 Sensors (Sensor Fusion) Similar to a Human Driver's Level 1-2 Simple Aid Brain 0.1 0.1 2010 - 2018 2025... 2019 - 2025 [SCR23]

ETH zürich



### **Embodied AI**



[AMD HotChips24]





**Efficient** 

On-car Computing P<sub>MAX</sub> < 1.5 kW Model complexity 10× every ~2.5 years

10× every ~2.5 years Moore's Law 10x every 12 years!

### **Autonomous Nano-Drones**

#### Advanced autonomous drone

A. Bachrach, "Skydio autonomy engine: Enabling the next generation of autonomous flight," IEEE Hot Chips 33 Symposium (HCS), 2021



https://www.skydio.com/skydio-2-plus

- 3D Mapping & Motion Planning ۲
- **Object recognition & Avoidance** ullet
- 0.06m<sup>2</sup> & 800g of weight •
- Battery Capacity 5410 mAh •



#### Nano-drone





https://www.bitcraze.io/products/crazyflie-2-1

- Smaller form factor of 0.008m<sup>2</sup>
- Weight:

- **27 g (30× lighter)**
- Battery capacity:
- 250 mAh (20× smaller)



Intelligence in a 30× smaller payload, 20× lower energy budget?



# Achieving True Autonomy on Nano-UAVs

Multiple,

complex, heterogeneous

tasks at high speed and robustness fully on board

ALMA MATER STUDIORUM

Obstacle avoidance & Navigation





#### **Object detection**



Environment exploration



Multi-GOPS workload at extreme efficiency  $\rightarrow P_{max}$  100mW

### **Efficiency through Heterogeneity: Multi-Specialization Brain-inspired**: Multiple areas, different structure different function!



# Kraken: 22nm SoC, Multiple Heterogeneous Accelerators



The Kraken: an "Extreme Edge" Brain

2000 .....

- RISC-V Cluster
  8 Compute cores +1 DMA core
- CUTIE
  - Dense ternary-neural-network accelerator
- SNE
   Energy-proportional spikingneural-network accelerator

|          | <          | 5000 μπ |                                                              |              |                       |
|----------|------------|---------|--------------------------------------------------------------|--------------|-----------------------|
|          |            |         |                                                              | Technology   | 22 nm FDSOI           |
|          |            |         | Cluster                                                      | Chip Area    | 9 mm <sup>2</sup>     |
| 'k       | SoC Domain |         | Domain<br>(PULPO)                                            | SRAM SoC     | 1 MiB                 |
|          |            |         |                                                              | SRAM Cluster | 128 KiB               |
| u ц о    |            |         | Ψ.                                                           | VDD range    | 0.55 V - <b>0.8 V</b> |
| 300      |            | Edition |                                                              | Cluster Freq | ~370 MHz              |
|          | SNE        | CU      | CUTIE                                                        | SNE Freq     | ~250 MHz              |
|          | Mark .     |         | Thin Frischor<br>A dentation 2 desce<br>Automotion 2 descent | CUTIE Freq   | ~140 MHz              |
|          |            |         | rizen Spilatzan<br>are Refistanser                           |              |                       |
| <b>.</b> |            |         |                                                              |              |                       |

### **CUTIE: Perception from Frame Sensors**



#### **Output channel compute unit (OCU)**

- Completely Unrolled Ternary Neural Inference Engine: K × K window, all input channels, cycle-by-cycle sliding
- One Output Compute Unit (OCU) computes one output activation per cycle!
- Zeros in weights and activations, spatial smoothness of activations reduce switching activity

Aggressive quantization and full specialization

# Kraken's CUTIE Implementation



#### 1fJ/MAC (1POPS/W) – Ternary OPS

ETHZÜRICH



General Purpose: Domain-Specialized RV32 Core (PE)

Specialization Cost: Power, Area:  $1.5 \times \uparrow$  Time  $15 \times \downarrow \rightarrow$  E = PT  $10 \times \downarrow \downarrow$ 

**ETH** zürich

ALMA MATER STUDIORUM

# **PULP Paradigm:** A PE **cluster** accelerates a host system





**ETH** zürich ALMA MATER STUDIORUM

# **SNE: Perception on Event Sensors**

Event Sensors – DVS camera Ultra-low latency Energy- proportional interface



#### Spiking Neural Engine (SNE)



Leaky Integrate & Fire (LIF) neurons



[Di Mauro et al. DATE22]

#### SNE works seamlessly with DVS (event-based) sensors

ETH zürich

# Event consumption, and output spikes generation



A more complex dynamic than conventional DNNs neurons:

- Membrane Potential Accumulation/Activation 1× SynAcc = 1× 4b-ADD + 1× 8b-COMPARE
- Membrane Potential decay 1× SynDec = (1× 8b-MUL) + (1× 8b-MUL + 1× 8b-ADD)

# **Kraken Shield and System Architecture**

7g payload

**ETH** zürich

- DVS and frame-based cameras  $\rightarrow$  real-time multi-modal perception.
- Designed for integration into nano-UAV platforms

ALMA MATER STUDIORUM





# Kraken Power Consumption (all Included)

Combined power consumption of SNE, CUTIE, PULP cluster

| Model | Inference/s | μJ/inf | Power<br>(mW) |
|-------|-------------|--------|---------------|
| SNE   | 1.02k       | 18     | 98            |
| CUTIE | 10k         | 6      | 110           |
| PULP  | 221         | 750    | 165           |

#### P=373mW, representing just 5% of the UAV's power budget







# Heterogeneous, Multiscale Accelerated Computing



#### **Multiple Scales of acceleration**

Extensions to processor cores

- Explore new extensions
- Efficient implementations

Shared-memory Accelerators

- Domain specific
- Local memory

Multiple Decoupled Accelerators

- Communication
- Synchronization

External 12 Memorv Accelerator mem mem mem mem mem mem Controller #1 bank bank bank bank bank bank Tightly coupled data memory interconnect L2 memory DMA RV RV RV RV ACC ACC 12 Host #2 #1 core core core core Accelerator core #2 EXT 12 Instruction Cache Peripherals Accelerator Cluster 1 #M Computing cluster with tightly coupled accelerators Decoupled Host, L2, L3 IOs

#### RISC-V is a key enabler $\rightarrow$ max agility, enabling SW build-up, without vendor lock-in

accelerators

High-speed on-chip interconnect (NoC, AXI, other..)





### **Tightly-coupled Accelerators**









#### Energy efficiency 10-20× (0.1pJ/OP) w.r.t. SW on cluster @same accuracy

**ETH** zürich

ALMA MATER STUDIORUM

# Specialization in perspective

Using 22FDX tech, NT@0.6V, High utilization, minimal IO & overhead







# Beyond Perception: Reasoning with Gen.Al

#### LLM Reasoning on Human Commands & Robot Observations







### Pervasive Gen.AI Challenge

OpenAl'23 arXiv:2303.08774







Performance of GPT-4 and smaller models: y-axis mean log pass rate on a subset of the HumanEval dataset. Dotted line: A power law fit to smaller models (excluding GPT-4)  $\rightarrow$  Accurately predicts GPT-4's performance. x-axis is training compute (log)

# There is no Othe Way to Go, but UP



#### **Multiple Scales of acceleration**

Extensions to processor cores

- Explore new extensions
- Efficient implementations

Shared-memory Accelerators

- Domain specific
- Local memory

Multiple Decoupled Accelerators

- Communication
- Synchronization



#### Specialize interconnects too! Local, global, package, system

### Snitch Core: Latency Tolerant, Extensible RV PE

- Snitch: tiny (20KGE), extensible RV core
  - Extensible through accelerator port
  - Latency-tolerant through scoreboard+ld/st queue
     → can issue ~10 non-blocking memOPs
  - Tolerates 10 cycles of memory latency (Little's law)
- Paired with ISA extension subsystem
- Native streaming support
  - Load/store elision

**ETH** zürich

• Reduction of I\$ pressure

ALMA MATER STUDIORUN







### SSR & FREP: Streaming Extension

- SSR: Link register read/writes into implicit LD/ST
  - Extension around the core's register file
  - Address generators (2-3KGE/SSR)

**ETH** zürich

- Configured out of inner loop (LD/ST elision)
- Staggering: generators prefetch from memory (latency tolerant!)
- FREP: L0 instruction buffer (no I\$ access)

ALMA MATER STUDIORUM

- Pseudo-dual issue (Int pipeline can proceed in parallel)
- No boundary checking for loop (similar HW loop in DSPs)
- Boost FPU utilization → 100% (once setup is amortized)

| dotp: 30% FPU                                                                                    | dotp: 90% FPU |                                                                         |
|--------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------------------------------|
| loop:<br><b>fld r0</b> , %[a]<br><b>fld r</b> 1, %[b]<br><b>fmadd</b> r2, <b>r0</b> , <b>r</b> 1 | -             | scfg 0, %[a], ldA<br>scfg 1, %[b], ldB<br>loop:<br>fmadd r2, ssr0, ssr1 |





Latency Tolerance: Less expensive than OoO (CPU) and Multi-threading (GPU)

# Snitch Cluster: The Fundamental Compute Block

- 8 Snitch compute cores
  - SIMD 64b FPU with SSRs & FREP
- 9<sup>th</sup> Core: DMA engine
  - 512b interface to interconnect
  - HW support for autonomous ≤ 2D transfers, higher dimensions through SW
  - Latency-tolerance block transfers (100s of cycles)
- 128 KiB TCDM
  - 32-bank, low-latency shared scratchpad
  - Double-buffer large chunks (KBs) with DMA
- Shared TCDM, I-cache and peripherals







# Specializing the Cluster for Gen.Al

• Attention is key

**ETH** zürich

• Attention matrix is a square matrix of order input length

Query

Linear

- Quadratic memory requirement vs. sequence length
- No asymmetry between operands ("weightless")
- MatMul & Softmax dominate

Softmax(
$$\mathbf{x}$$
)<sub>i</sub> =  $\frac{e^{x_i - \max(\mathbf{x})}}{\sum_j^n e^{x_j - \max(\mathbf{x})}}$ 

ALMA MATER STUDIORUM



# Matmul Benefits from Large Shared-L1 clusters

• Why?

**ETH** zürich

- Better global latency tolerance if L1<sub>size</sub> > 2× L2<sub>latency</sub> × L2<sub>bandwidth</sub> (Little's law + double buffer)
- Smaller data partitioning overhead

ALMA MATER STUDIORUN

- Larger Compute/Boundary bandwidth ratio: N<sup>3</sup>/N<sup>2</sup> for MMUL grows linearly with N!
- A large "MemPool": 256+ cores and 1+ MiB of shared L1 data memory



**MemPool Cluster** 

![](_page_26_Picture_9.jpeg)

# MemPool Cluster: A physical-aware design

- A Scalable Manycore Architecture with Low-Latency Shared L1 Memory
  - 256+ cores
  - 1+ MiB of shared L1 data memory
  - ≤ 8 cycle latency (Snitch can handle it)
- Hierarchical design
- Implemented in GF22
  - Targeting 500 MHz (SS/0.72V/125°C)
  - Reaching 600 MHz (TT/0.80V/25°C)
  - Targeting iso-frequency with PULP

ALMA MATER STUDIORUM

- Cluster area of 13 mm<sup>2</sup>
  - 5 mm diagonal

**ETH** zürich

- Round trip in 5 cycles
- Terapool: 1024 Cores!

#### **MemPool Group**

![](_page_27_Picture_15.jpeg)

![](_page_27_Picture_16.jpeg)

Group 0

Group

# MemPool + Integer Transformer Accelerator (ITA)

#### **Tightly coupled Acceleration Enginee**

- Matmul & Softmax
- Reduce pressure on memory and interconnect

#### **Collaborative Execution**

- Cores prepare activations for the next attention head
- Final head accumulation computed in cores
- Nonlinearity in cores (PACE)

![](_page_28_Figure_8.jpeg)

# MemPool + Integer Transformer Accelerator (ITA)

#### **Integer Attention Accelerator**

- 8-bit inputs, weights & outputs
- Builtin data marshaling & pipelined operation
- Streaming partial Softmax adding no additional latency
- Fused  $Q \times K^T$ , Softmax and  $A \times V$  computation
- Support for hardware-aware Softmax approximation in QuantLib

![](_page_29_Figure_7.jpeg)

![](_page_29_Figure_8.jpeg)

**ETH** zürich

### Attention on ITA

Performance increase of **15x** 

#### Energy Efficiency increase of 36x

Area Efficiency increase of 74x

ALMA MATER STUDIORUN

**ETH** zürich

![](_page_30_Figure_4.jpeg)

#### **Attention Efficiency**

# Scaling UP: Efficient and Flexible Data Movement

![](_page_31_Picture_1.jpeg)

# **Problem:** HBM Accesses are critical in terms of

- Access energy
- Congestion
- High latency

Instead reuse data on lower levels of the memory hierarchy

- Between clusters
- Across groups

Smartly distribute workload

- Clusters: Tiling, Depth-First
- Chiplets: E.g. Layer pipelining

### **Big trend!**

![](_page_31_Picture_13.jpeg)

### Addressing interconnect scalability

![](_page_32_Picture_1.jpeg)

#### • Fat-tree was very challenging in Implementation

- AXI has severe scalability issues
- Top-level Xbar had to be split up
- Still, interconnect takes up almost 40%\*
- Working on NoC solution, *FlooNoC* 
  - Fully AXI4 compatible
  - Solves AXI4 scalability issues
  - Designed with awareness of physical design
  - Wide & physical channels

![](_page_32_Picture_11.jpeg)

# Replacing the AXI interconnect with a NoC

![](_page_33_Picture_1.jpeg)

- Potential for big area/performance gains
  - Only ~10% interconnect area
  - 66% more clusters, same floorplan
  - *High Bandwidth*: 629Gbps/link
  - High Energy-Efficiency: 0.19pj/B/hop

![](_page_33_Figure_7.jpeg)

![](_page_33_Picture_8.jpeg)

![](_page_33_Picture_9.jpeg)

![](_page_33_Picture_11.jpeg)

### MHA Mapping on NoC: FlattenAttention

- Proposed Dataflow Schedule of MHA
  - We leverage all-cluster L1 for single head attention Minimize I/O complexity
  - Gen.Al specialized NoC
    - Matrix transpose engine for transposition of  $(K \rightarrow K^T)$
    - Collective operations on NoC
- Benchmark & Results
  - 16x16 Clusters (8TFLOPS, 256kB L1), 2TB/s HBM
  - One layer MHA of Llama3-70B (seq=4K, batch=8)
  - Efficient collective operation support on NoC is essential focus only on one head every head sequentially
    - 3x speedup to baseline

![](_page_34_Figure_11.jpeg)

# Scaling UP: From Chip to chiplets

![](_page_35_Figure_1.jpeg)

**Occamy System** 

![](_page_35_Picture_3.jpeg)

![](_page_35_Figure_4.jpeg)

Snitch Cluster

SuperBank 3

B31

B23

↑ ↑

CC 7

512b AXI Crossbar

L0 IS

Periph

CC 8

LO IS

Cluster

SB 0

B

SB 1

. .

CC 0

LO IS

ZeroMemor

**B**8

\_\_\_\_\_

SB 2

Shared L1 Scratchpad Crossbar

1 1

CC 1

Shared L1 I\$

LO IS

B16

![](_page_35_Picture_5.jpeg)

![](_page_35_Picture_6.jpeg)

ETHZÜRICH

![](_page_35_Picture_8.jpeg)

# Not Only Layer-by-Layer distribution across Chiplets!

![](_page_36_Figure_1.jpeg)

ETHZÜRICH

![](_page_37_Figure_1.jpeg)

![](_page_37_Figure_2.jpeg)

![](_page_37_Picture_3.jpeg)

![](_page_37_Picture_4.jpeg)

![](_page_38_Picture_1.jpeg)

![](_page_38_Figure_2.jpeg)

![](_page_39_Picture_1.jpeg)

![](_page_39_Picture_2.jpeg)

![](_page_40_Picture_1.jpeg)

![](_page_40_Figure_2.jpeg)

![](_page_41_Picture_1.jpeg)

![](_page_41_Figure_2.jpeg)

![](_page_42_Figure_0.jpeg)

### What's next?

![](_page_43_Picture_1.jpeg)

![](_page_43_Figure_2.jpeg)

ETH zürich 🛞 ALMA MATER STUDIORUM

### What's next?

![](_page_44_Picture_1.jpeg)

![](_page_44_Picture_2.jpeg)

![](_page_44_Picture_4.jpeg)

### What's next?

![](_page_45_Figure_1.jpeg)

ETHZÜRICH

![](_page_46_Picture_0.jpeg)

# Thank You!