

#### Toward Gen.Al Pervasive Intelligent Systems An Open RISC-V platform Approach

Luca Benini Ibenini@iis.ee.ethz.ch

**PULP Platform** Open Source Hardware, the way it should be!



@pulp\_platform >> pulp-platform.org



### Perception $\rightarrow$ Gen.Al $\rightarrow$ Pervasive Gen.Al



Interactive, creative



Efficient, RT-safe, secure





#### Pervasive Gen.Al: Robots

#### LLM Reasoning on Human Commands & Robot Observations







#### Pervasive Gen.AI: AI native Phy for RAN





https://developer.nvidia.com/blog/real-time-neural-receivers-drive-ai-ran-innovation/

H. Ye, L. Liang, G. Y. Li and B. -H. Juang, "Deep Learning-Based End-to-End Wireless Communication Systems With Conditional GANs as Unknown Channels," IEEE Transactions on Wireless Communications, 19.5, (2020)





#### Pervasive Gen.AI Challenge

OpenAl'23 arXiv:2303.08774







Performance of GPT-4 and smaller models: y-axis mean log pass rate on a subset of the HumanEval dataset. Dotted line: A power law fit to smaller models (excluding GPT-4)  $\rightarrow$  Accurately predicts GPT-4's performance. x-axis is training compute (log)

#### Technology is not Enough





On-car Computing P<sub>MAX</sub> < 1.5 kW Model complexity 10× every ~2.5 years Moore's Law 10x every 12 years!

[AMD HotChips24]

#### **Efficiency through Heterogeneity: Multi-Specialization Brain-inspired**: Multiple areas, different structure different function!



ALMA MATER STUDIORUM

**ETH** zürich

Somatosensorv Association Area Understanding of weight. texture, temperature, etc. for recognizing and comprehending an object

# 100 THE TWO IS AND THE TWO IS AND THE HAILO Part of the second s and and Service allow NO

Hailo-10H M.2 Key M ET **Generative AI Acceleration** Module (40TOPs, few TOPs/W)





### Looking up to the Leader

ALMA MATER STUDIORUM Università di Bologna

**ETH** zürich

Dally HotChips 2023

4000.00



Gains from Single-Chip Inference Performance - 1000X in 10 years 4500.00 H100 Number Representation . FP8 FP32, FP16, Int8 4000.00 . Transformer Eng (TF32, BF16) . 3500.00 ~16x 3000.00 **Complex Instructions**  DP4, HMMA, IMMA 2500.00 nt 8 TOPS ~12.5x 2000.00 A100 Process . Structured Sparsity 28nm, 16nm, 7nm, 5nm 1500.00 1248.00 ~2.5x IMMA HMMA Int8 Tensor 1000.00 Tensor Cores **FP16** Sparsity Cores DP4A Scalar FP32 Q8000 • ~2x 500.00 V100 261.00 K20X P100 M40 125.00 21.20 3.94 6.84 0.00 Model efficiency has also 4/1/12 8/14/13 12/27/14 5/10/16 9/22/17 2/4/19 6/18/20 improved - overall gain > 1000x

10/31/21 3/15/23

### Why NVIDIA owns the Market?

- It's the software → flexibility, fast evolution!
- Is there a way to Escape "NVIDIA gravity"?
- Need a standard to combat a monopoly

RISC-V°

RISC-V: The Free and Open RISC Instruction Set Architecture







RISC-V is a key enabler  $\rightarrow$  max agility, enabling SW build-up, without vendor lock-in

### Heterogeneous, Multiscale Accelerated Computing

#### P U P

#### **Multiple Scales of acceleration**

Extensions to processor cores

- Explore new extensions
- Efficient implementations

Shared-memory Accelerators

- Domain specific
- Local memory

Multiple Decoupled Accelerators

- Communication
- Synchronization



#### Specialize interconnects too! Local, global, package, system

### Snitch Core: Tiny, Latency Tolerant, Extensible RV PE

- Snitch: tiny (20KGE), extensible RV core
  - Extensible through accelerator port
  - Latency-tolerant through scoreboard
     → can issue ~10 non-blocking memOPs
- Paired with ISA extension subsystem
- Native streaming support
  - Load/store elision
  - Reduction of I\$ pressure





#### **ISA Extension:** quantization Galore

Extension for Low-Bitwidth INT (binay, ternary, crumble, nibble, byte) and FP

- Tensor unit support (being standardized now two versions: "attached" vs. "integrated")
- OCP *Microscaling* Formats (MX)  $\rightarrow$  RVV ISA is a good match
  - Version 1.0 published Sept 2023 Proponents: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, Qualcomm
- Polynomial Approximation (PACE stay tuned)

ALMA MATER STUDIORUM

**ETH** zürich



**MX** Number Formats Block level Total bits per Total bits per Total bits block of 32 block of 64 exponent 



[SemiAnalysis24]

#### SSR & FREP: Streaming Extension

- SSR: Link register read/writes into implicit LD/ST
  - Extension around the core's register file
  - Address generators (2-3KGE/SSR)

**ETH** zürich

- Configured out of inner loop (LD/ST elision)
- Staggering: generators prefetch from memory (latency tolerant!)
- FREP: L0 instruction buffer (no I\$ access)

ALMA MATER STUDIORUM

- Pseudo-dual issue (Int pipeline can proceed in parallel)
- No boundary checking for loop (similar HW loop in DSPs)
- Boost FPU utilization → 100% (once setup is amortized)

| dotp: 30% FPU                                             | dotp: 90% FPU                                                             |
|-----------------------------------------------------------|---------------------------------------------------------------------------|
| loop:<br>fld r0, %[a]<br>fld r1, %[b]<br>fmadd r2, r0, r1 | <pre>scfg 0, %[a], ldA scfg 1, %[b], ldB loop: fmadd r2, ssr0, ssr1</pre> |





Latency Tolerance: Less expensive than OoO (CPU) and Multi-threading (GPU)

# Snitch Cluster: The Fundamental Compute Block

- 8 Snitch compute cores
  - SIMD 64b FPU with SSRs & FREP
- 9<sup>th</sup> Core: DMA engine -
  - 512b interface to interconnect
  - HW support for autonomous ≤ 2D transfers, higher dimensions through SW
  - Latency-tolerance block transfers (100s of cycles)
- 128 KiB TCDM

**ETH** zürich

- 32-bank, low-latency shared scratchpad
- Double-buffer large chunks with DMA

ALMA MATER STUDIORUM

- Shared TDCDM, I-cache and peripherals
- Shared DMA (10% overhead) for global latency tolerance



# Specializing the Cluster for Gen.Al

• Attention is key

**ETH** zürich

• Attention matrix is a square matrix of order input length

Query

Linear

- Quadratic memory requirement vs. sequence length
- No asymmetry between operands ("weightless")
- MatMul & Softmax dominate

Softmax(
$$\mathbf{x}$$
)<sub>i</sub> =  $\frac{e^{x_i - \max(\mathbf{x})}}{\sum_j^n e^{x_j - \max(\mathbf{x})}}$ 

ALMA MATER STUDIORUM



### Matmul Benefits from Large Shared-L1 clusters

• Why?

**ETH** zürich

- Better global latency tolerance if  $L1_{size} > 2 \times L2_{latency} \times L2_{bandwidth}$  (Little's law + double buffer)
- Smaller data partitioning overhead

ALMA MATER STUDIORUN

- Larger Compute/Boundary bandwidth ratio: N<sup>3</sup>/N<sup>2</sup> for MMUL grows linearly with N!
- A large "MemPool": 256+ cores and 1+ MiB of shared L1 data memory



#### MemPool Cluster



### MemPool Cluster: A physical-aware design

- A Scalable Manycore Architecture with Low-Latency Shared L1 Memory
  - 256+ cores
  - 1+ MiB of shared L1 data memory
  - ≤ 8 cycle latency (Snitch can handle it)
- Hierarchical design
- Implemented in GF22
  - Targeting 500 MHz (SS/0.72V/125°C)
  - Reaching 600 MHz (TT/0.80V/25°C)
  - Targeting iso-frequency with PULP

ALMA MATER STUDIORUM

- Cluster area of 13 mm<sup>2</sup>
  - 5 mm diagonal

**ETH** zürich

- Round trip in 5 cycles
- Terapool: 1024 Cores!

#### **MemPool Group**





Group

Group 0

# MemPool + Integer Transformer Accelerator (ITA)

#### **Tightly coupled Acceleration Enginee**

- Matmul & Softmax
- Reduce pressure on memory and interconnect

#### **Collaborative Execution**

- Cores prepare activations for the next attention head
- Final head accumulation computed in cores
- Nonlinearity in cores (PACE)



### MemPool + Integer Transformer Accelerator (ITA)

#### **Integer Attention Accelerator**

**ETH** zürich

- 8-bit inputs, weights & outputs
- Builtin data marshaling & pipelined operation
- Streaming partial Softmax adding no additional latency
- Fused  $Q \times K^T$ , Softmax and  $A \times V$  computation
- Support for hardware-aware Softmax approximation in QuantLib

ALMA MATER STUDIORUM





#### Extending ITA to MXTA





P b P

#### Attention on ITA

Performance increase of **15x** 

#### Energy Efficiency increase of 36x

Area Efficiency increase of 74x

ALMA MATER STUDIORUN

**ETH** zürich



#### **Attention Efficiency**

# Scaling UP: Efficient and Flexible Data Movement



# **Problem:** HBM Accesses are critical in terms of

- Access energy
- Congestion
- High latency

Instead reuse data on lower levels of the memory hierarchy

- Between clusters
- Across groups

Smartly distribute workload

- Clusters: Tiling, Depth-First
- Chiplets: E.g. Layer pipelining

#### **Big trend!**



#### Addressing interconnect scalability



#### • Fat-tree was very challenging in Implementation

- AXI has severe scalability issues
- Top-level Xbar had to be split up
- Still, interconnect takes up almost 40%\*
- Working on NoC solution, *FlooNoC*
  - Fully AXI4 compatible
  - Solves AXI4 scalability issues
  - Designed with awareness of physical design
  - Wide & physical channels



# Replacing the AXI interconnect with a NoC



- Potential for big area/performance gains
  - Only ~10% interconnect area
  - 66% more clusters, same floorplan
  - *High Bandwidth*: 629Gbps/link
  - High Energy-Efficiency: 0.19pj/B/hop







#### MHA Mapping on NoC: FlattenAttention

- Proposed Dataflow Schedule of MHA
  - We leverage all-cluster L1 for single head attention Minimize I/O complexity
  - Gen.Al specialized NoC
    - Matrix transpose engine for transposition of (K -> K<sup>T</sup>)
    - Collective operations on NoC
- Benchmark & Results
  - 16x16 Clusters (8TFLOPS, 256kB L1), 2TB/s HBM
  - One layer MHA of Llama3-70B (seq=4K, batch=8)

ALMA MATER STUDIORUN Università di Bologn

- Efficient collective operation support on NoC is essential focus only on one head every head sequentially
  - 3x speedup to baseline

**ETH** zürich



### Scaling UP: From Chip to chiplets



**Occamy System** 





Snitch Cluster

Periph

Cluster



#### Occamy Group





## Not Only Layer-by-Layer distribution across Chiplets!





Y





























#### What next?







Y

#### What next?







# Thank You!