

## Occamy: A 432-core RISC-V Based 2.5D Chiplet System for Ultra-Efficient (Mini-)Floating-Point Computation

Gianna Paulin pauling@iis.ee.ethz.ch and the PULP team



**PULP Platform** 

Open Source Hardware, the way it should be!

P C C C C

> @pulp\_platform 🔰 pulp-platform.org 🐗



## Our latest design Occamy: 0.75 TFLOP/s, 400+ cores

#### Dual Chiplet System Occamy:

- 216+1 RISC-V Cores
- 0.75 TFLOP/s
- GF12LPP
- Area: 73mm<sup>2</sup>

#### 2x 16GByte HBM2e DRAMs Micron

#### 2.5D Integration

#### Silicon Interposer Hedwig:

- Technology: 65nm, passive (only BEOL)
- Area: 26.3mm x 23.05mm

#### Carrier PCB:

- RO4350B (Low-CTE, high stability)
- 52.5mm x 45mm



## How did we get here?

#### Concept architecture presented at Hotchips

2020 conference [1]

ETH zürich

- (Quad-) Chiplet-based architecture
  AI/HPC focused
- Essential components have been manufactured in GF22
- Measured for energy-efficiency

ALMA MATER STUDIORUM

Extrapolation on larger AI workloads (full training and inference steps)

# GlobalFoundries Synopsys<sup>®</sup>



[1] F. Zaruba et al., "Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing," in IEEE Micro, vol. 41,

no. 2, pp. 36-42, 1 March-April 2021, doi: 10.1109/MM.2020.3045564



3





## Not All Programs Are Created Equal

• Processors can do two kinds of useful work:

Decide (jump to different program part)

- Modulate flow of instructions
- Smarts:

**ETH** zürich

- Don't work too much
- Be clever about the battles you pick (e.g., search in a database)
- Lots of decisions
  Little number crunching

#### **Compute** (plough through numbers)

- Modulate flow of data
- Diligence:
  - Don't think too much
  - Just plough through the data (e.g., machine learning)
- Few decisions Lots of number crunching
- Many of today's challenges are of the diligence kind:
  - Tons of data, algorithm ploughs through, few decisions done based on the computed values
  - "Data-Oblivious Algorithms" (ML, or better DNNs are so!)
  - Large data footprint + sparsity

ALMA MATER STUDIORU

y

## Snitch – a Tiny 32b Control Core with a big 64b FPU



#### Introducing SNITCH

- Start with a simple RISC-V core
- Focus on key features:
  - Lightweight microarchitecture
  - Extensibility: Performance through ISA extensions
  - Latency tolerant
  - Competitive **frequency**
- Around 15-25 kGE
- Capable 64b FPU with many extensions

y

[2] F. Zaruba et al., "Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads," in IEEE Transactions on Computers, vol. 70, no. 11, pp. 1845-1860, 1 Nov. 2021, doi: 10.1109/TC.2020.3027900.

ETHZÜRICH

[3] F. Schuiki et al., "Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores," in IEEE Transactions on Computers, vol. 70, no. 2, pp. 212-227, 1 Feb. 2021, doi: 10.1109/TC.2020.2987314.



### Snitch Cluster – 5 MGE





6

## Main Compute architecture is open-source !!!



ALMA MATER STUDIORUM

**ETH** zürich

The main compute architecture is being developed fully open-source !!!

github.com/pulp-platform/snitch

<u>github.com/pulp-platform/serial\_link</u>

HBM, DFG, FLL, and any proprietary components are in a separate private repository on our internal Gitlab



## **Programming Model**

#### 00







- Multiple layers of abstraction:
  - Hand-tuned assembly
  - LLVM intrinsics (FREP, SSRs, DMA, ...)
  - High-level frameworks:

DaCE: <u>spcl.inf.ethz.ch/Research/DAPP/</u> Pytorch+Dory: tiling of neural networks

- Bare-metal runtime
- Basic OpenMP runtime
- Occamy mapped onto 2x VCU128 (with HBM) + 1x VCU1525
  - 1x CVA6
  - 2-4x 9-core Snitch cluster



## Our Silicon Interposer Hedwig (65nm, passive, GF)

#### Taped out: 15<sup>th</sup> of October 2022

- **Compact** die arrangement
  - No *dummy dies* or *stitching* needed
- Fairly low I/O pin count due to no high-bandwidth periphery
  - Off-package connectivity: ~200 wires
  - Array of 40 x 35 (-1) C4s (total of 1'399 C4 bumps)
    - Diameter: 400μm, Pitch: 650μm

ALMA MATER STUDIORUN

- Die-to-Die: ~600 wires
- HBM: ~1700 wires

**ETH** zürich



## Carrier PCB brings mainly "fan-out" for PCB mounting

#### **Carrier PCB** (52.5 x 45mm)

- Material Selection: RO4350B
  - low Coefficient of Thermal Expansion (CTE)
  - High stability
- Decoupling caps
- Custom ZIF socket design





## Waiting for the Assembly to complete....

- Finished Chiplet Tapeout in less than 15 months
  - Initial discussions 20<sup>th</sup> of October 2020
  - Started on 20<sup>th</sup> of April 2021
  - Taped out Chiplet on  $1^{st}$  of July 2022
  - Taped out Interposer on  $15^{\rm th}$  of October 2022
  - Currently being assembled
- Biggest Challenges:
  - Access to IPs

ETH zürich

- Low volume assembly
- Up to 25 engineers involved

ALMA MATER STUDIORU

AMY ETHZU



ᇬ ht

http://pulp-platform.org

@pulp\_platform