

### Occamy: The adventure of Bringing our open-source, PULP platform to HPC Huawei STW 2022 – Zurich Subsite

**Frank K. Gürkaynak** kgf@iis.ee.ethz.ch and the PULP team

PULP Platform

Open Source Hardware, the way it should be!



@pulp\_platform >>
 pulp-platform.org \*\*
youtube.com/pulp\_platform

### This is the story of how we designed a

- 1 billion transistor,
- 400+ RISC-V core,
- Chiplet-based system with two compute tiles and HBM2e memories with
- Peak performance more than **0.75 Tera** Double Precision Floating Point Op/s



# About 10-12 years ago, this was as good as we could get



### 2011 Sandy Bridge from Intel



### Academic research results in serious silicon

• 4 cores

**ETH** zürich

- 32nm, 216 mm<sup>2</sup>
- ~1 billion transistors

- 216 + 1 cores
- 12nm, 72mm<sup>2</sup>
- ~1 billion transistors

**2022 Occamy from ETH Zürich** 



### I want to talk about three points today



**Open Source hardware** 

From Concept to Real Silicon

**Computer Architecture Research** 





### Who are we and who is behind PULP?

**ETH** zürich





2012) UNIVERSITA DI BOLOGUM Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak



### Timeline of Parallel Ultra Low Power (PULP) project





**ETH** zürich



### PULP uses a permissive open source license

- All our development is on GitHub
  - HDL source code, testbenches, software development kit, virtual platform

### <u>https://github.com/pulp-platform</u>

• Allows anyone to use, change, and make products without restrictions.





### The complicated relationship of Open Source Hardware



Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak



| TYPE                   | EXAMPLES     | STATUS       |
|------------------------|--------------|--------------|
| Open Specifications    | RISC-V       | Established  |
| Architectures          | PULP         | Quite mature |
| Implementations in RTL | Snitch, Hero | Many         |







| TYPE                   | EXAMPLES     | STATUS       |
|------------------------|--------------|--------------|
| Open Specifications    | RISC-V       | Established  |
| Architectures          | PULP         | Quite mature |
| Implementations in RTL | Snitch, Hero | Many         |
| Open source Hard IP    | FLL, DDR PHY | Very Limited |
|                        |              |              |
|                        |              |              |



**ETH** zürich





| TYPE                           | EXAMPLES       | STATUS       |
|--------------------------------|----------------|--------------|
| Open Specifications            | RISC-V         | Established  |
| Architectures                  | PULP           | Quite mature |
| Implementations in RTL         | Snitch, Hero   | Many         |
| Open source Hard IP            | FLL, DDR PHY   | Very Limited |
| Process Design Kits Dependency | Skywater 130nm | Just Started |







| TYPE                   | EXAMPLES       | STATUS       |
|------------------------|----------------|--------------|
| Open Specifications    | RISC-V         | Established  |
| Architectures          | PULP           | Quite mature |
| Implementations in RTL | Snitch, Hero   | Many         |
| Open source Hard IP    | FLL, DDR PHY   | Very Limited |
| Process Design Kits    | Skywater 130nm | Just Started |
| Open Source Tools      | Open Lane      | On its way   |



**ETH** zürich



# Why is RISC-V so special: Freedom to Explore and Fail!



RISC-V

- The ISA provides a contract between HW and SW
  - As long as you stick to the ISA, you can develop HW and SW independently
  - All RISC-V research in HW can continue to rely on growing SW ecosystem for RISC-V
- RISC-V comes with plenty of options for extensions
  - There are reserved encoding spaces for instruction set extensions
- Being able to change everything gives great flexibility
  - Do you want 33 registers, or a 48 bit accumulator.. No problem
  - You need to bring the SW support for your additons.



### What if we had a tiny 32b core



### **Introducing SNITCH**

- Start with a simple RISC-V core
- Focus on key features:
  - Lightweight microarchitecture
  - Extensibility: Performance through ISA extensions
  - Latency tolerant
  - Competitive frequency
- Around 15-25 kGE



# What if we had a tiny 32b core and add a big 64b FPU



### **Introducing SNITCH**

- Start with a simple RISC-V core
- Focus on key features:
  - Lightweight microarchitecture
  - Extensibility: Performance through ISA • extensions
  - Latency tolerant •
  - Competitive frequency ٠
- Around 15-25 kGE
- Capable 64b FPU with many extensions

ALMA MATER STUDIORUM Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak

# What if we add a Floating-point Repetition Buffer? (FREP)





mν

### Remove control flow overhead

- Programmable micro-loop buffer
- Sequencer steps through the buffer, independently of the FPU
- Integer core free to operate in parallel: Pseudo-dual issue
- High area- and energy-efficiency





Allows custom instruction set extensions



# High FPU utilization ≈ high energy-efficiency Idea: Turn register R/W into memory loads/stores.

What if we could stream data to from FPU directly? (SSR)

- Extension around the core's register file
- Address generation hardware

Intuition:

ETH zürich

| loop:                   | <b>scfg</b> 0, %[a], ldA |
|-------------------------|--------------------------|
| <b>fld</b> r0, %[a]     | <b>scfg</b> 1, %[b], ldB |
| <b>fld</b> r1, %[b]     | loop:                    |
| <b>fmadd</b> r2, r0, r1 | fmadd r2, ssr0, ssr1     |

- Increase FPU/ALU utilization by ~3x up to 100%
- SSRs ≠ memory operands
  - Perfect prefetching, latency-tolerant







### We have a processor that maximizes FPU efficiency



Spending energy where it counts the most == Efficiency





1. Tian Tan Buddha (Big Budana)

2. The Bund (Wai Tan)

3. Mutianyu Great Wall

### Heterogeneous + Parallel... Why?

• Processors can do two kinds of useful work:

**Decide** (jump to different program part)

- Modulate flow of instructions
- Mostly sequential decisions:
  - Don't work too much
  - Be clever about the battles you pick (latency is king)
- Lots of decisions
   Little number crunching

### **Compute** (plough through numbers)

- Modulate flow of **data**
- Embarassingly data parallel:
  - Don't think too much
  - Plough through the data (throughput is king)
- Few decisions
  Lots of number crunching

- Today's workloads are dominated by "Compute":
  - Tons of data, few (as fast as possible) decisions based on the computed values,
  - Data-Oblivious Algorithms (ML, or better DNNs are so!)
  - Large data footprint + sparsity

### How to design an efficient "Compute" fabric?



### Efficient Architecture: Heterogeneous + Parallel





ALMA MATER STUDIORUM Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak



# Here is one solution: PULP cluster made of Snitch cores

L2

Mem

- We start with a single core
- Typical cluster has 4-16 cores
- Local scratchpad memory
  - Multibanked memory
  - Logarithmic interconnect
  - All cores connect to all banks
  - Banking factor reduces conflicts
- A DMA is used to copy data
  - To and from external memory
  - One specialized core supports DMA
- (Shared) Instruction cache





We have a solid building block to make larger systems



### Our systems are getting more complex, we need help !

- Modern IC design is complex and expensive
  - We need partners to help and collaborate
  - We need support (IPs, donations) to realize designs

### **Open Source to the rescue**

- Makes it easy to collaborate with external partners
  - Less paperwork/NDAs to get started
  - Partners see/are aware of what we provide
- What we do can be re-used (permissive licensing) by our partners
- Results can be more easily verified



### Open source collaboration scheme explained

#### Direct research collaborators on PULP

**ETH** zürich





#### Academic users we are aware of





# The open model led to successful industry collaborations







**ETHZÜRICh** (E) ALMA MATER ST KOLOBUM Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak

Occamy, ambitious project: needs strong partners







Fraunhofer IZM





# Where it started: Manticore Multi-Chiplet Concept (2020)

- Concept architecture presented two years ago at Hotchips32 conference
  - AI/HPC focused
    - Extrapolation on larger
  - Quad-Chiplet-based arc
    - 222mm<sup>2</sup> (14.9 x 14.9mr
    - Essential components h
  - Three die-to-die links:
    - Each die has short-range each sibling for non-uni
    - Efficient inter-die synch
  - Private 8GB HBM2 per c
    - SoA BW and efficiency

ALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA

ETH zürich



# Idea **Netlist Chip back Tapeout** PERFORMANCE **RTL** Backend Test and Bringup TIME

# Life cycle of an IC Design project

**Designs always look better on paper** 



### Designs always look better on paper

### 🧏 HUA

### Back to reality, constraints start eating into our margins

- Circuit size limits number of C2C links
  - Two instead of Four Chiplets
- IP availability reduces external BW
  - No PCIe IPs available
- Interposer size (32mm x 28mm) limits die size
  - Number of cores per chiplet reduce
- Physical design requires compromises
  - Smaller collection of clusters
- Metal stack incompatibility between IPs
  - Have to choose Die2Die or HBM memory







UAWEI

Low volume chiplet assembly is not easy to organize

# The heart: Occamy Chiplet: 384 GDFlop/s Engine

- GF12, target 1GHz (typ)
- 2 AXI NoCs (multi-hierarchy)
  - 64-bit for configuration/service
  - 512-bit with "interleaved" mode
- Peripherals
- Linux-capable manager core CVA6
- 6 Quadrants: 216 cores/chiplet
  - 4 cluster / quadrant:
    - 8 compute +1 DMA core / cluster
    - 1 multi-format FPU / core (FP64,x2 32, x4 16/alt, x8 8/alt)
- 8-channel HBM2e (8GB) 512GB/s
- D2D link (Wide, Narrow) 70+2GB/s
- System-level DMA

ETH zürich

SPM (2MB wide, 512KB narrow)





# Snitch Cluster, eight + one core for DMA + 128kB memory



**ETHZÜRICH** (INVERSITE STUDIORUM Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak



# Multiple Snitch clusters form a group



to Group AXI Narrow 64bit



to Group AXI Wide 512bit





### Snitch Group in Occamy: 4 Clusters



#### to Tenlevel AVI Nerrow C4bit





### **Total of Six Snitch Groups in Occamy**



Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak

# Occamy – Finally all together



O ANI Name

Group 5

\$5 AUG 1946

SerLini

D2D - 5

UAWE

SerLin

D2D - I

512 / 48

**HBM2E DRAM** 

512bit Constant Cache

**ETH** zürich

2 AXI Busses

Peripherals

D2D serial link

# Occamy Chiplet, balancing bandwidth and compute



### Occamy system with two compute dies and two HBM2e



ALMA MATER STUDIORUM Huawei STW 2022 - Zurich Subsite - Frank K. Gürkaynak

### Hedwig our Silicon Interposer (GF 65)



Challenges in power distribution and fast signal routing



# **Programming Model**

### . . .

```
void main() {
    unsigned repetition = 2, bound = 4, stride = 8;
    static int data[8] = {1,2,3,4,5,6.7,8};
```

\_\_builtin\_ssr\_setup\_1d(0, repetition, bound, stride, data); static volatile double d = 42.0;

```
__builtin_ssr_enable();
```

```
__builtin_ssr_push(0, d);
volatile double e;
e = __builtin_ssr_pop(0);
__builtin_ssr_disable();
```

- Multiple layers of abstraction:
  - Hand-tuned assembly
  - LLVM intrinsics
  - FREP inference
  - High-level frameworks: DaCE: <u>spcl.inf.ethz.ch/Research/DAPP/</u> Pytorch+Dory: tiling of neural networks
- Bare-metal runtime
- Basic OpenMP runtime



**ETH** zürich





### **Prototyping and Emulation**





- "Quad-chiplet" prototype board with FPGA interface
- Occamy mapped onto 2x VCU128 (with HBM) + 1x VCU1525
  - 1x CVA6
  - 2-4x 9-core Snitch cluster







### The adventure is just starting..

- Occamy (compute die) has been taped out
- Hedwig (silicon interposer) will follow end of September
- Wafers back end of year
- Assembly 30 weeks
- We should be up and running in 3Q23

# Much more to come...









Luca Benini, Alessandro Capotondi, Alessandro Ottaviano, Alessio Burrello, Alfio Di Mauro, Andrea Borghesi, Andrea Cossettini, Angelo Garofalo, Arpan Prasad, Corrado Bonfanti, Cristian Cioflan, Daniele Palossi, Davide Rossi, Fabio Montagna, Florian Glaser, Florian Zaruba, Francesco Conti, Georg Rutishauser, Germain Haugou, Gianna Paulin, Giuseppe Tagliavini, Hanna Müller, Luca Bertaccini, Luca Colagrande, Luca Valente, Manuel Eggimann, Manuele Rusci, Marco Bertuletti, Marco Guermandi, Matheus Cavalcante, Matteo Perotti, Michael Rogenmoser, Moritz Scherer, Moritz Schneider, Nazareno Bruschi, Nils Wistoff, Pasquale Davide Schiavone, Paul Scheffler, Philipp Mayer, Robert Balas, Samuel Riedel, Segio Mazzola, Sergei Vostrikov, Simone Benatti, Stefan Mach, Thomas Benz, Thorir Ingolfsson, Tim Fischer, Victor Javier Kartsch Morinigo, Victor Jung, Vlad Niculescu, Xiaying Wang, Yichao Zhang, Frank K. Gürkaynak,

all our past collaborators and many more that we forgot to mention





http://pulp-platform.org



@pulp\_platform

# Related publications and open-source repositories

### Publications:

- Manticore @HotChips: <u>https://ieeexplore.ieee.org/abstract/document/9296802</u>
- SSRs: <u>https://ieeexplore.ieee.org/abstract/document/9474230</u>
- ISSRs: <u>https://ieeexplore.ieee.org/abstract/document/9474230</u>
- Snitch core & FREP: <u>https://ieeexplore.ieee.org/abstract/document/9216552</u>
- MiniFloat-NN: <u>https://arxiv.org/pdf/2207.03192.pdf</u> (accepted ARITH `22)
- SoftTiles: <a href="https://arxiv.org/pdf/2209.00889.pdf">https://arxiv.org/pdf/2209.00889.pdf</a> (accepted ISVLSI `22)

### Repositories open-source:

• Snitch/Occamy: github.com/pulp-platform/snitch

### See also our main web page:

<u>https://pulp-platform.org</u>

