

# **PULP: Looking Back and Looking Forward**

#### lbenini@ethz.ch, luca.Benini@unibo.it Luca Benini





@pulp\_platform pulp-platform.org

Ρ





**PULP Platform** 

Open Source Hardware, the way it should be!

# Looking Back: April 2012, Job Talk @ ETHZ



#### Digital Platform Design in the Twilight of Moore's Law

Luca Benini Università di Bologna & STMicroelectronics



#### dig the chip Power (ITRS) Max Power (air cooling + heatsink) Chip Power (ITRS) Max Power (ITRS) Chip Power (ITRS) Chip Power (ITRS) Dark Silicon !!!

The Twilight of Moore's Law: Power

Thermal wall: transistor count still increases exponentially but we can no longer power the entire chip (voltages, cooling do not scale) [Hardavellas11]

#### Altis Semiconductor Dongbu Hitek Dongbu Hitek Freescale Freescale Fujitsu Fujitsu Globalfoundries Globalfoundries Freescale Grace Semiconductor Grace Semiconductor Fujitsu Globalfoundrie IBM Infineo Infineo IRM Infineo Intel Intel Fuiltsu Panason Panasoni Panasonic Globalfoundries Renesas (NEC Renesas (NEC Panason Samsung Renesas (NEC IBM Samsung Seiko Epsor Seiko Epso Samsung SMIC Renesas (NEC Globalfoundries Sony Samsung Into ST Microelectronics ST Microelectronics ST Microelectronics SMIC Danasonio Globalfoundries Inte Texas Instruments Texas Instruments Texas Instruments ST Microelectronics Samsung Toshiba Toshiha Toshih: Toshiba ST Microelectronics Samsung TSMC TSMC TSMC TSMC TSMC ST Microelectronics UMC UMC UMO UMO ISMC 130nm 65nm/55nm 45/40nm 32/28nm 22/20nm Market volume wall: only the largest volume products will be manufactured with the most advanced technology Life.augmented

#### The twilight of Moore's Law: IO Bandwidth



**Memory wall**: larger datasets and limited bandwidth at high power cost for accessing external memory







#### The Twilight of Moore's Law: Economics

### Looking Back: April 2012, Job Talk @ ETHZ



### Heterogeneous, Accelerated Computing, 3D integration...



# Looking Back: April 2012, Job Talk @ ETHZ





### P2022 was too early + Crashed against ARM dominance



# Looking Back: A few good ideas



### Parallel, Ultra Low Power Processors



Target pJ/OP @ GOPS and beyond

### But for what?

And how to escape the proprietary ISA cage?



### Looking Back: Serendipity!

Sevilla 22: arXiv:2202.05924, epochai.org





### 10x every 2 years



arge Scale Era

### Looking Back: Serendipity!



## Looking Back: More Serendipity!



### 2014 – A cute, open ISA



Recommendations and Roadmap for European Sovereignty in Open Source Hardware, Software, and RISC-V Technologies

Report from the

Open Source Hardware & Software Working Group

August 2022



RISC-//®

**ETH** zürich



### 2023 – Disruptive Force

2023 RISC-V International more than 26% membership growth year-over-year, with over 3,180 members across 70 countries. More than 10 billion RISC-V cores in the market, 10K+ engineers working on RISC-V



# A Good Intuition



**Open Source Hardware!** → RTL source code (permissive\*, e.g. Apache is key for industrial adoption)

Later stages contain closed IP of various actors  $\rightarrow$  not open source by default (working on that...)



# **Open Source Platform**





- OpenHW Group is a not-for-profit, global organization (EU,NA,Asia) where HW and SW
  designers collaborate in the development of open-source cores, related IP, tools and SW such
  as the Core-V family
- OpenHW Group provides an infrastructure for hosting high quality open-source HW developments in line with industry best practices.



**ETH** zürich



# Creating Product Value with OSHW



**First iteration**: test-chip for IP qualification, early customer engagement (MPW)

Second iteration: first low volume production (most effort on c and d) (MLR or full mask set)

NOTE – aggressive (e.g. Greenwaves: IoT processor) vs. cost-sensitive fabless (e.g Eggtronics: cellular charger IC) users Aggressive: customizing OSHW to provide differentiation wrt to ARM (differentiation). Targets advanced nodes Cost-sensitive: using OSHW "as is" to reduce cost wrt to ARM, and TtM, effort wrt to in-house, Targets older nodes

# With a Little Help from my Friends...



**Now to Eric+Loic!** 



### Forward to 2022: Job Done?







### Forward to 2022: Job Done?







# Forward to 2022: Job Done?

#### **RISC-V Cluster**

- Comparable 32bits-8bits SOA Energy efficiency to other PULPs [7]
- The highest energy efficiency on subbyte SIMD operations (4b-2b)

#### **SNE**

 1.7X higher than SOA [5] energy/efficiency

#### CUTIE

 2X higher energy efficiency improvement over SOA [6]



### **CUTIE, SNE** can work concurrently for SNN + TNN "fused" inference (never done so far)



[5] L. Deng et al., "Tianjic," JSSC 2020
[6] B. Moons et al., "Binareye," CICC, 2018
[7] D. Rossi et al., "Vega," JSSC 2022. 16

# Fast Forward: Perceptive $\rightarrow$ Generative $\rightarrow$ Embodied AI





## **Disruptive Embodied AI: Automotive**



### AD CHIPS COMPARISON

| Carlander 1 | CHIP              | TECH. NODE                     | PERF. TOPS | PC. WATTS                      | PERF/WATT |
|-------------|-------------------|--------------------------------|------------|--------------------------------|-----------|
|             | MOBILEYE Q4       | 28NM                           | 2.5        | 3                              | 0.83      |
|             | TESLA FSD         | 14NM                           | 144        | 72                             | 2         |
|             | MOBILEYE Q5       | 7NM                            | 24         | 10                             | 2.4       |
|             | NVIDIA ORIN       | 7 N M                          | 244        | 70                             | 3.48      |
|             |                   | HBM2E Ctrl & PHY               |            | GlobalFoundries<br>Occamy Chip |           |
|             |                   | HBM2E DRAM                     | Fraunhofer |                                |           |
| ET          | Hzürich 👜 alma ma | TER STUDIORUM<br>Tà di Bologna | Peak 3     | 84 GDPf                        | lop/s per |

- GF12, target 1GHz (typ)
- 2 AXI NoCs (multi-hierarchy)
  - 64-bit
  - 512-bit with "interleaved" mode
- Peripherals
- Linux-capable manager core CVA6
- 6 Quadrants: 216 cores/chiplet
  - 4 cluster / quadrant:
    - 8 compute +1 DMA core / cluster
    - 1 multi-format FPU / core (FP64,x2 32, x4 16/alt, x8 8/alt)
- 8-channel HBM2e (8GB) 512GB/s
- D2D link (Wide, Narrow) 70+2GB/s
- System-level DMA

chiplet

SPM (2MB wide, 512KB narrow)



### Conclusion

- Efficient, RT, Safe Secure: PE, Cluster, SoC, System
- Key ideas
  - Deep PE optimization  $\rightarrow$  extensible ISAs (RISC-V!)
  - Low-overhead work distribution. Latency hiding  $\rightarrow$  large "mempools" •
  - Heterogeneous architecture  $\rightarrow$  host+accelerator(s)
- Game-changing technologies
  - "Commoditized" chiplets: 2.5D, 3D
  - Computing "at" memory (DRAM mempool)
  - Coming: optical IO and smart NICs, swiches
- Challenges:
  - High performance RV Host
  - RV HPC software ecosystem? •
  - Access to technology!





<sup>[</sup>RIKEN Matsuoka MODSIM22]



Luca Benini, Alessandro Capotondi, Alessandro Ottaviano, Alessio Burrello, Alfio Di Mauro, Andrea Borghesi, Andrea Cossettini, Andreas Kurth, Angelo Garofalo, Antonio Pullini, Arpan Prasad, Bjoern Forsberg, Corrado Bonfanti, Cristian Cioflan, Daniele Palossi, Davide Rossi, Fabio Montagna, Florian Glaser, Florian Zaruba, Francesco Conti, Georg Rutishauser, Germain Haugou, Gianna Paulin, Giuseppe Tagliavini, Hanna Müller, Luca Bertaccini, Luca Valente, Manuel Eggimann, Manuele Rusci, Marco Guermandi, Matheus Cavalcante, Matteo Perotti, Matteo Spallanzani, Michael Rogenmoser, Moritz Scherer, Moritz Schneider, Nazareno Bruschi, Nils Wistoff, Pasquale Davide Schiavone, Paul Scheffler, Philipp Mayer, Robert Balas, Samuel Riedel, Segio Mazzola, Sergei Vostrikov, Simone Benatti, Stefan Mach, Thomas Benz, Thorir Ingolfsson, Tim Fischer, Victor Javier Kartsch Morinigo, Vlad Niculescu, Xiaying Wang, Yichao Zhang, Frank K. Gürkaynak, all our past collaborators and many more that we forgot to mention





http://pulp-platform.org

