

#### Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-Core Processor

Integrated Systems Laboratory (ETH Zürich)

Marco Bertuletti Yichao Zhang Alessandro Vanelli-Coralli Luca Benini mbertuletti@iis.ee.ethz.ch yiczhang@iis.ee.ethz.ch avanelli@iis.ee.ethz.ch lbenini@iis.ee.ethz.ch

**PULP Platform** 

Open Source Hardware, the way it should be!



@pulp\_platform >>
pulp-platform.org

#### Introduction

\* P

- 5G processing requires high throughput on large dimensional signals
- Research on RISCV open platforms: ensures long-term scalability, speeds-up communitydeveloped solutions, reduces vendor captivity

- Complexity evaluation of 5G-PUSCH processing chain
- Implementation of key kernels on a RISCV many-core cluster with low access latency
- Barriers for **partial synchronization** in the cluster
- Evaluation of **speed-up** and **utilization**





#### **PUSCH** processing

We receive frequencymultiplexed transmissions = symbols

- Orthogonal subcarriers
- From multiple antennas
- 14 symbols in Transmission Time-Interval (0.5ms)

(Pilot symbols, are known at the RX + TX, and allow the reconstruction of the channel)

ALMA MATER STUDIORUM

**ETH** zürich





#### **PUSCH** processing



#### PUSCH processing: Computational complexity

- A computational complexity analysis shows that most of the MACs are in the FFT, the BF and the MIMO stages
- We therefore focus on the optimization of these steps



#### MemPool/TeraPool: our target many-core

#### Snitch processing core

- RV32IMA instruction set architecture + Xpulpimg
- Single-stage single-issue core + LSU & IPU (pipelined)





## MemPool/TeraPool: NUMA architecture

#### Tiles are grouped in hierarchical levels



|                    | MemPool | TeraPool  |    |
|--------------------|---------|-----------|----|
| Cores per Tile     | 4       | 8         |    |
| Tiles per Group    | 16      | 16        |    |
| Groups per cluster | 4       | 8         |    |
|                    | =256    | cores _1( | )2 |

| Memory request                             | Latency  |  |
|--------------------------------------------|----------|--|
| Bank in the same Tile                      | 1 cycle  |  |
| Bank in a different Tile of the same Group | 3 cycles |  |
| Bank in a Tile of another Group            | 5 cycles |  |



Cavalcante, Matheus, et al. "**DATE 2021:** A shared L1 memory many-core cluster with lowlatency interconnect." (2021).



#### Programming model

Fork-join programming model

- Serial execution forks to parallel execution
- Cores access memory concurrently
- Cores are synchronized and parallel execution joins to serial





#### Synchronization barriers

#### Synchronization barriers

- Arrival = atomic writes to a synch variable
- Hardwired wake-up triggers for departure







#### Implemented kernels

To implement the most computationally complex PUSCH kernels

- We enforced **local access** to the banks in a Tile, to avoid long latency
- We limited the **contentions** for memory shared interconnection resources
- We kept **synchronization** to the bare minimum





#### Implemented kernels: FFT

The radix-4 butterfly gets inputs at distance N/4



Data is folded to keep these accesses local





Store access pattern



Data stored in the local memory of cores using it in the subsequent stage



#### Implemented kernels: FFT



4 cores are working on a 64points FFT  $\rightarrow$  we partially synchronize these cores

Independent FFTs can be run in sequence by the same cores before synchronization





#### Implemented kernels: Matrix-Matrix Multiplication

- 4x4 output window maximizes the use of the RF in Snitch
- Parallel version is optimized to avoid contentions

Each core is assigned 4 rows of A -

Cores from the same tiles shift to avoid accessing the same group

ALMA MATER STUDIORUM

**ETH** zürich





Cores are assigned columns of B to compute the output windows



# ited at a time by the core, to increase ion 17-19 April 2023

## Implemented kernels: Cholesky Decomposition

- Output matrix is computed column by column
- At each iteration cores

   access in parallel different
   rows → fold rows in the local
   memory
- Two mirrored matrices are computed at a time by the same core, to increase utilization

**ETH** zürich





#### FFT TeraPool 4096-points **MemPool** (16 independent single FFTs run between barriers) 0.8 0.0 0.2 0.4 0.6 1.0 Fraction of total cycles MMM **TeraPool** MemPool (Input 1 4096x64 Input 2 64x32) single 0.6 0.8 0.0 0.2 0.4 1.0 Fraction of total cycles Cholesky TeraPool **MemPool** 4x4 matrix (16 independent single dec. Run between barriers) 0.6 0.8 0.0 0.2 0.4 1.0 Fraction of total cycles

17-19 April 2023

**ETH** zürich

ALMA MATER STUDIORUM



- TeraPool scales well compared to MemPool (overhead = synchronization)
- LSU stalls are reduced to less than 10% of the total execution time



# High IPC is obtained on all benchmarks

#### Quasi-ideal speed-up and low latency





Use case: 4096 subcarriers, 64 antennas, 32 beams and 4 UEs on the same subcarrier

- The three benchmarks sum up to 0.785ms
   @1GHz
- Further improvement from architecture specialization



#### Conclusions

ETH zürich (

**P** 

- Identified most computationaly complex kernels in PUSCH lower PHY
- Partial synchronization between cores of the cluster
- Reduced the LSU stalls to less than 10% of the execution time
- Achieved high speed-up and utilization  $\rightarrow 0.785ms$  execution time @1GHz

github.com/pulp-platform/mempool

17-19 April 2023

Marco Bertuletti <u>mbertuletti@iis.ee.ethz.ch</u> ETZ, Gloriastrasse 35, 8092 Zürich @pulp\_platform @MarcoBertuletti

ALMA MATER STUDIORUM



# Thank you!

Marco Bertuletti <u>mbertuletti@iis.ee.ethz.ch</u> ETZ, Gloriastrasse 35, 8092 Zürich @pulp\_platform @MarcoBertuletti



ETHZÜRICH 🛞 OLNOLMOUK BEVERBERNN

