# HERO: A Heterogeneous Research Platform to Explore HW/SW Codesign and RISC-V Manycore Accelerators Luca Bertaccini <a href="mailto:lbertaccini@iis.ee.ethz.ch">lbertaccini@iis.ee.ethz.ch</a> # **Heterogeneous Systems-on-Chip (HeSoCs)** #### **HOST** - General-purpose - Linux-capable - Versatility - Programmability ## **PMCA** - Parallel Manycore Accelerator (PMCA) - Domain-specialized - Energy-efficient #### **Domain Specialization & Heterogeneity** Energy efficiency challenges (post-Moore era) **Domain Specialization** Low versatility and programmability for highly specialized design **Heterogeneous Systems** **TH**Zürich #### **Industrial HeSoCs** Qualcomm Snapdragon 888 (mobile processor – 5nm) NVIDIA Grace (CPU for AI and HPC – 5nm) Picture from qualcomm.com Picture from apple.com Picture from nvidia.com **TH**zürich #### **HERO: Overview and Goals** #### **HERO = Heterogeneous Research Platform** User-Space Software HERO API OpenMP RTL Accel Lib Kernel-Space Software Accelerator Driver Linux Kernel H Hardware Enables research and development on heterogeneous computers: - Algorithms and Applications - Programming Models, Task Distribution, Scheduling - Manycore Architectures, Hardware Accelerators, Core Microarchitecture - Memory Organization, Communication, Synchronization Focus of this talk Host OpenMP RTL Virt Mem Mgmt Lib **Runtime Environment** Hardware Abstraction Lib Accelerator **ETH**Zürich #### **HERO: Hardware** # Integration Host-PMCA Support for efficient communication On-Chip Network System-Level Cache Off-Chip Mem Ctrl **Coherent Cache** Physically-Addressed Scratchpad Memory #### **HERO: Hybrid IOMMU** #### **Hybrid IOMMU** - To bridge the gap between the different memory systems - SW-controlled TLB - TLB prefetching - Shared Virtual Memory (SVM) accessible by DMA transfers without additional buffers in the IOMMU # **HERO: TLB misses handling** PMCA's **DMA** issues a transaction that generates a TLB miss **IOMMU** responds with an error and drops the transaction One of the PMCA cores walks the page table, adds the new TLB entry, and notifies the DMA PMCA's DMA re-issues the transaction whose TLB miss has been resolved Worker Threads (**WTs**) and Prefetching Helper Threads (**PHTs**). The cores are statically allocated to WTs or PHTs Worker Threads (**WTs**) and Prefetching Helper Threads (**PHTs**). The cores are statically allocated to WTs or PHTs The PHTs are automatically generated by the compiler which checks for SVM accesses in the code The PHTs are automatically generated by the compiler which checks for SVM accesses in the code WTs contain additional store instructions to the L1 SPM to share the execution state while PHTs contain additional load instructions The PHTs are automatically generated by the compiler which checks for SVM accesses in the code WTs contain additional store instructions to the L1 SPM to share the execution state while PHTs contain additional load instructions The prefetch method informs the TLB miss handlers that a TLB must be set up ahead of the moment when a worker requires the data on a page # **HERO: Software** # SW stack Efficient offloading # **HERO: Software Stack** **ETH** zürich #### **HERO: OpenMP support** #### **Copy-Based Shared Memory** • **Data is copied** to and from a physically contiguous, uncached section in main memory, and **physical pointers** are passed to the PMCA #### **Shared Virtual Memory** It enables zero-copy offloads, directly passing virtual pointers to the PMCA #### **HERO: Programming Model and API** Principle: single-source heterogeneous programming. Example: ``` void host_function(unsigned n elems, const float a, const float* x, float* y) { #pragma omp target map(to: n_elems, a, x, y) device(HERO_DEVICE_PULP) float buf_x[BUF_ELEMS], buf_y[BUF_ELEMS]; for (unsigned offset = 0; offset < n elems; offset += BUF ELEMS) {</pre> const unsigned cur n elems = min(n elems - offset, BUF ELEMS); const size t cur memcpy size = cur n elems * sizeof(float); hero_memcpy_host2dev(buf x, x+offset, cur memcpy size); hero_memcpy_host2dev(buf y, y+offset, cur memcpy size); hero_dblas_saxpy(cur n elems, a, buf x, buf y); hero_memcpy_dev2host(y+offset, buf_y, cur_memcpy_size); ``` #### **HERO: Heterogeneous Compilation** #### Single-source, single-binary heterogeneous compilation ... ... provides first-class support for heterogeneous programming! ## **HERO: Implementation and Results** # FPGA implementations Results #### **HERO on FPGA** HERO implementations have been deployed on different FPGA platforms: - Zynq UltraScale+ MPSoC ZCU102 (Xilinx) - Zynq ZC706 Evaluation Kit (Xilinx) - Juno development board (Arm) - Virtex UltraScale+ HBM VCU128 (Xilinx) Larger programmable logic available on the FPGA enables research on larger PMCAs or multi-cluster PMCAs. #### Results: Zero-Copy vs. Copy-Based The main motivation for shared virtual memory (SVM) is programmability. However, SVM can also significantly improve performance! PageRank (algorithm to analyze graph connectivity): The overhead of manipulating pointers at offload-time in the copy-based approach is higher than the overhead introduced by translating pointer with SVM **MemCopy** (Copy a large array from DRAM to PMCA and back) The host copy phase takes much more time than letting the PMCA access data directly with high-bandwidth DMA transfers Copy-Based SM # **HERO: Ongoing efforts** **ETH** zürich #### Conclusion - HERO is a research platform to explore HW/SW codesign of heterogeneous systems - HERO enables full-system exploration of RISC-V manycore accelators - HERO achieves efficient collaboration between host and PMCA through a shared virtual memory enabled by its hybrid IOMMU - OpenMP plugin allows for transparent accelerator programming # ETHzürich # Thank you for your attention!