OpenPiton + Ariane 🚀 : The First Linux-Booting Open-Source RISC-V Manycore

Jonathan Balkind, Michael Schaffner
Princeton University, ETH Zurich
Who are we?

• Jonathan Balkind
  • Lead architect of OpenPiton

• OpenPiton Team
  • Led by Prof. David Wentzlaff
  • Princeton Parallel Research Group
  • Open source HW since 2015
  • 13 PhD students
  • 1 Postdoc
  • N undergraduates

• Michael Schaffner
  • Responsible for OpenPiton+ Ariane integration

• PULP Team
  • Led by Prof. Luca Benini
  • ETHZ / Università di Bologna
  • Open source HW since 2013
  • Leaders in RISC-V development
  • Ariane dev: Florian Zaruba, Michael Schaffner and others
This material is based on research sponsored by the NSF under Grants No. CNS-1823222, CCF-1217553, CCF1453112, CCF-1823032, and CCF-1438980, AFOSR under Grant No. FA9550-14-1-0148, Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement No. FA8650-18-2-7846, FA8650-18-2-7852, and FA8650-18-2-7862 and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA), the NSF, AFOSR, or the U.S. Government.
Project Overview

• Collaboration between Princeton University and ETH Zurich

• Goal is to develop a permissively licensed, Linux capable manycore research platform based on RISC-V
  • Based on mature, extensible designs
  • Booted SMP Linux in <6 months
  • The world's first open-source, Linux-booting, RISC-V manycore

• Ariane
  • RV64GC Core (with extensions)
  • Linux capable

• OpenPiton
  • Manycore research platform
  • Distributed cache coherence and NoC
Ariane RV64GC Core

• Application class processor
  • Written in SystemVerilog
• Linux Capable
  • Tightly integrated D$ and I$
  • M, S and U privilege modes
  • TLB, SV39
  • Hardware PTW
• Optimized for performance
  • Frequency: 1.5 GHz (22 FDX)
  • Area: ~ 175 kGE
  • Critical path: ~ 25 logic levels

• 6-stage pipeline
  • In-order (single) issue
  • Out-of-order write-back
  • In-order commit

• Scoreboarding

• Designed for extensibility

• Branch-prediction
  • Return Address Stack (RAS)
  • Branch Target Buffer (BTB)
  • Branch History Table (BHT)
Peeking inside...
Silicon Proven Designs: Ariane

- **Ariane** taped-out in **GlobalFoundries 22nm FDX** twice
- 16kB instruction and 32kB data caches

**Poseidon:**
- Area: 0.23 mm\(^2\) - 175 kGE
- 0.2 - 1.7 GHz (0.5 V - 1.15 V)

**Kosmodrom:**
- RV64GCXsmallFloat, Transprecision / Vector FPU
- **Ariane HP**
  - 8T library, 0.8V, 1.3 GHz
  - 55 mW @ 1 GHz
- **Ariane LP**
  - 7.5T ULP library, 0.5V, 250 MHz
  - 5 mW @ 200 MHz
OpenPiton

- Open source manycore
- Written in Verilog RTL
- Scales to ½ billion cores
- Configurable core, uncore
- Simulation in VCS, ModelSim, Incisive, Verilator, Icarus
- Includes synthesis and back-end flow
- ASIC & FPGA verified
- ASIC power and energy fully characterized [HPCA 2018]
- Runs full stack multi-user Debian Linux
- Used for Architecture, Programming Language, Compilers, Operating Systems, Security, EDA research
System Overview
System Overview
System Overview
System Overview

Chipset

P-Mesh Off-Chip Routers (3)

P-Mesh Chipset Crossbars (3)

Chip Bridge

Chip
System Overview
System Overview
System Overview

[Diagram showing a system overview with Chip and Chipset, including elements like Chip Bridge, P-Mesh Off-Chip Routers (3), P-Mesh Chipset Crossbars (3), AXI I/O, DRAM, and Wishbone SDHC.]
OpenPiton Tile

L2 Cache Slice + Directory Cache

P-Mesh Routers (3)

Modified OpenSPARC T1 Core

MITTS (Traffic Shaper)

L1.5 Cache

CCX Arbiter

FPU

To Other Tiles
Silicon Proven Designs: Piton

• 25-core
  • 2 Threads per core
  • Modified 64 bit OpenSPARC T1 Core

• 3 NoCs (P-Mesh)
  • 64 bit, 2D Mesh
  • Extend off-chip enabling multichip systems

• Directory-Based Cache System
  • 64kB L2 Cache per core (Shared)
  • 8kB L1.5 & L1 Data Caches
  • 16kB L1 Instruction Cache

• IBM 32nm SOI Process
  • 6mm x 6mm
  • 460 Million Transistors - Among largest chips built in academia

• Target: 1 GHz Clock @ 900 mV
• Received silicon and runs full-stack Debian in lab
OpenPiton+Ariane

Diagram showing the architecture of OpenPiton with Ariane and P-Mesh components. The diagram includes various blocks such as L1S, L1DS Adapter, L2, Traffic Shaper, NoC Routers, and several through connections labeled as NIC 1, NIC 2, NIC 3, and others. The Chipset on the right includes Bootrom, Debug Module, CLINT, PLIC, UART, Ethernet, SD, and DRAM Ctrl.
OpenPiton+Ariane

- New write-through cache subsystem with invalidations and the TRI interface
- LR/SC in L1.5 cache
- Fetch-and-op in L2 cache
- RISC-V Debug
- RISC-V Peripherals
# Configurability Options

<table>
<thead>
<tr>
<th>Component</th>
<th>Configurability Options</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cores (per chip)</td>
<td>Up to 65,536</td>
</tr>
<tr>
<td>Cores (per system)</td>
<td>Up to 500 million</td>
</tr>
<tr>
<td>Core Type</td>
<td>OpenSPARC T1</td>
</tr>
<tr>
<td></td>
<td>Ariane 64 bit RISC-V</td>
</tr>
<tr>
<td>Threads per Core</td>
<td>1/2/4</td>
</tr>
<tr>
<td>Floating-Point Unit</td>
<td>FP64, FP32</td>
</tr>
<tr>
<td></td>
<td>FP64, FP32, FP16, FP8, BFLOAT16</td>
</tr>
<tr>
<td>TLBs</td>
<td>8/16/32/64 entries</td>
</tr>
<tr>
<td></td>
<td>Number of entries (16 entries)</td>
</tr>
<tr>
<td>L1 I-Cache</td>
<td>Number of Sets, Ways (16kB, 4-way)</td>
</tr>
<tr>
<td>L1 D-Cache</td>
<td>Number of Sets, Ways (8kB, 4-way)</td>
</tr>
<tr>
<td>L1.5 Cache</td>
<td>Number of Sets, Ways (8kB, 4-way)</td>
</tr>
<tr>
<td>L2 Cache</td>
<td>Number of Sets, Ways (64kB, 4-way)</td>
</tr>
<tr>
<td>Intra-chip Topologies</td>
<td>2D Mesh, Crossbar</td>
</tr>
<tr>
<td>Inter-chip Topologies</td>
<td>2D Mesh, 3D Mesh, Crossbar, Butterfly Network</td>
</tr>
<tr>
<td>Bootloading</td>
<td>SD/SDHC Card, UART, RISC-V JTAG Debug</td>
</tr>
</tbody>
</table>
FPGA Prototyping Platforms

Available:

- Digilent Genesys2
  - $999 ($600 academic)
  - 1-2 cores at 66MHz
- Xilinx VC707
  - $3500
  - 1-4 cores at 60MHz
- Digilent Nexys Video
  - $500 ($250 academic)
  - 1 core at 30MHz

In progress:

- Xilinx VCU118, BittWare XUPP3R
  - $7000-8000
  - >100MHz
- Amazon AWS F1
  - Rent by the hour
Boot SMP Linux Today!

• Clone from:
  • [https://github.com/PrincetonUniversity/openpiton](https://github.com/PrincetonUniversity/openpiton)
  • Simulation with Modelsim, VCS, Verilator

• Prebuilt bitfiles and Linux image available
  • Play Tetris, browse the web!

• Roadmap:
  • OpenSBI, U-Boot (?), Debian/Fedora distro
  • Simulation extensions (RV Torture, litmus, etc)
  • Performance enhancements (TLBs, mem. IF, multi-issue)
  • Tapeouts!
Upcoming Events / Papers

• Hands-on workshop @WOSH this Thursday afternoon (13:30 – 18:00, ETZ D61.1)  
  http://openpiton.org/WOSH19_tutorial.html

• Hands-on workshop @ISCA on Sunday afternoon  
  (June 23 14:00, Phoenix, Arizona, USA) 
  http://openpiton.org/ISCA19_tutorial.html

• Talk @CARRV on Saturday morning  
  (June 22 09:00, Phoenix, Arizona, USA) 
  Paper with more details: https://carrv.github.io
QUESTIONS?

@OpenPiton
http://openpiton.org

@pulp_platform
http://pulp-platform.org
<table>
<thead>
<tr>
<th>Board Name / FPGA Type</th>
<th>Clock [MHz]</th>
<th>Config X × Y</th>
<th>Core Type</th>
<th>FPU</th>
<th>LUTs [k] (71%)</th>
<th>Registers [k] (27%)</th>
<th>RAM Tiles [#] (18%)</th>
<th>DSPs [#] (2%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Digilent NexysVideo</td>
<td>30</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>no</td>
<td>95</td>
<td>72</td>
<td>66</td>
<td>16</td>
</tr>
<tr>
<td>Artix 7</td>
<td>30</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>yes</td>
<td>110</td>
<td>75</td>
<td>66</td>
<td>27</td>
</tr>
<tr>
<td>7a200tsbg484</td>
<td>30</td>
<td>1 × 1</td>
<td>OpenSPARCT1</td>
<td>yes</td>
<td>115</td>
<td>96</td>
<td>59</td>
<td>13</td>
</tr>
<tr>
<td>Digilent Genesys2</td>
<td>67</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>no</td>
<td>86</td>
<td>72</td>
<td>66</td>
<td>16</td>
</tr>
<tr>
<td>Kintex 7</td>
<td>67</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>yes</td>
<td>99</td>
<td>75</td>
<td>66</td>
<td>27</td>
</tr>
<tr>
<td>7k325ttfg900-2</td>
<td>67</td>
<td>2 × 1</td>
<td>Ariane</td>
<td>no</td>
<td>141</td>
<td>113</td>
<td>124</td>
<td>16</td>
</tr>
<tr>
<td></td>
<td>67</td>
<td>2 × 1</td>
<td>OpenSPARCT1</td>
<td>yes</td>
<td>167</td>
<td>120</td>
<td>124</td>
<td>54</td>
</tr>
<tr>
<td>Xilinx VC707</td>
<td>60</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>no</td>
<td>99</td>
<td>73</td>
<td>63</td>
<td>16</td>
</tr>
<tr>
<td>Virtex 7</td>
<td>60</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>yes</td>
<td>114</td>
<td>77</td>
<td>63</td>
<td>27</td>
</tr>
<tr>
<td>7vx485ttfg1761-2</td>
<td>60</td>
<td>2 × 2</td>
<td>Ariane</td>
<td>no</td>
<td>284.1</td>
<td>202</td>
<td>237</td>
<td>64</td>
</tr>
<tr>
<td></td>
<td>60</td>
<td>3 × 1</td>
<td>Ariane</td>
<td>yes</td>
<td>268</td>
<td>169</td>
<td>179</td>
<td>81</td>
</tr>
<tr>
<td></td>
<td>60</td>
<td>3 × 1</td>
<td>OpenSPARCT1</td>
<td>yes</td>
<td>255</td>
<td>208</td>
<td>158</td>
<td>48</td>
</tr>
<tr>
<td>Xilinx VCU118</td>
<td>100</td>
<td>1 × 1</td>
<td>Ariane</td>
<td>no</td>
<td>90</td>
<td>81</td>
<td>88</td>
<td>19</td>
</tr>
<tr>
<td>Virtex US+ xcvu9pilga2104-2L</td>
<td>100</td>
<td>4 × 4</td>
<td>Ariane</td>
<td>no</td>
<td>923</td>
<td>704</td>
<td>963</td>
<td>259</td>
</tr>
<tr>
<td></td>
<td>100</td>
<td>4 × 2</td>
<td>Ariane</td>
<td>yes</td>
<td>583</td>
<td>399</td>
<td>495</td>
<td>219</td>
</tr>
</tbody>
</table>

† Without Coherence Domain Restriction [8] in caches.