



ALMA MATER STUDIORUM UNIVERSITÀ DI BOLOGNA

# Designing Real SoCs using open EDA tools

Integrated Systems Laboratory (ETH Zürich)

#### Luca Benini

lbenini@iis.ee.ethz.ch, luca.Benini@unibo.it

**PULP Platform** Open Source Hardware, the way it should be!



@pulp\_platform pulp-platform.org 📲 youtube.com/pulp platform



### In 11 years PULP team has designed more than 60 chips



**RISC-V** and open-source hardware have been instrumental in our success



MC65

# A simplified view of the IC design flow





# Most of open source hardware is at RTL level





#### We have created a sandbox to design System on Chips





UNIVERSITÀ DI BOLOGNA

# We make everything (we can) available openly

- All our development is on GitHub using a permissive license
  - HDL source code, testbenches, software development kit, virtual platform

#### https://github.com/pulp-platform



• Allows anyone to use, change, and make products without restrictions.



LMA MATER STUDIORUM

JNIVERSITÀ DI BOLOGN/

**ETH** zürich

#### Heterogeneous Research Platform (HERO)

HERO is an **FPGA-based research platform** that enables accurate and fast exploration of **heterogeneous computers** consisting of **programmable many-core accelerators** and an **application-class host CPU**. Currently, 32-bit RISC-V cores are supported in the accelerator and 64-bit ARMv8 or RISC-V cores as host CPU. HERO allows to **seamlessly share data between host and accelerator** through a unified heterogeneous programming interface based on OpenMP 4.5 and a mixed-data-model, mixed-ISA heterogeneous compiler based on LLVM.

HERO's hardware architecture, shown below, combines a general-purpose host CPU (in the upper left corner) with a domain-specific programmable many-core accelerator (on the right side) so that data in the main memory (in the lower left corner) can be shared effectively.





# Diverse set of open source based industry collaborations

#### P b p

#### GF22 (2018) **Arnold**

eFPGA coupled with a RISC-V microcontroller.

In one year from agreement to actual tapeout



#### GF22 (2022) **Marsellus**

*Heterogeneous IoT processor With Aggressive voltage scaling* 



Permissive open-source licensing key to our industrial relationships

#### Siracusa

SoC for Extended Reality visual processing

# 🔿 Meta

**ETH** zürich



#### Carfield

*Open-Research platform for safety, resilient and time-predictable systems* 





### And many continue to use our work for their research





# Unlocking the rest of the design flow









3<sup>rd</sup> party IP when included can limit what can be open sourced







Fabs do not make PDK information accessible

# Open PDKs are a key enabler for further development







#### The output (and even scripts) of EDA vendors are closed EUROPRACTICE Fab **3rd Party IP** Library PDK Netlist RTL GDS2 Chip Verification **Synthesizer** P&R

#### EDA vendors limit the output of their tools

#### Open-source community can develop EDA tools too! EUROPRACTICE Fab **3rd Party IP** Library PDK Netlist Chip RTL GDS2 Verification Synthesizer P&R



#### **Open-Source synthesis flow: Yosys**



#### Elaboration

- Behavioral RTL to connected cells (structural)
- High-level phase
  - Cells are arithmetic operations
  - Fuse and transform operations





#### **Open-Source synthesis flow: Yosys**

- Elaboration
  - Behavioral RTL to connected cells (structural)
- High-level phase
  - Cells are arithmetic operations
  - Fuse and transform operations
- Generic gate phase
  - Abstract standard cell library
  - Gate-level optimizations





### **Open-Source synthesis flow: Yosys**

- Elaboration
  - Behavioral RTL to connected cells (structural)
- High-level phase
  - Cells are arithmetic operations
  - Fuse and transform operations
- Generic gate phase
  - Abstract standard cell library
  - Gate-level optimizations
- Technology mapping

**ETH** zürich

- Performed in included tool called ABC
- High-performance logic optimization
- Mapping to standard cell library



### Yosys is structured, documented and maintained

#### Clear structure

- 'Passes' operate on current representation
- Each pass is a file in a category (directory)
- Guides for users and developers
  - Starts with simple 'how to use'
  - Ends with 'how do I implement a pass'
- Regular contributors
  - YosysHQ employs developers
  - Other stakeholders also contribute often





#### But, what about handling practical (>MG designs) with acceptable QoR?



### Using SystemVerilog with Yosys

- SystemVerilog is widespread
  - PULP-Platform: Cheshire
  - lowRISC: OpenTitan and its IPs
  - OpenHW Group: CVA6, CV32E40P
  - BlackParrot RISC-V core
  - Most industrial IPs are implemented in SV





# Using SystemVerilog with Yosys

- SystemVerilog is widespread
  - PULP-Platform: Cheshire
  - lowRISC: OpenTitan and its IPs
  - OpenHW Group: CVA6, CV32E40P
  - BlackParrot RISC-V core
  - Most industrial IPs are implemented in SV
- In the past:
  - Pre-process the RTL
  - SVase simplifies SystemVerilog
  - SV2V converts to Verilog



ETH zürich

### Using SystemVerilog with Yosys

- SystemVerilog is widespread
  - PULP-Platform: Cheshire
  - lowRISC: OpenTitan and its IPs
  - OpenHW Group: CVA6, CV32E40P
  - BlackParrot RISC-V core
  - Most industrial IPs are implemented
  - - Pre-process the vosys-slang
  - SystemVerilog Frontend Svase simplifies







# Yosys-Slang is improving quickly

- Developed by Martin Povišer
  - Yosys developer and freelancer
  - Based on Slang (SystemVerilog library)
  - Started in June
  - First release in November
- Compatibility testing
  - See repo below
  - If you have a design, consider adding it
  - They are used for regression testing

- Fully elaborates (# signals)
  - Black Parrot (132k)
  - BSC Core Tile (62k)
  - CV32E40P (16k)
  - CVA6 (90k)
  - Snitch Cluster (800k)
  - Ara (1.8M)
  - Mempool
  - ITA
  - OpenTitan IPs

github.com/povik/yosys-slang-compat-suite





#### **Yosys-Slang elaborates better**

Cheshire synthesis, memory usage over time



#### Distribution starting now

- Release Candidate
  - First release end of October / early November
- OSS-CAD Suite (Yosys)
  - Will be included as a Yosys plugin
- IIC-OSIC-Tools docker container
  - Included in next release
  - Yosys plugin (loaded by default)
  - ETHZ started using this container
    - Croc (educational SoC) already uses it
    - Cheshire-IHP130-o will be updated shortly







github.com/iic-jku/iic-osic-tools



# Library of Arithmetic Unit (LAU)

- Block replacement is implemented in Yosys
  - Only used in FPGA designs to infer DSP slices
  - Detects and replaces arithmetic operators
- No open-source LAU
  - LAU created at IIS as part of a PhD thesis
    - A wide range of arithmetic operations
    - 3 different performance variants of generic gate netlists
    - Thoroughly QoR evaluated and optimized
  - Currently: fully manual selection
    - Replace Yosys operators via script
    - Instantiate directly in RTL



#### **Optimized ABC Scripts and Better Integration**

- Utilizing "Lazy Man's Synthesis" by Yang et al.
  - Pre-computation of optimal 6-input graphs
  - Used as look-up-table in a mapping step
  - Repeated application with graph re-writing and balancing in between
  - High QoR at an affordable runtime
- Improved ABC integration in Yosys
  - Improved ABC command sequence
  - Different ABC scripts: span larger design space
  - ABC requires defining slew and gain parameters to use PDK timings
    - Otherwise it uses unit-delays







#### **Our Improved ABC Script**

- Optimization loop
  - Repeated 20 times
  - LMS with 6-input and 4-input LUTs
  - Structural choices computation
- Mapping loop
  - Convergence: 5-6 iterations
  - Optimization and mapping sequence
  - Technology mapping
- Buffer optimization and resizing

strash &get -n &st; &if -y -K 6; Loop 20 &syn2; &b; &st; &opt iter &dch -x; &if -K 4; loopend Loop 5 &opt iter; &map\_iter; &opt\_iter loopend &st; &nf {period\_ps}; &put topo buffer -p 100p 2 upsize {period ps} downsize {period ps} loopend

#### Compared to Yosys Default

| strash                                           | Depart lass expansive                           | strash                          |                                         |  |  |
|--------------------------------------------------|-------------------------------------------------|---------------------------------|-----------------------------------------|--|--|
| &get -n                                          | Repeat less-expensive<br>LMS-based optimization | &get -n                         | &st &if -y -K 6;                        |  |  |
| &fraig -x                                        | loop                                            | Loop 20                         | &syn2 &b &st                            |  |  |
| &put                                             | юор                                             | &opt_iter <                     | &dch -x; &if -K 4;                      |  |  |
| scorr                                            |                                                 | loopend                         |                                         |  |  |
| dc2                                              |                                                 | Loop 5                          |                                         |  |  |
| dretime                                          |                                                 |                                 | nap_iter; &opt_iter                     |  |  |
| strash                                           |                                                 | loopend                         |                                         |  |  |
| &get -n                                          |                                                 | ↦_iter                          |                                         |  |  |
| &dch –f                                          |                                                 | &put                            | <pre>&amp;st &amp;nf {period_ps};</pre> |  |  |
| &nf {period_ps}                                  |                                                 | topo                            |                                         |  |  |
| &put                                             |                                                 | buffer -p                       |                                         |  |  |
| buffer                                           | Repeatedly optimize during                      | loop 2                          |                                         |  |  |
| <pre>upsize {period_ps} dnsize {period_ps}</pre> | Technology mapping                              | upsize {period_ps}              |                                         |  |  |
| <pre>dnsize {period_ps}</pre>                    |                                                 | <pre>downsize {period_ps}</pre> |                                         |  |  |
|                                                  |                                                 | loopend                         |                                         |  |  |

#### What Yosys Devs are working on

- Reshape flow to improve runtime/memory
  - Generic mapping (techmap command) is a bottleneck
  - Time and memory intense

ETH zürich

- Idea: Skip it for most operators
- Instead write out AIG and go directly into ABC

#### Operator choices (with OpenROAD)

- Yosys synthesizes all operator implementations it knows (eg different adder architectures)
- Handed to OpenRoad via netlist with attributes
- Backend evaluates and choses fitting one
- First steps towards a joint Synthesis+Backend flow



30

#### Ongoing Work at ETHZ

- Integrate Mockturtle
  - Offers different algorithms





### Ongoing Work at ETHZ

#### Integrate Mockturtle

- Offers different algorithms
- Highlight: emap mapper
  - Up to 15% smaller average area
  - Up to 5% faster critical path
- Good documentation

Table 4.9: Comparing our technology mapper *emap* against ABC &*nf* after performance-driven buffering and gate sizing using the ASAP7 library.

| Benchmark       | Baseline |       | ABC nf-p           |            | Our work emap      |            |
|-----------------|----------|-------|--------------------|------------|--------------------|------------|
|                 | Size     | Depth | Area ( $\mu m^2$ ) | Delay (ps) | Area ( $\mu m^2$ ) | Delay (ps) |
| ac97_ctrl       | 14241    | 11    | 947.86             | 126.61     | 785.69             | 117.96     |
| aes_core        | 21376    | 21    | 1674.44            | 273.42     | 1697.16            | 252.39     |
| des_area        | 4808     | 27    | 470.15             | 372.71     | 433.14             | 353.17     |
| des_perf        | 80101    | 17    | 7213.21            | 237.51     | 7156.42            | 232.15     |
| DMA             | 24278    | 21    | 1581.92            | 271.38     | 1394.36            | 249.12     |
| DSP             | 45004    | 52    | 2813.2             | 577.41     | 2529.13            | 556.61     |
| ethernet        | 86576    | 27    | 5164.59            | 335.3      | 4219.71            | 354.08     |
| iwls05_i2c      | 1132     | 12    | 69.34              | 160.69     | 62.52              | 162.4      |
| iwls05_mem_ctrl | 15048    | 29    | 790.94             | 356.72     | 721.45             | 344.57     |
| pci_bridge32    | 22705    | 22    | 1421.13            | 294.12     | 1278.46            | 296.03     |
| RISC            | 74651    | 36    | 4298.5             | 404.78     | 3838.19            | 417.24     |
| sasc            | 770      | 8     | 46.9               | 101.19     | 37.69              | 109.13     |
| simple_spi      | 1039     | 10    | 64.97              | 134.48     | 55.94              | 131.12     |
| spi             | 3760     | 31    | 241.09             | 322.71     | 229.46             | 327.47     |
| ss_pcm          | 405      | 7     | 32.75              | 90.55      | 35.87              | 65.24      |
| systemcaes      | 12242    | 44    | 770.63             | 508.21     | 641.51             | 474.88     |
| systemcdes      | 2877     | 23    | 283.16             | 303.6      | 228.79             | 308.51     |
| tv80            | 9461     | 43    | 619.61             | 509.44     | 575.3              | 483.14     |
| usb_funct       | 15715    | 23    | 926.56             | 277.43     | 823.67             | 283.81     |
| usb_phy         | 452      | 9     | 30.47              | 90.78      | 29.48              | 92.9       |
| wb_conmax       | 47520    | 18    | 2360.05            | 279.26     | 2281.22            | 276.46     |
| Reduction       |          |       |                    |            | 9.22%              | 2.59%      |

**Tempia Calvino, A.** (2024). *Technology Mapping and Optimization Algorithms for Logic Synthesis of Advanced Technologies.* Doctoral thesis.



### Ongoing Work at ETHZ

- Integrate Mockturtle
  - Offers different algorithms
  - Highlight: emap mapper
    - Up to 15% smaller average area
    - Up to 5% faster critical path
  - Good documentation
- Exploration using ABC/Mockturtle
  - Finer control over 'abc' command
    - Define custom flows calling other tools
  - Multiple scripts with *different optimization goals*
  - On a per-module basis, decide which ones to try
  - Evaluate and pick one or more candidates





# Ongoing Work at ETHZ (2)

- Architectural Variants
  - VHDL has architectural and behavioral
  - We want different variants of an architecture )
  - Compatible with Verilog/SystemVerilog
    - Uses only parameters and attributes

```
module adder #(
 (* arch variant = "FASTEST, BALANCED, SMALLEST" *)
   parameter string SpeedGrade = "BALANCED"
   input
          logic [7:0] a,
          logic [7:0] b,
   input
   output logic [7:0] sum
);
   if (SpeedGrade == "FASTEST") begin
       // fastest implementation
   end
   else if (SpeedGrade == "SMALLEST") begin
       // smallest implementation
   end
   else begin
       // default implementation (balanced)
   end
```

endmodule

ETH zürich

# Ongoing Work at ETHZ (2)

- Architectural Variants
  - VHDL has architectural and behavioral
  - We want different variants of an architecture
  - Compatible with Verilog/SystemVerilog
    - Uses only parameters and attributes
- Why?
  - Integrate ETHZ library of arithmetic units
  - Prepare Yosys for deeper exploration
    - Explore flattening/keeping of hierarchy
    - Explore different implementations and evaluate
    - Needs to keep track of many internal variations
  - Offer place & route variants it may choose from



#### On a more Abstract Level

- Synthesis should explore design spaces
  - This requires variants and iterations
- Retain Information
  - It may allow for specific optimizations later in the flow
  - Eg flattening if there is very little benefit is not good
  - Annotate nets/modules etc with additional info
    - Eg if you find a bus, make sure it is kept as a bus and mark it as such -> place and route can use this info
- Add intelligence over time
  - Start with manual operator selection
  - Move to simple heuristic
  - End with complex systems (eg ML-driven)



## The output (and even scripts) of EDA vendors are closed **EUROPRACTICE** Fab **3rd Party IP** Library **PDK** Netlist RTL Chip GDS2 P&R Synthesizer Verification Yosys

## 1. Floorplan

- Define size
- Place pads and macros





## 1. Floorplan

- Define size
- Place pads and macros
- 2. Power distribution





## 1. Floorplan

- Define size
- Place pads and macros
- 2. Power distribution
- 3. Placement
  - Rough global placement
  - Legalize cell positions (detailed placement)





## 1. Floorplan

- Define size
- Place pads and macros
- 2. Power distribution

## 3. Placement

- Rough global placement
- Legalize cell positions (detailed placement)

## 4. Generate clock tree





## 1. Floorplan

- Define size
- Place pads and macros
- 2. Power distribution

## 3. Placement

- Rough global placement
- Legalize cell positions (detailed placement)

## 4. Generate clock tree

5. Routing

**ETH** zürich

- Plan resource utilization for each wire
- Create wires, fix violations (shorts etc)





## **OpenROAD: A Collection of Backend Tools**

- Research turned into a common flow
  - Global place: RePlace
  - Global route: FastRoute
  - Clock tree: TritonCTS
- Common openDB data structure
  - Designed by industry professionals
  - Documented and tested
- Supporting infrastructure around it
  - CLI, GUI, reporting, metrics collection etc
  - Plugin system for easy extensibility















We are getting there, first fully open chips are underway

# Meet Basilisk: Open RTL, Open EDA, Open PDK



## Designed in IHP 130nm OpenPDK

- 6.25mm x 5.50mm
- 60MHz
- 1.08 MGE logic, 60% density
- 24 SRAM macros (114 KiB)

## CVA6 based SoC

- Runs and boots Linux
- Active collaboration with





#### 48

# Basilisk is based on PULP: Cheshire SoC Platform

- Multi-million gate design
- 64-bit RISC-V Core
  - Complete Linux-capable SoC
  - Simple "Raspberry Pi"
- Rich Peripherals
  - Includes an open USB 1.1 host
- Open-source DRAM interface
  - Digital-only interface
- Silicon-proven
  - Multiple tapeouts with commercial EDA

JNIVERSITÀ DI BOLOGN.



#### github.com/pulp-platform/cheshire-ihp130-o



# Open-source vs. Commercial EDA – Reality Check

#### Improvements until June 2024 (Basilisk TO)

- SV-to-Verilog chain @ <2min runtime
- Yosys synthesis:

→ 1.1 MGE (1.6×) @ 77 MHz (2.3×) → 1.4× less runtime, 2.4× less peak RAM

- OpenROAD P&R: tuning
   → -12% die area, +10% core utilization
- Improvements June-October
  - Yosys-slang replaces SV2V
    - 1.6× less runtime, 10× less peak RAM
  - -10% logic area (preliminary)







- Constraints and Timing
  - Yosys has zero constraints support
  - ABC has very limited support
  - We want to use SDC compatible constraints
  - Approaches:
    - Integrate (parts of) OpenSTA?
    - Create something new?
  - Yosys needs to respect constraints at every level!
  - We need also timing-driven P&R and integration



ETH ZÜRICH

- Joint Synthesis + PnR flow
  - How should the tools integrate?
  - How can the flow be more iterative across Synthesis and PnR?
  - How can they collaborate more closely? (ie synthesis hinting to placer how something may be layed out)





- Interchange Formats
  - AIGER and BLIF used between Yosys and ABC
  - More difficult to debug for users
  - Limits how tools exchange information
    - Fundamentally limits some features (multi clock-domain optimizations/analysis, re-use of structures etc)
  - Missing Features

**ETH** zürich

- Robust variant support
- Annotation of arbitrary objects (requires named objects)
- Fast to read and write
- Do we extend an existing format?

ALMA MATER STUDIORUN

UNIVERSITÀ DI BOLOGN/

- Can we adapt something else?
  - Eg: naja-edits DNL/SNL formats (based on Cap'n Proto)



#### Runtime and Memory

- Yosys' data structure is a legacy constraint
  - Yosys devs aware of this
  - Changes are either compatible and iterative
  - Or break compatibility with large performance gains
- How can we work around this constraint?
- How would we solve it for the future?





- Four main points
  - Constraints and Timing
  - Joint Synthesis + PnR flow
  - Interchange formats
  - Runtime and memory optimizations
  - Bonus: Can we better leverage compute?
    - Heavily multi-threaded
    - GPU based techniques and algorithms





# Croc SoC: A simple chip for students





github.com/pulp-platform/croc



# Croc SoC: A simple chip for students

- Croc is simple to understand
  - Everything in one repository
  - Plain SystemVerilog
  - (soon) guides from students for students
- Croc is flexible

**ETH** zürich

- Students can replace or add parts
- Every step in the flow is fully exposed
- Croc flow is easy to run
  - Runs on older laptops (<4GB RAM)
  - Tools in a docker container Works on Linux, Windows and MacOS
  - 4 Make commands from RTL to GDS



## Used for ETH Zürich VLSI 2 lecture starting 2025

# Benefits of end to end openness



- Easier collaboration (no NDAs)
- Reproducible results, benchmarking
- Combined impact of design and design automation



- Transparent chain of trust, sovereignty
- Lower initial cost
- Faster research  $\rightarrow$  product



- Increased accessibility
- No black boxes, full visibility
- Experiment with flows and tools



# International Cooperation is Key





## More Open PDKs are needed

- We have access to three open PDKs at the moment
  - Skywater 130nm
  - Globalfoundries 180 (500nm high-voltage flavor of their 180nm node)
  - IHP 130nm
- State of the Art from 2000-2004
  - Many exciting designs possible
- We need more
  - For more innovation: higher volume, faster turn-around, frequent MPWs
  - For more capabilities: Access to newer nodes

## An Open PDK in the 65-28nm would be a game changer!





# Looking Forward...

- Open EDA is moving fast → MCU ✓ MPU ✓ HPC/GenAI ×
- Scale is the challenge, QoR is next
- Can we accelerate?
  - Al for EDA can be a key enabler (Language  $\rightarrow$  RTL is ripe IMHO)
  - ChatGPT arxiv.org/pdf/2305.14019, ChipChat (arxiv.org/abs/2305.13243)
  - Is a LLAMAv3 moment near for AI-based EDA?

| Benchmark      | Test Set | ChatGPT-4 |           |            | ChatGPT-3.5 |           |            |
|----------------|----------|-----------|-----------|------------|-------------|-----------|------------|
|                |          | Outcome   | Compliant | # Messages | Outcome     | Compliant | # Messages |
| Shift Register | T1       | TF        | 1         | 3          | SHF         | 1         | 13         |
|                | T2       | TF        | 1         | 9          | FAIL        |           | 25         |
|                | T3       | AHF       | 1         | 15         | FAIL        |           | 11         |
| Sequence Gen.  | TI       | AHF       | 1         | 14         | FAIL        |           | 25         |
|                | T2       | TF        | 1         | 4          | FAIL        | -         | 7          |
|                | T3       | AHF       | 1         | 20         | FAIL        |           | 25         |
| Sequence Det.  | T1       | FAIL      | -         | 24         | FAIL        | -         | 21         |
|                | T2       | SHF       | 1         | 9          | SHF         | ×         | 8          |
|                | T3       | TF        | 1         | 13         | SHF         | ×         | 8          |
| ABRO           | T1       | FAIL      |           | 16         | FAIL        | -         | 25         |
|                | T2       | AHF       | 1         | 20         | MHF         | 1         | 15         |
|                | T3       | TF        | 1         | 12         | NFN         | ×         | 3          |
| LFSR           | T1       | TF        | 1         | 12         | FAIL        | -         | 25         |
|                | T2       | SHF       | /         | 7          | TF          | 1         | 4          |
|                | T3       | SHF       | 1         | 9          | FAIL        | -         | 11         |
| Binary to BCD  | T1       | TF        | 1         | 4          | SHF         | ×         | 8          |
|                | T2       | NFN       | 1         | 2          | FAIL        | -         | 12         |
|                | T3       | SHF       | 1         | 9          | TF          | ×         | 4          |
| Traffic Light  | TI       | TF        | 1         | 4          | FAIL        | -         | 25         |
|                | T2       | SHF       | 1         | 12         | FAIL        | -         | 13         |
|                | T3       | TF        | 1         | 5          | FAIL        | -         | 18         |
| Dice Roller    | T1       | SHF       | ×         | 8          | MHF         | ×         | 9          |
|                | T2       | SHF       | 1         | 9          | FAIL.       | 1 2       | 25         |
|                | T3       | SHF       | ×         | 18         | NFN         | X         | 3          |









## **Final words**

- We use open source because it works
  - Allows us to manage complex designs
  - Facilitates Industry/Academia Relationships
  - Creates Auditable Designs, Reproducible Results
  - Enables research into new directions

### Helps us and others concentrate work where it matters

- Open Source sees no borders
  - There is no 'European/Chinese/American Open Source',
  - There can be 'European/Chinese/American support for Open Source'

#### Open Source is global, it just can have more or less support in a region/country



# There is still more to come 🕲



# Thank You!