Understanding the TCDM interconnect and implementing HWPEs
#1
Hello,

I am currently trying to implement a HWPE into the PULPissimo platform.
Now i am facing a few problems and it would be great if someone could help me clarify some things.
This might be a rather long thread with quite a few questions but i hope that this will not only help me but also other people that (might) face the same or similar problems as i do.

What i am trying to do to understand the interaction between the L2 Memory, the TCDM interconnect and a HWPE, is to make my own simple HWPE.
At first it should only consist of a couple of registers and be able to read/write from/to the L2 Memory.
This means that data flow is the thing i am interested in at the moment.
If this simple HWPE works and can successfully read/write then a more complex HWPE can easily follow.


So first i was reading the HWPE Interface Specifications and the Systemverilog code to get a rough overview of the (possible) interactions between L2 Memory <-> TCDM <-> HWPE.
This is how i think things are working so please correct me if i am wrong:


There are (at least) two components that can do read/write operations on the L2 Memory.

First the core which has an instruction memory interface and a data memory interface.
The fc_subsystem has two XBAR_TCDM_BUS.Master ports used by the core.
One for data (l2_data_master) and one for instructions (l2_instr_master).
These connect the core with the XBAR_TCDM_BUS.Slave ports of the soc_interconnect_wrap.
The soc_interconnect itself then has a RISC data port and a RISC instruction port.
Now this is where things start to get more confusing to me.
But the important thing is that the core can use its data memory interface to interact with the L2 Memory.
This means i know that it is possible to send signals to the L2 Memory (req, addr, etc.) for read/write operations.
So one possibility would be to have the HWPE interact with the L2 Memory the same way the core does.
This brings up some question:
1) Can the core and the HWPE share the same bus? If yes, would that mean that the core would have to stall if the HWPE is using the bus?
2) If they don't share the same bus and you create a new connection to the L2 Memory based on the cores data memory interface, how could you handle possible errors regarding Write-After-Read, etc. I am not sure if the request/response trees are able to handle such scenarios as PULPissimo is a single core system. Then again there are also multi-core variants but i haven't read anything about data flow in such PULP systems.
3) Similar to 2) but this time the HWPE would use more than one port (similar to how the current HWPE implementation works).

The second component that can do read/write operations on the L2 Memory is the HWPE.
At least in the pulp-rt-example the data is loaded into the L2 Memory.
Now the HWPE uses streams and i tried to make my own HWPE use only one master port.
Questions regarding the HWPE variant:
1) Can you use a single port for read/write or do you need at least two (source+sink)?
2) I tried the pulp-rt-example for the accelerator and reduced the number of master ports down to two. This failed as it seems just changing the parameters for the number of master ports is not enough. You probably have to do some changes in the stream controller, right? (maybe even more changes)


So in a nutshell: I am trying to implement my own HWPE in a PULPissimo platform. Currently the HWPE should only be able to read/write from/to the L2 Memory. At first i wanted to use only one port for that. If that works i wanted to increase the number of ports.
The questions are:
- Does it make sense to use the same port as the core?
- Does it make sense to create a new port which mimics the port of the core?
- Would the best/easiest/most efficient way be to just use the ports of the HWPE which are already defined and just replace the example HWPE with my own?
- What are the limits of the number of ports for both the core style variant as well as the HWPE variant?


If anything is unclear please feel free to ask and i will try my best to give further details.


Thank you very much.
LPLA
Reply
#2
Hello,

Just wanted to remind you that we have a tutorial under:
https://pulp-platform.org/conferences.html

Slides:
https://pulp-platform.org/docs/riscv_wor...torial.pdf

And there is also a video recording of the hands on talk, which could be useful:
https://www.youtube.com/watch?v=27tndT6cBH0

I think this could reduce the questions by some amount. In a nutshell, you can add as many ports to the HWPE as you want, generally the issue in computing is the memory bandwidth, so the more the better. Of course adding an arbitrary number of ports, could
a) lead to contentions (multiple sources/sinks access the same memory and end up waiting)
b) complicate the crossbar that is in between, reducing the access speed.
So a good balance is required.

If you do not have much memory access, you can actually even get away in making it a regular APB connected peripheral.

As the internals of the HWPE, we have some documentation
https://hwpe-doc.readthedocs.io/en/latest/

The streamers are designed to drive the memory ports, so you are right you should also modify them. The docs above could help you with that.

Part1, Q1: Yes core and HWPE use the same interconnect (not a bus) to access the same memory. If the access is not to the same physical memory block these accesses can be concurrent. There is a roundrobin like arbitration to make sure that the core (or HWPE) does not stall unfairly.

Hope this helps a bit
Visit pulp-platform.org and follow us on twitter @pulp_platform
Reply
#3
Thank you for the quick response.
I will check out the information you provided.
Reply
#4
Hi LPLA, some more detail with respect to the second part of your question.

Quote:Questions regarding the HWPE variant:
1) Can you use a single port for read/write or do you need at least two (source+sink)?

Well, both things! You can use a single port but you need two different streamers to manage incoming streams (source, to generate loads) and outgoing ones (sink, to generate stores). The separate "streams" of loads and stores generated by source and sink can be then mixed by means of a dynamic mux or a static mux. In the former case, the mux is essentially a level of interconnect arbitrating between conflicting accesses. In the latter, it is really a multiplexer - there is a static signal selecting which "stream" of loads and stores is selected.

Quote:2) I tried the pulp-rt-example for the accelerator and reduced the number of master ports down to two. This failed as it seems just changing the parameters for the number of master ports is not enough. You probably have to do some changes in the stream controller, right? (maybe even more changes)

In general, what is in this repo https://github.com/pulp-platform/hwpe-mac-engine is provided as an example, assume you have to change it. In the case of the toy MAC engine, there are three source and one sink modules, what you probably need is only two.

Quote:- Does it make sense to use the same port as the core? 

If by this you mean the same physical memory, yes -- but you'll need a layer of arbitration in between. Also, I do not recommend doing it.


Quote:- Does it make sense to create a new port which mimics the port of the core?

This is essentially how it already works, except that HWPE ports jump a few demux/interconnection layers as they are not capable of accessing some parts of L2 (non-interleaved memory).

Quote:- Would the best/easiest/most efficient way be to just use the ports of the HWPE which are already defined and just replace the example HWPE with my own?

By far the simplest and most recommended route. You can simply tie the unused master ports so that their req, add, wen, be, data signals are tied to 0.


Quote:- What are the limits of the number of ports for both the core style variant as well as the HWPE variant?

I am not sure if I understand this question; in PULPissimo specifically it does not make much sense to have more then four ports without making more changes, because there are only four memory banks (so more ports would not change the available bandwidth). In PULP (multi-core cluster) there is no such limitation, although up to now we stuck with 4 ports due to other architectural considerations (mainly, to keep the size/complexity  of the interconnect under check).
Reply
#5
Thanks a lot! This is very useful information.
Reply
#6
Hi, so I want to connect the accelerator to hwpe, pulpissimo and was looking for some resources that could guide me in depth on how to make the connection, access memory for data, and run a simple program on it. The accelerator is currently a simple one and is  performing only one operation. Could you help me out with the same?
Reply
#7
Hello,

We have a number of tutorials on this topic. Did you check:
https://pulp-platform.org/pulp_training.html

A Deep Dive into HW/SW Development with PULP should actually cover most questions and you would have a better view of what is needed further.
Visit pulp-platform.org and follow us on twitter @pulp_platform
Reply
#8
(02-13-2024, 07:19 AM)kgf Wrote: Hello,

We have a number of tutorials on this topic. Did you check:
  https://pulp-platform.org/pulp_training.html

A Deep Dive into HW/SW Development with PULP should actually cover most questions and you would have a better view of what is needed further.

Hey, thank you for replying.
 
So I went through both parts of the tutorial and there was one exercise (full-stack IP integration) that involved the connection of the hardware accelerator with Pulpissimo. But that's not using hwpe for the connection. It's using only axi4 outlet to connect the accelerator. I didn't find steps for connection to hwpe. Could you help me out with the same?
Reply


Forum Jump:


Users browsing this thread: 5 Guest(s)