It has been a long time that I posted on my webpage related to Digital Design and SoC concepts. My last post was related to evaluating data transfer rates between OCM, DDR3 RAM and PL Block RAM by utilizing PS DMA of Processing System of the Zynq SoC. Please visit the related post for details:
I want to continue on data mover solutions for Xilinx Zynq SoC. I will perform same latency tests in this post, data transfer between OCM, DDR3 RAM and PL BRAM, but this time with AXI CDMA IP instead of PS DMA.
Before starting the details of the topic and implementations, I want to refer anyone who is interested in data transfer methods between PS and PL in system design perspective to read the document below:
https://docs.xilinx.com/v/u/en-US/wp459-data-mover-IP-zynq
Xilinx has different IP solutions for data transfer between PS and PL. If you search for DMA usage in PL, you will find that there are different IP resources, sharing a common name: AXI and DMA. I wrote DMA in Xilinx IP catalog search window and this is the result (in Vivado 2021.1):
The memory elements we will be transferring data will be Memory-Mapped instead of Streaming interface. For data transfer between memory-mapped interfaces, Xilinx provides AXI Central Direct Memory Access (AXI CDMA) IP solution:
“The AMD LogiCORE™ IP AXI Central Direct Memory Access (CDMA) core is a soft AMD Intellectual Property (IP) core for use with the Vivado™ Design Suite. The AXI CDMA provides high-bandwidth Direct Memory Access (DMA) between a memory-mapped source address and a memory-mapped destination address using the AXI4 protocol.” from
https://www.xilinx.com/products/intellectual-property/axi_central_dma.html
I decided to use AXI CDMA IP to move data between PL BRAM, OCM and DDR3 DRAM in Zynq SoC. I checked the interconnection in Zynq SoC to decide which PS-PL ports I have to use:
The first option is to use one of the four high performance slave ports: S_AXI_HP0/1/2/3
The second option is to use Acceleration Coherency Port (ACP): S_AXI_ACP
Here is the interconnect from Zynq TRM:
We can see that both ports can access both OCM and DDR Memory from PL. I decided to use high performance slave port instead of ACP. Maybe in a future post I can focus and utilize ACP.
Okay now I started creating the Vivado design by adding Zynq PS. After running block design automation, by default, it doesn’t enable slave HP port, but enables master GP port:
So I opened the Zynq PS IP properties and added HP port:
You can select AXI data width as 32 or 64, I continue with default 64-bit to improve performance.
I also enabled fabric interrupts from PL to PS in customization of the Zynq PS:
In order to change BRAM size, open Adress Editor and change Memory range from default 8K to 32K:
Then the final block design diagram will be as:
FPGA view and resource utilization after P&R:
In Vitis SDK, we first select our generated xsa file:
I will generate a standalone (baremetal, no-os) software project:
I first tried AXI CDMA options as 64-bit write/read data width 256 write/read burst size. DDR to DDR data transfer worked with this configuration but DDR to BRAM and DDR to OCM did not work.
Then I changed burts size to 64. Still no change.
Then I changed write/read data width to 32 and burts to 64. This time DDR, OCM and BRAM transfers completed successfully.
Performance of the data transfer of 32 kB with AXI CDMA is calculated as while running in Zedboard:
DDR to DDR (32 kB): 85.194 us
DDR to OCM (32 kB): 84.969 us
DDR to BRAM (32 kB): 126.111 us
DDR to DDR with CPU load/store (32 kB): 19127.15 us
So, best performances are achieved with OCM and DDR, while OCM is slightly better than DDR. BRAM transfer is a little bit slower, but comparing to the CPU load store transfers AXI CDMA is a great solution.
The performances are a little bit worse than PS DMA transfer method. The PS DMA performance was (it was calculated in my previous post):
Saying that, if only DDR to DDR data movement is involved, AXI CDMA performance was around 45 us with 64-bit read/write data width. However, I couldn’t find a way for OCM and BRAM data transfers with 64-bit data width.
Another way to improve performance is to increase PL clock frequency. So, I tried 200 MHz. Vivado couldn’t achieve 200 Mhz. Then I tried 150 MHz and flow completed in success. But clock frequency has nearly no effect on performance.
In future, I will focus on what is the problem with 64-bit data width not working on data transfers to OCM and BRAM.
You can regenerate the block design with tcl scripts that I provided in my github page:
https://github.com/mbaykenar/zynq-soc-hw-sw-design/blob/main/ders15/system.tcl
https://github.com/mbaykenar/zynq-soc-hw-sw-design/blob/main/ders15/zynq_axi_cdma.tcl
You can also view software in the github also:
https://github.com/mbaykenar/zynq-soc-hw-sw-design/blob/main/ders15/znyq_axi_cdma_tutorial.c
Also modified linker script:
https://github.com/mbaykenar/zynq-soc-hw-sw-design/blob/main/ders15/lscript.ld
In this post, I showed how to transfer data between DDR RAM, OCM and PL BRAM through utilizing Xilinx AXI DMAC IP and compare the latencies. You can find all source codes in my github page related to this post:
https://github.com/mbaykenar/zynq-soc-hw-sw-design/tree/main/ders15
Regards,
Mehmet Burak AYKENAR
You can connect me via LinledIn: Just sent me an invitation