KALMAN FILTER IMPLEMENTATION on ZYNQ PS (ARM CORTEX-A9) and LATENCY MEASUREMENTS

In the last post I implemented 6 states with 3 measurement Kalman filter using “Eigen” library in C++ in Microsoft Visual Studio. If you did not read it I suggest first look at it before this post:

https://www.mehmetburakaykenar.com/kalman-filter-implementation-in-c-with-eigen-library-in-visual-studio/159/

Now I will use again “Eigen” library but this time the target platform is Xilinx Zynq SoC’s processing system (PS) part, where there is ARM Cortex-A9 CPU running at 666.6 MHz.

I have Digilent’s Zedboard. You can read the reference manual of Zedboard from this link:

https://digilent.com/reference/programmable-logic/zedboard/reference-manual

Normally, in a development cycle for Zynq SoC device, first hardware description is defined and the XSA (Xilinx Support Archive) file is generated with or without the FPGA bitstream. Then with XSA file the SW developer now is able to write the application. For this example, I will use pre-defined XSA file of Zedboard, where in Vitis it is a built-in property.

So after opening Vitis (I use 2020.1), I created a new application project and select Zedboard for XSA file:

In the next page where you need to choose platform, I choose “zed” and continue:

In the next page you need to choose the application name and the processor core where you want to run you SW:

In the domain page I choose defaults. Here the important part is I will use standalone (baremetal) OS, where there is actually no operating system but peripeheral and other resource drivers are provided as API:

Then I selected “Empty Application (C++)” option and finished the initializations:

Then the project is created. We need to add the folder path where Eigen is installed:

You can see include <iostream> and <random> seems unresolved. You can get rid of this error by adding CDT GCC Built-in Compiler Settings options from Preprocessor Include Paths, Macros etc:

In the Eigen webpage it is also recommended to add an xml file which is for header substitution:

After all these includes and other stuff the code below complied successfully:

#include <iostream>
#include <Eigen/Dense>
#include <random>
using namespace Eigen;

#define T                   1       // second
#define MC_RUN              100	    // Monte-Carlo run number
#define SIM_NUM             1000	// simulation run number

#define STATE_VECTOR        6       // length of the state vector
#define MEASUREMENT_VECTOR  3       // length of the measurement vector
#define PROCESS_NOISE_SIGMA 0.5     // process std_dev
#define MEAS_NOISE_SIGMA    5       // measurement std_dev

int main()
{
	return 0;
}

Then I simply copy and paste the code from the last post. I just added platform.h, platform_config.h and platform.c files to the project, which are required to run init_platform() and cleanup_platform() functions. In order to measure the latency of the Kalman filter I also added xcutimer.h header file, which has the functions and definitions to utilize Zynq PS timer unit. So the header and definition section of the code is:

#include <iostream>
#include <Eigen/Dense>
#include <random>
#include "xscutimer.h"
#include "platform.h"
using namespace Eigen;

#define T                   1       // second
#define MC_RUN              100	    // Monte-Carlo run number
#define SIM_NUM             1000	// simulation run number

#define STATE_VECTOR        6       // length of the state vector
#define MEASUREMENT_VECTOR  3       // length of the measurement vector
#define PROCESS_NOISE_SIGMA 0.5     // process std_dev
#define MEAS_NOISE_SIGMA    5       // measurement std_dev

#define TIMER_DEVICE_ID		XPAR_XSCUTIMER_0_DEVICE_ID
#define TIMER_LOAD_VALUE	0xFFFFFFFF

The other parts are same as the last post with some small additions like timer parameter definitions and initializations:

u64 et[MC_RUN];
float et_us[MC_RUN];

XScuTimer Timer;		/* Cortex A9 SCU Private Timer Instance */
XScuTimer_Config *ConfigPtr;
XScuTimer *TimerInstancePtr = &Timer;
u16 DeviceId = TIMER_DEVICE_ID;
ConfigPtr = XScuTimer_LookupConfig(DeviceId);
int Status = 0;
u32 tic = 0;
u32 toc = 0;

Status = XScuTimer_CfgInitialize(TimerInstancePtr, ConfigPtr,
			 ConfigPtr->BaseAddr);
if (Status != XST_SUCCESS) {
	return XST_FAILURE;
}

Before the Kalman filter for loop, I start the timer and after the loop ends I stop the timer:

XScuTimer_LoadTimer(TimerInstancePtr, TIMER_LOAD_VALUE);
XScuTimer_Start(TimerInstancePtr);
tic = XScuTimer_GetCounterValue(TimerInstancePtr);
These part is to measure the latency:
XScuTimer_Stop(TimerInstancePtr);
toc = XScuTimer_GetCounterValue(TimerInstancePtr);
et[i] = tic - toc;
et_us[i] = ((float)((float)et[i]/997.0) / 333333333.3)*1000000;

I divide the et (elapsed time) to 997 as it is the simulation step number. The timer clock frequency is half of the main ARM clock which is 666 MHz.

In order to debug the project on Zynq, I created a debug configuration as:

When I run (debug) the code, it took some time to finish the simulation and the elapsed time us variable shows us that the latency for this Kalman filter for each step seems to be around 1.5 ms:

I expected 936 multiplication and 879 addition operation on float (single precision) data type for a Kalman step with 6 states 3 measurements. So one multiplication and one addition nearly requires 500 clock cycles. This seems to be high for me as ARM Cortex-A9 has a dedicated floating point ALU. Then I investigated the latency for a Matrix multiplication with two 6×6 length Matrixes with the code below:

XScuTimer_LoadTimer(TimerInstancePtr, TIMER_LOAD_VALUE);
XScuTimer_Start(TimerInstancePtr);
tic = XScuTimer_GetCounterValue(TimerInstancePtr);

P_pre = F * P;

XScuTimer_Stop(TimerInstancePtr);
toc = XScuTimer_GetCounterValue(TimerInstancePtr);
et[j] = tic - toc;
et_us[j] = ((float)((float)et[j]) / 333333333.3)*1000000;

The average latency was 270 us, which is nearly 180000 clock cycles. Then I changed the code to see just one single precision floating point multiplication latency as:

XScuTimer_LoadTimer(TimerInstancePtr, TIMER_LOAD_VALUE);
XScuTimer_Start(TimerInstancePtr);
tic = XScuTimer_GetCounterValue(TimerInstancePtr);

//P_pre = F * P;
test3 = test1*test2;


XScuTimer_Stop(TimerInstancePtr);
toc = XScuTimer_GetCounterValue(TimerInstancePtr);
et[j] = tic - toc;
et_us[j] = ((float)((float)et[j]) / 333333333.3)*1000000;

Where test1, test2 and test3 are all float type variables. The clock cycle latency was 36 and the latency was about 108 ns. Then I started curious about the performance of the Eigen library, or maybe the problem is huge memory requirements as I try to save all the rms error, estimated and measured values and the ground truth. More memory means more memory access, which is possibly takes a lot of clock cycles. Therefore, I commented out huge matrixes used for error measurements and changed the simulation time from 1000 to 50 by changing SIM_NUM constant. Suprise:

Now 1 Kalman filter step latency is around 73 us !!!

Well let me tell you the truth: I was expecting latency around 70 us as this is not the first time I try Kalman filter in Zynq PS 😊 I tried Kalman filter implementation in C language without using Eigen library and the result was around 73 us, so observing latencies around 1.5 ms was schocking for me and I became curios and try to assess the situation and where I made mistakes. The mistake was to use huge amount of memory data by utilizing huge 2D arrays, which increases store and load instructions inside the code.

So, at the end of the day, I learnt to be careful on measuring software performance. In digital design, FPGA or ASIC, you have the chance to measure cycle-by-cycle latency of the operation. This performance does not change with whatever is happening on the other side of the silicon. In software, things are happening sequentially and the memory usage becomes an important factor.

For the next post I plan to use the same C++ code in Vivado HLS tool and see the synthesis results, wait for it !!!

Regards,

Mehmet Burak AYKENAR

You can connect me via LinledIn: Just sent me an invitation

https://tr.linkedin.com/in/mehmet-burak-aykenar-73326419a

Bir yanıt yazın

E-posta adresiniz yayınlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir