STM32MP1 NTP server, part 3

Previous posts: part 1, part 2, NTP client

The NTP client confirmed that there was a 943ns difference between my two NTP servers.  Now, to investigate where it is coming from.

Changes

I'll start with a graph of the results of various changes and then explain them in detail.  I tried a number of different things before finding a working solution.

Offset between stm32mp1 and the APU2

Each of the different changes is explained in their own section.

RX timestamp

I disabled the NTP RX timestamp adjustment and used the 1588 RX timestamp.  I went over this in the previous post, but this made the timestamp measurements symmetrical.  The APU2 server was already using the 1588 RX timestamp.

sys_arch_counter

I changed the system time from the TIM2 peripheral to the built-in CPU counter arch_sys_counter.  arch_sys_counter is much lower latency to read.  Below is a measurement by phc2sys comparing the TIM2 peripheral vs arch_sys_timer.  The "delay" column at the far right is in nanoseconds.

Reading the time in order to sync the ethernet hardware clock

Reading from the TIM2 peripheral adds two measurements each with around 1.2us latency, but does not change the offset significantly.

No PPS adjust

I next experimented with disabling the PPS timestamp adjustment code. The function in the screenshot below calls pps_get_ts to get the current system's time.  I need to adjust the current system's time to match the counter value, so I read TIM_CNT as well.  The previous code was reading TIM_CNT before and after pps_get_ts and assuming pps_get_ts happened exactly in the middle of those two readings.  This version of the code just assumed pps_get_ts happens exactly after the TIM_CNT reading.

calculating the PPS timestamp

Adjust *

  • Adjust 1/2 - I took two timer peripheral timestamps in a row and added half of that latency to the PPS timestamp
  • Adjust 100% - I took the whole latency between two timer peripheral timestamp reads in a row

pps_get_ts doesn't happen exactly after TIM_CNT, it takes some time to read the system clock and calculate the system time.  This function takes about 4us to complete on this hardware.

Looking closer at what it is doing, pps_get_ts calls ktime_get_snapshot.  ktime_get_snapshot then calls tk_clock_read, and tk_clock_read calls the function that actually reads the clock.  Then ktime_get_snapshot does some other processing that takes around 3us.

ktime_get_snapshot

I looked at the compiled version of ktime_get_snapshot to see how much work it is doing.  There's 16 instructions, between when it starts and when it calls the function to read the clock (the "blx r3" instruction).  That's about 26ns on this hardware.

ktime_get_snapshot - compiled

So the amount of time that passes between calling ktime_get_snapshot and the actual counter value it actually uses to calculate the time should be about the same (for small values of "about the same") as two calls to read the counter value.

new adjustment

The new code takes the latency between reading the timer twice (about 1.2us on this hardware), and adds that to the PPS offset.

Final graph

The median offset between these two systems is now 41ns!  Standard deviation is 87ns, min -503ns max +416ns.

I'm happy with this result.  Next up, figure out where the offset is coming from on my embedded NTP client.

Code is on github