Watchdog Timers, Part I

3/27/2018

I find that a lot of engineers do not understand the role of a Watchdog Timer (WDT) and how to use it properly. In this series I will cover various important aspects of WDTs as applied to embedded systems in general, and to nanosatellites in particular.

The purpose of a WDT is to help the system autonomously recover from a serious error. This recovery is usually in the form of a restart or reset, and after this restart or reset, the system will be in a well-defined initial (and, presumably rather safe and correct) state.

Suppose, for example, that your system included a small piece of code that had not been tested particularly well. Sometime during normal operation, your application found itself in this little code snippet, and from external appearances, stopped working. In reality, this little snippet had an (unintended) infinite loop in it, and this prevented the application from operating normally. But you don't know that; all you know by observing your application is that it suddenly stopped working, it stopped responding to commands, etc. In a situation like this, all you can typically do is press the reset button (I hope you incorporated one in your design), watch the application restart, and then get to the bottom of this problem. Think of a WDT as a reset button that automatically restarts the application when and only when the application software/firmware goes "into the weeds."

WDTs all share a basic behavior; namely, that the application must regularly communicate with the WDT within a certain window of time, or else the WDT will cause the application to restart. WDTs can be internal via software or external in the form of a standalone hardware WDT. When we say that we "kick" the WDT, we mean that we interact with the WDT to prevent it from restarting the application. How and when to kick your WDT is a subject unto itself, that we may cover at a later time. Suffice it to say that you must structure your software/firmware such that whenever your application stops working properly, the result is that the WDT is no longer being kicked, and the system will automatically restart, a short time later.

In the example above, we're stuck in a loop (that presumably doesn't kick the WDT), and so by being stuck in that loop, the Timer part of the WDT expires (because the WDT is not being kicked), and therefore the WDT forces a system reset/restart.

Let's examine the "a short time later" clause, above, and how it may apply to nanosatellites. On a typical nanosatellite, imagine a serious but recoverable software fault (say, one that inadvertently disables the nanosatellite's receivers). Whether the WDT forces a restart in 20ms or 2 minutes or 2 hours or 2 days is up to the designer, though the faster the reset/restart, the less time we'll spend waiting for the system to come back online and receive commands after this software fault. Some nanosatellite power systems have a feature whereby they will power-cycle the entire spacecraft if they don't receive a command or telemetry request within a specified period; this is eminently suited to solve this sort of problem, and is a nice example of a real-world WDT in action. These power systems ignore a system-wide -RESET signal; they neither respond to it, nor do they drive it. Instead, they cycle off and then on all power that is sourced to the other systems in the nanosatellite, and thereby force a system-wide software/firmware restart via Power-On Reset (POR). There are subtleties surrounding a signal like -RESET that we'll explore later in this series.

Now imagine a software fault where a high-pressure propulsion valve gets stuck in the open position due to a software fault. In this case, the nanosatellite can quickly spin out of control and/or exhaust its supply of propellant, should the valve remain energized and open. Here, a WDT with a much faster kick rate (say, at 100Hz) is required to ensure that the valve never stays open for more than 10ms longer than it should. The takeaway here is that some software failures are perfectly well-served by WDTs with long periods, but others require a much faster system response, and hence a "faster" WDT.

In the next installment, we'll explore how the WDT is connected to the rest of a typical nanosatellite.

AEK

From deep in (Pumpkin) space ...