Async instead of RTOS?

One of the more recent develoments in programming languages is asynchronous code. While languages like C# and Python have sported async features for quite a while (in fact C#’s async keyword was introduced in 2012), these features have been fairly recent additions to languages traditionally associated with embedded work, with C++ only gaining async support in C++20 – and that support is of limited use for embedded systems. Rust however has pretty mature async support with frameworks such as Tokio that make async code a breeze to write. The way async/await has been integrated into Rust allows for easily swapping out the implementation if the need arises, and – indeed – for common embedded use cases Tokio is not a good fit and needs to be replaced with something else. One alternative would be embassy.rs.

Intro to Async Code

But why async code in the first place? A lot of today’s embedded work relies on an RTOS such as FreeRTOS. RTOS are readily available, well understood and have a decently low resource footprint, making them usable on smallish microcontrollers. They offer basic OS functionality, usually in the form of a scheduler, tasks, synchronization and – very often – mailboxes/queues for passing around data. RTOS will usually sport a – compared to traditional OSes- very low interrupt latency and usually have memory requirements in the single digit KiB to low two-digit KiBs. This makes them good choices for mid-range MCUs. An async runtime on the other hand provides a couple of similar constructs but will have a decidedly lower resource footprint than an RTOS, most notably because:

The runtime is a lot smaller than a typical RTOS kernel
Async has no need for dedicated task stacks. With an RTOS the programmer has to provide and manage the stack for each task. Needing a stack for each task will obviously increase the minimum amount of RAM needed by the program, thus increasing the resource footprint. Further badly dimensioned stack will often lead to hard to diagnose failures (especially large stack overflows tend to cause this and these are usually not reliably detected by the RTOS stackprotection mechanisms).

So, what’s the real difference between an RTOS and an async runtime? As said before an RTOS task is something that behaves very similar to a thread in a desktop OS, i.e. it is its own unit of execution and requires its own stack space. This allows pre-emptive scheduling on the part of the RTOS kernel. An async runtime only ever has a single active task and uses cooperative scheduling, meaning: Tasks must be written in such a way, that they never block for extended periods of time without returning control to the runtime. This means, that any hot-loop has the potential to starve the rest of the system. On the plus-side this approach allows us to only have a single stack as our program is effectively a super-loop program. In the past similar approaches have usually been implemented by handcrafted state machines where each invocation of a task function would execute a single state in the state machine, advance to the next state and then return control the the runtime. This would usually look similar to this:

TaskState MyTaskFunc(TaskState lastState)
{
  switch(lastState)
  {
     case TaskState::Begin:
       // code for "Begin state
       return TaskState::Middle;
       break;
     case TaskState::Middle:
       // code for "Middle" state
       return TaskState::End;
       break;
     case TaskState::End
       // code for "End" state
       return TaskState::Finished
       break;     
  }
}

Each invocation of MyTaskFunc would execute a different state. While this works it is tedious, error prone and hard to understand. The introduction of Async/Await allowed compilers to generate the above code and hide it behind syntactic sugar. An async variant of the above would be:

async MyTaskFunc()
{
   await (CodeForBeginState());
   await (CodeForMiddleState());
   await (CodeForEndState());
}

The compiler will – as stated – generate a statemachine similar to the one further above from this code and the async runtime will take care that all states are progressed through. How this is done is for another article to discuss. The important bit is, that each “await” corresponds to a return in the original code, meaning that – at runtime – the code between awaits will always be completely executed before the async runtime gets to switch to a different task.

Why this is a good fit for embedded systems

The observant reader will have noted from the last sentence that, – in the consequence – no task-preemption happens at all, meaning: A single misbehaving task can starve the system. However this is also somewhat true for an RTOS system, where having a “while(true);” loop will have a dramatic impact on the system performance and is capable of bringing down the system in a lot of cases. Further: Most modern firmwares are designed to have as few (RTOS!) tasks active as possible with most tasks waiting to receive a message of some type, even though the tasks technically should run in parallel (or at least in the RTOS’ approximation of parallel). Effectively this means, that there is often only ever a single task doing any work while the other tasks are waiting, which is pretty much a perfect fit for the async execution model. This extends to a lot of BSPs, where the chip vendor integrates asynchronous concepts in the form of completion callbacks for certain functions. Consider this snippet from Microchip’s softpack library;

static void _usart_dma_rx(const uint8_t* buffer, uint32_t len )
{

	struct _buffer rx = {
		.data = (unsigned char*)buffer,
		.size = len,
		.attr = USARTD_BUF_ATTR_READ,
	};
	struct _callback _cb = {
		.method = _usart_finish_rx_transfer_callback,
		.arg = 0,
	};
	usartd_transfer(0, &rx, &_cb);
	usartd_wait_rx_transfer(0);

}

What is basically happening is that the “usartd_transfer” function gets a callback, that will be called, once the transfer is done. Yes, in this example the author chose to call “usartd_wait_rx_transfer” which sounds, as if it would block until the transfer has completed, but that is obviously not mandatory and we can use this API completely asynchronously. “But wait – isn’t that just a callback that will be called by means of some interrupt handler?” – glad you asked. Of course that is the case, and that is the main reason why asynchronous code is a good fit for embedded systems – we have external triggers, that we can use to control the “readyness” of a task. This means, that it is actually fairly easy to wrap CPU/MCU peripheral functions in async code, as our usual mode of interaction is “trigger hardware function that takes a while – wait for interrupt – continue execution”.

Resource Implications

As stated before, using an async approach allows us get rid of all the task stacks, that a standard RTOS would need. Especially for smaller systems this is a boon (e.g. I’ve worked with systems where we’d spend > 50% of the available RAM on task stacks) and frees up valuable memory for other things. Further, async runtimes tend to be on the order of 4 – 7 times smaller than an RTOS. E.g. FreeRTOS clocks in at ~4 – 9 KiB, whereas runtimes such as embassy take up considerably less (Source). The actual difference is a bit hard to gauge in the source, as we only have final binaries to compare with ~ 20 KiB (FreeRTOS) vs. 14 KiB (embassy/Rust). Assuming the code that is needed to run the chip (i.e. the BSP) is of roughly similar size for both examples the size of the embassy runtime should be less than 2 KiB. Add to this the fact that the async solution only needs about 15% of the RAM compared to the FreeRTOS version and we end up with the fact that we can do the same thing using a smaller/cheaper MCU. Or, to put it differently: While the source’s FreeRTOS example would need an MCU with at least 8 KiB of RAM and 32 KiB of ROM we’d get away with a measily 1 KiB of RAM and 16 KiB of flash using the async solution, which firmly puts us into a low price bracket where we can actually source parts for less than 1€. The FreeRTOS solution needs a beefier MCU which will most definitely cost more in comparable quantities.

Performance/Latency Implications

This is the elephant in the room. While Dion Dokters findings suggest an all around better performance of the asynchronous solution. there are several things to consider:

Since a task will always run to the next completion “await” point before a different task gets a chance to run, the latency between an interrupt occuring and the task, that waits for the interrupt, to start running again, is harder to reason about, as the currently acitve task needs to reach an “await” point first.
Newly written code will have to be written with care and should frequently “await” to give the runtime the posibility to switch to another task. There must never be long running/blocking parts of code in a task, so your typical “while(RegisterValueIsNotCorrectYet)” busy wait is not viable.
The currently available async runtimes do not offer priorities. Tasks are scheduled in round-robin fashion, which makes reasoning about the expected latencies even harder. However, since the runtimes are small and not very hard to understand it is entirely feasible roll your own runtime, that contains some sort of priorties for tasks.

Other things to note

On very interesting property of async code is the fact that the writer gets to choose when a rescheduling can happen. This lone fact eliminates almost a full class of hard to find errors: Data Races. Consider the following snippet:

/*...*/
auto i = get_value(SOME_VALUE) | 0x01;    // retrieve a value from somewhere and set a single bit
set_value(i); // write the value back with the newly changed bit
/*...*/

If executed in the context of an RTOS these two lines will need synchronized access to “SOME_VALUE”, i.e. a mutex will be needed around the two lines. Otherwise it is entirely possible, that the task is pre-empted and another task changes “SOME_VALUE” after it has already been retrieved and the bitwise or has been performed in the first line. E.g.:

Task A	Task B	“Some Value” (global)	“Some Value” (Task A)	“Some Value” (Task B)
get_value (Some_Value)	Nop	0x00	0x00	n.a.
\| 0x01	Nop	0x00	0x01	n.a
Nop	get_value (Some_Value)	0x00	0x01	0x00
Nop	\| 0x02	0x00	0x01	0x02
Nop	set_value(Some_Value)	0x02	0x01	0x02
set_value(Some_Value)	Nop	0x01	0x01	n.a

Data race when accessing global, BOLD font indicates active task

In the above example two tasks access “Some_Value”, both first obtain a copy and independently set a single bit and write the result back to “Some_Value”. If the first task is preempted after it has already read “Some_Value (i.e. after the first or second row in the table!) the change made by the second task will be overwritten once the first task is rescheduled. Since these kinds of bugs are highly timing dependent it is entirely possible that they occur only once every couple of hours or even days, making them very hard to find and diagnose. If we were to run the above example with an async runtime the described bug could never happen, except if the developer were to put it in actively. e.g.

/*...*/
auto i = get_value(SOME_VALUE) | 0x01;    // retrieve a value from somewhere and set a single bit
await (SomeThing);   // Will reschedule, might trigger data race.
set_value(i); // write the value back with the newly changed bit
/*...*/

The nice thing here is: The location where the task might get rescheduled is very visible, so even if this error is made it is far easier to spot than in the pre-emptive case.

TL;DR

Async code brings a lot of interesting properties to the table and should be considered a valid option for embedded work, especally for small systems. The technical risks involved are manageable and are far outweighed by the gains possible.

Article image: Louis Reed / Unsplash

3 thoughts on “Async instead of RTOS?”

Johannes says:

2023-03-10 at 12:57

Thanks for this interesting perspective.

The main things that may have held back async programming in more places is that

– Writing async code is somewhat cumbersome if you don’t have first class support in the language. If you have to rely on callbacks for async code, you have the burden of the heavy syntax tax plus you need closures for the callbacks with all the consequences (dynamic dispatch, potentially dynamic memory management). Having the language generate the state machines from linear code makes both the code nicer but also improves its runtime properties.
– Creating an async runtime is hard (as is writing kernel task schedulers) but especially so in a multi-threaded environment. Having good portable ones available (like tokio) and also restricting oneself to single-threaded execution helps a lot.

So, I guess, it might not be surprising after all that async code might thrive in the future in the embedded space.

(Incidentally, there is some debate going on the JVM world whether async is still the right way to go after green threads/continuations have been introduced with the JVM loom project. Some people argue that with green threads all the thread pool and thread starvation problems get solved automatically, so why would you still care about explicit async programming when you can get most of the benefits for free? One answer to that is that you still need to keep the stacks around which will be still much heavier than keeping just enough data around to execute the next async thunk. But interestingly, from what I gather from your article here is that the memory argument is even more important in a memory constrained environment where stacks can make up a more significant chunk of the total memory usage than in a typical JVM backend environment where the heaps dominate memory usage…)

Philipp Bormuth says:

2023-03-10 at 18:49

I agree with the sentiment that you absolutely need language level support to make async really worthwile. However: Given that, writing a runtime is not all that difficult and not that much work. You’ll need decent command of your target language, of course. But – looking at embassy’s implementation – this can be done in less than 1 KLoC, and that code is not all that hard to understand if you have a reasonable understanding of Rust’s typesystem. More full-featured runtimes such as Tokio will obviously be a lot more complex and are nothing for the faint of heart to write.

Pingback: ESE 2023 Talk - blackforest-embedded.de

Posts