2026-04-29
u/Less-Owl-4025 How would you design a logging system for a low latency application to not have much impact on the latency?
One thing comes to my mind is, not doing any formatting on the hot path and send raw data through a channel to another thread. In that thread, format the log appropriately and use tracing, tracing-subscriber, tracing-appender to log to a file.
Is there any other suggested approaches or crates for that kind of problem?
Thanks in advance.
Disclaimer: I love logging. I've implemented no less than 4 complete logging systems at various companies in the last 16 years, two of which were low-latency. Brace yourself.
Nothing is faster than doing nothing. Nothing.
A low-latency X must therefore endeavor to do as little work as possible. This means displacing the work that needs to be done as much as possible:
This may seem obvious, but it's actually surprising how much can be displaced.
A special call out that initialization does NOT mean first call, it means a separate phase.
Unless you have a warm-up phase for your application, which can exercise all the execution paths that matter, you should avoid lazy initialization like the plague, lest the first time an execution path is taken it's much slower than usual.
Even if you don't care (now) or think you don't, I still recommend avoiding it. It's the kind of unexpected "surprise" that comes back to bite you, and it can be surprisingly tricky to identify later on.
One the keys to doing less when logging is to note just how much of the information contained in a log statement is static and immutable:
That's a lot of information which:
So just do it.
The constructor crate allows executing functions at binary/library load time. My personal strategy is thus:
main, instantiate the GlobalLogger type, which will walk the above list and register everything.As an aside, I personally like granular activation/deactivation of logs. Being able to specify a minimum log level on a per-module basis (or finer grained) is fairly useful for debugging, I find.
My trick is to use an AtomicU32 log ID per log statement. A static reference to that atomic is registered alongside its information, and the GlobalLogger can activate/deactivate each based on a configuration file.
At log time, it's a simple Relaxed load and compare to 0:
Remember: there's nothing faster than doing nothing.
In my implementation, the lowest level of log level (Debug) expands to nothing in Release mode. This means that within a Debug log statement I can perform potentially costly operations confident in the knowledge that it won't ever matter in Release mode, with no reliance on the optimizer.
Similarly, nothing is executed prior to checking the activation state. This means once again that I can perform potentially costly operations in the log statement, confident that they won't actually be executed unless I activate it.
Combined with a #[cold] annotation for the branch of the second lowest log level, so the compiler can place the entire branch out of the way.
Logging is ancillary concern of the application. It should be out of the way as much as possible.
As such, I recommend the following structure:
if id != 0 {
log_impl(id, || (arguments, ...));
}
Where log_impl itself is #[inline(never)]:
Minimizing the code footprint of log statements means that the optimizer is less likely to stop performing an optimization on the surrounding when a log is introduced.
It's common for logs to be timestamped. Yes, really.
The cheapest way to get an idea of time on x86 is the rdtsc instruction, which clocks in at about 6ns. It doesn't give you the time, but instead the idealized number of cycles since the start of the host machine.
If you're not that low-latency, gettimeofday will give you the time -- nanoseconds resolution -- for only 12ns. An extra 6ns to avoid syncing clocks yourself may be worth it.
Do ask yourself whether you care, however:
If you care about 6ns (or 12ns), those are alternatives to consider.
Formatting can be displaced, as you mentioned. All you need is sufficient information to decode the byte stream...
... the complexity of which varies depending on what information you try to encode in your byte stream.
It's common for logs to allow logging any object. Yes, really.
I recommend not bothering. Instead:
#[repr([i|u]X)] enums.For the latter, my cutesy trick is to register meta-information about the enum -- a mapping from index to name, associated to an u32 ID. I then transmit the enum ID + index (u32) in the stream, which the formatter will translate back to the name of the enumerator, for greater readability.
You could use registration to encode meta-information about user-defined types, or sequences of such. I've never found it worth it. Relying on the fact that expensive calls in log statements are only performed if the log is activated, I just use format!("{x:?") as an argument when I need to, generally in Debug statements (compiled out anyway), or Warn/Error statements (where the latency impact is the least of my worries).
At some point, you do need to pass the dynamic information through. As quickly as possible.
I recommend going as low-level as possible: using a bytes-oriented SPSC queue, with deferred commit.
Unpacking:
The least amount of bytes transmitted the better. The least branches necessary the better.
memcpy the bytes in the slice. The optimizer is pretty good at making optimizing a serie of contiguous writes with no branch in-between.Yes, commits should be done at the end of the event-loop. Really.
There are, however, situations where "pending writes" may accumulate, quite a lot, within a single iteration of the event-loop:
As such, I recommend having a check after writing in the queue: if the "pending writes" are > 1/4 of the queue capacity, then in a cold branch, call the never-inlined commit function.
In a low-latency settings, as long as the consumer keeps up, it'll never be called, so the branch will be well-predicted and have nigh zero impact on performance.
I do personally add a little more work in there, for Quality of Service (QoS).
Any queue needs to deal with the potential for the queue getting full. There are 3 general strategies to do so:
They are here ordered by cost. Extending the queue is very costly. Absolutely incompatible with low-latency. Overwriting old data is possible, cheaply, though a little more complicated... and who, exactly, would want to overwrite an Error log with a Debug one? Not me.
Hence I throw away new data if the queue is full... but I favor doing so responsibly:
The two, combined, add a wee bit of extra code in the critical path, which cost a few cycles. I consider them well-spent. YMMV.
Something, at the other end of the queue, actually needs to handle all of this and push those logs somewhere.
I recommend using a separate process over using a separate thread for several reasons:
A different process does have some extra complexity. Most notably, it cannot access the log meta-information registry nor the enum meta-information registry or a potential object meta-information registry.
I structure communication between the application & logging processes in 1+N queues:
It's all fine... just pace it out.
In order to minimize the impact of logging on the main application -- and therefore on the host -- it's important to avoid large batches of work. They may clog the PCI bus at the wrong moment.
Similarly, if you send logs over the network, you'll ideally want isolation: separate card, separate interface, separate cable, separate switch. If you can't have complete isolation, then once again pace it out:
I did warn I was pretty enthusiastic about the topic, didn't I?
I hope you enjoyed the read ;)
Stephen Funk