Memory Ordering has been the latest topic under discussion in the Open Source Study Group. We walked through several concepts such as atomic memory access, volatile declarations, compiler and CPU reordering with different semantics -including the acquire-release model-, fences, compare-and-set (CAS) loops, ABA problems, lock-free algorithms and more. A few lock-free ring-buffer implementations were analyzed; including loki, dpdk and FreeBSD’s buf_ring (kernel).
We found a memory corruption bug in FreeBSD’s ring-buffer. Under specific thread scheduling conditions, and assuming a multiple-producers scenario, it could be possible to overwrite unread entries. However, we did not attempt full exploitation and our PoC (proof-of-concept) makes a few assumptions that may not accurately represent the reality. As a result, it well be the case that the bug is there but triggering not feasible under the constraints -currently- imposed by the context.
At the risk of being overly conservative, we reported this bug following FreeBSD’s Security Reporting guidelines and let them assess potential denial-of-service or other security implications. Only after that is that we are publishing this work. FreeBSD’s Security Team acknowledged the bug, is currently working on a fix but do not believe that embargoed treatment is required. At the end of this article you will find the chronological sequence of events and the reference to the public report.
Looks to us that FreeBSD version 12.1.0 -latest at the moment of conducting this research- is affected. Previous releases may be affected as well.
Getting started
While doing static analysis in buf_ring.h, we made a couple of annotations of what could possibly go wrong. Given the inherent complexity of race condition bugs and memory ordering, a PoC was essential to confirm or discard each hypothesis.
Before moving into the code, there were a few obstacles to sort out:
- setting up an environment to compile and debug the FreeBSD kernel would require some time;
- “pseudo-randomness” does not guarantee that proper scheduling conditions are consistently met for the bug to be triggered; and
- there has to be a reliable mechanism to detect memory corruption -crashes may be valid on some scenarios, but could be more subtle-.
To get around the previous, we made the following decisions:
- port the buf_ring implementation to Linux user-space, compiling with GCC for x86_64 and developing all the scaffolding around to minimize changes; and,
- instrument buf_ring to generate both deterministic thread scheduling conditions and debug traces.
Porting buf_ring from FreeBSD kernel to Linux user-space
buf_ring implementation is mostly contained in buf_ring.h, with a few functions in subr_bufring.c. Most of the source code is plain C, architecture and operating system independent. There are a few exceptions, though; which explain why machine/cpu.h is an include. To make this work in Linux, we developed our own version of cpu.h mocking all the structures and functions used; trying to provide reasonable equivalence:
- rmb
- A memory fence (mfence) should be a valid replacement in x86_64, as long as it’s not reordered by the compiler.
- critical_enter / critical_exit
- These functions are used in FreeBSD to avoid preemption but do not impose locking or memory ordering constraints. We assume that they can be turned into no-operations: the ring-buffer should still deal with concurrency in the presence of multiple CPUs and, perhaps, interruptions.
- cpu_spinwait
- Busy loop which, in x86_64, is a nop instruction.
- atomic_cmpset_acq_int
- GCC’s __atomic_compare_exchange_n intrinsic generates an equivalent sequence of instructions to FreeBSD in x86_64: they both begin with a lock cmpxchg and, while in FreeBSD a sete explicitly gets the result, in GCC a conditional jump acts upon it.
- atomic_store_rel_int / atomic_load_acq_32
- GCC’s atomic_store_rel_int and atomic_load_acq_32 intrinsics with __ATOMIC_RELEASE and __ATOMIC_ACQUIRE should be semantically equivalent. In x86_64, sequential consistency and atomic uint32 access won’t require special memory barriers anyways.
Deterministic scheduling conditions can be achieved by making the system work like a state machine, with only one thread executing at a time. A mutex lock-unlock scheme with a global state variable would be helpful in that regard. buf_ring has to be instrumented so threads can stop where needed and debug traces generated. A complete state machine move would be: realize who the thread is (pthread_self), check the current global state, do something based on that, set the next global state, wake up the next thread unlocking its mutex and block again.
The bug
We’ll go straight to the bug now but you probably want to refresh how a ring-buffer with multiple consumers and producers work first. There is plenty of information out there; among which the following could be a good starting point: Lock-Free Queue – Part I and Lock Free Queue – Part II.
FreeBSD’s buf_ring allows reads and writes of one entry at a time. There is one strong invariant: at least one entry is always empty in the buffer. The reason is to have a stopper sentinel which prevents Producers from writing unread entries. To write a value (buf_ring_enqueue function), Producers read the current Producer Head (prod_head), calculate its next value considering buffer wrap-around (prod_next) and read the current Consumer Tail (cons_tail).
1 2 3 |
prod_head = br->br_prod_head; prod_next = (prod_head + 1) & br->br_prod_mask; cons_tail = br->br_cons_tail; |
If cons_tail is equal to prod_next, the invariant of one entry always empty would be broken. As a result, there is a check before trying the CAS:
1 2 3 |
if (prod_next == cons_tail) { ... } |
The CAS is supposed to succeed only if no-other Producer wrote first and changed the prod_head value. When succeeding in the CAS, the Producer will write its entry in prod_head while the ring-buffer’s prod_head was set to prod_next. This means that any other Producer succeeding in the CAS will write the next entry and there won’t be conflicts between them.
CAS and entry writing:
1 2 3 |
while (!atomic_cmpset_acq_int(&br->br_prod_head, prod_head, prod_next)); br->br_ring[prod_head] = buf; |
Things get more interesting when a Producer succeeds in the CAS not because no-other Producer wrote in-between but because too many did: the wrapped-around prod_head value is indistinguishable from the one originally read. This is known as an ABA problem.
In the ABA scenario, a Producer will succeed in the CAS but
the ‘unnoticed’ activity may have moved cons_tail. In particular, it may
have moved cons_tail to the point that the previous prod_next ==
cons_tail check result is now invalid. The Producer will move forward -without any further check- and write the entry supposed to be always empty, breaking the strong invariant. With no stopper empty entry, following Producers may overwrite unread values.
We believe that short ring-buffers in highly concurrent scenarios are
the circumstances that can expose the problem. This is because of a more likely prod_head wrap-around.
PoC execution trace
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
main: Start Producer 5: About to execute CAS. Values read: prod_head = 0, prod_next = 1, cons_tail = 0 Producer 1: About to execute CAS. Values read: prod_head = 0, prod_next = 1, cons_tail = 0 Producer 1: Waiting for the attack signal Producer 5: About to execute CAS. Values read: prod_head = 1, prod_next = 2, cons_tail = 0 Producer 2: About to execute CAS. Values read: prod_head = 1, prod_next = 2, cons_tail = 0 Producer 2: Waiting for the attack signal Producer 5: About to execute CAS. Values read: prod_head = 2, prod_next = 3, cons_tail = 0 Producer 3: About to execute CAS. Values read: prod_head = 2, prod_next = 3, cons_tail = 0 Producer 3: Waiting for the attack signal Producer 5: Flush the queue (filled with Producer 5 entries) so more producers can join the CAS loop Consumer 1: Read br->br_ring[0] = a Consumer 1: Value retrieved = a Consumer 1: Read br->br_ring[1] = a Consumer 1: Value retrieved = a Producer 5: About to execute CAS. Values read: prod_head = 3, prod_next = 0, cons_tail = 2 Producer 4: About to execute CAS. Values read: prod_head = 3, prod_next = 0, cons_tail = 2 Producer 4: Waiting for the attack signal Producer 5: About to execute CAS. Values read: prod_head = 0, prod_next = 1, cons_tail = 2 Producer 5: All producers ready to attack Producer 5: Queue is full: flush it first Consumer 1: Read br->br_ring[2] = a Consumer 1: Value retrieved = a Consumer 1: Read br->br_ring[3] = a Consumer 1: Value retrieved = a Consumer 1: Read br->br_ring[0] = a Consumer 1: Value retrieved = a Producer 5: prod_head = 1, cons_tail = 1 Producer 5: Launch attack! Producer 2: CAS succeeded. br->br_prod_head = 2 Producer 2: Wrote br->br_ring[1] = a Producer 3: CAS succeeded. br->br_prod_head = 3 Producer 3: Wrote br->br_ring[2] = a Producer 4: CAS succeeded. br->br_prod_head = 0 Producer 4: Wrote br->br_ring[3] = a Producer 1: CAS succeeded. br->br_prod_head = 1 Producer 1: Wrote br->br_ring[0] = a Producer 5: All queue entries were written with 'a', none of them read Producer 5: The 'one entry always empty' invariant has been broken at this point Producer 5: Writing a 'b' would corrupt an unread entry Producer 5: Wrote br->br_ring[1] = b main: Success |
You see there how Producers 1 to 4 first waited before trying the CAS but after passing the prod_next == cons_tail check. Producer 5 and Consumer 1 are helpers to generate this condition. Once achieved, Producer 5 calls each waiting producer in the right order to move, succeed in the CAS and write an entry to the ring-buffer. All entries are eventually written breaking the one entry always empty invariant. Producer 5 is then able to write one more entry and overwrite an unread one.
Please note that the use of many Producers is for illustration purposes only: a couple should be enough -ready for the challenge?-.
Proposed fix
The goal of this fix is to always detect the ABA and abort the CAS, forcing the re-execution of the loop and checking prod_next == cons_tail again. In other words, making the ABA unrealistic to accomplish.
As demonstrated before, the length (range of possible values for prod_head) is a weak protection against the ABA on short ring-buffers: something larger is needed. There are two options:
- Turn the CAS into a double-word or wide CAS. The first word would be prod_head -as before- but the second a uint32 counter incremented always by one. Both values will be checked and set atomically. For an ABA to occur, the counter needs to be wrapped-around the uint32 range and the prod_head around the ring-buffer at the same time. The downside of this approach is that requires support from the architecture. See further information here.
- An easier fix -almost as robust as the previous- is to let prod_head take values across the whole uint32 range. Any uses of prod_head (either for reading or writing entries) would need the mask to be applied first, so there is not a ring-buffer overflow. This is the approach taken by dpdk.
Any of these alternatives should turn the ABA into an unrealistic event, even in the presence of short ring-buffers and highly concurrent scenarios. The second is probably easier to implement, though.
Bug reporting events
- 2020-05-10: Encrypted report sent to FreeBSD’s Security Officer (security-officer@freebsd.org) explaining the bug and attaching the PoC code. Confirmation and an expected timeline were solicited.
- 2020-05-10: FreeBSD’s Security Team confirmed reception and asked for further clarification.
- 2020-05-10: More information sent.
- 2020-05-10: FreeBSD’s Security Team informed that they were going to analyze the report and reply in a couple of days.
- 2020-05-15: FreeBSD’s Security Team confirmed the bug and informed that they are working on a fix. However, they don’t believe that an embargoed treatment is required and asked us to report on the public Bugzilla.
- 2020-05-15: We reported the bug on the public Bugzilla (bug ID 246475) and proceeded to publish this write-up.
Credits: Andres Lopez, Martin Di Paola and Martin Balao.