Under Windows you won't need hazard pointer because you've structured exception handling (EXCEPTION_ACCESS_VIOLATION). This is possible with
Posix also if you trap SIGSEGV and do a siglongjmp out of the handler.
A problem with that is that these signal handlers are global and the
thread itself only has its private signal mask.
So I invented a little framework to handle signals with a thread-local
signal handler. A global signal handler is established as a static
object before main(). If an according signal (given as an int template paramter) is caught it dispatches to the thread-local handler.
Here's the source of my little framework (yes, it's rather a framework
and not a lib since it has a callback):[...]
I don't think it mentions it there, but SEH is used. I prefer proxy collectors over hazard pointers, but that's just me. Well, Joe knows.
Hazard pointers basically need async membar behavior or else they eat
your lunch with that damn #StoreLoad.
Am 09.03.2026 um 23:36 schrieb Chris M. Thomasson:
I don't think it mentions it there, but SEH is used. I prefer proxy
collectors over hazard pointers, but that's just me. Well, Joe knows.
Hazard pointers basically need async membar behavior or else they eat
your lunch with that damn #StoreLoad.
I've developed a lockfree stack with SEH. With one thread it is
lower than a SLIST, with two threads it is faster and with 32
threads it is magnitudes faster:
I searched for thread_local and didn't see it.
thread_local gets basic structure into the linker and generalises to
uctx's from older POSIX standards for those systems where uctx is
green-like (fillet cut wrt. thread_local, eg linux) even if you have to exclude those systems that are fiber-like (steak-cut wrt. thread_local,
as I understand it, eg bsd).
thread_local works as C++11 defined it.
Am 09.03.2026 um 23:36 schrieb Chris M. Thomasson:
I don't think it mentions it there, but SEH is used. I prefer proxy
collectors over hazard pointers, but that's just me. Well, Joe knows.
Hazard pointers basically need async membar behavior or else they eat
your lunch with that damn #StoreLoad.
I've developed a lockfree stack with SEH. With one thread it is
lower than a SLIST, with two threads it is faster and with 32
threads it is magnitudes faster:
alignas(2 * sizeof(size_t)) head m_head = { 0, 0 };
#pragma once
#if defined(_WIN32)
ÿÿÿÿ#include <Windows.h>
#endif
#include <atomic>
template<typename T>
struct lockfree_stack
{
ÿÿÿÿstruct node : T
ÿÿÿÿ{
ÿÿÿÿÿÿÿ using T::T;
ÿÿÿÿprivate:
ÿÿÿÿÿÿÿ template<typename T>
ÿÿÿÿÿÿÿ friend struct lockfree_stack;
ÿÿÿÿÿÿÿ node *m_next;
ÿÿÿÿ};
ÿÿÿÿvoid push( node *nd ) noexcept;
ÿÿÿÿnode *pop() noexcept;
private:
ÿÿÿÿstruct head { node *ptr; size_t ctr; };
ÿÿÿÿstatic_assert(atomic_ref<head>::is_always_lock_free);
ÿÿÿÿalignas(2 * sizeof(size_t)) head m_head = { 0, 0 };
};
template<typename T>
void lockfree_stack<T>::push( node *nd ) noexcept
{
ÿÿÿÿusing namespace std;
ÿÿÿÿatomic_ref ptr( m_head.ptr );
ÿÿÿÿatomic_ref ctr( m_head.ctr );
ÿÿÿÿhead hdRef( ptr.load( memory_order_relaxed ),
ctr.load( memory_order_relaxed ) );
ÿÿÿÿatomic_ref<head> aHead( m_head );
ÿÿÿÿdo
ÿÿÿÿÿÿÿ nd->m_next = (node *)hdRef.ptr;
ÿÿÿÿwhile( aHead.compare_exchange_strong( hdRef, head( nd, hdRef.ctr +
1 ), memory_order_release, memory_order_relaxed ) );
}
template<typename T>
auto lockfree_stack<T>::pop() noexcept -> node *
{
ÿÿÿÿusing namespace std;
ÿÿÿÿatomic_ref ptr( m_head.ptr );
ÿÿÿÿatomic_ref ctr( m_head.ctr );
ÿÿÿÿhead hdRef( ptr.load( memory_order_relaxed ),
ctr.load( memory_order_relaxed ) );
ÿÿÿÿatomic_ref<head> aHead( m_head );
ÿÿÿÿfor( ; ; )
ÿÿÿÿ{
ÿÿÿÿÿÿÿ node *next;
ÿÿÿÿÿÿÿ if( !hdRef.ptr )
ÿÿÿÿÿÿÿÿÿÿÿ return nullptr;
ÿÿÿÿÿÿÿ __try
ÿÿÿÿÿÿÿ {
ÿÿÿÿÿÿÿÿÿÿÿ next = hdRef.ptr->m_next;
ÿÿÿÿÿÿÿ }
ÿÿÿÿÿÿÿ __except( EXCEPTION_EXECUTE_HANDLER )
ÿÿÿÿÿÿÿ {
ÿÿÿÿÿÿÿÿÿÿÿ hdRef = head( ptr.load( memory_order_relaxed ),
ctr.load( memory_order_relaxed ) );
ÿÿÿÿÿÿÿÿÿÿÿ continue;
ÿÿÿÿÿÿÿ }
ÿÿÿÿÿÿÿ if( aHead.compare_exchange_strong( hdRef, head( next, hdRef.ctr
+ 1 ), memory_order_acquire, memory_order_relaxed ) )
ÿÿÿÿÿÿÿÿÿÿÿ return hdRef.ptr;
ÿÿÿÿ}
}
In the next step I make this Posix'd with my scoped_signal.
On 3/9/2026 6:44 AM, Bonita Montero wrote:
Under Windows you won't need hazard pointer because you've structured
exception handling (EXCEPTION_ACCESS_VIOLATION). This is possible with
Posix also if you trap SIGSEGV and do a siglongjmp out of the handler.
A problem with that is that these signal handlers are global and the
thread itself only has its private signal mask.
So I invented a little framework to handle signals with a thread-local
signal handler. A global signal handler is established as a static
object before main(). If an according signal (given as an int template
paramter) is caught it dispatches to the thread-local handler.
Here's the source of my little framework (yes, it's rather a framework
and not a lib since it has a callback):[...]
Well, iirc SLIST uses SEH for its work when shit hits the fan on
windows. However, its very specialized for the algorithm:
https://learn.microsoft.com/en-us/windows/win32/sync/interlocked-singly- linked-lists
I don't think it mentions it there, but SEH is used. I prefer proxy collectors over hazard pointers, but that's just me. Well, Joe knows.
Hazard pointers basically need async membar behavior or else they eat
your lunch with that damn #StoreLoad.
Am 09.03.2026 um 23:36 schrieb Chris M. Thomasson:
I don't think it mentions it there, but SEH is used. I prefer proxy
collectors over hazard pointers, but that's just me. Well, Joe knows.
Hazard pointers basically need async membar behavior or else they eat
your lunch with that damn #StoreLoad.
I've developed a lockfree stack with SEH. With one thread it is
lower than a SLIST, with two threads it is faster and with 32
threads it is magnitudes faster:
#pragma once
#if defined(_WIN32)
ÿÿÿÿ#include <Windows.h>
#endif
#include <atomic>
template<typename T>
struct lockfree_stack
{
ÿÿÿÿstruct node : T
ÿÿÿÿ{
ÿÿÿÿÿÿÿ using T::T;
ÿÿÿÿprivate:
ÿÿÿÿÿÿÿ template<typename T>
ÿÿÿÿÿÿÿ friend struct lockfree_stack;
ÿÿÿÿÿÿÿ node *m_next;
ÿÿÿÿ};
ÿÿÿÿvoid push( node *nd ) noexcept;
ÿÿÿÿnode *pop() noexcept;
private:
ÿÿÿÿstruct head { node *ptr; size_t ctr; };
ÿÿÿÿstatic_assert(atomic_ref<head>::is_always_lock_free);
ÿÿÿÿalignas(2 * sizeof(size_t)) head m_head = { 0, 0 };
};
template<typename T>
void lockfree_stack<T>::push( node *nd ) noexcept
{
ÿÿÿÿusing namespace std;
ÿÿÿÿatomic_ref ptr( m_head.ptr );
ÿÿÿÿatomic_ref ctr( m_head.ctr );
ÿÿÿÿhead hdRef( ptr.load( memory_order_relaxed ),
ctr.load( memory_order_relaxed ) );
ÿÿÿÿatomic_ref<head> aHead( m_head );
ÿÿÿÿdo
ÿÿÿÿÿÿÿ nd->m_next = (node *)hdRef.ptr;
ÿÿÿÿwhile( aHead.compare_exchange_strong( hdRef, head( nd, hdRef.ctr +
1 ), memory_order_release, memory_order_relaxed ) );
}
template<typename T>
auto lockfree_stack<T>::pop() noexcept -> node *
{
ÿÿÿÿusing namespace std;
ÿÿÿÿatomic_ref ptr( m_head.ptr );
ÿÿÿÿatomic_ref ctr( m_head.ctr );
ÿÿÿÿhead hdRef( ptr.load( memory_order_relaxed ),
ctr.load( memory_order_relaxed ) );
ÿÿÿÿatomic_ref<head> aHead( m_head );
ÿÿÿÿfor( ; ; )
ÿÿÿÿ{
ÿÿÿÿÿÿÿ node *next;
ÿÿÿÿÿÿÿ if( !hdRef.ptr )
ÿÿÿÿÿÿÿÿÿÿÿ return nullptr;
ÿÿÿÿÿÿÿ __try
ÿÿÿÿÿÿÿ {
ÿÿÿÿÿÿÿÿÿÿÿ next = hdRef.ptr->m_next;
ÿÿÿÿÿÿÿ }
ÿÿÿÿÿÿÿ __except( EXCEPTION_EXECUTE_HANDLER )
ÿÿÿÿÿÿÿ {
ÿÿÿÿÿÿÿÿÿÿÿ hdRef = head( ptr.load( memory_order_relaxed ),
ctr.load( memory_order_relaxed ) );
ÿÿÿÿÿÿÿÿÿÿÿ continue;
ÿÿÿÿÿÿÿ }
ÿÿÿÿÿÿÿ if( aHead.compare_exchange_strong( hdRef, head( next, hdRef.ctr
+ 1 ), memory_order_acquire, memory_order_relaxed ) )
ÿÿÿÿÿÿÿÿÿÿÿ return hdRef.ptr;
ÿÿÿÿ}
}
In the next step I make this Posix'd with my scoped_signal.
Depends on how standard your C++ is and whether your system defines a generalisation to beyond the C++ standard, the charter of this group
isn't restricted to ISO standards AFAIK so you'll find reality creeping in.
Notice where you align?
alignas(2 * sizeof(size_t)) head m_head = { 0, 0 };
Well, I don't think a SLIST anchor is automatically aligned. ...
? Notice _aligned_malloc?
Also, have you checked that your compare_exchange_strong is using
CMPXCHG8B on 32 bit systems and CMPXCHG16B on 64 bit systems? That
used to piss me off when it said not always lock-free.
Am 11.03.2026 um 00:59 schrieb Chris M. Thomasson:
Notice where you align?
alignas(2 * sizeof(size_t)) head m_head = { 0, 0 };
Because atomic_ref makes a runtime-check about that.
Well, I don't think a SLIST anchor is automatically aligned. ...
On ARM it should.
? Notice _aligned_malloc?
Not necessary here; the head is just a class.
Also, have you checked that your compare_exchange_strong is using
CMPXCHG8B on 32 bit systems and CMPXCHG16B on 64 bit systems? That
used to piss me off when it said not always lock-free.
Both are lock-free for sure.
Just to be tidy, so to speak. Pad _and_ align the head on a l2 cache
line, even with SLIST_HEADER. Completely isolate it. Also, for your
nodes, ditto.
Good! What does the lock-free check from C++ give you?
Also, for fun... Keep a counter in a "debug" or "profile" mode of yourI didn't test that since the code is trivial.
code. It simply counts how many times the SEH fired. ...
Am 11.03.2026 um 20:28 schrieb Chris M. Thomasson:
Just to be tidy, so to speak. Pad _and_ align the head on a l2 cache
line, even with SLIST_HEADER. Completely isolate it. Also, for your
nodes, ditto.
It's just enough not to stride two cachlines. With my Zen4-CPU there's
a performance penalty only of the word strides page boundaries.
Good! What does the lock-free check from C++ give you?
I trust in cl, clang-cl, g++ and clang++.
All do it correctly.
Also, for fun... Keep a counter in a "debug" or "profile" mode of your
code. It simply counts how many times the SEH fired. ...
I didn't test that since the code is trivial.
Well, align it on a boundary and pad it up to a boundary.
That way you know your head, or anchor, is isolated. Also, for your nodes... Each
node should be its own l2 cacheline, properly aligned and padded. This
goes for using SLIST_HEADER as well...
Did you do a disassembly? On a 64 bit system it should boil down to LOCK CMPXCHG16B and report alway lock free = true. :^)
You should do it. Does it get above zero in your tests?
You should do it. Does it get above zero in your tests?
It's only one line of code with less than 20 characters.
Am 12.03.2026 um 08:36 schrieb Bonita Montero:
You should do it. Does it get above zero in your tests?
It's only one line of code with less than 20 characters.
It's only that line "next = hdRef.ptr->m_next;" which is
encapsulated in a __try / __except.
Right. Just count how many times it get triggered? It should give you an insight...
Am 13.03.2026 um 00:05 schrieb Chris M. Thomasson:
Right. Just count how many times it get triggered? It should give you
an insight...
No matter how often since it's clear that this is a rare condition
with today's memory-allocation.
It should be rare, but... Well, add a counter anyway.
Just to see how far it goes over zero... ;^)
Am 13.03.2026 um 05:27 schrieb Chris M. Thomasson:
It should be rare, but... Well, add a counter anyway.
Just to see how far it goes over zero... ;^)
You can't simulate that without writing a large program with
a specific allocation scheme.
On 3/12/2026 10:13 PM, Bonita Montero wrote:
Am 13.03.2026 um 05:27 schrieb Chris M. Thomasson:
It should be rare, but... Well, add a counter anyway.
Just to see how far it goes over zero... ;^)
You can't simulate that without writing a large program with
a specific allocation scheme.
Well, you need to try? See what that counter gets to after say, a 5 hour
run of your system.
Am 13.03.2026 um 09:33 schrieb Chris M. Thomasson:
On 3/12/2026 10:13 PM, Bonita Montero wrote:
Am 13.03.2026 um 05:27 schrieb Chris M. Thomasson:
It should be rare, but... Well, add a counter anyway.
Just to see how far it goes over zero... ;^)
You can't simulate that without writing a large program with
a specific allocation scheme.
Well, you need to try? See what that counter gets to after say, a 5
hour run of your system.
It's very unlikely that the counter comes back to the same value and
the pointer also. With 32 bit it's 2 ^ 32 iterations and the pointer
must also have the same value. With 64 bit it's 2 ^ 64, i.e. 1,8E19
rounds, so you can assume that this never happens also.
On 3/13/2026 2:38 AM, Bonita Montero wrote:
Am 13.03.2026 um 09:33 schrieb Chris M. Thomasson:
On 3/12/2026 10:13 PM, Bonita Montero wrote:
Am 13.03.2026 um 05:27 schrieb Chris M. Thomasson:
It should be rare, but... Well, add a counter anyway.
Just to see how far it goes over zero... ;^)
You can't simulate that without writing a large program with
a specific allocation scheme.
Well, you need to try? See what that counter gets to after say, a 5
hour run of your system.
It's very unlikely that the counter comes back to the same value and
the pointer also. With 32 bit it's 2 ^ 32 iterations and the pointer
must also have the same value. With 64 bit it's 2 ^ 64, i.e. 1,8E19
rounds, so you can assume that this never happens also.
It can occur. You can make a special setup to artificially trigger the
SEH to fire?
On 3/13/2026 11:49 AM, Chris M. Thomasson wrote:
On 3/13/2026 2:38 AM, Bonita Montero wrote:
Am 13.03.2026 um 09:33 schrieb Chris M. Thomasson:
On 3/12/2026 10:13 PM, Bonita Montero wrote:
Am 13.03.2026 um 05:27 schrieb Chris M. Thomasson:
It should be rare, but... Well, add a counter anyway.
Just to see how far it goes over zero... ;^)
You can't simulate that without writing a large program with
a specific allocation scheme.
Well, you need to try? See what that counter gets to after say, a 5
hour run of your system.
It's very unlikely that the counter comes back to the same value and
the pointer also. With 32 bit it's 2 ^ 32 iterations and the pointer
must also have the same value. With 64 bit it's 2 ^ 64, i.e. 1,8E19
rounds, so you can assume that this never happens also.
It can occur. You can make a special setup to artificially trigger the
SEH to fire?
Architect a special condition.
On 3/13/2026 2:38 AM, Bonita Montero wrote:
It's very unlikely that the counter comes back to the same value and
the pointer also. With 32 bit it's 2 ^ 32 iterations and the pointer
must also have the same value. With 64 bit it's 2 ^ 64, i.e. 1,8E19
rounds, so you can assume that this never happens also.
It can occur. You can make a special setup to artificially trigger
the SEH to fire?
Am 13.03.2026 um 19:49 schrieb Chris M. Thomasson:
On 3/13/2026 2:38 AM, Bonita Montero wrote:
It's very unlikely that the counter comes back to the same value and
the pointer also. With 32 bit it's 2 ^ 32 iterations and the pointer
must also have the same value. With 64 bit it's 2 ^ 64, i.e. 1,8E19
rounds, so you can assume that this never happens also.
It can occur. You can make a special setup to artificially trigger
the SEH to fire?
It's obvious where and why the access violation may happen. It's easy
to test that with a simple nullpointer-assignment. Taken that code you
can combine the next-pointer read with the trap-catching code. There's
no need to test that, it's only one CPU instruction that may fail here.
Am 10.03.2026 um 23:45 schrieb Tristan Wibberley:
Depends on how standard your C++ is and whether your system defines a
generalisation to beyond the C++ standard, the charter of this group
isn't restricted to ISO standards AFAIK so you'll find reality
creeping in.
Which compiler is non-conforming in that sense ?
See how many times it trips during an "intense simulation" for fun?
Build a program, let it run for say, 12 hours. And see how many times
the SEH fired. You are making me think about the good ol' days. Thanks.
Am 15.03.2026 um 21:58 schrieb Chris M. Thomasson:
See how many times it trips during an "intense simulation" for fun?
Build a program, let it run for say, 12 hours. And see how many times
the SEH fired. You are making me think about the good ol' days. Thanks.
It's not possible to simulate that with the same environment like in
a real application since the allocation and dellocation behaviour of
a real application is highly dependent on the surrounding code.
Well, I have seen many programs over the years that call new for a node;
push it; pop a node; read its data and delete it in a thread pool with producers and consumers.
That is not efficient, but yet they were there.
On 16/03/2026 19:03, Chris M. Thomasson wrote:
Well, I have seen many programs over the years that call new for a node;
push it; pop a node; read its data and delete it in a thread pool with
producers and consumers.
That is not efficient, but yet they were there.
overloaded operator new?
| Sysop: | Tetrazocine |
|---|---|
| Location: | Melbourne, VIC, Australia |
| Users: | 16 |
| Nodes: | 8 (0 / 8) |
| Uptime: | 40:05:18 |
| Calls: | 208 |
| Files: | 21,502 |
| Messages: | 82,823 |