mbox series

[kirkstone,V2,0/8] glibc: Fix lost wakeup in pthread_cond_wait (PR25847)

Message ID 20250612095639.3630518-1-sunilkumar.dora@windriver.com
Headers show
Series glibc: Fix lost wakeup in pthread_cond_wait (PR25847) | expand

Message

Dora, Sunil Kumar June 12, 2025, 9:56 a.m. UTC
From: Sunil Dora <sunilkumar.dora@windriver.com>

V2 changes:
- Split the previous single large patch into 9 individual patches, 
  each corresponding to an upstream commit, for easier review as requested.

PR: https://sourceware.org/bugzilla/show_bug.cgi?id=25847

A race condition in pthread_cond_wait that could cause lost wake-ups
under high thread contention. The issue particularly affects
applications with many threads waiting on condition variables.

[Root Cause Analysis]
 The lost wakeup occurs when all of these happen simultaneously:
 1. A waiter thread gets preempted between signal decrement and group check
 2. Multiple group rotations occur during preemption
 3. Signal accounting becomes corrupted
 4. Final signals are delivered to empty groups

[Impact]
 Applications using pthread condition variables could experience hangs due to lost wake-ups.
 This particularly affects high-contention scenarios with many threads.

[Fix Details]
 The fix prevents signal stealing by:
     Broadening scope of g_refs to cover entire wait operation
     Removing the complex signal stealing handling code
     Properly maintaining signal accounting invariants

Upstream Status: Fixed in glibc 2.41 and master:
 [https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c72]

According to https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c74, 
commit c36fc50781995e6758cae2b6927839d0157f213c is unsuitable for backporting to older branches 
and has therefore been excluded.

[Testing]
 Verified on x86_64 with:
 - Custom reproduction test case
 - Stress testing with high thread counts

Reproduction Steps:
 1. Build glibc(2.35) with injected delay (usleep(10)) in __pthread_cond_wait_common
     Just before : uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
     File: { nptl/pthread_cond_wait.c }
     (Note: The injected delay is for testing purposes only to reliably reproduce the race condition)
 2. Compile test program with custom glibc:
     [https://sourceware.org/bugzilla/attachment.cgi?id=14360]
     gcc -g -o pthread_cond_bug pthread_cond_bug.c \
     -Wl,-dynamic-linker=<CUSTOM_GLIBC>/lib/ld-linux-x86-64.so.2 \
     -Wl,-rpath=<CUSTOM_GLIBC>/lib -L<CUSTOM_GLIBC>/lib \
     -I<CUSTOM_GLIBC>/include -lpthread
 3. Run with CPU pinning: taskset -c 0 ./pthread_cond_bug

[Behavior]
 Without the fix:
 - Process hangs for 300 seconds
 - Eventually crashes with core dump
 - Verifiable via gdb

With the fix:
 - Process runs smoothly even with forced delay

Sunil Dora (8):
  pthreads NPTL: lost wakeup fix 2
  nptl: Update comments and indentation for new condvar implementation
  nptl: Remove unnecessary catch-all-wake in condvar group switch
  nptl: Remove unnecessary quadruple check in pthread_cond_wait
  nptl: Use a single loop in pthread_cond_wait instaed of a nested loop
  nptl: Fix indentation
  nptl: rename __condvar_quiesce_and_switch_g1
  nptl: Use all of g1_start and g_signals

 .../glibc/glibc/0026-PR25847-1.patch          | 453 ++++++++++++++++++
 .../glibc/glibc/0026-PR25847-2.patch          | 143 ++++++
 .../glibc/glibc/0026-PR25847-3.patch          |  77 +++
 .../glibc/glibc/0026-PR25847-4.patch          | 116 +++++
 .../glibc/glibc/0026-PR25847-5.patch          | 103 ++++
 .../glibc/glibc/0026-PR25847-6.patch          | 171 +++++++
 .../glibc/glibc/0026-PR25847-7.patch          | 159 ++++++
 .../glibc/glibc/0026-PR25847-8.patch          | 191 ++++++++
 meta/recipes-core/glibc/glibc_2.35.bb         |   8 +
 9 files changed, 1421 insertions(+)
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-1.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-2.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-3.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-4.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-5.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-6.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-7.patch
 create mode 100644 meta/recipes-core/glibc/glibc/0026-PR25847-8.patch

Comments

Gyorgy Sarvari June 12, 2025, 10:29 a.m. UTC | #1
On 6/12/25 11:56, Dora, Sunil Kumar via lists.openembedded.org wrote:
> From: Sunil Dora <sunilkumar.dora@windriver.com>
>
> V2 changes:
> - Split the previous single large patch into 9 individual patches, 
>   each corresponding to an upstream commit, for easier review as requested.
>
Thanks a lot for splitting it and for the extended introduction, it was
much easier the patches compare it to the source.

One small thing: the recipe name is missing from the individual commit
summaries/email subjects.
Otherwise (for the whole series):

Reviewed-by: Gyorgy Sarvari <skandigraun@gmail.com>
Dora, Sunil Kumar June 12, 2025, 10:46 a.m. UTC | #2
Thank you for the feedback. To confirm, would you like me to send a V3 with glibc: added to each commit subject line as well? I had included it only in the cover letter to avoid modifying the original commit titles.
Gyorgy Sarvari June 12, 2025, 10:54 a.m. UTC | #3
On 6/12/25 12:46, Dora, Sunil Kumar via lists.openembedded.org wrote:
> Thank you for the feedback. To confirm, would you like me to send a V3
> with |glibc:| added to each commit subject line as well? I had
> included it only in the cover letter to avoid modifying the original
> commit titles.
>
Yes, it will be easier to read the commit history. You could also send
all the separate patches in one email, though at this point I think it
would be easier an faster just to keep the series and prefix the
subjects of the individual messages.