Message ID | 20250702204758.2773339-1-ecordonnier@snap.com |
---|---|
State | New |
Headers | show |
Series | uihelper: Fix KeyError race condition in pidmap access | expand |
On Wed, 2025-07-02 at 22:47 +0200, Etienne Cordonnier via lists.openembedded.org wrote: > From: Etienne Cordonnier <ecordonnier@snap.com> > > I'm seeing random build errors on scarthgap 5.0.10 because the pid is not contained in pidmap. > Fix the error by ensure the PID exists in pidmap before accessing it. > > Call-stack of the error: > > ``` > Traceback (most recent call last): > File "poky/bitbake/lib/bb/ui/knotty.py", line 681, in main > helper.eventHandler(event) > File "poky/bitbake/lib/bb/ui/uihelper.py", line 43, in eventHandler > removetid(event.pid, tid) > File "poky/bitbake/lib/bb/ui/uihelper.py", line 28, in removetid > if self.pidmap[pid] == tid: > KeyError: 21041 > WARNING: Exiting due to interrupt. > ``` The code was designed such that this should not happen. If it is happening, it suggests there is some other issue going on with the task reference handling and the patch is just hiding some other underlying problem. Is there any way to reproduce the issue? Have you other patches applied to bitbake? Cheers, Richard
Hi Richard, I have not yet found a way to reproduce the issue. I'm seeing the issue in 2 from 50 builds after updating from 5.0.9 to 5.0.10, on a build server with 60 cores, so that's about 4% failure rate. I've never seen the error before the update. There are no other patches applied to bitbake, however there are many other layers used so it may not be reproducible on stand-alone poky. I'll monitor and try to find a way to reproduce. Étienne On Thu, Jul 3, 2025 at 8:42 AM Richard Purdie < richard.purdie@linuxfoundation.org> wrote: > On Wed, 2025-07-02 at 22:47 +0200, Etienne Cordonnier via > lists.openembedded.org wrote: > > From: Etienne Cordonnier <ecordonnier@snap.com> > > > > I'm seeing random build errors on scarthgap 5.0.10 because the pid is > not contained in pidmap. > > Fix the error by ensure the PID exists in pidmap before accessing it. > > > > Call-stack of the error: > > > > ``` > > Traceback (most recent call last): > > File "poky/bitbake/lib/bb/ui/knotty.py", line 681, in main > > helper.eventHandler(event) > > File "poky/bitbake/lib/bb/ui/uihelper.py", line 43, in eventHandler > > removetid(event.pid, tid) > > File "poky/bitbake/lib/bb/ui/uihelper.py", line 28, in removetid > > if self.pidmap[pid] == tid: > > KeyError: 21041 > > WARNING: Exiting due to interrupt. > > ``` > > The code was designed such that this should not happen. If it is > happening, it suggests there is some other issue going on with the task > reference handling and the patch is just hiding some other underlying > problem. > > Is there any way to reproduce the issue? Have you other patches applied > to bitbake? > > Cheers, > > Richard >
On Thu, 2025-07-03 at 10:03 +0200, Etienne Cordonnier wrote: > Hi Richard, > > I have not yet found a way to reproduce the issue. I'm seeing the > issue in 2 from 50 builds after updating from 5.0.9 to 5.0.10, on a > build server with 60 cores, so that's about 4% failure rate. I've > never seen the error before the update. There are no other patches > applied to bitbake, however there are many other layers used so it > may not be reproducible on stand-alone poky. I'll monitor and try to > find a way to reproduce. Is there any pattern such as only interrupted builds (Ctrl+C?) or build with failing tasks? Any particular kinds of failures? Just trying to narrow down the possible causes and give ideas... Cheers, Richard
There is no obvious pattern. The two builds with failure are successful until this error is shown in the logs. There is no Ctrl+C interruption, since those are build-servers without human interaction. There is also no obvious reason why a SIGINT would be sent. I'll try to find the root-cause, or at least some way to reproduce. Étienne On Thu, Jul 3, 2025 at 10:52 AM Richard Purdie < richard.purdie@linuxfoundation.org> wrote: > On Thu, 2025-07-03 at 10:03 +0200, Etienne Cordonnier wrote: > > Hi Richard, > > > > I have not yet found a way to reproduce the issue. I'm seeing the > > issue in 2 from 50 builds after updating from 5.0.9 to 5.0.10, on a > > build server with 60 cores, so that's about 4% failure rate. I've > > never seen the error before the update. There are no other patches > > applied to bitbake, however there are many other layers used so it > > may not be reproducible on stand-alone poky. I'll monitor and try to > > find a way to reproduce. > > Is there any pattern such as only interrupted builds (Ctrl+C?) or build > with failing tasks? Any particular kinds of failures? Just trying to > narrow down the possible causes and give ideas... > > Cheers, > > Richard > >
On Wed, 2025-07-02 at 22:47 +0200, Etienne Cordonnier via lists.openembedded.org wrote: > From: Etienne Cordonnier <ecordonnier@snap.com> > > I'm seeing random build errors on scarthgap 5.0.10 because the pid is not contained in pidmap. > Fix the error by ensure the PID exists in pidmap before accessing it. > > Call-stack of the error: > > ``` > Traceback (most recent call last): > File "poky/bitbake/lib/bb/ui/knotty.py", line 681, in main > helper.eventHandler(event) > File "poky/bitbake/lib/bb/ui/uihelper.py", line 43, in eventHandler > removetid(event.pid, tid) > File "poky/bitbake/lib/bb/ui/uihelper.py", line 28, in removetid > if self.pidmap[pid] == tid: > KeyError: 21041 > WARNING: Exiting due to interrupt. > ``` > > Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com> > --- > lib/bb/ui/uihelper.py | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lib/bb/ui/uihelper.py b/lib/bb/ui/uihelper.py > index e6983bd55..0bc526ca1 100644 > --- a/lib/bb/ui/uihelper.py > +++ b/lib/bb/ui/uihelper.py > @@ -25,7 +25,7 @@ class BBUIHelper: > def removetid(pid, tid): > self.running_pids.remove(tid) > del self.running_tasks[tid] > - if self.pidmap[pid] == tid: > + if pid in self.pidmap and self.pidmap[pid] == tid: > del self.pidmap[pid] > self.needUpdate = True > FWIW I did discover a bug recently: https://git.yoctoproject.org/poky/commit/?id=b7173ca2254421c45f01243a77be611fe4b9d1c5 which could potentially have caused the issue you saw. Copying Steve as this fix could potentially be stable branch material. Cheers, Richard
Thanks for the information. I have continued to monitor this on our side, but I have not seen a single failure after July 2 even though 300 builds have run on the CI build pipeline which failed 2 from 50 builds while building the 5.0.10 update branch. The fix above is not applied. I guess there was some pre-condition to trigger the race-condition in this update branch which is not there any more. Étienne On Wed, Jul 23, 2025 at 12:45 PM Richard Purdie < richard.purdie@linuxfoundation.org> wrote: > On Wed, 2025-07-02 at 22:47 +0200, Etienne Cordonnier via > lists.openembedded.org wrote: > > From: Etienne Cordonnier <ecordonnier@snap.com> > > > > I'm seeing random build errors on scarthgap 5.0.10 because the pid is > not contained in pidmap. > > Fix the error by ensure the PID exists in pidmap before accessing it. > > > > Call-stack of the error: > > > > ``` > > Traceback (most recent call last): > > File "poky/bitbake/lib/bb/ui/knotty.py", line 681, in main > > helper.eventHandler(event) > > File "poky/bitbake/lib/bb/ui/uihelper.py", line 43, in eventHandler > > removetid(event.pid, tid) > > File "poky/bitbake/lib/bb/ui/uihelper.py", line 28, in removetid > > if self.pidmap[pid] == tid: > > KeyError: 21041 > > WARNING: Exiting due to interrupt. > > ``` > > > > Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com> > > --- > > lib/bb/ui/uihelper.py | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/lib/bb/ui/uihelper.py b/lib/bb/ui/uihelper.py > > index e6983bd55..0bc526ca1 100644 > > --- a/lib/bb/ui/uihelper.py > > +++ b/lib/bb/ui/uihelper.py > > @@ -25,7 +25,7 @@ class BBUIHelper: > > def removetid(pid, tid): > > self.running_pids.remove(tid) > > del self.running_tasks[tid] > > - if self.pidmap[pid] == tid: > > + if pid in self.pidmap and self.pidmap[pid] == tid: > > del self.pidmap[pid] > > self.needUpdate = True > > > > FWIW I did discover a bug recently: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__git.yoctoproject.org_poky_commit_-3Fid-3Db7173ca2254421c45f01243a77be611fe4b9d1c5&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=IJNUV1g533T5CdmLjFtHp3AubpGaT_vPTNW9eXmoXP2mLpyc7CKV6QDuolkL3rQW&s=kOBWhnHiyEQNgRIPAIPbPAfBmFzavE7sa6RXMnnj01g&e= > > which could potentially have caused the issue you saw. > > Copying Steve as this fix could potentially be stable branch material. > > Cheers, > > Richard >
diff --git a/lib/bb/ui/uihelper.py b/lib/bb/ui/uihelper.py index e6983bd55..0bc526ca1 100644 --- a/lib/bb/ui/uihelper.py +++ b/lib/bb/ui/uihelper.py @@ -25,7 +25,7 @@ class BBUIHelper: def removetid(pid, tid): self.running_pids.remove(tid) del self.running_tasks[tid] - if self.pidmap[pid] == tid: + if pid in self.pidmap and self.pidmap[pid] == tid: del self.pidmap[pid] self.needUpdate = True