Message ID | 20241111131604.364308-3-mikko.rapeli@linaro.org |
---|---|
State | New |
Headers | show |
Series | [1/3] uki.bbclass: fix debug print logging level | expand |
On Mon, 2024-11-11 at 13:16 +0000, Mikko Rapeli via lists.openembedded.org wrote: > runqemu can fail with RuntimeError exception. Non-cought exception > causes cooker process leaks which bind to successive bitbake command > line calls and that can cause really odd errors to users, e.g. when > build/tmp is wiped and cooker processes expect files to be there. > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org> > --- > meta/classes-recipe/testimage.bbclass | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/meta/classes-recipe/testimage.bbclass b/meta/classes-recipe/testimage.bbclass > index 19075ce1f3..a9b031093a 100644 > --- a/meta/classes-recipe/testimage.bbclass > +++ b/meta/classes-recipe/testimage.bbclass > @@ -371,7 +371,7 @@ def testimage_main(d): > complete = True > if results.hasAnyFailingTest(): > run_failed_tests_post_actions(d, tc) > - except (KeyboardInterrupt, BlockingIOError) as err: > + except (KeyboardInterrupt, BlockingIOError, RuntimeError) as err: > if isinstance(err, KeyboardInterrupt): > bb.error('testimage interrupted, shutting down...') > else: > During review it is hard to understand what the real issue is from this description. I don't like the sound of processes leaking and if that is happening, adding another exception to this list doesn't feel correct. I was going to ask for a better explanation but looking at the code, perhaps this error handling path just needs rewriting/improving with more of the code in the finally, conditionally? I just want to make sure we fix the real bug here. Cheers, Richard
Hi, On Tue, Nov 12, 2024 at 11:25:51AM +0000, Richard Purdie wrote: > On Mon, 2024-11-11 at 13:16 +0000, Mikko Rapeli via lists.openembedded.org wrote: > > runqemu can fail with RuntimeError exception. Non-cought exception > > causes cooker process leaks which bind to successive bitbake command > > line calls and that can cause really odd errors to users, e.g. when > > build/tmp is wiped and cooker processes expect files to be there. > > > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org> > > --- > > �meta/classes-recipe/testimage.bbclass | 2 +- > > �1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/meta/classes-recipe/testimage.bbclass b/meta/classes-recipe/testimage.bbclass > > index 19075ce1f3..a9b031093a 100644 > > --- a/meta/classes-recipe/testimage.bbclass > > +++ b/meta/classes-recipe/testimage.bbclass > > @@ -371,7 +371,7 @@ def testimage_main(d): > > �������� complete = True > > �������� if results.hasAnyFailingTest(): > > ������������ run_failed_tests_post_actions(d, tc) > > -��� except (KeyboardInterrupt, BlockingIOError) as err: > > +��� except (KeyboardInterrupt, BlockingIOError, RuntimeError) as err: > > �������� if isinstance(err, KeyboardInterrupt): > > ������������ bb.error('testimage interrupted, shutting down...') > > �������� else: > > > > During review it is hard to understand what the real issue is from this > description. I don't like the sound of processes leaking and if that is > happening, adding another exception to this list doesn't feel correct. > I was going to ask for a better explanation but looking at the code, > perhaps this error handling path just needs rewriting/improving with > more of the code in the finally, conditionally? > > I just want to make sure we fix the real bug here. Sorry for being unclear. I thought the backtrace would be too verbose. The bug happens when runqemu startup fails: poky/meta/lib/oeqa/targetcontrol.py: raise RuntimeError("%s - FAILED to start qemu - check the task log and the boot log" % self.pn) cooker processes do leak when the exceptions are not cought. Maybe these are not strictly related but it happens for me. It can be that cleanup happens but just slowly, and when I run other bitbake commands right after failure they connect to these leaked cooker processes which then behave badly, for example when build/tmp was already wiped. Here is a failure log from one of my builds from meta-arm with other changes on top which are the root cause for missing rootfs file: ERROR: core-image-base-1.0-r0 do_testimage: Invalid rootfs /home/builder/src/base/meta-arm/build/tmp/deploy/images/qemuarm64-secureboot/core-image-base-qemuarm64-secureboot.rootfs.wic.qcow2 ERROR: core-image-base-1.0-r0 do_testimage: Error executing a python function in exec_func_python() autogenerated: The stack trace of python calls that resulted in this exception/failure was: File: 'exec_func_python() autogenerated', lineno: 2, function: <module> 0001: *** 0002:do_testimage(d) 0003: File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 122, function: do_testimage 0118: dump-guest-memory {"paging":false,"protocol":"file:%s.img"} 0119:} 0120: 0121:python do_testimage() { *** 0122: testimage_main(d) 0123:} 0124: 0125:addtask testimage 0126:do_testimage[nostamp] = "1" File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 364, function: testimage_main 0360: orig_sigterm_handler = signal.signal(signal.SIGTERM, sigterm_exception) 0361: try: 0362: # We need to check if runqemu ends unexpectedly 0363: # or if the worker send us a SIGTERM *** 0364: tc.target.start(params=d.getVar("TEST_QEMUPARAMS"), runqemuparams=d.getVar("TEST_RUNQEMUPARAMS")) 0365: import threading 0366: try: 0367: threading.Timer(int(d.getVar("TEST_OVERALL_TIMEOUT")), handle_test_timeout, (int(d.getVar("TEST_OVERALL_TIMEOUT")),)).start() 0368: except ValueError: File: '/home/builder/src/base/meta-arm/build/../poky/meta/lib/oeqa/core/target/qemu.py', lineno: 91, function: start 0087: except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError) as err: 0088: msg += "Error running command: %s\n%s\n" % (blcmd, err) 0089: msg += "\n\n===== end: snippet =====\n" 0090: *** 0091: raise RuntimeError("FAILED to start qemu - check the task log and the boot log %s" % (msg)) 0092: 0093: def stop(self): 0094: self.runner.stop() Exception: RuntimeError: FAILED to start qemu - check the task log and the boot log ===== start: snippet ===== ===== end: snippet ===== ===== start: snippet ===== Error running command: ['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/testimage/qemu_boot_log.20241031142321'] Command '['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/testimage/qemu_boot_log.20241031142321']' returned non-zero exit status 1. ===== end: snippet ===== ERROR: Logfile of failure stored in: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/temp/log.do_testimage.1779338 NOTE: recipe core-image-base-1.0-r0: task do_testimage: Failed ERROR: Task (/home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage) failed with exit code '1' NOTE: Tasks Summary: Attempted 7084 tasks of which 7054 didn't need to be rerun and 1 failed. Summary: 1 task failed: /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage Cheers, -Mikko
On Mon, 2024-11-18 at 10:00 +0200, Mikko Rapeli wrote: > On Tue, Nov 12, 2024 at 11:25:51AM +0000, Richard Purdie wrote: > > On Mon, 2024-11-11 at 13:16 +0000, Mikko Rapeli via lists.openembedded.org wrote: > > > runqemu can fail with RuntimeError exception. Non-cought exception > > > causes cooker process leaks which bind to successive bitbake command > > > line calls and that can cause really odd errors to users, e.g. when > > > build/tmp is wiped and cooker processes expect files to be there. > > > > > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org> > > > --- > > > meta/classes-recipe/testimage.bbclass | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/meta/classes-recipe/testimage.bbclass b/meta/classes-recipe/testimage.bbclass > > > index 19075ce1f3..a9b031093a 100644 > > > --- a/meta/classes-recipe/testimage.bbclass > > > +++ b/meta/classes-recipe/testimage.bbclass > > > @@ -371,7 +371,7 @@ def testimage_main(d): > > > complete = True > > > if results.hasAnyFailingTest(): > > > run_failed_tests_post_actions(d, tc) > > > - except (KeyboardInterrupt, BlockingIOError) as err: > > > + except (KeyboardInterrupt, BlockingIOError, RuntimeError) as err: > > > if isinstance(err, KeyboardInterrupt): > > > bb.error('testimage interrupted, shutting down...') > > > else: > > > > > > > During review it is hard to understand what the real issue is from this > > description. I don't like the sound of processes leaking and if that is > > happening, adding another exception to this list doesn't feel correct. > > I was going to ask for a better explanation but looking at the code, > > perhaps this error handling path just needs rewriting/improving with > > more of the code in the finally, conditionally? > > > > I just want to make sure we fix the real bug here. > > Sorry for being unclear. I thought the backtrace would be too verbose. > > The bug happens when runqemu startup fails: > > poky/meta/lib/oeqa/targetcontrol.py: raise RuntimeError("%s - FAILED to start qemu - check the task log and the boot log" % self.pn) > > cooker processes do leak when the exceptions are not cought. > Maybe these are not strictly related but it happens for me. It > can be that cleanup happens but just slowly, and when I run > other bitbake commands right after failure they connect to these > leaked cooker processes which then behave badly, for example when > build/tmp was already wiped. > Sorry for the delay in looking at this patch. I'm a bit worried about there being a leaked processes and wanted to understand if there was other cleanup we should be doing. Instead of this path, would it make sense to move the results.stop() inside the finally? I'm worried that other forms of exception would also leak processes. Cheers, Richard
On Tue, 2025-01-28 at 13:04 +0000, Richard Purdie via lists.openembedded.org wrote: > On Mon, 2024-11-18 at 10:00 +0200, Mikko Rapeli wrote: > > On Tue, Nov 12, 2024 at 11:25:51AM +0000, Richard Purdie wrote: > > > On Mon, 2024-11-11 at 13:16 +0000, Mikko Rapeli via > > > lists.openembedded.org wrote: > > > > runqemu can fail with RuntimeError exception. Non-cought > > > > exception > > > > causes cooker process leaks which bind to successive bitbake > > > > command > > > > line calls and that can cause really odd errors to users, e.g. > > > > when > > > > build/tmp is wiped and cooker processes expect files to be > > > > there. > > > > > > > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org> > > > > --- > > > > meta/classes-recipe/testimage.bbclass | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/meta/classes-recipe/testimage.bbclass > > > > b/meta/classes-recipe/testimage.bbclass > > > > index 19075ce1f3..a9b031093a 100644 > > > > --- a/meta/classes-recipe/testimage.bbclass > > > > +++ b/meta/classes-recipe/testimage.bbclass > > > > @@ -371,7 +371,7 @@ def testimage_main(d): > > > > complete = True > > > > if results.hasAnyFailingTest(): > > > > run_failed_tests_post_actions(d, tc) > > > > - except (KeyboardInterrupt, BlockingIOError) as err: > > > > + except (KeyboardInterrupt, BlockingIOError, RuntimeError) > > > > as err: > > > > if isinstance(err, KeyboardInterrupt): > > > > bb.error('testimage interrupted, shutting > > > > down...') > > > > else: > > > > > > > > > > During review it is hard to understand what the real issue is > > > from this > > > description. I don't like the sound of processes leaking and if > > > that is > > > happening, adding another exception to this list doesn't feel > > > correct. > > > I was going to ask for a better explanation but looking at the > > > code, > > > perhaps this error handling path just needs rewriting/improving > > > with > > > more of the code in the finally, conditionally? > > > > > > I just want to make sure we fix the real bug here. > > > > Sorry for being unclear. I thought the backtrace would be too > > verbose. > > > > The bug happens when runqemu startup fails: > > > > poky/meta/lib/oeqa/targetcontrol.py: raise > > RuntimeError("%s - FAILED to start qemu - check the task log and > > the boot log" % self.pn) > > > > cooker processes do leak when the exceptions are not cought. > > Maybe these are not strictly related but it happens for me. It > > can be that cleanup happens but just slowly, and when I run > > other bitbake commands right after failure they connect to these > > leaked cooker processes which then behave badly, for example when > > build/tmp was already wiped. > > > > Sorry for the delay in looking at this patch. I'm a bit worried about > there being a leaked processes and wanted to understand if there was > other cleanup we should be doing. > > Instead of this path, would it make sense to move the results.stop() > inside the finally? I'm worried that other forms of exception would > also leak processes. I've looked more at this and I don't understand how this patch works. By adding the exception to the exception clause, it will trigger a bb.error() call. The results.stop() call won't happen as results isn't set in this failure. The only significant difference would therefore be the results = tc.results assignment. Does that really stop processes? If so, should we always be doing that? I'd like to understand what we need to do to "fix" things so we can handle other exceptions. Cheers, Richard
Hi, On Tue, Jan 28, 2025 at 01:49:00PM +0000, Richard Purdie wrote: > On Tue, 2025-01-28 at 13:04 +0000, Richard Purdie via > lists.openembedded.org wrote: > > On Mon, 2024-11-18 at 10:00 +0200, Mikko Rapeli wrote: > > > On Tue, Nov 12, 2024 at 11:25:51AM +0000, Richard Purdie wrote: > > > > On Mon, 2024-11-11 at 13:16 +0000, Mikko Rapeli via > > > > lists.openembedded.org wrote: > > > > > runqemu can fail with RuntimeError exception. Non-cought > > > > > exception > > > > > causes cooker process leaks which bind to successive bitbake > > > > > command > > > > > line calls and that can cause really odd errors to users, e.g. > > > > > when > > > > > build/tmp is wiped and cooker processes expect files to be > > > > > there. > > > > > > > > > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org> > > > > > --- > > > > > �meta/classes-recipe/testimage.bbclass | 2 +- > > > > > �1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/meta/classes-recipe/testimage.bbclass > > > > > b/meta/classes-recipe/testimage.bbclass > > > > > index 19075ce1f3..a9b031093a 100644 > > > > > --- a/meta/classes-recipe/testimage.bbclass > > > > > +++ b/meta/classes-recipe/testimage.bbclass > > > > > @@ -371,7 +371,7 @@ def testimage_main(d): > > > > > �������� complete = True > > > > > �������� if results.hasAnyFailingTest(): > > > > > ������������ run_failed_tests_post_actions(d, tc) > > > > > -��� except (KeyboardInterrupt, BlockingIOError) as err: > > > > > +��� except (KeyboardInterrupt, BlockingIOError, RuntimeError) > > > > > as err: > > > > > �������� if isinstance(err, KeyboardInterrupt): > > > > > ������������ bb.error('testimage interrupted, shutting > > > > > down...') > > > > > �������� else: > > > > > > > > > > > > > During review it is hard to understand what the real issue is > > > > from this > > > > description. I don't like the sound of processes leaking and if > > > > that is > > > > happening, adding another exception to this list doesn't feel > > > > correct. > > > > I was going to ask for a better explanation but looking at the > > > > code, > > > > perhaps this error handling path just needs rewriting/improving > > > > with > > > > more of the code in the finally, conditionally? > > > > > > > > I just want to make sure we fix the real bug here. > > > > > > Sorry for being unclear. I thought the backtrace would be too > > > verbose. > > > > > > The bug happens when runqemu startup fails: > > > > > > poky/meta/lib/oeqa/targetcontrol.py:����������� raise > > > RuntimeError("%s - FAILED to start qemu - check the task log and > > > the boot log" % self.pn) > > > > > > cooker processes do leak when the exceptions are not cought. > > > Maybe these are not strictly related but it happens for me. It > > > can be that cleanup happens but just slowly, and when I run > > > other bitbake commands right after failure they connect to these > > > leaked cooker processes which then behave badly, for example when > > > build/tmp was already wiped. > > > > > > > Sorry for the delay in looking at this patch. I'm a bit worried about > > there being a leaked processes and wanted to understand if there was > > other cleanup we should be doing. > > > > Instead of this path, would it make sense to move the results.stop() > > inside the finally? I'm worried that other forms of exception would > > also leak processes. > > I've looked more at this and I don't understand how this patch works. > > By adding the exception to the exception clause, it will trigger a > bb.error() call. The results.stop() call won't happen as results isn't > set in this failure. > > The only significant difference would therefore be the results = > tc.results assignment. Does that really stop processes? If so, should > we always be doing that? > > I'd like to understand what we need to do to "fix" things so we can > handle other exceptions. If I remember correctly, I saw this when building meta-arm qemuarm64-secureboot kas config which runs oeqa runtime tests with testimage.bbclass during build. Without fix 4b3b37156cdf70230bc74c647784da8514c284cf in poky a single failing test leaked Cooker processes. Then a rebuild of core-image-base connected to that Cooker process and somehow failed to re-generate rootfs which became an empty directory inside .wic image. This resulted qemu boot hanging which again throws RuntimeError from meta/lib/oeqa/core/target/qemu.py line 52. Example with poky from October 2024 (from old log files which I still had): NOTE: Reconnecting to bitbake server... ... NOTE: recipe core-image-base-1.0-r0: task do_create_image_spdx: Succeeded QMP Available for connection at /home/builder/src/base/meta-arm/build/tmp/.ndozl5yn QMP connected to QEMU at 10/31/24 13:55:56 and took 0.56 seconds QMP released QEMU at 10/31/24 13:55:56 and took 0.19 seconds from connect Bitbake still alive (no events for 600s). Active tasks: /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage WARNING: core-image-base-1.0-r0 do_testimage: Target didn't reach login banner in 1000 seconds (10/31/24 14:12:36) WARNING: core-image-base-1.0-r0 do_testimage: Last 25 lines of login console (22962): [ 4.345255] Key type fscrypt-provisioning registered [ 4.363120] Btrfs loaded, zoned=no, fsverity=no [ 4.522255] input: QEMU QEMU USB Tablet as /devices/platform/4010000000.pcie/pci0000:00/0000:00:03.0/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input /input0 [ 4.528765] hid-generic 0003:0627:0001.0001: input: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:03.0-1/input0 [ 4.636796] Key type encrypted registered [ 4.644819] printk: legacy console [netcon0] enabled [ 4.645638] netconsole: network logging started [ 4.651065] clk: Disabling unused clocks [ 4.651780] PM: genpd: Disabling unused power domains [ 4.669045] usb 1-2: new high-speed USB device number 3 using xhci_hcd [ 4.794967] Freeing unused kernel memory: 1856K [ 4.803684] Run /init as init process [ 4.847358] input: QEMU QEMU USB Keyboard as /devices/platform/4010000000.pcie/pci0000:00/0000:00:03.0/usb1/1-2/1-2:1.0/0003:0627:0001.0002/inp ut/input1 [ 4.919633] hid-generic 0003:0627:0001.0002: input: USB HID v1.11 Keyboard [QEMU QEMU USB Keyboard] on usb-0000:00:03.0-2/input0 Starting systemd-udevd version 256.6 [ 7.407191] virtio_net virtio0 enp0s1: renamed from eth0 [ 12.633172] ext3: Unknown parameter 'umask' [ 12.634318] ext2: Unknown parameter 'umask' [ 12.634924] ext4: Unknown parameter 'umask' [ 12.676391] EXT4-fs (vda2): mounted filesystem 41dc66e6-d693-46b3-a5d6-a46998500609 r/w with ordered data mode. Quota mode: disabled. [ 13.080423] /dev/disk/by-label/root: Can't open blockdev [ 13.081201] /dev/disk/by-label/root: Can't open blockdev ERROR: There's no '/dev' on rootfs. WARNING: core-image-base-1.0-r0 do_testimage: Check full boot log: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-l inux/core-image-base/1.0/testimage/qemu_boot_log.20241031135427 Output from runqemu: runqemu - INFO - Received signal: 15 runqemu - INFO - Cleaning up runqemu - INFO - Host uptime: 14098967.33 tput: No value for $TERM and no -T specified WARNING: core-image-base-1.0-r0 do_testimage: Extra log data read: Output from runqemu: runqemu - INFO - Received signal: 15 runqemu - INFO - Cleaning up runqemu - INFO - Host uptime: 14098967.33 tput: No value for $TERM and no -T specified Output from runqemu: runqemu - INFO - Received signal: 15 runqemu - INFO - Cleaning up runqemu - INFO - Host uptime: 14098967.33 tput: No value for $TERM and no -T specified ERROR: core-image-base-1.0-r0 do_testimage: Error executing a python function in exec_func_python() autogenerated: The stack trace of python calls that resulted in this exception/failure was: File: 'exec_func_python() autogenerated', lineno: 2, function: <module> 0001: *** 0002:do_testimage(d) 0003: File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 122, function: do_testimage 0118: dump-guest-memory {"paging":false,"protocol":"file:%s.img"} 0119:} 0120: 0121:python do_testimage() { *** 0122: testimage_main(d) 0123:} 0124: 0125:addtask testimage 0126:do_testimage[nostamp] = "1" File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 364, function: testimage_main 0360: orig_sigterm_handler = signal.signal(signal.SIGTERM, sigterm_exception) 0361: try: 0362: # We need to check if runqemu ends unexpectedly 0363: # or if the worker send us a SIGTERM *** 0364: tc.target.start(params=d.getVar("TEST_QEMUPARAMS"), runqemuparams=d.getVar("TEST_RUNQEMUPARAMS")) 0365: import threading 0366: try: 0367: threading.Timer(int(d.getVar("TEST_OVERALL_TIMEOUT")), handle_test_timeout, (int(d.getVar("TEST_OVERALL_TIMEOUT")),)).start( ) 0368: except ValueError: File: '/home/builder/src/base/meta-arm/build/../poky/meta/lib/oeqa/core/target/qemu.py', lineno: 91, function: start 0087: except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError) as err: 0088: msg += "Error running command: %s\n%s\n" % (blcmd, err) 0089: msg += "\n\n===== end: snippet =====\n" 0090: *** 0091: raise RuntimeError("FAILED to start qemu - check the task log and the boot log %s" % (msg)) 0092: 0093: def stop(self): 0094: self.runner.stop() Exception: RuntimeError: FAILED to start qemu - check the task log and the boot log ===== start: snippet ===== ===== end: snippet ===== ===== start: snippet ===== Error running command: ['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/te stimage/qemu_boot_log.20241031135427'] Command '['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/testimage/qemu_b oot_log.20241031135427']' returned non-zero exit status 1. ===== end: snippet ===== ERROR: Logfile of failure stored in: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/temp/ log.do_testimage.1578012 NOTE: recipe core-image-base-1.0-r0: task do_testimage: Failed ERROR: Task (/home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage) failed with exit code '1 ' NOTE: Tasks Summary: Attempted 7116 tasks of which 7081 didn't need to be rerun and 1 failed. Summary: 1 task failed: /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage Since this exit happens without catching RuntimeError, there is no cleanup and Cooker, qemu-system-aarch64 etc processes get leaked again. If the RuntimeError is captured, then clean up is done and qemu-system-aarch64 and Cooker processes get killed. Together with 4b3b37156cdf70230bc74c647784da8514c284cf in poky the failing oeqa runtime test under qemu was handled correctly and rebuilds always ended up with same failure, specifically from parselogs which is a bit greedy when detecting errors. Does this help? If not, I can try to replicate the setup and failures again, and try different fixes. Cheers, -Mikko
On Tue, 2025-01-28 at 16:37 +0200, Mikko Rapeli wrote: > On Tue, Jan 28, 2025 at 01:49:00PM +0000, Richard Purdie wrote: > > On Tue, 2025-01-28 at 13:04 +0000, Richard Purdie via > > lists.openembedded.org wrote: > > > > > > Sorry for the delay in looking at this patch. I'm a bit worried about > > > there being a leaked processes and wanted to understand if there was > > > other cleanup we should be doing. > > > > > > Instead of this path, would it make sense to move the results.stop() > > > inside the finally? I'm worried that other forms of exception would > > > also leak processes. > > > > I've looked more at this and I don't understand how this patch works. > > > > By adding the exception to the exception clause, it will trigger a > > bb.error() call. The results.stop() call won't happen as results isn't > > set in this failure. > > > > The only significant difference would therefore be the results = > > tc.results assignment. Does that really stop processes? If so, should > > we always be doing that? > > > > I'd like to understand what we need to do to "fix" things so we can > > handle other exceptions. > > If I remember correctly, I saw this when building meta-arm qemuarm64-secureboot > kas config which runs oeqa runtime tests with testimage.bbclass during build. > > Without fix 4b3b37156cdf70230bc74c647784da8514c284cf in poky a single failing test > leaked Cooker processes. Then a rebuild of core-image-base connected to that Cooker > process and somehow failed to re-generate rootfs which became an empty directory > inside .wic image. This resulted qemu boot hanging which again throws RuntimeError > from meta/lib/oeqa/core/target/qemu.py line 52. Just to be clear, is memory resident bitbake set or not (BB_SERVER_TIMEOUT) ? > Example with poky from October 2024 (from old log files which I still had): > > NOTE: Reconnecting to bitbake server... > ... > NOTE: recipe core-image-base-1.0-r0: task do_create_image_spdx: Succeeded > QMP Available for connection at /home/builder/src/base/meta-arm/build/tmp/.ndozl5yn > QMP connected to QEMU at 10/31/24 13:55:56 and took 0.56 seconds > QMP released QEMU at 10/31/24 13:55:56 and took 0.19 seconds from connect > Bitbake still alive (no events for 600s). Active tasks: > /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage > WARNING: core-image-base-1.0-r0 do_testimage: Target didn't reach login banner in 1000 seconds (10/31/24 14:12:36) > WARNING: core-image-base-1.0-r0 do_testimage: Last 25 lines of login console (22962): > [ 4.345255] Key type fscrypt-provisioning registered > [ 4.363120] Btrfs loaded, zoned=no, fsverity=no > [ 4.522255] input: QEMU QEMU USB Tablet as /devices/platform/4010000000.pcie/pci0000:00/0000:00:03.0/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input > /input0 > [ 4.528765] hid-generic 0003:0627:0001.0001: input: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:03.0-1/input0 > [ 4.636796] Key type encrypted registered > [ 4.644819] printk: legacy console [netcon0] enabled > [ 4.645638] netconsole: network logging started > [ 4.651065] clk: Disabling unused clocks > [ 4.651780] PM: genpd: Disabling unused power domains > [ 4.669045] usb 1-2: new high-speed USB device number 3 using xhci_hcd > [ 4.794967] Freeing unused kernel memory: 1856K > [ 4.803684] Run /init as init process > [ 4.847358] input: QEMU QEMU USB Keyboard as /devices/platform/4010000000.pcie/pci0000:00/0000:00:03.0/usb1/1-2/1-2:1.0/0003:0627:0001.0002/inp > ut/input1 > [ 4.919633] hid-generic 0003:0627:0001.0002: input: USB HID v1.11 Keyboard [QEMU QEMU USB Keyboard] on usb-0000:00:03.0-2/input0 > Starting systemd-udevd version 256.6 > > [ 7.407191] virtio_net virtio0 enp0s1: renamed from eth0 > [ 12.633172] ext3: Unknown parameter 'umask' > [ 12.634318] ext2: Unknown parameter 'umask' > [ 12.634924] ext4: Unknown parameter 'umask' > [ 12.676391] EXT4-fs (vda2): mounted filesystem 41dc66e6-d693-46b3-a5d6-a46998500609 r/w with ordered data mode. Quota mode: disabled. > [ 13.080423] /dev/disk/by-label/root: Can't open blockdev > [ 13.081201] /dev/disk/by-label/root: Can't open blockdev > ERROR: There's no '/dev' on rootfs. > > WARNING: core-image-base-1.0-r0 do_testimage: Check full boot log: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-l > inux/core-image-base/1.0/testimage/qemu_boot_log.20241031135427 > Output from runqemu: > runqemu - INFO - Received signal: 15 > runqemu - INFO - Cleaning up > runqemu - INFO - Host uptime: 14098967.33 > > tput: No value for $TERM and no -T specified > WARNING: core-image-base-1.0-r0 do_testimage: Extra log data read: > > Output from runqemu: > runqemu - INFO - Received signal: 15 > runqemu - INFO - Cleaning up > runqemu - INFO - Host uptime: 14098967.33 > > tput: No value for $TERM and no -T specified > > Output from runqemu: > runqemu - INFO - Received signal: 15 > runqemu - INFO - Cleaning up > runqemu - INFO - Host uptime: 14098967.33 > > tput: No value for $TERM and no -T specified > > ERROR: core-image-base-1.0-r0 do_testimage: Error executing a python function in exec_func_python() autogenerated: > > The stack trace of python calls that resulted in this exception/failure was: > File: 'exec_func_python() autogenerated', lineno: 2, function: <module> > 0001: > *** 0002:do_testimage(d) > 0003: > File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 122, function: do_testimage > 0118: dump-guest-memory {"paging":false,"protocol":"file:%s.img"} > 0119:} > 0120: > 0121:python do_testimage() { > *** 0122: testimage_main(d) > 0123:} > 0124: > 0125:addtask testimage > 0126:do_testimage[nostamp] = "1" > File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 364, function: testimage_main > 0360: orig_sigterm_handler = signal.signal(signal.SIGTERM, sigterm_exception) > 0361: try: > 0362: # We need to check if runqemu ends unexpectedly > 0363: # or if the worker send us a SIGTERM > *** 0364: tc.target.start(params=d.getVar("TEST_QEMUPARAMS"), runqemuparams=d.getVar("TEST_RUNQEMUPARAMS")) > 0365: import threading > 0366: try: > 0367: threading.Timer(int(d.getVar("TEST_OVERALL_TIMEOUT")), handle_test_timeout, (int(d.getVar("TEST_OVERALL_TIMEOUT")),)).start( > ) > 0368: except ValueError: > File: '/home/builder/src/base/meta-arm/build/../poky/meta/lib/oeqa/core/target/qemu.py', lineno: 91, function: start > 0087: except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError) as err: > 0088: msg += "Error running command: %s\n%s\n" % (blcmd, err) > 0089: msg += "\n\n===== end: snippet =====\n" > 0090: > *** 0091: raise RuntimeError("FAILED to start qemu - check the task log and the boot log %s" % (msg)) > 0092: > 0093: def stop(self): > 0094: self.runner.stop() > Exception: RuntimeError: FAILED to start qemu - check the task log and the boot log > > ===== start: snippet ===== > > > > ===== end: snippet ===== > > ===== start: snippet ===== > > Error running command: ['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/te > stimage/qemu_boot_log.20241031135427'] > Command '['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/testimage/qemu_b > oot_log.20241031135427']' returned non-zero exit status 1. > > ===== end: snippet ===== > > > ERROR: Logfile of failure stored in: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/temp/ > log.do_testimage.1578012 > NOTE: recipe core-image-base-1.0-r0: task do_testimage: Failed > ERROR: Task (/home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage) failed with exit code '1 > ' > NOTE: Tasks Summary: Attempted 7116 tasks of which 7081 didn't need to be rerun and 1 failed. > > Summary: 1 task failed: > /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage > > > Since this exit happens without catching RuntimeError, there is no cleanup > and Cooker, qemu-system-aarch64 etc processes get leaked again. Can you point at the "no cleanup"? The code in question has a try/except/finally clause and the finally should always run. I'm therefore struggling to work out which cleanup code isn't running. > If the RuntimeError is captured, then clean up is done and qemu-system-aarch64 > and Cooker processes get killed. Together with > 4b3b37156cdf70230bc74c647784da8514c284cf in poky > the failing oeqa runtime test under qemu was handled correctly > and rebuilds always ended up with same failure, specifically > from parselogs which is a bit greedy when detecting errors. > > Does this help? If not, I can try to replicate the setup and failures > again, and try different fixes. I can't really sed which cleanup code we're missing and I'd like to understand this before we take the patch as something isn't adding up. Cheers, Richard
Hi, On Tue, Jan 28, 2025 at 03:10:34PM +0000, Richard Purdie wrote: > On Tue, 2025-01-28 at 16:37 +0200, Mikko Rapeli wrote: > > On Tue, Jan 28, 2025 at 01:49:00PM +0000, Richard Purdie wrote: > > > On Tue, 2025-01-28 at 13:04 +0000, Richard Purdie via > > > lists.openembedded.org wrote: > > > > > > > > Sorry for the delay in looking at this patch. I'm a bit worried about > > > > there being a leaked processes and wanted to understand if there was > > > > other cleanup we should be doing. > > > > > > > > Instead of this path, would it make sense to move the results.stop() > > > > inside the finally? I'm worried that other forms of exception would > > > > also leak processes. > > > > > > I've looked more at this and I don't understand how this patch works. > > > > > > By adding the exception to the exception clause, it will trigger a > > > bb.error() call. The results.stop() call won't happen as results isn't > > > set in this failure. > > > > > > The only significant difference would therefore be the results = > > > tc.results assignment. Does that really stop processes? If so, should > > > we always be doing that? > > > > > > I'd like to understand what we need to do to "fix" things so we can > > > handle other exceptions. > > > > If I remember correctly, I saw this when building meta-arm qemuarm64-secureboot > > kas config which runs oeqa runtime tests with testimage.bbclass during build. > > > > Without fix 4b3b37156cdf70230bc74c647784da8514c284cf in poky a single failing test > > leaked Cooker processes. Then a rebuild of core-image-base connected to that Cooker > > process and somehow failed to re-generate rootfs which became an empty directory > > inside .wic image. This resulted qemu boot hanging which again throws RuntimeError > > from meta/lib/oeqa/core/target/qemu.py line 52. > > Just to be clear, is memory resident bitbake set or not > (BB_SERVER_TIMEOUT) ? meta-arm builds set: ci/base.yml: BB_SERVER_TIMEOUT = "300" I was building in kas shell. > > Example with poky from October 2024 (from old log files which I still had): > > > > NOTE: Reconnecting to bitbake server... > > ... > > NOTE: recipe core-image-base-1.0-r0: task do_create_image_spdx: Succeeded > > QMP Available for connection at /home/builder/src/base/meta-arm/build/tmp/.ndozl5yn > > QMP connected to QEMU at 10/31/24 13:55:56 and took 0.56 seconds > > QMP released QEMU at 10/31/24 13:55:56 and took 0.19 seconds from connect > > Bitbake still alive (no events for 600s). Active tasks: > > /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage > > WARNING: core-image-base-1.0-r0 do_testimage: Target didn't reach login banner in 1000 seconds (10/31/24 14:12:36) > > WARNING: core-image-base-1.0-r0 do_testimage: Last 25 lines of login console (22962): > > [��� 4.345255] Key type fscrypt-provisioning registered > > [��� 4.363120] Btrfs loaded, zoned=no, fsverity=no > > [��� 4.522255] input: QEMU QEMU USB Tablet as /devices/platform/4010000000.pcie/pci0000:00/0000:00:03.0/usb1/1-1/1-1:1.0/0003:0627:0001.0001/input > > /input0 > > [��� 4.528765] hid-generic 0003:0627:0001.0001: input: USB HID v0.01 Mouse [QEMU QEMU USB Tablet] on usb-0000:00:03.0-1/input0 > > [��� 4.636796] Key type encrypted registered > > [��� 4.644819] printk: legacy console [netcon0] enabled > > [��� 4.645638] netconsole: network logging started > > [��� 4.651065] clk: Disabling unused clocks > > [��� 4.651780] PM: genpd: Disabling unused power domains > > [��� 4.669045] usb 1-2: new high-speed USB device number 3 using xhci_hcd > > [��� 4.794967] Freeing unused kernel memory: 1856K > > [��� 4.803684] Run /init as init process > > [��� 4.847358] input: QEMU QEMU USB Keyboard as /devices/platform/4010000000.pcie/pci0000:00/0000:00:03.0/usb1/1-2/1-2:1.0/0003:0627:0001.0002/inp > > ut/input1 > > [��� 4.919633] hid-generic 0003:0627:0001.0002: input: USB HID v1.11 Keyboard [QEMU QEMU USB Keyboard] on usb-0000:00:03.0-2/input0 > > Starting systemd-udevd version 256.6 > > > > [��� 7.407191] virtio_net virtio0 enp0s1: renamed from eth0 > > [�� 12.633172] ext3: Unknown parameter 'umask' > > [�� 12.634318] ext2: Unknown parameter 'umask' > > [�� 12.634924] ext4: Unknown parameter 'umask' > > [�� 12.676391] EXT4-fs (vda2): mounted filesystem 41dc66e6-d693-46b3-a5d6-a46998500609 r/w with ordered data mode. Quota mode: disabled. > > [�� 13.080423] /dev/disk/by-label/root: Can't open blockdev > > [�� 13.081201] /dev/disk/by-label/root: Can't open blockdev > > ERROR: There's no '/dev' on rootfs. > > > > WARNING: core-image-base-1.0-r0 do_testimage: Check full boot log: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-l > > inux/core-image-base/1.0/testimage/qemu_boot_log.20241031135427 > > Output from runqemu: > > runqemu - INFO - Received signal: 15 > > runqemu - INFO - Cleaning up > > runqemu - INFO - Host uptime: 14098967.33 > > > > tput: No value for $TERM and no -T specified > > WARNING: core-image-base-1.0-r0 do_testimage: Extra log data read: > > > > Output from runqemu: > > runqemu - INFO - Received signal: 15 > > runqemu - INFO - Cleaning up > > runqemu - INFO - Host uptime: 14098967.33 > > > > tput: No value for $TERM and no -T specified > > > > Output from runqemu: > > runqemu - INFO - Received signal: 15 > > runqemu - INFO - Cleaning up > > runqemu - INFO - Host uptime: 14098967.33 > > > > tput: No value for $TERM and no -T specified > > > > ERROR: core-image-base-1.0-r0 do_testimage: Error executing a python function in exec_func_python() autogenerated: > > > > The stack trace of python calls that resulted in this exception/failure was: > > File: 'exec_func_python() autogenerated', lineno: 2, function: <module> > > ���� 0001: > > �*** 0002:do_testimage(d) > > ���� 0003: > > File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 122, function: do_testimage > > ���� 0118:��� dump-guest-memory {"paging":false,"protocol":"file:%s.img"} > > ���� 0119:} > > ���� 0120: > > ���� 0121:python do_testimage() { > > �*** 0122:��� testimage_main(d) > > ���� 0123:} > > ���� 0124: > > ���� 0125:addtask testimage > > ���� 0126:do_testimage[nostamp] = "1" > > File: '/home/builder/src/base/meta-arm/build/../poky/meta/classes-recipe/testimage.bbclass', lineno: 364, function: testimage_main > > ���� 0360:��� orig_sigterm_handler = signal.signal(signal.SIGTERM, sigterm_exception) > > ���� 0361:��� try: > > ���� 0362:������� # We need to check if runqemu ends unexpectedly > > ���� 0363:������� # or if the worker send us a SIGTERM > > �*** 0364:������� tc.target.start(params=d.getVar("TEST_QEMUPARAMS"), runqemuparams=d.getVar("TEST_RUNQEMUPARAMS")) > > ���� 0365:������� import threading > > ���� 0366:������� try: > > ���� 0367:����������� threading.Timer(int(d.getVar("TEST_OVERALL_TIMEOUT")), handle_test_timeout, (int(d.getVar("TEST_OVERALL_TIMEOUT")),)).start( > > ) > > ���� 0368:������� except ValueError: > > File: '/home/builder/src/base/meta-arm/build/../poky/meta/lib/oeqa/core/target/qemu.py', lineno: 91, function: start > > ���� 0087:����������� except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError) as err: > > ���� 0088:��������������� msg += "Error running command: %s\n%s\n" % (blcmd, err) > > ���� 0089:����������� msg += "\n\n===== end: snippet =====\n" > > ���� 0090: > > �*** 0091:����������� raise RuntimeError("FAILED to start qemu - check the task log and the boot log %s" % (msg)) > > ���� 0092: > > ���� 0093:��� def stop(self): > > ���� 0094:������� self.runner.stop() > > Exception: RuntimeError: FAILED to start qemu - check the task log and the boot log > > > > ===== start: snippet ===== > > > > > > > > ===== end: snippet ===== > > > > ===== start: snippet ===== > > > > Error running command: ['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/te > > stimage/qemu_boot_log.20241031135427'] > > Command '['tail', '-20', '/home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/testimage/qemu_b > > oot_log.20241031135427']' returned non-zero exit status 1. > > > > ===== end: snippet ===== > > > > > > ERROR: Logfile of failure stored in: /home/builder/src/base/meta-arm/build/tmp/work/qemuarm64_secureboot-poky-linux/core-image-base/1.0/temp/ > > log.do_testimage.1578012 > > NOTE: recipe core-image-base-1.0-r0: task do_testimage: Failed > > ERROR: Task (/home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage) failed with exit code '1 > > ' > > NOTE: Tasks Summary: Attempted 7116 tasks of which 7081 didn't need to be rerun and 1 failed. > > > > Summary: 1 task failed: > > � /home/builder/src/base/meta-arm/build/../poky/meta/recipes-core/images/core-image-base.bb:do_testimage > > > > > > Since this exit happens without catching RuntimeError, there is no cleanup > > and Cooker, qemu-system-aarch64 etc processes get leaked again. > > Can you point at the "no cleanup"? The code in question has a > try/except/finally clause and the finally should always run. I'm > therefore struggling to work out which cleanup code isn't running. > > > If the RuntimeError is captured, then clean up is done and qemu-system-aarch64 > > and Cooker processes get killed. Together with > > 4b3b37156cdf70230bc74c647784da8514c284cf in poky > > the failing oeqa runtime test under qemu was handled correctly > > and rebuilds always ended up with same failure, specifically > > from parselogs which is a bit greedy when detecting errors. > > > > Does this help? If not, I can try to replicate the setup and failures > > again, and try different fixes. > > I can't really sed which cleanup code we're missing and I'd like to > understand this before we take the patch as something isn't adding up. With TESTIMAGE_AUTO = "1" in local.conf, I have seen a lot of failures to kill Cooker and qemu-system-aarch64 processes. Running tests with slirp networking over ssh and even pressing ctrl-c multiple times on the build leaves those into the background. Manual cleanup is needed. I will setup the builds again to test these out. Cheers, -Mikko
diff --git a/meta/classes-recipe/testimage.bbclass b/meta/classes-recipe/testimage.bbclass index 19075ce1f3..a9b031093a 100644 --- a/meta/classes-recipe/testimage.bbclass +++ b/meta/classes-recipe/testimage.bbclass @@ -371,7 +371,7 @@ def testimage_main(d): complete = True if results.hasAnyFailingTest(): run_failed_tests_post_actions(d, tc) - except (KeyboardInterrupt, BlockingIOError) as err: + except (KeyboardInterrupt, BlockingIOError, RuntimeError) as err: if isinstance(err, KeyboardInterrupt): bb.error('testimage interrupted, shutting down...') else:
runqemu can fail with RuntimeError exception. Non-cought exception causes cooker process leaks which bind to successive bitbake command line calls and that can cause really odd errors to users, e.g. when build/tmp is wiped and cooker processes expect files to be there. Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org> --- meta/classes-recipe/testimage.bbclass | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)