Message ID | 20230525102105.1480610-1-michalwsieron@gmail.com |
---|---|
State | New |
Headers | show |
Series | bitbake: Add task timeout support | expand |
Hi, On Thu, May 25, 2023 at 12:21:05PM +0200, Michal Sieron wrote: > By setting `BB_TASK_TIMEOUT` you can control timeout of for tasks. > Using flags `BB_TASK_TIMEOUT[do_mytask]` you can override timeout > settings for specific tasks. > > This is especially useful when some server doesn't work properly and > `do_fetch` task takes too long. We may want to kill it early instead > of waiting for a fetch that won't happen. Is the problem related some tools and network protocols? Which? An alternative would be to configure the tool (git, svn, curl etc) specific settings, or even the default socket timeouts on the running system. I've seen this kind of problems too, but I'm not sure this task timeout is the correct way to solve it. Time is too relative :) Cheers, -Mikko
I tend to agree, if the problem is that some server is too slow, or unresponsive, this does not solve the problem. If you want to forcibly terminate stuff (which is still not a good idea - you should look into why something 'doesn't work properly', not work around it), you can terminate the whole bitbake process instead. Alex On Thu, 25 May 2023 at 12:45, Mikko Rapeli <mikko.rapeli@linaro.org> wrote: > > Hi, > > On Thu, May 25, 2023 at 12:21:05PM +0200, Michal Sieron wrote: > > By setting `BB_TASK_TIMEOUT` you can control timeout of for tasks. > > Using flags `BB_TASK_TIMEOUT[do_mytask]` you can override timeout > > settings for specific tasks. > > > > This is especially useful when some server doesn't work properly and > > `do_fetch` task takes too long. We may want to kill it early instead > > of waiting for a fetch that won't happen. > > Is the problem related some tools and network protocols? Which? > > An alternative would be to configure the tool (git, svn, curl etc) specific > settings, or even the default socket timeouts on the running system. > > I've seen this kind of problems too, but I'm not sure this task timeout is the > correct way to solve it. Time is too relative :) > > Cheers, > > -Mikko > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#14807): https://lists.openembedded.org/g/bitbake-devel/message/14807 > Mute This Topic: https://lists.openembedded.org/mt/99127010/1686489 > Group Owner: bitbake-devel+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [alex.kanavin@gmail.com] > -=-=-=-=-=-=-=-=-=-=-=- >
Hello, let me give some more context on the patch. We use self hosted gerrit and artifactory (with git-lfs content and tarballs with source code) instances which tend to be slow sometimes due to some overloads and other reasons. We have a 3h timeout for the whole build. Now let's assume do_fetch tasks take already 60 hours instead of 10 minutes. With the timeout solution proposed by Michał we can detect it quite fast and stop the build job and unblock the jenkins nodes for other builds or just retry the build in case the fetch on the second run would take less time. Now without that and with timeout on the whole bitbake process we're still holding the jenkins node for another 2-3 hours even though we know this build has no future. Best regards, Tomasz Dziendzielski
Yocto has a download cache mechanism, so you should not have to fetch more than once. Are you throwing away that cache between builds? Again, the actual problem is slow gerrit/artifactory instances, not stuck fetch tasks. The idea of 'retrying' the fetch so that it maybe works better on second try is especially off-putting. Alex On Thu, 25 May 2023 at 13:41, Tomasz Dziendzielski <tomasz.dziendzielski@gmail.com> wrote: > > Hello, > let me give some more context on the patch. We use self hosted gerrit and artifactory (with git-lfs content and tarballs with source code) instances which tend to be slow sometimes due to some overloads and other reasons. We have a 3h timeout for the whole build. Now let's assume do_fetch tasks take already 60 hours instead of 10 minutes. With the timeout solution proposed by Michał we can detect it quite fast and stop the build job and unblock the jenkins nodes for other builds or just retry the build in case the fetch on the second run would take less time. Now without that and with timeout on the whole bitbake process we're still holding the jenkins node for another 2-3 hours even though we know this build has no future. > > Best regards, > Tomasz Dziendzielski
>Yocto has a download cache mechanism, so you should not have to fetch >more than once. Are you throwing away that cache between builds? Do you mean the DL_DIR or something else as cache mechanism? We do keep DL_DIR between builds, but we're talking about cases where there was a change in the repository. >the actual problem is slow gerrit/artifactory instances, not >stuck fetch tasks Totally agree, but we're not able to fix the slow gerrit/artifactory in a short term. >The idea of 'retrying' the fetch Well, it's not like I'm saying this as a solution, usually we just disable jobs until the slowness is gone. But isn't that what we already do with wget in bitbake? "wget -t 2 -T 30" from bitbake/lib/bb/fetch2/wget.py, where -t is tries, and -T is timeout Problem is that we get the slowness for git repositories and git does not support such a thing. Best regards, Tomasz Dziendzielski
On Thu, 25 May 2023 at 14:09, Tomasz Dziendzielski <tomasz.dziendzielski@gmail.com> wrote: > Well, it's not like I'm saying this as a solution, usually we just disable jobs until the slowness is gone. But isn't that what we already do with wget in bitbake? > "wget -t 2 -T 30" from bitbake/lib/bb/fetch2/wget.py, where -t is tries, and -T is timeout > Problem is that we get the slowness for git repositories and git does not support such a thing. But then you can simply wrap git with a timeout command. https://man7.org/linux/man-pages/man1/timeout.1.html No? Alex
>But then you can simply wrap git with a timeout command. >https://man7.org/linux/man-pages/man1/timeout.1.html Well, this is a good point. I don't know how we could not come up with that idea :D We can just simply use the FETCHCMD_git variable with timeout. Please disregard this patch then. Best regards, Tomasz Dziendzielski
Hi, On Thu, May 25, 2023 at 02:13:10PM +0200, Tomasz Dziendzielski wrote: > >Yocto has a download cache mechanism, so you should not have to fetch > >more than once. Are you throwing away that cache between builds? > > Do you mean the DL_DIR or something else as cache mechanism? We do keep > DL_DIR between builds, but we're talking about cases where there was a > change in the repository. > > >the actual problem is slow gerrit/artifactory instances, not > >stuck fetch tasks > Totally agree, but we're not able to fix the slow gerrit/artifactory in a > short term. > > >The idea of 'retrying' the fetch > Well, it's not like I'm saying this as a solution, usually we just > disable jobs until the slowness is gone. But isn't that what we already do > with wget in bitbake? > "wget -t 2 -T 30" from bitbake/lib/bb/fetch2/wget.py, where -t is tries, > and -T is timeout > Problem is that we get the slowness for git repositories and git does not > support such a thing. git uses curl, can't we shorten the timeouts or even add retries there? Maybe even specific ones for the lfs support? I've also seen a lot of problematic artifactory and gerrit instances, in addition to maven, npm etc... Cheers, -Mikko
Hi, On Thu, 25 May 2023 15:57:12 +0300 Mikko Rapeli <mikko.rapeli@linaro.org> wrote: > git uses curl, can't we shorten the timeouts or even add retries > there? Maybe even specific ones for the lfs support? > > I've also seen a lot of problematic artifactory and gerrit instances, > in addition to maven, npm etc... I checked this and couldn't find any environment variables that would control timeout for the whole operation. The only solution I found (which doesn't require patching/recompiling) was using previously mentioned timeout command like so: ``` FETCHCMD_git = "git \ -c core.fsyncobjectfiles=0 \ -c filter.lfs.clean='/usr/bin/timeout 300 git-lfs clean -- %f' \ -c filter.lfs.smudge='/usr/bin/timeout 300 git-lfs smudge -- %f' \ -c filter.lfs.process='/usr/bin/timeout 300 git-lfs filter-process' \ " ``` You can of course put those settings in your gitconfig file instead. However, this method is not so nice, as when the fetch task ends, and prints error message from git, we get "fatal: the remote end hung up unexpectedly", which doesn't really explain what went wrong. Michal
diff --git a/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst b/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst index 01d4f8d14a..eaaa307960 100644 --- a/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst +++ b/bitbake/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst @@ -686,6 +686,13 @@ overview of their function and contents. in images is given a higher priority as compared to build tasks to ensure that images do not suffer timeouts on loaded systems. + :term:`BB_TASK_TIMEOUT` + Specifies time after which still running tasks will be terminated. + + Time is provided as an integer number of seconds. + You can set timeout for specific task using ``BB_TASK_TIMEOUT[do_mytask]``. + To remove timeout use an empty value ``BB_TASK_TIMEOUT = ""``. + :term:`BB_TASKHASH` Within an executing task, this variable holds the hash of the task as returned by the currently enabled signature generator. diff --git a/bitbake/lib/bb/build.py b/bitbake/lib/bb/build.py index 44d08f5c55..2057dd9415 100644 --- a/bitbake/lib/bb/build.py +++ b/bitbake/lib/bb/build.py @@ -17,6 +17,7 @@ import sys import logging import glob import itertools +import multiprocessing import time import re import stat @@ -594,7 +595,7 @@ def _task_data(fn, task, d): bb.data.expandKeys(localdata) return localdata -def _exec_task(fn, task, d, quieterr): +def __exec_task(fn, task, d, quieterr): """Execute a BB 'task' Execution of a task involves a bit more setup than executing a function, @@ -761,6 +762,46 @@ def _exec_task(fn, task, d, quieterr): return 0 +def get_task_timeout(d, task): + # task specific timeout + timeout = d.getVarFlag("BB_TASK_TIMEOUT", task) + + if timeout is None: + # any task timeout + timeout = d.getVar("BB_TASK_TIMEOUT") + + if timeout is None or timeout == "": + return None + else: + try: + return int(timeout) + except ValueError as e: + raise ValueError("invalid `BB_TASK_TIMEOUT` value '%s'" % timeout) + +def _exec_task(fn, task, d, quieterr): + """Intermediate function between exec_task and __exec_task, to support task timeouts.""" + + def targetFunc(func): + def inner(*args, **kwargs): + sys.exit(func(*args, **kwargs)) + + return inner + + try: + timeout = get_task_timeout(d, task) + except ValueError as e: + logger.error("%s: %s" % (d.getVar("PF"), e.args[0])) + return 1 + + task_proc = multiprocessing.Process(target=targetFunc(__exec_task), args=(fn, task, d, quieterr)) + task_proc.start() + task_proc.join(timeout) + if task_proc.exitcode is None: + logger.error("timeout exceeded (%s s)" % (str(timeout))) + return 1 + else: + return task_proc.exitcode + def exec_task(fn, task, d, profile = False): try: quieterr = False