Message ID | 7c42e340baae4038324e250583e9395fb8d90463.1732725209.git.joerg.sommer@navimatix.de |
---|---|
State | New |
Headers | show |
Series | lib/oe/utils.cpu_count: Raise maximum count from 64 to 192 | expand |
Please read the original rationale for picking 64, and provide benchmarks that actually show better completion times with the new setting. Alex On Wed 27. Nov 2024 at 17.33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote: > From: Jörg Sommer <joerg.sommer@navimatix.de> > > We have a system with 96 CPUs and 128 are not uncommon. The border of 64 > limits the number of parallel tasks make or ninja spawns, because the value > goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a > single > recipe gets build (e.g. rust, due to dependencies) this leaves one third of > our CPUs idle. > > Signed-off-by: Jörg Sommer <joerg.sommer@navimatix.de> > --- > meta/lib/oe/utils.py | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/meta/lib/oe/utils.py b/meta/lib/oe/utils.py > index c9c7a47041..354237b643 100644 > --- a/meta/lib/oe/utils.py > +++ b/meta/lib/oe/utils.py > @@ -251,7 +251,7 @@ def trim_version(version, num_parts=2): > trimmed = ".".join(parts[:num_parts]) > return trimmed > > -def cpu_count(at_least=1, at_most=64): > +def cpu_count(at_least=1, at_most=192): > cpus = len(os.sched_getaffinity(0)) > return max(min(cpus, at_most), at_least) > > -- > 2.45.2 > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#207936): > https://lists.openembedded.org/g/openembedded-core/message/207936 > Mute This Topic: https://lists.openembedded.org/mt/109808873/1686489 > Group Owner: openembedded-core+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [ > alex.kanavin@gmail.com] > -=-=-=-=-=-=-=-=-=-=-=- > >
Thank you for your submission. Patchtest identified one or more issues with the patch. Please see the log below for more information: --- Testing patch /home/patchtest/share/mboxes/lib-oe-utils.cpu_count-Raise-maximum-count-from-64-to-192.patch FAIL: test commit message user tags: Mbox includes one or more GitHub-style username tags. Ensure that any "@" symbols are stripped out of usernames (test_mbox.TestMbox.test_commit_message_user_tags) PASS: pretest pylint (test_python_pylint.PyLint.pretest_pylint) PASS: test Signed-off-by presence (test_mbox.TestMbox.test_signed_off_by_presence) PASS: test author valid (test_mbox.TestMbox.test_author_valid) PASS: test commit message presence (test_mbox.TestMbox.test_commit_message_presence) PASS: test max line length (test_metadata.TestMetadata.test_max_line_length) PASS: test mbox format (test_mbox.TestMbox.test_mbox_format) PASS: test non-AUH upgrade (test_mbox.TestMbox.test_non_auh_upgrade) PASS: test pylint (test_python_pylint.PyLint.test_pylint) PASS: test shortlog format (test_mbox.TestMbox.test_shortlog_format) PASS: test shortlog length (test_mbox.TestMbox.test_shortlog_length) PASS: test target mailing list (test_mbox.TestMbox.test_target_mailing_list) SKIP: pretest src uri left files: No modified recipes, skipping pretest (test_metadata.TestMetadata.pretest_src_uri_left_files) SKIP: test CVE check ignore: No modified recipes or older target branch, skipping test (test_metadata.TestMetadata.test_cve_check_ignore) SKIP: test CVE tag format: No new CVE patches introduced (test_patch.TestPatch.test_cve_tag_format) SKIP: test Signed-off-by presence: No new CVE patches introduced (test_patch.TestPatch.test_signed_off_by_presence) SKIP: test Upstream-Status presence: No new CVE patches introduced (test_patch.TestPatch.test_upstream_status_presence_format) SKIP: test bugzilla entry format: No bug ID found (test_mbox.TestMbox.test_bugzilla_entry_format) SKIP: test lic files chksum modified not mentioned: No modified recipes, skipping test (test_metadata.TestMetadata.test_lic_files_chksum_modified_not_mentioned) SKIP: test lic files chksum presence: No added recipes, skipping test (test_metadata.TestMetadata.test_lic_files_chksum_presence) SKIP: test license presence: No added recipes, skipping test (test_metadata.TestMetadata.test_license_presence) SKIP: test series merge on head: Merge test is disabled for now (test_mbox.TestMbox.test_series_merge_on_head) SKIP: test src uri left files: No modified recipes, skipping pretest (test_metadata.TestMetadata.test_src_uri_left_files) SKIP: test summary presence: No added recipes, skipping test (test_metadata.TestMetadata.test_summary_presence) --- Please address the issues identified and submit a new revision of the patch, or alternatively, reply to this email with an explanation of why the patch should be accepted. If you believe these results are due to an error in patchtest, please submit a bug at https://bugzilla.yoctoproject.org/ (use the 'Patchtest' category under 'Yocto Project Subprojects'). For more information on specific failures, see: https://wiki.yoctoproject.org/wiki/Patchtest. Thank you!
On Wed, 27 Nov 2024 at 17:43, Alexander Kanavin via lists.openembedded.org <alex.kanavin=gmail.com@lists.openembedded.org> wrote: > > Please read the original rationale for picking 64, and provide benchmarks that actually show better completion times with the new setting. This was the rationale: https://git.yoctoproject.org/poky/commit/?id=c6f23f1f0fad29da4dee27a9cb8219ae05a8bfd5 Alex
On 27 Nov 2024, at 16:33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote: > > From: Jörg Sommer <joerg.sommer@navimatix.de> > > We have a system with 96 CPUs and 128 are not uncommon. The border of 64 > limits the number of parallel tasks make or ninja spawns, because the value > goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single > recipe gets build (e.g. rust, due to dependencies) this leaves one third of > our CPUs idle. 192 seems like it was arbitrarily chosen as “more than your current system”: if we’re doing that then we should just remove the maximum cap by reverting the commit that added it in the first place. The point of the default is to be reasonable, and in my benchmarking on a system with 128 cores going beyond 64 only gives you more chance of OOMs, I/O contention, and other users of the presumably shared machine being angry. If you have a powerful server that only does a single build then you’re welcome to set PARALLEL_MAKE = “-j128” in your build environment to take full advantage of it. Ross
Ross Burton schrieb am Do 28. Nov, 13:36 (+0000): > On 27 Nov 2024, at 16:33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote: > > > > From: Jörg Sommer <joerg.sommer@navimatix.de> > > > > We have a system with 96 CPUs and 128 are not uncommon. The border of 64 > > limits the number of parallel tasks make or ninja spawns, because the value > > goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single > > recipe gets build (e.g. rust, due to dependencies) this leaves one third of > > our CPUs idle. > > 192 seems like it was arbitrarily chosen as “more than your current > system”: if we’re doing that then we should just remove the maximum cap by > reverting the commit that added it in the first place. > > The point of the default is to be reasonable, and in my benchmarking on a > system with 128 cores going beyond 64 only gives you more chance of OOMs, > I/O contention, and other users of the presumably shared machine being > angry. How much RAM did this system have? Ours has 128GB and it's more or less empty during the build. Also the NVME-RAID shows seldom an exhaustion. To the point that I an build at least two of our images in parallel in a tmpfs. I try to record some graphs with systemd-bootchart. It's a bit tricky, because I have to split the recording. Do you know a better tool? When analysing a system I watch with atop, but this doesn't give graphs. And our monitoring system contains too much information. > If you have a powerful server that only does a single build then you’re > welcome to set PARALLEL_MAKE = “-j128” in your build environment to take > full advantage of it. My other solution is to run three or more builds in parallel, but there are not often many images to build at the same time. Regards, Jörg
I'll share some data from my daily work. My server has 128 CPUs and 256G memory. When there were two world builds (40000+ tasks for each world build), I got OOM several times. I once got an OOM when there was only one world build. So I mounted an extra 256G swap on that server. I just noticed that the 64 cap was there since 2021. So all these OOM happened even with this 64 cap. So we'd better not increase this default cap number unless it's proved to be really necessary. Regards, Qi On 11/29/24 02:49, Jörg Sommer via lists.openembedded.org wrote: > Ross Burton schrieb am Do 28. Nov, 13:36 (+0000): >> On 27 Nov 2024, at 16:33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote: >>> From: Jörg Sommer <joerg.sommer@navimatix.de> >>> >>> We have a system with 96 CPUs and 128 are not uncommon. The border of 64 >>> limits the number of parallel tasks make or ninja spawns, because the value >>> goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single >>> recipe gets build (e.g. rust, due to dependencies) this leaves one third of >>> our CPUs idle. >> 192 seems like it was arbitrarily chosen as “more than your current >> system”: if we’re doing that then we should just remove the maximum cap by >> reverting the commit that added it in the first place. >> >> The point of the default is to be reasonable, and in my benchmarking on a >> system with 128 cores going beyond 64 only gives you more chance of OOMs, >> I/O contention, and other users of the presumably shared machine being >> angry. > How much RAM did this system have? Ours has 128GB and it's more or less > empty during the build. Also the NVME-RAID shows seldom an exhaustion. To > the point that I an build at least two of our images in parallel in a tmpfs. > > I try to record some graphs with systemd-bootchart. It's a bit tricky, > because I have to split the recording. Do you know a better tool? When > analysing a system I watch with atop, but this doesn't give graphs. And our > monitoring system contains too much information. > >> If you have a powerful server that only does a single build then you’re >> welcome to set PARALLEL_MAKE = “-j128” in your build environment to take >> full advantage of it. > My other solution is to run three or more builds in parallel, but there are > not often many images to build at the same time. > > > Regards, Jörg > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#207997): https://lists.openembedded.org/g/openembedded-core/message/207997 > Mute This Topic: https://lists.openembedded.org/mt/109808873/7304865 > Group Owner: openembedded-core+owner@lists.openembedded.org > Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [Qi.Chen@eng.windriver.com] > -=-=-=-=-=-=-=-=-=-=-=- >
On Thu, 28 Nov 2024 at 19:50, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote: > How much RAM did this system have? Ours has 128GB and it's more or less > empty during the build. Also the NVME-RAID shows seldom an exhaustion. To > the point that I an build at least two of our images in parallel in a tmpfs. It's not difficult to exhaust that much RAM, or come close to it: run do_compile for webkitgtk and llvm at the same time, with 64 threads each. Alex
diff --git a/meta/lib/oe/utils.py b/meta/lib/oe/utils.py index c9c7a47041..354237b643 100644 --- a/meta/lib/oe/utils.py +++ b/meta/lib/oe/utils.py @@ -251,7 +251,7 @@ def trim_version(version, num_parts=2): trimmed = ".".join(parts[:num_parts]) return trimmed -def cpu_count(at_least=1, at_most=64): +def cpu_count(at_least=1, at_most=192): cpus = len(os.sched_getaffinity(0)) return max(min(cpus, at_most), at_least)