diff mbox series

lib/oe/utils.cpu_count: Raise maximum count from 64 to 192

Message ID 7c42e340baae4038324e250583e9395fb8d90463.1732725209.git.joerg.sommer@navimatix.de
State New
Headers show
Series lib/oe/utils.cpu_count: Raise maximum count from 64 to 192 | expand

Commit Message

Jörg Sommer Nov. 27, 2024, 4:33 p.m. UTC
From: Jörg Sommer <joerg.sommer@navimatix.de>

We have a system with 96 CPUs and 128 are not uncommon. The border of 64
limits the number of parallel tasks make or ninja spawns, because the value
goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single
recipe gets build (e.g. rust, due to dependencies) this leaves one third of
our CPUs idle.

Signed-off-by: Jörg Sommer <joerg.sommer@navimatix.de>
---
 meta/lib/oe/utils.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Alexander Kanavin Nov. 27, 2024, 4:43 p.m. UTC | #1
Please read the original rationale for picking 64, and provide benchmarks
that actually show better completion times with the new setting.

Alex

On Wed 27. Nov 2024 at 17.33, Jörg Sommer via lists.openembedded.org
<joerg.sommer=navimatix.de@lists.openembedded.org> wrote:

> From: Jörg Sommer <joerg.sommer@navimatix.de>
>
> We have a system with 96 CPUs and 128 are not uncommon. The border of 64
> limits the number of parallel tasks make or ninja spawns, because the value
> goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a
> single
> recipe gets build (e.g. rust, due to dependencies) this leaves one third of
> our CPUs idle.
>
> Signed-off-by: Jörg Sommer <joerg.sommer@navimatix.de>
> ---
>  meta/lib/oe/utils.py | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/meta/lib/oe/utils.py b/meta/lib/oe/utils.py
> index c9c7a47041..354237b643 100644
> --- a/meta/lib/oe/utils.py
> +++ b/meta/lib/oe/utils.py
> @@ -251,7 +251,7 @@ def trim_version(version, num_parts=2):
>      trimmed = ".".join(parts[:num_parts])
>      return trimmed
>
> -def cpu_count(at_least=1, at_most=64):
> +def cpu_count(at_least=1, at_most=192):
>      cpus = len(os.sched_getaffinity(0))
>      return max(min(cpus, at_most), at_least)
>
> --
> 2.45.2
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#207936):
> https://lists.openembedded.org/g/openembedded-core/message/207936
> Mute This Topic: https://lists.openembedded.org/mt/109808873/1686489
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [
> alex.kanavin@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
patchtest@automation.yoctoproject.org Nov. 27, 2024, 4:46 p.m. UTC | #2
Thank you for your submission. Patchtest identified one
or more issues with the patch. Please see the log below for
more information:

---
Testing patch /home/patchtest/share/mboxes/lib-oe-utils.cpu_count-Raise-maximum-count-from-64-to-192.patch

FAIL: test commit message user tags: Mbox includes one or more GitHub-style username tags. Ensure that any "@" symbols are stripped out of usernames (test_mbox.TestMbox.test_commit_message_user_tags)

PASS: pretest pylint (test_python_pylint.PyLint.pretest_pylint)
PASS: test Signed-off-by presence (test_mbox.TestMbox.test_signed_off_by_presence)
PASS: test author valid (test_mbox.TestMbox.test_author_valid)
PASS: test commit message presence (test_mbox.TestMbox.test_commit_message_presence)
PASS: test max line length (test_metadata.TestMetadata.test_max_line_length)
PASS: test mbox format (test_mbox.TestMbox.test_mbox_format)
PASS: test non-AUH upgrade (test_mbox.TestMbox.test_non_auh_upgrade)
PASS: test pylint (test_python_pylint.PyLint.test_pylint)
PASS: test shortlog format (test_mbox.TestMbox.test_shortlog_format)
PASS: test shortlog length (test_mbox.TestMbox.test_shortlog_length)
PASS: test target mailing list (test_mbox.TestMbox.test_target_mailing_list)

SKIP: pretest src uri left files: No modified recipes, skipping pretest (test_metadata.TestMetadata.pretest_src_uri_left_files)
SKIP: test CVE check ignore: No modified recipes or older target branch, skipping test (test_metadata.TestMetadata.test_cve_check_ignore)
SKIP: test CVE tag format: No new CVE patches introduced (test_patch.TestPatch.test_cve_tag_format)
SKIP: test Signed-off-by presence: No new CVE patches introduced (test_patch.TestPatch.test_signed_off_by_presence)
SKIP: test Upstream-Status presence: No new CVE patches introduced (test_patch.TestPatch.test_upstream_status_presence_format)
SKIP: test bugzilla entry format: No bug ID found (test_mbox.TestMbox.test_bugzilla_entry_format)
SKIP: test lic files chksum modified not mentioned: No modified recipes, skipping test (test_metadata.TestMetadata.test_lic_files_chksum_modified_not_mentioned)
SKIP: test lic files chksum presence: No added recipes, skipping test (test_metadata.TestMetadata.test_lic_files_chksum_presence)
SKIP: test license presence: No added recipes, skipping test (test_metadata.TestMetadata.test_license_presence)
SKIP: test series merge on head: Merge test is disabled for now (test_mbox.TestMbox.test_series_merge_on_head)
SKIP: test src uri left files: No modified recipes, skipping pretest (test_metadata.TestMetadata.test_src_uri_left_files)
SKIP: test summary presence: No added recipes, skipping test (test_metadata.TestMetadata.test_summary_presence)

---

Please address the issues identified and
submit a new revision of the patch, or alternatively, reply to this
email with an explanation of why the patch should be accepted. If you
believe these results are due to an error in patchtest, please submit a
bug at https://bugzilla.yoctoproject.org/ (use the 'Patchtest' category
under 'Yocto Project Subprojects'). For more information on specific
failures, see: https://wiki.yoctoproject.org/wiki/Patchtest. Thank
you!
Alexander Kanavin Nov. 27, 2024, 8:43 p.m. UTC | #3
On Wed, 27 Nov 2024 at 17:43, Alexander Kanavin via
lists.openembedded.org <alex.kanavin=gmail.com@lists.openembedded.org>
wrote:
>
> Please read the original rationale for picking 64, and provide benchmarks that actually show better completion times with the new setting.

This was the rationale:
https://git.yoctoproject.org/poky/commit/?id=c6f23f1f0fad29da4dee27a9cb8219ae05a8bfd5

Alex
Ross Burton Nov. 28, 2024, 1:36 p.m. UTC | #4
On 27 Nov 2024, at 16:33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote:
> 
> From: Jörg Sommer <joerg.sommer@navimatix.de>
> 
> We have a system with 96 CPUs and 128 are not uncommon. The border of 64
> limits the number of parallel tasks make or ninja spawns, because the value
> goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single
> recipe gets build (e.g. rust, due to dependencies) this leaves one third of
> our CPUs idle.

192 seems like it was arbitrarily chosen as “more than your current system”: if we’re doing that then we should just remove the maximum cap by reverting the commit that added it in the first place.

The point of the default is to be reasonable, and in my benchmarking on a system with 128 cores going beyond 64 only gives you more chance of OOMs, I/O contention, and other users of the presumably shared machine being angry.

If you have a powerful server that only does a single build then you’re welcome to set PARALLEL_MAKE = “-j128” in your build environment to take full advantage of it.

Ross
Jörg Sommer Nov. 28, 2024, 6:49 p.m. UTC | #5
Ross Burton schrieb am Do 28. Nov, 13:36 (+0000):
> On 27 Nov 2024, at 16:33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote:
> > 
> > From: Jörg Sommer <joerg.sommer@navimatix.de>
> > 
> > We have a system with 96 CPUs and 128 are not uncommon. The border of 64
> > limits the number of parallel tasks make or ninja spawns, because the value
> > goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single
> > recipe gets build (e.g. rust, due to dependencies) this leaves one third of
> > our CPUs idle.
> 
> 192 seems like it was arbitrarily chosen as “more than your current
> system”: if we’re doing that then we should just remove the maximum cap by
> reverting the commit that added it in the first place.
> 
> The point of the default is to be reasonable, and in my benchmarking on a
> system with 128 cores going beyond 64 only gives you more chance of OOMs,
> I/O contention, and other users of the presumably shared machine being
> angry.

How much RAM did this system have? Ours has 128GB and it's more or less
empty during the build. Also the NVME-RAID shows seldom an exhaustion. To
the point that I an build at least two of our images in parallel in a tmpfs.

I try to record some graphs with systemd-bootchart. It's a bit tricky,
because I have to split the recording. Do you know a better tool? When
analysing a system I watch with atop, but this doesn't give graphs. And our
monitoring system contains too much information.

> If you have a powerful server that only does a single build then you’re
> welcome to set PARALLEL_MAKE = “-j128” in your build environment to take
> full advantage of it.

My other solution is to run three or more builds in parallel, but there are
not often many images to build at the same time.


Regards, Jörg
ChenQi Nov. 29, 2024, 3:10 a.m. UTC | #6
I'll share some data from my daily work.
My server has 128 CPUs and 256G memory. When there were two world builds 
(40000+ tasks for each world build), I got OOM several times. I once got 
an OOM when there was only one world build. So I mounted an extra 256G 
swap on that server.

I just noticed that the 64 cap was there since 2021. So all these OOM 
happened even with this 64 cap. So we'd better not increase this default 
cap number unless it's proved to be really necessary.

Regards,
Qi


On 11/29/24 02:49, Jörg Sommer via lists.openembedded.org wrote:
> Ross Burton schrieb am Do 28. Nov, 13:36 (+0000):
>> On 27 Nov 2024, at 16:33, Jörg Sommer via lists.openembedded.org <joerg.sommer=navimatix.de@lists.openembedded.org> wrote:
>>> From: Jörg Sommer <joerg.sommer@navimatix.de>
>>>
>>> We have a system with 96 CPUs and 128 are not uncommon. The border of 64
>>> limits the number of parallel tasks make or ninja spawns, because the value
>>> goes into `PARALLEL_MAKE ?= "-j ${@oe.utils.cpu_count()}"`. If only a single
>>> recipe gets build (e.g. rust, due to dependencies) this leaves one third of
>>> our CPUs idle.
>> 192 seems like it was arbitrarily chosen as “more than your current
>> system”: if we’re doing that then we should just remove the maximum cap by
>> reverting the commit that added it in the first place.
>>
>> The point of the default is to be reasonable, and in my benchmarking on a
>> system with 128 cores going beyond 64 only gives you more chance of OOMs,
>> I/O contention, and other users of the presumably shared machine being
>> angry.
> How much RAM did this system have? Ours has 128GB and it's more or less
> empty during the build. Also the NVME-RAID shows seldom an exhaustion. To
> the point that I an build at least two of our images in parallel in a tmpfs.
>
> I try to record some graphs with systemd-bootchart. It's a bit tricky,
> because I have to split the recording. Do you know a better tool? When
> analysing a system I watch with atop, but this doesn't give graphs. And our
> monitoring system contains too much information.
>
>> If you have a powerful server that only does a single build then you’re
>> welcome to set PARALLEL_MAKE = “-j128” in your build environment to take
>> full advantage of it.
> My other solution is to run three or more builds in parallel, but there are
> not often many images to build at the same time.
>
>
> Regards, Jörg
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#207997): https://lists.openembedded.org/g/openembedded-core/message/207997
> Mute This Topic: https://lists.openembedded.org/mt/109808873/7304865
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [Qi.Chen@eng.windriver.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Alexander Kanavin Nov. 29, 2024, 9:25 a.m. UTC | #7
On Thu, 28 Nov 2024 at 19:50, Jörg Sommer via lists.openembedded.org
<joerg.sommer=navimatix.de@lists.openembedded.org> wrote:
> How much RAM did this system have? Ours has 128GB and it's more or less
> empty during the build. Also the NVME-RAID shows seldom an exhaustion. To
> the point that I an build at least two of our images in parallel in a tmpfs.

It's not difficult to exhaust that much RAM, or come close to it: run
do_compile for webkitgtk and llvm at the same time, with 64 threads
each.

Alex
diff mbox series

Patch

diff --git a/meta/lib/oe/utils.py b/meta/lib/oe/utils.py
index c9c7a47041..354237b643 100644
--- a/meta/lib/oe/utils.py
+++ b/meta/lib/oe/utils.py
@@ -251,7 +251,7 @@  def trim_version(version, num_parts=2):
     trimmed = ".".join(parts[:num_parts])
     return trimmed
 
-def cpu_count(at_least=1, at_most=64):
+def cpu_count(at_least=1, at_most=192):
     cpus = len(os.sched_getaffinity(0))
     return max(min(cpus, at_most), at_least)