Message ID | 20250515144455.2799533-2-philip.lorenz@bmw.de |
---|---|
State | New |
Headers | show |
Series | siggen: Support non-compressed sigdata files | expand |
On Thu, 2025-05-15 at 16:44 +0200, Philip Lorenz via lists.openembedded.org wrote: > Compression utilities such as `zstd` can achieve significant compression > ratio improvements even when the low entropy data is spread over a > larger input window. As a result compressing a tarball containing > uncompressed sigdata files with zstd can reduce the amount by a factor > of 6-16. > > Support this by introducing the `BB_SIGDATA_COMPRESSION` variable which > can be used to control compression of sigdata files. Setting the > variable to "none" disables compression, setting it to "zstd" (its > default value) retains the current behaviour. > > All functions used to consume sigdata files are extended to > automatically detect whether input files are compressed and > transparently select the correct behaviour depending on that. > > Additionally, expose the compression paramater in all functions to > enable overriding the global setting in some scenarios (e.g. for > .siginfo files created by sstate.bbclass). > > Compression results for core-image-sato sigdata files produced using > `bitbake -S none core-image-sato` on poky > 122e9a49614b2ddedaae1d90c06004a7a4c43998. > > Tarball containing compressed sigdata files: 98 MB > Compressed tarball (zstd -3) containing compressed sigdata files: 76 MB > Compressed tarball (zstd -17) containing compressed sigdata files: 65 MB > Tarball containing uncompressed sigdata files: 310 MB > Compressed tarball (zstd -3) containing uncompressed sigdata files: 12 MB > Compressed tarball (zstd -17) containing uncompressed sigdata files: 4 MB > > Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de> > --- > lib/bb/siggen.py | 84 +++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 76 insertions(+), 8 deletions(-) Whilst this improves the "tarball up a stamps directory" workflow, it comes at the expense of a significant amount more IO and disk usage of the build tree during the build itself. For that reason, I'd not recommend anyone to be running with this set by default or in general. I'm not a big fan of adding config options which add non-default workflows, mainly due to the doubling of the test matrix and the fact that one of the code paths ends up stale/broken. Further, once you expose this to the siginfo files, you then have to either add multiple queries for sstate objects, or have extra configuration to setup sstate feeds. I'm not sure either is a good outcome. This would also imply there would be a further patch to adapt the code in oe-core to use this new option for the siginfo files and again, this is complexity I'd really prefer simply not to have. How many users are regularly taring up sstate feeds or stamps directories? Cheers, Richard
Hi Richard, On 16.05.25 08:39, Richard Purdie wrote: > On Thu, 2025-05-15 at 16:44 +0200, Philip Lorenz via lists.openembedded.org wrote: >> Compression utilities such as `zstd` can achieve significant compression >> ratio improvements even when the low entropy data is spread over a >> larger input window. As a result compressing a tarball containing >> uncompressed sigdata files with zstd can reduce the amount by a factor >> of 6-16. >> >> Support this by introducing the `BB_SIGDATA_COMPRESSION` variable which >> can be used to control compression of sigdata files. Setting the >> variable to "none" disables compression, setting it to "zstd" (its >> default value) retains the current behaviour. >> >> All functions used to consume sigdata files are extended to >> automatically detect whether input files are compressed and >> transparently select the correct behaviour depending on that. >> >> Additionally, expose the compression paramater in all functions to >> enable overriding the global setting in some scenarios (e.g. for >> .siginfo files created by sstate.bbclass). >> >> Compression results for core-image-sato sigdata files produced using >> `bitbake -S none core-image-sato` on poky >> 122e9a49614b2ddedaae1d90c06004a7a4c43998. >> >> Tarball containing compressed sigdata files: 98 MB >> Compressed tarball (zstd -3) containing compressed sigdata files: 76 MB >> Compressed tarball (zstd -17) containing compressed sigdata files: 65 MB >> Tarball containing uncompressed sigdata files: 310 MB >> Compressed tarball (zstd -3) containing uncompressed sigdata files: 12 MB >> Compressed tarball (zstd -17) containing uncompressed sigdata files: 4 MB >> >> Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de> >> --- >> lib/bb/siggen.py | 84 +++++++++++++++++++++++++++++++++++++++++++----- >> 1 file changed, 76 insertions(+), 8 deletions(-) > Whilst this improves the "tarball up a stamps directory" workflow, it > comes at the expense of a significant amount more IO and disk usage of > the build tree during the build itself. For that reason, I'd not > recommend anyone to be running with this set by default or in general. I benchmarked these changes and couldn't find a significant difference in both IO (my test core-image-sato-minimal build causes 80GB disk writes) and build time with uncompressed files. Disk usage of the .sigdata files increases by a factor of 3 (in our typical builds this boils down to 400 MB added) which is a trade off that we're willing to accept in the CI context. Are there any code paths that I'm missing where this would cause significant slow downs / disadvantages? > I'm not a big fan of adding config options which add non-default > workflows, mainly due to the doubling of the test matrix and the fact > that one of the code paths ends up stale/broken. I fully understand that (in particular because this change is mainly beneficial in one particular context). I wanted to share to see if there is interest from others, if there isn't its completely understandable to reduce complexity. > Further, once you expose this to the siginfo files, you then have to > either add multiple queries for sstate objects, or have extra > configuration to setup sstate feeds. I'm not sure either is a good > outcome. This would also imply there would be a further patch to adapt > the code in oe-core to use this new option for the siginfo files and > again, this is complexity I'd really prefer simply not to have. For feeds nothing would change. I didn't include the patch to oe-core (which would use the exposed compression parameter) before the applicability of the main patch was discussed. Thanks, Philip
On Fri, 2025-05-16 at 19:29 +0200, Philip Lorenz wrote: > On 16.05.25 08:39, Richard Purdie wrote: > > On Thu, 2025-05-15 at 16:44 +0200, Philip Lorenz via > > lists.openembedded.org wrote: > > > Compression utilities such as `zstd` can achieve significant > > > compression > > > ratio improvements even when the low entropy data is spread over > > > a > > > larger input window. As a result compressing a tarball containing > > > uncompressed sigdata files with zstd can reduce the amount by a > > > factor > > > of 6-16. > > > > > > Support this by introducing the `BB_SIGDATA_COMPRESSION` variable > > > which > > > can be used to control compression of sigdata files. Setting the > > > variable to "none" disables compression, setting it to "zstd" > > > (its > > > default value) retains the current behaviour. > > > > > > All functions used to consume sigdata files are extended to > > > automatically detect whether input files are compressed and > > > transparently select the correct behaviour depending on that. > > > > > > Additionally, expose the compression paramater in all functions > > > to > > > enable overriding the global setting in some scenarios (e.g. for > > > .siginfo files created by sstate.bbclass). > > > > > > Compression results for core-image-sato sigdata files produced > > > using > > > `bitbake -S none core-image-sato` on poky > > > 122e9a49614b2ddedaae1d90c06004a7a4c43998. > > > > > > Tarball containing compressed sigdata files: 98 MB > > > Compressed tarball (zstd -3) containing compressed sigdata files: > > > 76 MB > > > Compressed tarball (zstd -17) containing compressed sigdata > > > files: 65 MB > > > Tarball containing uncompressed sigdata files: 310 MB > > > Compressed tarball (zstd -3) containing uncompressed sigdata > > > files: 12 MB > > > Compressed tarball (zstd -17) containing uncompressed sigdata > > > files: 4 MB > > > > > > Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de> > > > --- > > > lib/bb/siggen.py | 84 > > > +++++++++++++++++++++++++++++++++++++++++++----- > > > 1 file changed, 76 insertions(+), 8 deletions(-) > > Whilst this improves the "tarball up a stamps directory" workflow, > > it > > comes at the expense of a significant amount more IO and disk usage > > of > > the build tree during the build itself. For that reason, I'd not > > recommend anyone to be running with this set by default or in > > general. > > I benchmarked these changes and couldn't find a significant > difference > in both IO (my test core-image-sato-minimal build causes 80GB disk > writes) and build time with uncompressed files. Disk usage of the > .sigdata files increases by a factor of 3 (in our typical builds this > boils down to 400 MB added) which is a trade off that we're willing > to > accept in the CI context. Are there any code paths that I'm missing > where this would cause significant slow downs / disadvantages? A better test would be "bitbake core-image-minimal -S none" which writes out the files without the "noise" of there rest of the build. It basically slows down task completion and hence is the bottleneck in task execution overhead of bitbake itself on builds. This is more significant in builds from sstate or where large numbers of tasks are being run through with little work being needed. > > I'm not a big fan of adding config options which add non-default > > workflows, mainly due to the doubling of the test matrix and the > > fact > > that one of the code paths ends up stale/broken. > I fully understand that (in particular because this change is mainly > beneficial in one particular context). I wanted to share to see if > there is interest from others, if there isn't its completely > understandable to reduce complexity. Fair enough, lets see if there is wide interest for doing this. > > Further, once you expose this to the siginfo files, you then have > > to > > either add multiple queries for sstate objects, or have extra > > configuration to setup sstate feeds. I'm not sure either is a good > > outcome. This would also imply there would be a further patch to > > adapt > > the code in oe-core to use this new option for the siginfo files > > and > > again, this is complexity I'd really prefer simply not to have. > > For feeds nothing would change. I didn't include the patch to oe-core > (which would use the exposed compression parameter) before the > applicability of the main patch was discussed. It should at least be mentioned as the maintenance burden and impact is larger if the other patch is considered with this too. I suspect the oecore change would attract more negative review due to the "forking" of the sstate config/format and the need for more config. Cheers, Richard
diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py index a6163b55e..5c954228f 100644 --- a/lib/bb/siggen.py +++ b/lib/bb/siggen.py @@ -4,7 +4,10 @@ # SPDX-License-Identifier: GPL-2.0-only # +import contextlib +import enum import hashlib +import io import logging import os import re @@ -68,6 +71,67 @@ def init(d): ', '.join(obj.name for obj in siggens)) return SignatureGenerator(d) +class SigdataCompression(enum.Enum): + """Enumeration of sigdata compression / decompression schemes""" + + AUTO = "auto" + """ + Compression mode is automatically determined by file content (when reading files) or based + on the value of BB_SIGDATA_COMPRESSION (when writing files). BB_SIGDATA_COMPRESSION may be + set to any of the other constants defined in the enumeration converted to lower case (e.g. none + or zstd). + """ + + NONE = "none" + """ + Sigdata files are stored uncompressed. + """ + + ZSTD = "zstd" + """ + Sigdata files are stored using zstd compression. + """ + +@contextlib.contextmanager +def open_sigdata(path, mode, compression=SigdataCompression.AUTO): + """Open the given sigdata file at path with the given mode. + + The compression parameter shall be used to specify which compression mode should be applied to + the sigdata file. See SigdataCompressionMode for a list of available modes. + + The special compression mode `auto` may be used to automatically determine the compression mode + depending on existing sigdata content. As such `auto` is only supported when opening sigdata + files for reading. + """ + ZSTD_MAGIC_BYTES = b"\x28\xb5\x2f\xfd" + + # The underlying file is always opened in binary mode as we may need to pass the file descriptor + # to a subprocess for description. In case text decoding is required (e.g. for uncompressed + # files) a TextIOWrapper will be used. + file_mode = mode.replace("t", "") + if "b" not in file_mode: + file_mode += "b" + + # The fileobj will be passed to Popen which then continues to operate on the raw file + # descriptor. Turn off buffering as this interferes with this assumption and operations such as + # peek() or seek() no longer move the file descriptor to the expected position. + with open(path, file_mode, buffering=0) as f: + if "r" in mode and compression == SigdataCompression.AUTO: + compression = SigdataCompression.ZSTD if f.read(4) == ZSTD_MAGIC_BYTES else SigdataCompression.NONE + f.seek(0) + + if compression == SigdataCompression.NONE: + if "t" in mode: + f = io.TextIOWrapper( + f, encoding="utf-8", write_through=True + ) + + yield f + return + + with bb.compress.zstd.open(f, mode=mode, encoding="utf-8", num_threads=1) as zf: + yield zf + class SignatureGenerator(object): """ """ @@ -172,7 +236,7 @@ class SignatureGenerator(object): def stampcleanmask(self, stampbase, file_name, taskname, extrainfo): return ("%s.%s.%s" % (stampbase, taskname, extrainfo)).rstrip('.') - def dump_sigtask(self, mcfn, task, stampbase, runtime): + def dump_sigtask(self, mcfn, task, stampbase, runtime, compression=None): return def invalidate_task(self, task, mcfn): @@ -239,6 +303,7 @@ class SignatureGeneratorBasic(SignatureGenerator): self.unitaskhashes = self.unihash_cache.init_cache(data, "bb_unihashes.dat", {}) self.localdirsexclude = (data.getVar("BB_SIGNATURE_LOCAL_DIRS_EXCLUDE") or "CVS .bzr .git .hg .osc .p4 .repo .svn").split() self.tidtopn = {} + self.sigdata_compression = SigdataCompression(data.getVar("BB_SIGDATA_COMPRESSION") or SigdataCompression.ZSTD) def init_rundepcheck(self, data): self.taskhash_ignore_tasks = data.getVar("BB_TASKHASH_IGNORE_TASKS") or None @@ -415,7 +480,9 @@ class SignatureGeneratorBasic(SignatureGenerator): def save_unitaskhashes(self): self.unihash_cache.save(self.unitaskhashes) - def dump_sigtask(self, mcfn, task, stampbase, runtime): + def dump_sigtask(self, mcfn, task, stampbase, runtime, compression=SigdataCompression.AUTO): + compression = compression if compression != SigdataCompression.AUTO \ + else self.sigdata_compression tid = mcfn + ":" + task mc = bb.runqueue.mc_from_tid(mcfn) referencestamp = stampbase @@ -478,7 +545,7 @@ class SignatureGeneratorBasic(SignatureGenerator): fd, tmpfile = bb.utils.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.") try: - with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f: + with open_sigdata(fd, "wt", compression) as f: json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder) f.flush() os.chmod(tmpfile, 0o664) @@ -880,12 +947,13 @@ def clean_checksum_file_path(file_checksum_tuple): return "./" + f.split("/./")[1] return os.path.basename(f) -def dump_this_task(outfile, d): +def dump_this_task(outfile, d, compression=SigdataCompression.AUTO): import bb.parse mcfn = d.getVar("BB_FILENAME") task = "do_" + d.getVar("BB_CURRENTTASK") referencestamp = bb.parse.siggen.stampfile_base(mcfn) - bb.parse.siggen.dump_sigtask(mcfn, task, outfile, "customfile:" + referencestamp) + bb.parse.siggen.dump_sigtask(mcfn, task, outfile, "customfile:" + referencestamp, + compression=compression) def init_colors(enable_color): """Initialise colour dict for passing to compare_sigfiles()""" @@ -969,13 +1037,13 @@ def compare_sigfiles(a, b, recursecb=None, color=False, collapsed=False): return formatstr.format(**formatparams) try: - with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f: + with open_sigdata(a, "rt") as f: a_data = json.load(f, object_hook=SetDecoder) except (TypeError, OSError) as err: bb.error("Failed to open sigdata file '%s': %s" % (a, str(err))) raise err try: - with bb.compress.zstd.open(b, "rt", encoding="utf-8", num_threads=1) as f: + with open_sigdata(b, "rt") as f: b_data = json.load(f, object_hook=SetDecoder) except (TypeError, OSError) as err: bb.error("Failed to open sigdata file '%s': %s" % (b, str(err))) @@ -1218,7 +1286,7 @@ def dump_sigfile(a): output = [] try: - with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f: + with open_sigdata(a, "rt") as f: a_data = json.load(f, object_hook=SetDecoder) except (TypeError, OSError) as err: bb.error("Failed to open sigdata file '%s': %s" % (a, str(err)))
Compression utilities such as `zstd` can achieve significant compression ratio improvements even when the low entropy data is spread over a larger input window. As a result compressing a tarball containing uncompressed sigdata files with zstd can reduce the amount by a factor of 6-16. Support this by introducing the `BB_SIGDATA_COMPRESSION` variable which can be used to control compression of sigdata files. Setting the variable to "none" disables compression, setting it to "zstd" (its default value) retains the current behaviour. All functions used to consume sigdata files are extended to automatically detect whether input files are compressed and transparently select the correct behaviour depending on that. Additionally, expose the compression paramater in all functions to enable overriding the global setting in some scenarios (e.g. for .siginfo files created by sstate.bbclass). Compression results for core-image-sato sigdata files produced using `bitbake -S none core-image-sato` on poky 122e9a49614b2ddedaae1d90c06004a7a4c43998. Tarball containing compressed sigdata files: 98 MB Compressed tarball (zstd -3) containing compressed sigdata files: 76 MB Compressed tarball (zstd -17) containing compressed sigdata files: 65 MB Tarball containing uncompressed sigdata files: 310 MB Compressed tarball (zstd -3) containing uncompressed sigdata files: 12 MB Compressed tarball (zstd -17) containing uncompressed sigdata files: 4 MB Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de> --- lib/bb/siggen.py | 84 +++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 76 insertions(+), 8 deletions(-)