diff mbox series

[RFC,1/1] siggen: Support non-compressed sigdata files

Message ID 20250515144455.2799533-2-philip.lorenz@bmw.de
State New
Headers show
Series siggen: Support non-compressed sigdata files | expand

Commit Message

Philip Lorenz May 15, 2025, 2:44 p.m. UTC
Compression utilities such as `zstd` can achieve significant compression
ratio improvements even when the low entropy data is spread over a
larger input window. As a result compressing a tarball containing
uncompressed sigdata files with zstd can reduce the amount by a factor
of 6-16.

Support this by introducing the `BB_SIGDATA_COMPRESSION` variable which
can be used to control compression of sigdata files. Setting the
variable to "none" disables compression, setting it to "zstd" (its
default value) retains the current behaviour.

All functions used to consume sigdata files are extended to
automatically detect whether input files are compressed and
transparently select the correct behaviour depending on that.

Additionally, expose the compression paramater in all functions to
enable overriding the global setting in some scenarios (e.g. for
.siginfo files created by sstate.bbclass).

Compression results for core-image-sato sigdata files produced using
`bitbake -S none core-image-sato` on poky
122e9a49614b2ddedaae1d90c06004a7a4c43998.

Tarball containing compressed sigdata files: 98 MB
Compressed tarball (zstd -3) containing compressed sigdata files: 76 MB
Compressed tarball (zstd -17) containing compressed sigdata files: 65 MB
Tarball containing uncompressed sigdata files: 310 MB
Compressed tarball (zstd -3) containing uncompressed sigdata files: 12 MB
Compressed tarball (zstd -17) containing uncompressed sigdata files: 4 MB

Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de>
---
 lib/bb/siggen.py | 84 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 76 insertions(+), 8 deletions(-)

Comments

Richard Purdie May 16, 2025, 6:39 a.m. UTC | #1
On Thu, 2025-05-15 at 16:44 +0200, Philip Lorenz via lists.openembedded.org wrote:
> Compression utilities such as `zstd` can achieve significant compression
> ratio improvements even when the low entropy data is spread over a
> larger input window. As a result compressing a tarball containing
> uncompressed sigdata files with zstd can reduce the amount by a factor
> of 6-16.
> 
> Support this by introducing the `BB_SIGDATA_COMPRESSION` variable which
> can be used to control compression of sigdata files. Setting the
> variable to "none" disables compression, setting it to "zstd" (its
> default value) retains the current behaviour.
> 
> All functions used to consume sigdata files are extended to
> automatically detect whether input files are compressed and
> transparently select the correct behaviour depending on that.
> 
> Additionally, expose the compression paramater in all functions to
> enable overriding the global setting in some scenarios (e.g. for
> .siginfo files created by sstate.bbclass).
> 
> Compression results for core-image-sato sigdata files produced using
> `bitbake -S none core-image-sato` on poky
> 122e9a49614b2ddedaae1d90c06004a7a4c43998.
> 
> Tarball containing compressed sigdata files: 98 MB
> Compressed tarball (zstd -3) containing compressed sigdata files: 76 MB
> Compressed tarball (zstd -17) containing compressed sigdata files: 65 MB
> Tarball containing uncompressed sigdata files: 310 MB
> Compressed tarball (zstd -3) containing uncompressed sigdata files: 12 MB
> Compressed tarball (zstd -17) containing uncompressed sigdata files: 4 MB
> 
> Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de>
> ---
>  lib/bb/siggen.py | 84 +++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 76 insertions(+), 8 deletions(-)

Whilst this improves the "tarball up a stamps directory" workflow, it
comes at the expense of a significant amount more IO and disk usage of
the build tree during the build itself. For that reason, I'd not
recommend anyone to be running with this set by default or in general.

I'm not a big fan of adding config options which add non-default
workflows, mainly due to the doubling of the test matrix and the fact
that one of the code paths ends up stale/broken.

Further, once you expose this to the siginfo files, you then have to
either add multiple queries for sstate objects, or have extra
configuration to setup sstate feeds. I'm not sure either is a good
outcome. This would also imply there would be a further patch to adapt
the code in oe-core to use this new option for the siginfo files and
again, this is complexity I'd really prefer simply not to have.

How many users are regularly taring up sstate feeds or stamps
directories?

Cheers,

Richard
Philip Lorenz May 16, 2025, 5:29 p.m. UTC | #2
Hi Richard,

On 16.05.25 08:39, Richard Purdie wrote:
> On Thu, 2025-05-15 at 16:44 +0200, Philip Lorenz via lists.openembedded.org wrote:
>> Compression utilities such as `zstd` can achieve significant compression
>> ratio improvements even when the low entropy data is spread over a
>> larger input window. As a result compressing a tarball containing
>> uncompressed sigdata files with zstd can reduce the amount by a factor
>> of 6-16.
>>
>> Support this by introducing the `BB_SIGDATA_COMPRESSION` variable which
>> can be used to control compression of sigdata files. Setting the
>> variable to "none" disables compression, setting it to "zstd" (its
>> default value) retains the current behaviour.
>>
>> All functions used to consume sigdata files are extended to
>> automatically detect whether input files are compressed and
>> transparently select the correct behaviour depending on that.
>>
>> Additionally, expose the compression paramater in all functions to
>> enable overriding the global setting in some scenarios (e.g. for
>> .siginfo files created by sstate.bbclass).
>>
>> Compression results for core-image-sato sigdata files produced using
>> `bitbake -S none core-image-sato` on poky
>> 122e9a49614b2ddedaae1d90c06004a7a4c43998.
>>
>> Tarball containing compressed sigdata files: 98 MB
>> Compressed tarball (zstd -3) containing compressed sigdata files: 76 MB
>> Compressed tarball (zstd -17) containing compressed sigdata files: 65 MB
>> Tarball containing uncompressed sigdata files: 310 MB
>> Compressed tarball (zstd -3) containing uncompressed sigdata files: 12 MB
>> Compressed tarball (zstd -17) containing uncompressed sigdata files: 4 MB
>>
>> Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de>
>> ---
>>   lib/bb/siggen.py | 84 +++++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 76 insertions(+), 8 deletions(-)
> Whilst this improves the "tarball up a stamps directory" workflow, it
> comes at the expense of a significant amount more IO and disk usage of
> the build tree during the build itself. For that reason, I'd not
> recommend anyone to be running with this set by default or in general.

I benchmarked these changes and couldn't find a significant difference 
in both IO (my test core-image-sato-minimal build causes 80GB disk 
writes) and build time with uncompressed files. Disk usage of the 
.sigdata files increases by a factor of 3 (in our typical builds this 
boils down to 400 MB added) which is a trade off that we're willing to 
accept in the CI context. Are there any code paths that I'm missing 
where this would cause significant slow downs / disadvantages?

> I'm not a big fan of adding config options which add non-default
> workflows, mainly due to the doubling of the test matrix and the fact
> that one of the code paths ends up stale/broken.
I fully understand that (in particular because this change is mainly 
beneficial in one particular context). I wanted to share to see if there 
is interest from others, if there isn't its completely understandable to 
reduce complexity.
> Further, once you expose this to the siginfo files, you then have to
> either add multiple queries for sstate objects, or have extra
> configuration to setup sstate feeds. I'm not sure either is a good
> outcome. This would also imply there would be a further patch to adapt
> the code in oe-core to use this new option for the siginfo files and
> again, this is complexity I'd really prefer simply not to have.

For feeds nothing would change. I didn't include the patch to oe-core 
(which would use the exposed compression parameter) before the 
applicability of the main patch was discussed.

Thanks,

Philip
Richard Purdie May 19, 2025, 1:19 p.m. UTC | #3
On Fri, 2025-05-16 at 19:29 +0200, Philip Lorenz wrote:
> On 16.05.25 08:39, Richard Purdie wrote:
> > On Thu, 2025-05-15 at 16:44 +0200, Philip Lorenz via
> > lists.openembedded.org wrote:
> > > Compression utilities such as `zstd` can achieve significant
> > > compression
> > > ratio improvements even when the low entropy data is spread over
> > > a
> > > larger input window. As a result compressing a tarball containing
> > > uncompressed sigdata files with zstd can reduce the amount by a
> > > factor
> > > of 6-16.
> > > 
> > > Support this by introducing the `BB_SIGDATA_COMPRESSION` variable
> > > which
> > > can be used to control compression of sigdata files. Setting the
> > > variable to "none" disables compression, setting it to "zstd"
> > > (its
> > > default value) retains the current behaviour.
> > > 
> > > All functions used to consume sigdata files are extended to
> > > automatically detect whether input files are compressed and
> > > transparently select the correct behaviour depending on that.
> > > 
> > > Additionally, expose the compression paramater in all functions
> > > to
> > > enable overriding the global setting in some scenarios (e.g. for
> > > .siginfo files created by sstate.bbclass).
> > > 
> > > Compression results for core-image-sato sigdata files produced
> > > using
> > > `bitbake -S none core-image-sato` on poky
> > > 122e9a49614b2ddedaae1d90c06004a7a4c43998.
> > > 
> > > Tarball containing compressed sigdata files: 98 MB
> > > Compressed tarball (zstd -3) containing compressed sigdata files:
> > > 76 MB
> > > Compressed tarball (zstd -17) containing compressed sigdata
> > > files: 65 MB
> > > Tarball containing uncompressed sigdata files: 310 MB
> > > Compressed tarball (zstd -3) containing uncompressed sigdata
> > > files: 12 MB
> > > Compressed tarball (zstd -17) containing uncompressed sigdata
> > > files: 4 MB
> > > 
> > > Signed-off-by: Philip Lorenz <philip.lorenz@bmw.de>
> > > ---
> > >   lib/bb/siggen.py | 84
> > > +++++++++++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 76 insertions(+), 8 deletions(-)
> > Whilst this improves the "tarball up a stamps directory" workflow,
> > it
> > comes at the expense of a significant amount more IO and disk usage
> > of
> > the build tree during the build itself. For that reason, I'd not
> > recommend anyone to be running with this set by default or in
> > general.
> 
> I benchmarked these changes and couldn't find a significant
> difference 
> in both IO (my test core-image-sato-minimal build causes 80GB disk 
> writes) and build time with uncompressed files. Disk usage of the 
> .sigdata files increases by a factor of 3 (in our typical builds this
> boils down to 400 MB added) which is a trade off that we're willing
> to 
> accept in the CI context. Are there any code paths that I'm missing 
> where this would cause significant slow downs / disadvantages?

A better test would be "bitbake core-image-minimal -S none" which
writes out the files without the "noise" of there rest of the build.

It basically slows down task completion and hence is the bottleneck in
task execution overhead of bitbake itself on builds. This is more
significant in builds from sstate or where large numbers of tasks are
being run through with little work being needed.

> > I'm not a big fan of adding config options which add non-default
> > workflows, mainly due to the doubling of the test matrix and the
> > fact
> > that one of the code paths ends up stale/broken.
> I fully understand that (in particular because this change is mainly 
> beneficial in one particular context). I wanted to share to see if
> there  is interest from others, if there isn't its completely
> understandable to reduce complexity.

Fair enough, lets see if there is wide interest for doing this.

> > Further, once you expose this to the siginfo files, you then have
> > to
> > either add multiple queries for sstate objects, or have extra
> > configuration to setup sstate feeds. I'm not sure either is a good
> > outcome. This would also imply there would be a further patch to
> > adapt
> > the code in oe-core to use this new option for the siginfo files
> > and
> > again, this is complexity I'd really prefer simply not to have.
> 
> For feeds nothing would change. I didn't include the patch to oe-core
> (which would use the exposed compression parameter) before the 
> applicability of the main patch was discussed.

It should at least be mentioned as the maintenance burden and impact is
larger if the other patch is considered with this too. I suspect the
oecore change would attract more negative review due to the "forking"
of the sstate config/format and the need for more config.

Cheers,

Richard
diff mbox series

Patch

diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
index a6163b55e..5c954228f 100644
--- a/lib/bb/siggen.py
+++ b/lib/bb/siggen.py
@@ -4,7 +4,10 @@ 
 # SPDX-License-Identifier: GPL-2.0-only
 #
 
+import contextlib
+import enum
 import hashlib
+import io
 import logging
 import os
 import re
@@ -68,6 +71,67 @@  def init(d):
                      ', '.join(obj.name for obj in siggens))
         return SignatureGenerator(d)
 
+class SigdataCompression(enum.Enum):
+    """Enumeration of sigdata compression / decompression schemes"""
+
+    AUTO = "auto"
+    """
+    Compression mode is automatically determined by file content (when reading files) or based
+    on the value of BB_SIGDATA_COMPRESSION (when writing files). BB_SIGDATA_COMPRESSION may be
+    set to any of the other constants defined in the enumeration converted to lower case (e.g. none
+    or zstd).
+    """
+
+    NONE = "none"
+    """
+    Sigdata files are stored uncompressed.
+    """
+
+    ZSTD = "zstd"
+    """
+    Sigdata files are stored using zstd compression.
+    """
+
+@contextlib.contextmanager
+def open_sigdata(path, mode, compression=SigdataCompression.AUTO):
+    """Open the given sigdata file at path with the given mode.
+
+    The compression parameter shall be used to specify which compression mode should be applied to
+    the sigdata file. See SigdataCompressionMode for a list of available modes.
+
+    The special compression mode `auto` may be used to automatically determine the compression mode
+    depending on existing sigdata content. As such `auto` is only supported when opening sigdata
+    files for reading.
+    """
+    ZSTD_MAGIC_BYTES = b"\x28\xb5\x2f\xfd"
+
+    # The underlying file is always opened in binary mode as we may need to pass the file descriptor
+    # to a subprocess for description. In case text decoding is required (e.g. for uncompressed
+    # files) a TextIOWrapper will be used.
+    file_mode = mode.replace("t", "")
+    if "b" not in file_mode:
+        file_mode += "b"
+
+    # The fileobj will be passed to Popen which then continues to operate on the raw file
+    # descriptor. Turn off buffering as this interferes with this assumption and operations such as
+    # peek() or seek() no longer move the file descriptor to the expected position.
+    with open(path, file_mode, buffering=0) as f:
+        if "r" in mode and compression == SigdataCompression.AUTO:
+            compression = SigdataCompression.ZSTD if f.read(4) == ZSTD_MAGIC_BYTES else SigdataCompression.NONE
+            f.seek(0)
+
+        if compression == SigdataCompression.NONE:
+            if "t" in mode:
+                f = io.TextIOWrapper(
+                    f, encoding="utf-8", write_through=True
+                )
+
+            yield f
+            return
+
+        with bb.compress.zstd.open(f, mode=mode, encoding="utf-8", num_threads=1) as zf:
+            yield zf
+
 class SignatureGenerator(object):
     """
     """
@@ -172,7 +236,7 @@  class SignatureGenerator(object):
     def stampcleanmask(self, stampbase, file_name, taskname, extrainfo):
         return ("%s.%s.%s" % (stampbase, taskname, extrainfo)).rstrip('.')
 
-    def dump_sigtask(self, mcfn, task, stampbase, runtime):
+    def dump_sigtask(self, mcfn, task, stampbase, runtime, compression=None):
         return
 
     def invalidate_task(self, task, mcfn):
@@ -239,6 +303,7 @@  class SignatureGeneratorBasic(SignatureGenerator):
         self.unitaskhashes = self.unihash_cache.init_cache(data, "bb_unihashes.dat", {})
         self.localdirsexclude = (data.getVar("BB_SIGNATURE_LOCAL_DIRS_EXCLUDE") or "CVS .bzr .git .hg .osc .p4 .repo .svn").split()
         self.tidtopn = {}
+        self.sigdata_compression = SigdataCompression(data.getVar("BB_SIGDATA_COMPRESSION") or SigdataCompression.ZSTD)
 
     def init_rundepcheck(self, data):
         self.taskhash_ignore_tasks = data.getVar("BB_TASKHASH_IGNORE_TASKS") or None
@@ -415,7 +480,9 @@  class SignatureGeneratorBasic(SignatureGenerator):
     def save_unitaskhashes(self):
         self.unihash_cache.save(self.unitaskhashes)
 
-    def dump_sigtask(self, mcfn, task, stampbase, runtime):
+    def dump_sigtask(self, mcfn, task, stampbase, runtime, compression=SigdataCompression.AUTO):
+        compression = compression if compression != SigdataCompression.AUTO \
+            else self.sigdata_compression
         tid = mcfn + ":" + task
         mc = bb.runqueue.mc_from_tid(mcfn)
         referencestamp = stampbase
@@ -478,7 +545,7 @@  class SignatureGeneratorBasic(SignatureGenerator):
 
         fd, tmpfile = bb.utils.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.")
         try:
-            with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f:
+            with open_sigdata(fd, "wt", compression) as f:
                 json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder)
                 f.flush()
             os.chmod(tmpfile, 0o664)
@@ -880,12 +947,13 @@  def clean_checksum_file_path(file_checksum_tuple):
         return "./" + f.split("/./")[1]
     return os.path.basename(f)
 
-def dump_this_task(outfile, d):
+def dump_this_task(outfile, d, compression=SigdataCompression.AUTO):
     import bb.parse
     mcfn = d.getVar("BB_FILENAME")
     task = "do_" + d.getVar("BB_CURRENTTASK")
     referencestamp = bb.parse.siggen.stampfile_base(mcfn)
-    bb.parse.siggen.dump_sigtask(mcfn, task, outfile, "customfile:" + referencestamp)
+    bb.parse.siggen.dump_sigtask(mcfn, task, outfile, "customfile:" + referencestamp,
+                                 compression=compression)
 
 def init_colors(enable_color):
     """Initialise colour dict for passing to compare_sigfiles()"""
@@ -969,13 +1037,13 @@  def compare_sigfiles(a, b, recursecb=None, color=False, collapsed=False):
         return formatstr.format(**formatparams)
 
     try:
-        with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
+        with open_sigdata(a, "rt") as f:
             a_data = json.load(f, object_hook=SetDecoder)
     except (TypeError, OSError) as err:
         bb.error("Failed to open sigdata file '%s': %s" % (a, str(err)))
         raise err
     try:
-        with bb.compress.zstd.open(b, "rt", encoding="utf-8", num_threads=1) as f:
+        with open_sigdata(b, "rt") as f:
             b_data = json.load(f, object_hook=SetDecoder)
     except (TypeError, OSError) as err:
         bb.error("Failed to open sigdata file '%s': %s" % (b, str(err)))
@@ -1218,7 +1286,7 @@  def dump_sigfile(a):
     output = []
 
     try:
-        with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
+        with open_sigdata(a, "rt") as f:
             a_data = json.load(f, object_hook=SetDecoder)
     except (TypeError, OSError) as err:
         bb.error("Failed to open sigdata file '%s': %s" % (a, str(err)))