diff mbox series

[1/3] sbom30.py: reduce redundant spdxid-hash symlinks to save inode on host

Message ID 20241110030741.4108407-1-hongxu.jia@windriver.com
State New
Headers show
Series [1/3] sbom30.py: reduce redundant spdxid-hash symlinks to save inode on host | expand

Commit Message

Hongxu Jia Nov. 10, 2024, 3:07 a.m. UTC
In order to support all in-scope SPDX data within a single
JSON-LD file for SPDX 3.0.1, Yocto's SBOM:
- In native/target/nativesdk recipe, created spdxid-hash symlink
  for each element to point to the JSON-LD file that contains
  element details;
- In image recipe, use spdxid-hash symlink to collect element
  details from varies of JSON-LD files

While SPDX_INCLUDE_SOURCES = "1", it adds sources to JSON-LD file
and create 2N+ spdxid-hash symlinks for N source files.
(N for software_File, N for hasDeclaredLicense's Relationship)

For large numbers of source files, adding an extra symlink -> real file
will occupy one more inode (per file), which will need a slot in
the OS's inode cache. In this situation, disk performance is slow
and inode is used up quickly

While using function add_package_files to add source files to JSON-LD file,
the spdxid-hash symlinks for source files point to the same JSON-LD file,
then according to the format of spdxId

- spdxId of souce file:
http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sourcefile/1

Remove the count number ('/1') from spdxId suffix, then all
source files in one recipe will share one spdxid-hash symlink.

The same reason to sysroot and package files

- spdxId of sysroot file:
http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sysroot/1

- spdxId of pacakge file:
http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/package/shadow-src/file/1

Build core-image-minimal with/without this commit, comparing the spdxid-hash number, 7 281 824 -> 70 508

echo 'SPDX_INCLUDE_SOURCES = "1"' >> local.conf

With this commit:
$ time bitbake core-image-minimal
real    95m6.960s
user    0m22.832s
sys     0m4.087s

$ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash/ -name "*.spdx.json" |wc -l
70508

Without this commit:
$ time bitbake core-image-minimal
real    100m17.769s
user    0m24.516s
sys     0m4.334s

$ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash -name "*.json" |wc -l
7281824

Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
---
 meta/lib/oe/sbom30.py | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Joshua Watt Nov. 18, 2024, 9:52 p.m. UTC | #1
I think what you are trying to get at here is that the same file is
present in multiple different "packages", so all references are
equally interchangeable?

I'm fine if we can figure out a reasonable way to do that, but I don't
think this is the correct approach. A better option would be to simple
reference the SPDX ID of the previously described file instead of
making a new one each time. I don't really like "magic" in the
jsonld_hash_path() which really hides what we are actually after (only
creating a single file element and referencing it multiple times).

This would also conveniently solve the license problem since only one
file element would be created per hash.

However, I think the reason it's done in the manner it is, is because
each instance of the file is in a different path, so you'd lose that
information by combining them all into the same file element;
although, you might still be able to deduplicate the license
information

On Sat, Nov 9, 2024 at 8:07 PM Hongxu Jia <hongxu.jia@windriver.com> wrote:
>
> In order to support all in-scope SPDX data within a single
> JSON-LD file for SPDX 3.0.1, Yocto's SBOM:
> - In native/target/nativesdk recipe, created spdxid-hash symlink
>   for each element to point to the JSON-LD file that contains
>   element details;
> - In image recipe, use spdxid-hash symlink to collect element
>   details from varies of JSON-LD files
>
> While SPDX_INCLUDE_SOURCES = "1", it adds sources to JSON-LD file
> and create 2N+ spdxid-hash symlinks for N source files.
> (N for software_File, N for hasDeclaredLicense's Relationship)
>
> For large numbers of source files, adding an extra symlink -> real file
> will occupy one more inode (per file), which will need a slot in
> the OS's inode cache. In this situation, disk performance is slow
> and inode is used up quickly
>
> While using function add_package_files to add source files to JSON-LD file,
> the spdxid-hash symlinks for source files point to the same JSON-LD file,
> then according to the format of spdxId
>
> - spdxId of souce file:
> http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sourcefile/1
>
> Remove the count number ('/1') from spdxId suffix, then all
> source files in one recipe will share one spdxid-hash symlink.
>
> The same reason to sysroot and package files
>
> - spdxId of sysroot file:
> http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sysroot/1
>
> - spdxId of pacakge file:
> http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/package/shadow-src/file/1
>
> Build core-image-minimal with/without this commit, comparing the spdxid-hash number, 7 281 824 -> 70 508
>
> echo 'SPDX_INCLUDE_SOURCES = "1"' >> local.conf
>
> With this commit:
> $ time bitbake core-image-minimal
> real    95m6.960s
> user    0m22.832s
> sys     0m4.087s
>
> $ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash/ -name "*.spdx.json" |wc -l
> 70508
>
> Without this commit:
> $ time bitbake core-image-minimal
> real    100m17.769s
> user    0m24.516s
> sys     0m4.334s
>
> $ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash -name "*.json" |wc -l
> 7281824
>
> Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com>
> ---
>  meta/lib/oe/sbom30.py | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py
> index e3a9428668..4efeaae3a0 100644
> --- a/meta/lib/oe/sbom30.py
> +++ b/meta/lib/oe/sbom30.py
> @@ -911,6 +911,10 @@ def jsonld_arch_path(d, arch, subdir, name, deploydir=None):
>
>
>  def jsonld_hash_path(_id):
> +    # For the spdId added by add_package_files, remove suffix count number
> +    if re.match(r".*/(sourcefile|sysroot|file)/\w+$", _id):
> +        _id = os.path.dirname(_id)
> +
>      h = hashlib.sha256(_id.encode("utf-8")).hexdigest()
>
>      return Path("by-spdxid-hash") / h[:2], h
> @@ -992,6 +996,11 @@ def write_recipe_jsonld_doc(
>              *hash_path,
>              deploydir=deploydir,
>          )
> +
> +        # Return if expected symlink exists
> +        if link_name.is_symlink() and link_name.resolve() == dest:
> +            return hash_path[-1]
> +
>          try:
>              link_name.parent.mkdir(exist_ok=True, parents=True)
>              link_name.symlink_to(os.path.relpath(dest, link_name.parent))
> --
> 2.25.1
>
diff mbox series

Patch

diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py
index e3a9428668..4efeaae3a0 100644
--- a/meta/lib/oe/sbom30.py
+++ b/meta/lib/oe/sbom30.py
@@ -911,6 +911,10 @@  def jsonld_arch_path(d, arch, subdir, name, deploydir=None):
 
 
 def jsonld_hash_path(_id):
+    # For the spdId added by add_package_files, remove suffix count number
+    if re.match(r".*/(sourcefile|sysroot|file)/\w+$", _id):
+        _id = os.path.dirname(_id)
+
     h = hashlib.sha256(_id.encode("utf-8")).hexdigest()
 
     return Path("by-spdxid-hash") / h[:2], h
@@ -992,6 +996,11 @@  def write_recipe_jsonld_doc(
             *hash_path,
             deploydir=deploydir,
         )
+
+        # Return if expected symlink exists
+        if link_name.is_symlink() and link_name.resolve() == dest:
+            return hash_path[-1]
+
         try:
             link_name.parent.mkdir(exist_ok=True, parents=True)
             link_name.symlink_to(os.path.relpath(dest, link_name.parent))