| Message ID | 20241110030741.4108407-1-hongxu.jia@windriver.com |
|---|---|
| State | New |
| Headers | show |
| Series | [1/3] sbom30.py: reduce redundant spdxid-hash symlinks to save inode on host | expand |
I think what you are trying to get at here is that the same file is present in multiple different "packages", so all references are equally interchangeable? I'm fine if we can figure out a reasonable way to do that, but I don't think this is the correct approach. A better option would be to simple reference the SPDX ID of the previously described file instead of making a new one each time. I don't really like "magic" in the jsonld_hash_path() which really hides what we are actually after (only creating a single file element and referencing it multiple times). This would also conveniently solve the license problem since only one file element would be created per hash. However, I think the reason it's done in the manner it is, is because each instance of the file is in a different path, so you'd lose that information by combining them all into the same file element; although, you might still be able to deduplicate the license information On Sat, Nov 9, 2024 at 8:07 PM Hongxu Jia <hongxu.jia@windriver.com> wrote: > > In order to support all in-scope SPDX data within a single > JSON-LD file for SPDX 3.0.1, Yocto's SBOM: > - In native/target/nativesdk recipe, created spdxid-hash symlink > for each element to point to the JSON-LD file that contains > element details; > - In image recipe, use spdxid-hash symlink to collect element > details from varies of JSON-LD files > > While SPDX_INCLUDE_SOURCES = "1", it adds sources to JSON-LD file > and create 2N+ spdxid-hash symlinks for N source files. > (N for software_File, N for hasDeclaredLicense's Relationship) > > For large numbers of source files, adding an extra symlink -> real file > will occupy one more inode (per file), which will need a slot in > the OS's inode cache. In this situation, disk performance is slow > and inode is used up quickly > > While using function add_package_files to add source files to JSON-LD file, > the spdxid-hash symlinks for source files point to the same JSON-LD file, > then according to the format of spdxId > > - spdxId of souce file: > http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sourcefile/1 > > Remove the count number ('/1') from spdxId suffix, then all > source files in one recipe will share one spdxid-hash symlink. > > The same reason to sysroot and package files > > - spdxId of sysroot file: > http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sysroot/1 > > - spdxId of pacakge file: > http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/package/shadow-src/file/1 > > Build core-image-minimal with/without this commit, comparing the spdxid-hash number, 7 281 824 -> 70 508 > > echo 'SPDX_INCLUDE_SOURCES = "1"' >> local.conf > > With this commit: > $ time bitbake core-image-minimal > real 95m6.960s > user 0m22.832s > sys 0m4.087s > > $ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash/ -name "*.spdx.json" |wc -l > 70508 > > Without this commit: > $ time bitbake core-image-minimal > real 100m17.769s > user 0m24.516s > sys 0m4.334s > > $ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash -name "*.json" |wc -l > 7281824 > > Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com> > --- > meta/lib/oe/sbom30.py | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py > index e3a9428668..4efeaae3a0 100644 > --- a/meta/lib/oe/sbom30.py > +++ b/meta/lib/oe/sbom30.py > @@ -911,6 +911,10 @@ def jsonld_arch_path(d, arch, subdir, name, deploydir=None): > > > def jsonld_hash_path(_id): > + # For the spdId added by add_package_files, remove suffix count number > + if re.match(r".*/(sourcefile|sysroot|file)/\w+$", _id): > + _id = os.path.dirname(_id) > + > h = hashlib.sha256(_id.encode("utf-8")).hexdigest() > > return Path("by-spdxid-hash") / h[:2], h > @@ -992,6 +996,11 @@ def write_recipe_jsonld_doc( > *hash_path, > deploydir=deploydir, > ) > + > + # Return if expected symlink exists > + if link_name.is_symlink() and link_name.resolve() == dest: > + return hash_path[-1] > + > try: > link_name.parent.mkdir(exist_ok=True, parents=True) > link_name.symlink_to(os.path.relpath(dest, link_name.parent)) > -- > 2.25.1 >
diff --git a/meta/lib/oe/sbom30.py b/meta/lib/oe/sbom30.py index e3a9428668..4efeaae3a0 100644 --- a/meta/lib/oe/sbom30.py +++ b/meta/lib/oe/sbom30.py @@ -911,6 +911,10 @@ def jsonld_arch_path(d, arch, subdir, name, deploydir=None): def jsonld_hash_path(_id): + # For the spdId added by add_package_files, remove suffix count number + if re.match(r".*/(sourcefile|sysroot|file)/\w+$", _id): + _id = os.path.dirname(_id) + h = hashlib.sha256(_id.encode("utf-8")).hexdigest() return Path("by-spdxid-hash") / h[:2], h @@ -992,6 +996,11 @@ def write_recipe_jsonld_doc( *hash_path, deploydir=deploydir, ) + + # Return if expected symlink exists + if link_name.is_symlink() and link_name.resolve() == dest: + return hash_path[-1] + try: link_name.parent.mkdir(exist_ok=True, parents=True) link_name.symlink_to(os.path.relpath(dest, link_name.parent))
In order to support all in-scope SPDX data within a single JSON-LD file for SPDX 3.0.1, Yocto's SBOM: - In native/target/nativesdk recipe, created spdxid-hash symlink for each element to point to the JSON-LD file that contains element details; - In image recipe, use spdxid-hash symlink to collect element details from varies of JSON-LD files While SPDX_INCLUDE_SOURCES = "1", it adds sources to JSON-LD file and create 2N+ spdxid-hash symlinks for N source files. (N for software_File, N for hasDeclaredLicense's Relationship) For large numbers of source files, adding an extra symlink -> real file will occupy one more inode (per file), which will need a slot in the OS's inode cache. In this situation, disk performance is slow and inode is used up quickly While using function add_package_files to add source files to JSON-LD file, the spdxid-hash symlinks for source files point to the same JSON-LD file, then according to the format of spdxId - spdxId of souce file: http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sourcefile/1 Remove the count number ('/1') from spdxId suffix, then all source files in one recipe will share one spdxid-hash symlink. The same reason to sysroot and package files - spdxId of sysroot file: http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/sysroot/1 - spdxId of pacakge file: http://spdx.org/spdxdocs/shadow-10e66933-65cf-5a2d-9a1d-99b12a405441/0838759b8d71923d250a0813dda7356ffd309576115bbf8ed7e266cf4aed86a5/package/shadow-src/file/1 Build core-image-minimal with/without this commit, comparing the spdxid-hash number, 7 281 824 -> 70 508 echo 'SPDX_INCLUDE_SOURCES = "1"' >> local.conf With this commit: $ time bitbake core-image-minimal real 95m6.960s user 0m22.832s sys 0m4.087s $ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash/ -name "*.spdx.json" |wc -l 70508 Without this commit: $ time bitbake core-image-minimal real 100m17.769s user 0m24.516s sys 0m4.334s $ find tmp/deploy/spdx/3.0.1/*/by-spdxid-hash -name "*.json" |wc -l 7281824 Signed-off-by: Hongxu Jia <hongxu.jia@windriver.com> --- meta/lib/oe/sbom30.py | 9 +++++++++ 1 file changed, 9 insertions(+)