[RFC] Implementation of a submodule fetcher

Message ID 20211215142449.13469-1-pontusja@lap5cg1077w4w.se.axis.com
State New
Headers show
Series [RFC] Implementation of a submodule fetcher | expand

Commit Message

Pontus Jaensson Dec. 15, 2021, 2:24 p.m. UTC
From: Pontus Jaensson <pontusja@axis.com>

This patch consists of a Submodule class which inherits from and extends the FetchMethod-class
for retrieving, initializing and updating a submodule directory. The main reason for implementing
it is to make it possible to track the version of a packet with git instead of just a recipe.

The submodule fetcher is used by using the following syntax in the SRC_URI field in the recipe:
SRC_URI = submodule://<PATH TO TOP LEVEL DIRECTORY>;localpath=<NAME OF THE WANTED SUBMODULE>.
Where the submodule-keyword is necessary to use in order to call this specific fetcher.

In the download-method the URI is then parsed and the relevant parts from it are extracted and
used to find information about the module in the .gitmodules-file in the top level git directory.
The final thing that happens in the download-method is that git submodule init & git submodule
update are called and executed at the right location for the specified submodule.

In the unpack-method the empty directory, that is being created during the process but contains no
information, is removed and replaced with a symlink to the directory that contains the actual
source code.

The code has been tested and so far it seems to work fine.
---
 bitbake/lib/bb/fetch2/__init__.py  |   2 +
 bitbake/lib/bb/fetch2/submodule.py | 123 +++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)
 create mode 100644 bitbake/lib/bb/fetch2/submodule.py

Comments

Quentin Schulz Dec. 15, 2021, 2:42 p.m. UTC | #1
Hi Pontus,

On Wed, Dec 15, 2021 at 03:24:50PM +0100, Pontus Jaensson wrote:
> From: Pontus Jaensson <pontusja@axis.com>
> 
> This patch consists of a Submodule class which inherits from and extends the FetchMethod-class
> for retrieving, initializing and updating a submodule directory. The main reason for implementing
> it is to make it possible to track the version of a packet with git instead of just a recipe.
> 
> The submodule fetcher is used by using the following syntax in the SRC_URI field in the recipe:
> SRC_URI = submodule://<PATH TO TOP LEVEL DIRECTORY>;localpath=<NAME OF THE WANTED SUBMODULE>.
> Where the submodule-keyword is necessary to use in order to call this specific fetcher.
> 
> In the download-method the URI is then parsed and the relevant parts from it are extracted and
> used to find information about the module in the .gitmodules-file in the top level git directory.
> The final thing that happens in the download-method is that git submodule init & git submodule
> update are called and executed at the right location for the specified submodule.
> 
> In the unpack-method the empty directory, that is being created during the process but contains no
> information, is removed and replaced with a symlink to the directory that contains the actual
> source code.
> 
> The code has been tested and so far it seems to work fine.

Your Signed-off-by is missing. Please read https://www.openembedded.org/wiki/How_to_submit_a_patch_to_OpenEmbedded

We already have a gitsm fetcher, c.f.
https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bitbake-user-manual-fetching.html#git-submodule-fetcher-gitsm
what is missing in this fetcher to be able to support your usecase? I
think it makes more sense to extend it than create a new fetcher if only
a few features are missing.

Cheers,
Quentin
Alexander Kanavin Dec. 15, 2021, 2:44 p.m. UTC | #2
Apologies, but this implements non-reproducible fetching (one submodule
revision today, another tomorrow), and therefore unlikely to be accepted.
You can use gitsm:// to check out a specific top level revision, then build
just the submodule.

Alex

On Wed, 15 Dec 2021 at 15:25, Pontus Jaensson <pontus.jaensson@axis.com>
wrote:

> From: Pontus Jaensson <pontusja@axis.com>
>
> This patch consists of a Submodule class which inherits from and extends
> the FetchMethod-class
> for retrieving, initializing and updating a submodule directory. The main
> reason for implementing
> it is to make it possible to track the version of a packet with git
> instead of just a recipe.
>
> The submodule fetcher is used by using the following syntax in the SRC_URI
> field in the recipe:
> SRC_URI = submodule://<PATH TO TOP LEVEL DIRECTORY>;localpath=<NAME OF THE
> WANTED SUBMODULE>.
> Where the submodule-keyword is necessary to use in order to call this
> specific fetcher.
>
> In the download-method the URI is then parsed and the relevant parts from
> it are extracted and
> used to find information about the module in the .gitmodules-file in the
> top level git directory.
> The final thing that happens in the download-method is that git submodule
> init & git submodule
> update are called and executed at the right location for the specified
> submodule.
>
> In the unpack-method the empty directory, that is being created during the
> process but contains no
> information, is removed and replaced with a symlink to the directory that
> contains the actual
> source code.
>
> The code has been tested and so far it seems to work fine.
> ---
>  bitbake/lib/bb/fetch2/__init__.py  |   2 +
>  bitbake/lib/bb/fetch2/submodule.py | 123 +++++++++++++++++++++++++++++
>  2 files changed, 125 insertions(+)
>  create mode 100644 bitbake/lib/bb/fetch2/submodule.py
>
> diff --git a/bitbake/lib/bb/fetch2/__init__.py
> b/bitbake/lib/bb/fetch2/__init__.py
> index 6a38cb0955..a87e0b5338 100644
> --- a/bitbake/lib/bb/fetch2/__init__.py
> +++ b/bitbake/lib/bb/fetch2/__init__.py
> @@ -1923,6 +1923,7 @@ class FetchConnectionCache(object):
>  from . import cvs
>  from . import git
>  from . import gitsm
> +from . import submodule
>  from . import gitannex
>  from . import local
>  from . import svn
> @@ -1945,6 +1946,7 @@ methods.append(wget.Wget())
>  methods.append(svn.Svn())
>  methods.append(git.Git())
>  methods.append(gitsm.GitSM())
> +methods.append(submodule.Submodule())
>  methods.append(gitannex.GitANNEX())
>  methods.append(cvs.Cvs())
>  methods.append(ssh.SSH())
> diff --git a/bitbake/lib/bb/fetch2/submodule.py
> b/bitbake/lib/bb/fetch2/submodule.py
> new file mode 100644
> index 0000000000..ef46404c2e
> --- /dev/null
> +++ b/bitbake/lib/bb/fetch2/submodule.py
> @@ -0,0 +1,123 @@
> +"""
> +BitBake 'Fetch' implementation for local git submodules.
> +
> +Inherits from and extends the FetchMethod for retrieving, initializing and
> +updating a submodule directory.
> +
> +SRC_URI = "submodule://<PATH TO TOP LEVEL GIT
> DIRECTORY>;localpath=<SUBMODULE NAME>"
> +
> +NOTE: If the name of the submodule contains space(s), please indicate
> this by placing
> +the <SUBMODULE NAME> within a single quotation mark and replace the
> space(s) with a
> +backslash (\). Ex) If the submodule name is: name with spaces, then write
> the
> +following: localpath='name\with\spaces'.
> +
> +"""
> +
> +# Pontus Jaensson 2021
> +#
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +
> +import os
> +import subprocess
> +from re import compile
> +from bb.fetch2 import FetchMethod
> +
> +class Submodule(FetchMethod):
> +
> +    def supports(self, ud, d):
> +        """
> +        Checks if the given url can be fetched with submodule.
> +        """
> +        return ud.type in ['submodule']
> +
> +    def download(self, ud, d):
> +        """
> +        Finds the path and the location to the submodule in the local
> .gitmodules
> +        file and then passes it to the execution method.
> +        """
> +        module_path, module_location = self._prepare_module_data(ud.url)
> +
> +        self._run_command(f'git submodule init {module_path}',
> module_location)
> +        self._run_command(f'git submodule update {module_path}',
> module_location)
> +
> +    def unpack(self, ud, root, d):
> +        """
> +        Is called once the 'download' of the submodule has been completed
> in order
> +        to ensure that the source code is in the right location thanks to
> symlinks.
> +        """
> +        path, location = self._prepare_module_data(ud.url)
> +        abs_path = location + path
> +
> +        try:
> +            subprocess.check_output(['rm', '-r', 'gitmodule'])
> +            subprocess.check_output(['ln', '-s', abs_path, 'gitmodule'])
> +        except subprocess.CalledProcessError as e:
> +            raise Exception(f'Failed setting up the symlinks correctly')
> from e
> +
> +    def _run_command(self, cmd, location):
> +        """
> +        Runs the provided cmd command at the path specified by the
> location argument.
> +        Raises an error if the location is unvalid or if the cmd command
> fails.
> +        """
> +        try:
> +            modules_git_command = cmd.split(' ')[:3]
> +            module_path = cmd.split(' ')[3:]
> +            os.chdir(location)
> +            if len(module_path) > 1:
> +                module_path = [' '.join(module_path)]
> +            cmd_array = modules_git_command + module_path
> +            subprocess.check_output(cmd_array)
> +        except OSError as e:
> +            raise Exception(f'Not able to change current working
> directory to: {location}') from e
> +        except subprocess.CalledProcessError as e:
> +            raise Exception(f'Not able to run command: {cmd} at
> {location}') from e
> +
> +    def _parse_url(self, url):
> +        """
> +        Helper method for deconstructing the url from the user data.
> +        """
> +        res =
> compile('([^:]*)://(([^/;]+)@)?(?P<location>[^;]+)(;(?P<submodule>.*))?').match(url)
> +        module_location = res.group('location') + '/'
> +        submodule = res.group('submodule').split('=')[1]
> +        return module_location, submodule
> +
> +    def _prepare_module_data(self, url):
> +        """
> +        Helper method for preparing and processing the submodule data
> based on the
> +        url provided.
> +        """
> +        module_location, name = self._parse_url(url)
> +        modules_path = os.path.abspath(module_location + '.gitmodules')
> +        try:
> +            with open(modules_path) as file:
> +                gitmodule_data = file.readlines()
> +        except:
> +            raise Exception(f'Could not open .gitmodules file located at:
> {modules_path}')
> +
> +        return self._process_submodules_info(gitmodule_data, name),
> module_location
> +
> +    def _process_submodules_info(self, gitmodule_data, wanted_module):
> +        """
> +        Helper method for iterating through the provided gitmodule_data
> in order to find
> +        the path to the wanted_module. Returns it if it is found
> otherwise raises an Exception.
> +        """
> +        module_path = ''
> +
> +        if wanted_module[0] == wanted_module[-1] == "'":
> +            wanted_module = wanted_module[1:-1].replace('\\', ' ')
> +
> +        for line in gitmodule_data:
> +            if line.startswith('[submodule'):
> +                name = line.split('"')[1]
> +                if name == wanted_module:
> +                    current_module = True
> +                else:
> +                    current_module = False
> +            elif line.strip().startswith('path') and current_module:
> +                module_path = line.split('=')[1].strip()
> +                break
> +        if module_path:
> +            return module_path
> +        else:
> +            raise ValueError(f'{wanted_module} is not a valid submodule
> name.')
> --
> 2.20.1
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#13168):
> https://lists.openembedded.org/g/bitbake-devel/message/13168
> Mute This Topic: https://lists.openembedded.org/mt/87744763/1686489
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/bitbake-devel/unsub [
> alex.kanavin@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
Alexander Kanavin Dec. 16, 2021, 8:15 a.m. UTC | #3
On Thu, 16 Dec 2021 at 06:31, Fredrik Gustafsson <
Fredrik.Gustafsson@axis.com> wrote:

> bitbake has already support for non reproduceable builds, you can set
> SRCREV to a branch, which will change. This patch is actually for
> introducing reproducible builds, however the information is in git and not
> in bb-files.
>
> The usecase here is traceability. Let's say you're developing a package
> and you want to integrate that to your bitbake system. It will require two
> commits, one to the package repository and one to the layer repository. You
> can get around this by settings SRCREV to a branch but then you no longer
> get reproduceable builds.
>
> To solve this, we have built a service that once a package is updated, the
> bb-file in the layer repository is automatically rewritten so match the new
> version of the package. This is error prone and cumbersome. Using the layer
> repository as a superproject will made it possible to use gerrits
> superproject subscription that makes the layer repository to be updated
> each time the submodule (the package) is updated. This gives us
> reproduceable builds, only one commit to introduce a change and no inhouse
> developed services.
>

Apologies again, but without organizationally enforcing submodule structure
this fetcher is not providing reproducible behaviour. If there are no
source code checksums available for bitbake to verify during fetch, then
there's no guarantee that the component repo has not been modified, and
that guarantee must be technical, not organizational. The requirement for
all fetchers is that they provide that ability: you can switch it off if
you want to, but that is not allowed to be the default (or only) behavior.

If the problem is updating SRCREVs in a recipe, there is a standard,
official way to do it: 'devtool upgrade'. Would that work for you?

Alex
Richard Purdie Dec. 16, 2021, 9:16 a.m. UTC | #4
On Wed, 2021-12-15 at 15:24 +0100, Pontus Jaensson wrote:
> From: Pontus Jaensson <pontusja@axis.com>
> 
> This patch consists of a Submodule class which inherits from and extends the FetchMethod-class
> for retrieving, initializing and updating a submodule directory. The main reason for implementing
> it is to make it possible to track the version of a packet with git instead of just a recipe.
> 
> The submodule fetcher is used by using the following syntax in the SRC_URI field in the recipe:
> SRC_URI = submodule://<PATH TO TOP LEVEL DIRECTORY>;localpath=<NAME OF THE WANTED SUBMODULE>.
> Where the submodule-keyword is necessary to use in order to call this specific fetcher.
> 
> In the download-method the URI is then parsed and the relevant parts from it are extracted and
> used to find information about the module in the .gitmodules-file in the top level git directory.
> The final thing that happens in the download-method is that git submodule init & git submodule
> update are called and executed at the right location for the specified submodule.
> 
> In the unpack-method the empty directory, that is being created during the process but contains no
> information, is removed and replaced with a symlink to the directory that contains the actual
> source code.
> 
> The code has been tested and so far it seems to work fine.
> ---
>  bitbake/lib/bb/fetch2/__init__.py  |   2 +
>  bitbake/lib/bb/fetch2/submodule.py | 123 +++++++++++++++++++++++++++++
>  2 files changed, 125 insertions(+)
>  create mode 100644 bitbake/lib/bb/fetch2/submodule.py

As others have said, why do we need two different fetchers to work with git
submodules? I really dislike having two ways of doing something, it is bad for
usability and suggests a problem in the underlying code, particularly when it
isn't clear when one should be used and when the other should.

Does the fetcher meet the design goals in:

https://git.yoctoproject.org/poky/tree/bitbake/lib/bb/fetch2/README

?

I'd also note there are no tests, not that adding any would change my review
given the above points.

Cheers,

Richard
Pontus Jaensson Dec. 16, 2021, 1:36 p.m. UTC | #5
Hi Richard,

Thanks for your comments. Regarding your question whether or not the design goals are followed I have the following motivation:

a) Yes, the fetcher follows the same structure as the other fetchers.
b) Done.
c) You can’t for now. DL_DIR is not needed to be inside a layer. Fredrik has two ideas about how to solve it but we would very much want to hear your opinion about any of them is a viable solution. The idea is that it is possible to solve the problem by a symlink or by copying the content of the submodule to DL_DIR. Do you think either of these suggestions can solve the problem?
d) I don’t completely understand this one but once the do_fetch() is done, there is no more network access.
e) It should be. As long as the layer is a git repository and still has its git-database it’s reproducible.
f) done.
g) done.
h) Not yet implemented. I can look up how it is done in other git:// recipes in the meta/layer and do my best to implement it here as well.
i) Should already be done since no new tools are used.
j) Not implemented??
k) Not implemented yet.
l) It does not use any tools.

Regarding your other question why we do need two different fetchers to work with git
submodules I use the same answer as I did to Quentin. All though they are both using submodule I don't think they are related and should have a shared functionality. gitsm:// is allowing a package to have submodules, my version submodules:// is allowing a layer to have submodules. I am aware that the naming is not perfect for the intended purpose and as Fredrik answered you before I am happy to receive suggestions for a better naming. I personally can see a reason for using the submodule:// fetcher but I don't understand why the gitsm:// fetcher is not just an option to the git:// fetcher.

Hope this answered your questions, please come back if you have more or if there is something that I need to explain further.

Best Regards
/ Pontus
Pontus Jaensson Dec. 16, 2021, 1:37 p.m. UTC | #6
Hi Quentin,

Thank you for your feedback and thanks for noticing that I missed the Signed-off. The answer to your question about why I just not simply extend the already existing gitsm fetcher instead of writing my own is that all though they are both using submodule I don't think they are related and should have a shared functionality. gitsm:// is allowing a package to have submodules, my version submodules:// is allowing a layer to have submodules. I am aware that the naming is not perfect for the intended purpose, it's a quite big difference. I personally can see a reason for using the submodule:// fetcher but I don't understand why the gitsm:// fetcher is not just an option to the git:// fetcher.

If I have understood the current implementation of gitsm it fetches information about the submodules from the main git repository. My implementation uses the local .gitmodules file for initializing and updating the submodules. For this reason we won't have to rely on the SRCREV-tag in recipes and it can be removed completely. Also as previously mentioned by Fredrik, the purpose of this patch is to introduce reproducible builds where the information is stored in git and not in the bb-files. This will significantly improve the traceability since when you are developing a package that should be integrated to the bitbake system you will no longer need to commit both to the package- and the layer repository. To solve this problem, we have developed a service that once a package is updated, the bb-file in the layer repository is automatically rewritten so match the new version of the package. But by using the layer repository as a super project it is possible to use gerrits superproject subscription that makes the layer repository to be updated each time the submodule (the package) is updated. This gives us reproducible builds, only one commit to introduce a change and no in-house developed services.

Hope this clarified my intentions
Best Regards
/ Pontus

Patch

diff --git a/bitbake/lib/bb/fetch2/__init__.py b/bitbake/lib/bb/fetch2/__init__.py
index 6a38cb0955..a87e0b5338 100644
--- a/bitbake/lib/bb/fetch2/__init__.py
+++ b/bitbake/lib/bb/fetch2/__init__.py
@@ -1923,6 +1923,7 @@  class FetchConnectionCache(object):
 from . import cvs
 from . import git
 from . import gitsm
+from . import submodule
 from . import gitannex
 from . import local
 from . import svn
@@ -1945,6 +1946,7 @@  methods.append(wget.Wget())
 methods.append(svn.Svn())
 methods.append(git.Git())
 methods.append(gitsm.GitSM())
+methods.append(submodule.Submodule())
 methods.append(gitannex.GitANNEX())
 methods.append(cvs.Cvs())
 methods.append(ssh.SSH())
diff --git a/bitbake/lib/bb/fetch2/submodule.py b/bitbake/lib/bb/fetch2/submodule.py
new file mode 100644
index 0000000000..ef46404c2e
--- /dev/null
+++ b/bitbake/lib/bb/fetch2/submodule.py
@@ -0,0 +1,123 @@ 
+"""
+BitBake 'Fetch' implementation for local git submodules.
+
+Inherits from and extends the FetchMethod for retrieving, initializing and
+updating a submodule directory.
+
+SRC_URI = "submodule://<PATH TO TOP LEVEL GIT DIRECTORY>;localpath=<SUBMODULE NAME>"
+
+NOTE: If the name of the submodule contains space(s), please indicate this by placing
+the <SUBMODULE NAME> within a single quotation mark and replace the space(s) with a
+backslash (\). Ex) If the submodule name is: name with spaces, then write the
+following: localpath='name\with\spaces'.
+
+"""
+
+# Pontus Jaensson 2021
+#
+# SPDX-License-Identifier: GPL-2.0-only
+#
+
+import os
+import subprocess
+from re import compile
+from bb.fetch2 import FetchMethod
+
+class Submodule(FetchMethod):
+
+    def supports(self, ud, d):
+        """
+        Checks if the given url can be fetched with submodule.
+        """
+        return ud.type in ['submodule']
+
+    def download(self, ud, d):
+        """
+        Finds the path and the location to the submodule in the local .gitmodules
+        file and then passes it to the execution method.
+        """
+        module_path, module_location = self._prepare_module_data(ud.url)
+
+        self._run_command(f'git submodule init {module_path}', module_location)
+        self._run_command(f'git submodule update {module_path}', module_location)
+
+    def unpack(self, ud, root, d):
+        """
+        Is called once the 'download' of the submodule has been completed in order
+        to ensure that the source code is in the right location thanks to symlinks.
+        """
+        path, location = self._prepare_module_data(ud.url)
+        abs_path = location + path
+
+        try:
+            subprocess.check_output(['rm', '-r', 'gitmodule'])
+            subprocess.check_output(['ln', '-s', abs_path, 'gitmodule'])
+        except subprocess.CalledProcessError as e:
+            raise Exception(f'Failed setting up the symlinks correctly') from e
+
+    def _run_command(self, cmd, location):
+        """
+        Runs the provided cmd command at the path specified by the location argument.
+        Raises an error if the location is unvalid or if the cmd command fails.
+        """
+        try:
+            modules_git_command = cmd.split(' ')[:3]
+            module_path = cmd.split(' ')[3:]
+            os.chdir(location)
+            if len(module_path) > 1:
+                module_path = [' '.join(module_path)]
+            cmd_array = modules_git_command + module_path
+            subprocess.check_output(cmd_array)
+        except OSError as e:
+            raise Exception(f'Not able to change current working directory to: {location}') from e
+        except subprocess.CalledProcessError as e:
+            raise Exception(f'Not able to run command: {cmd} at {location}') from e
+
+    def _parse_url(self, url):
+        """
+        Helper method for deconstructing the url from the user data.
+        """
+        res = compile('([^:]*)://(([^/;]+)@)?(?P<location>[^;]+)(;(?P<submodule>.*))?').match(url)
+        module_location = res.group('location') + '/'
+        submodule = res.group('submodule').split('=')[1]
+        return module_location, submodule
+
+    def _prepare_module_data(self, url):
+        """
+        Helper method for preparing and processing the submodule data based on the
+        url provided.
+        """
+        module_location, name = self._parse_url(url)
+        modules_path = os.path.abspath(module_location + '.gitmodules')
+        try:
+            with open(modules_path) as file:
+                gitmodule_data = file.readlines()
+        except:
+            raise Exception(f'Could not open .gitmodules file located at: {modules_path}')
+
+        return self._process_submodules_info(gitmodule_data, name), module_location
+
+    def _process_submodules_info(self, gitmodule_data, wanted_module):
+        """
+        Helper method for iterating through the provided gitmodule_data in order to find
+        the path to the wanted_module. Returns it if it is found otherwise raises an Exception.
+        """
+        module_path = ''
+
+        if wanted_module[0] == wanted_module[-1] == "'":
+            wanted_module = wanted_module[1:-1].replace('\\', ' ')
+
+        for line in gitmodule_data:
+            if line.startswith('[submodule'):
+                name = line.split('"')[1]
+                if name == wanted_module:
+                    current_module = True
+                else:
+                    current_module = False
+            elif line.strip().startswith('path') and current_module:
+                module_path = line.split('=')[1].strip()
+                break
+        if module_path:
+            return module_path
+        else:
+            raise ValueError(f'{wanted_module} is not a valid submodule name.')