diff mbox series

fetch2: Ensure GCP fetcher checks if file exists before download.

Message ID 20231127131147.1128798-1-charlie.johnston@loftorbital.com
State Accepted, archived
Commit 1ab1d36c0af6fc58a974106b61ff4d37da6cb229
Headers show
Series fetch2: Ensure GCP fetcher checks if file exists before download. | expand

Commit Message

Charlie Johnston Nov. 27, 2023, 1:11 p.m. UTC
The GCP fetcher was calling bb.fetch2.check_network_access with
"gsutil stat" as the command, but then never actually ran that
command to check if the file exists. In cases where the file did
not exist in a gs:// premirror, this would lead to an unhandled
exception from do_fetch when the GCP python API tried to perform
the download.

This change resolves that issue by adding a runfetchcmd to call
gsutil.

Signed-off-by: Charlie Johnston <charlie.johnston@loftorbital.com>
---
 lib/bb/fetch2/gcp.py | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Etienne Cordonnier Aug. 21, 2024, 11:13 a.m. UTC | #1
Hi,
sorry, I'm testing this only now, long after it has been merged.
The second addition of runfetchcmd inside the function checkstatus is not
good in my opinion. It is redundant with the call
to self.gcp_client.bucket(ud.host).blob(path).exists() which already
returns True/False and does not throw an exception in case the file does
not exist. Also the call to gsutil stat is much slower than using the
python client to call exists() so we should not replace the call to
exists() with a call to gsutil stat. I think the intent of calling
check_network_access in checkstatus() was to error-out in case the error is
disabled. We can rather change the string "gsutil stat" to something else
to make the code more readable.

In the download function, I also think it would be better to add a
try/catch block around download_to_filename and throw a FetchError instead
of adding this extra slow [1] call to gsutil stat.

[1]: 2x to 4x slower according to
https://blog.gdeltproject.org/simple-experiments-in-gcs-ls-performance-comparing-gcloud-python-the-gcs-json-api/#:~:text=Unsurprisingly%2C%20the%20GCLOUD%20CLI%20is,library%20offering%202%2D4x%20speedup


Étienne

On Mon, Nov 27, 2023 at 7:14 PM Charlie Johnston <
charlie.johnston@loftorbital.com> wrote:

> The GCP fetcher was calling bb.fetch2.check_network_access with
> "gsutil stat" as the command, but then never actually ran that
> command to check if the file exists. In cases where the file did
> not exist in a gs:// premirror, this would lead to an unhandled
> exception from do_fetch when the GCP python API tried to perform
> the download.
>
> This change resolves that issue by adding a runfetchcmd to call
> gsutil.
>
> Signed-off-by: Charlie Johnston <charlie.johnston@loftorbital.com>
> ---
>  lib/bb/fetch2/gcp.py | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/lib/bb/fetch2/gcp.py b/lib/bb/fetch2/gcp.py
> index f42c81fd..f40ce2ea 100644
> --- a/lib/bb/fetch2/gcp.py
> +++ b/lib/bb/fetch2/gcp.py
> @@ -47,6 +47,7 @@ class GCP(FetchMethod):
>              ud.basename = os.path.basename(ud.path)
>
>          ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
> +        ud.basecmd = "gsutil stat"
>
>      def get_gcp_client(self):
>          from google.cloud import storage
> @@ -61,7 +62,8 @@ class GCP(FetchMethod):
>          if self.gcp_client is None:
>              self.get_gcp_client()
>
> -        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
> +        bb.fetch2.check_network_access(d, ud.basecmd,
> f"gs://{ud.host}{ud.path}")
> +        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
>
>          # Path sometimes has leading slash, so strip it
>          path = ud.path.lstrip("/")
> @@ -88,7 +90,8 @@ class GCP(FetchMethod):
>          if self.gcp_client is None:
>              self.get_gcp_client()
>
> -        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
> +        bb.fetch2.check_network_access(d, ud.basecmd,
> f"gs://{ud.host}{ud.path}")
> +        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
>
>          # Path sometimes has leading slash, so strip it
>          path = ud.path.lstrip("/")
> --
> 2.39.2
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#15576):
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_message_15576&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=9t4tVhNJHm_a-j81RmtjjjGs-dK8z3mbf1k2KnTXY2QSV7n3ZFg3rZKGKnIzpw7L&s=Ufc44GrQXbX_l46qNc5Y5lxJcIMmKJ5bdnDVQW4EFu8&e=
> Mute This Topic:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_mt_102834834_7048771&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=9t4tVhNJHm_a-j81RmtjjjGs-dK8z3mbf1k2KnTXY2QSV7n3ZFg3rZKGKnIzpw7L&s=9hBl_2FJgpF4ZQnus1TTZe8pJVWYK-n-Wv8mBBTNYyc&e=
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_unsub&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=9t4tVhNJHm_a-j81RmtjjjGs-dK8z3mbf1k2KnTXY2QSV7n3ZFg3rZKGKnIzpw7L&s=gvGokflWmVxRC7GwObSJV_2_JvuDVfe-sh2jVMegqDU&e=
> [ecordonnier@snap.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
Etienne Cordonnier Aug. 21, 2024, 11:19 a.m. UTC | #2
"I think the intent of calling check_network_access in checkstatus() was to
error-out in case the error is disabled"
There is a typo: I meant "in case the network is disabled"

Étienne

On Wed, Aug 21, 2024 at 1:13 PM Etienne Cordonnier via
lists.openembedded.org <ecordonnier=snap.com@lists.openembedded.org> wrote:

> Hi,
> sorry, I'm testing this only now, long after it has been merged.
> The second addition of runfetchcmd inside the function checkstatus is not
> good in my opinion. It is redundant with the call
> to self.gcp_client.bucket(ud.host).blob(path).exists() which already
> returns True/False and does not throw an exception in case the file does
> not exist. Also the call to gsutil stat is much slower than using the
> python client to call exists() so we should not replace the call to
> exists() with a call to gsutil stat. I think the intent of calling
> check_network_access in checkstatus() was to error-out in case the error is
> disabled. We can rather change the string "gsutil stat" to something else
> to make the code more readable.
>
> In the download function, I also think it would be better to add a
> try/catch block around download_to_filename and throw a FetchError instead
> of adding this extra slow [1] call to gsutil stat.
>
> [1]: 2x to 4x slower according to
> https://blog.gdeltproject.org/simple-experiments-in-gcs-ls-performance-comparing-gcloud-python-the-gcs-json-api/#:~:text=Unsurprisingly%2C%20the%20GCLOUD%20CLI%20is,library%20offering%202%2D4x%20speedup
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.gdeltproject.org_simple-2Dexperiments-2Din-2Dgcs-2Dls-2Dperformance-2Dcomparing-2Dgcloud-2Dpython-2Dthe-2Dgcs-2Djson-2Dapi_-23-3A-7E-3Atext-3DUnsurprisingly-252C-2520the-2520GCLOUD-2520CLI-2520is-2Clibrary-2520offering-25202-252D4x-2520speedup&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=FS9gvIUFPiKVDga08z06aINbaxjo5mnE40H4SNdvI43ssnr5oya_Wybo84xMWtdH&s=ibgZhJYtKknknj7Hy4XRLEYowUfGpK2h0LsAyxnmZNs&e=>
>
>
> Étienne
>
> On Mon, Nov 27, 2023 at 7:14 PM Charlie Johnston <
> charlie.johnston@loftorbital.com> wrote:
>
>> The GCP fetcher was calling bb.fetch2.check_network_access with
>> "gsutil stat" as the command, but then never actually ran that
>> command to check if the file exists. In cases where the file did
>> not exist in a gs:// premirror, this would lead to an unhandled
>> exception from do_fetch when the GCP python API tried to perform
>> the download.
>>
>> This change resolves that issue by adding a runfetchcmd to call
>> gsutil.
>>
>> Signed-off-by: Charlie Johnston <charlie.johnston@loftorbital.com>
>> ---
>>  lib/bb/fetch2/gcp.py | 7 +++++--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/bb/fetch2/gcp.py b/lib/bb/fetch2/gcp.py
>> index f42c81fd..f40ce2ea 100644
>> --- a/lib/bb/fetch2/gcp.py
>> +++ b/lib/bb/fetch2/gcp.py
>> @@ -47,6 +47,7 @@ class GCP(FetchMethod):
>>              ud.basename = os.path.basename(ud.path)
>>
>>          ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
>> +        ud.basecmd = "gsutil stat"
>>
>>      def get_gcp_client(self):
>>          from google.cloud import storage
>> @@ -61,7 +62,8 @@ class GCP(FetchMethod):
>>          if self.gcp_client is None:
>>              self.get_gcp_client()
>>
>> -        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
>> +        bb.fetch2.check_network_access(d, ud.basecmd,
>> f"gs://{ud.host}{ud.path}")
>> +        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"),
>> d)
>>
>>          # Path sometimes has leading slash, so strip it
>>          path = ud.path.lstrip("/")
>> @@ -88,7 +90,8 @@ class GCP(FetchMethod):
>>          if self.gcp_client is None:
>>              self.get_gcp_client()
>>
>> -        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
>> +        bb.fetch2.check_network_access(d, ud.basecmd,
>> f"gs://{ud.host}{ud.path}")
>> +        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"),
>> d)
>>
>>          # Path sometimes has leading slash, so strip it
>>          path = ud.path.lstrip("/")
>> --
>> 2.39.2
>>
>>
>>
>>
>>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#16508):
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_message_16508&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=FS9gvIUFPiKVDga08z06aINbaxjo5mnE40H4SNdvI43ssnr5oya_Wybo84xMWtdH&s=9ve9G6OxtTRxfAM7OJKYpbrpPAr0DtKiclOarHuYIMM&e=
> Mute This Topic:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_mt_102834834_7048771&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=FS9gvIUFPiKVDga08z06aINbaxjo5mnE40H4SNdvI43ssnr5oya_Wybo84xMWtdH&s=wIeCjd0uxsM__rakYQuKwjp7vPi-fBe5s6ieUEf0rV4&e=
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_unsub&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=FS9gvIUFPiKVDga08z06aINbaxjo5mnE40H4SNdvI43ssnr5oya_Wybo84xMWtdH&s=EGE4mfxQZBOZuuRais6PCFAwDMGDIrGnieFa9dJa4ZQ&e=
> [ecordonnier@snap.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
Charlie Johnston Aug. 21, 2024, 5:41 p.m. UTC | #3
Hello,

I actually ran into the slowness of gsutil stat recently and have been testing a patch locally that does something similar to what you describe. The difference is that I modified it to drop the runfetchcomd and raise a FetchError if self.gcp_client.bucket(ud.host).blob(path).exists() is false.
Is that approach acceptable, or is it better to bundle it into the download_to_filename logic as you describe?

Regards,
Charlie
Etienne Cordonnier Aug. 22, 2024, 10:37 a.m. UTC | #4
Hi Charlie,
I tested this patch locally, but I'm using google cloud only as state-cache
mirror, and I tested that it works when I try to call download() on a file
which doesn't exist in the google cloud bucket. Does it work for your
use-case?

Étienne

From 2d10a74b01e4b6d9a7b7f94e62784e345ec85f0a Mon Sep 17 00:00:00 2001
From: Etienne Cordonnier <ecordonnier@snap.com>
Date: Wed, 21 Aug 2024 13:34:34 +0200
Subject: [PATCH] gcp.py: remove slow calls to gsutil stat

Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
---
 bitbake/lib/bb/fetch2/gcp.py | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/bitbake/lib/bb/fetch2/gcp.py b/bitbake/lib/bb/fetch2/gcp.py
index eb3e0c6a6bb..9ac2f752f6b 100644
--- a/bitbake/lib/bb/fetch2/gcp.py
+++ b/bitbake/lib/bb/fetch2/gcp.py
@@ -19,11 +19,11 @@ Additionally, gsutil must also be installed.

 import os
 import bb
+from google.api_core.exceptions import NotFound
 import urllib.parse, urllib.error
 from bb.fetch2 import FetchMethod
 from bb.fetch2 import FetchError
 from bb.fetch2 import logger
-from bb.fetch2 import runfetchcmd

 class GCP(FetchMethod):
     """
@@ -48,7 +48,6 @@ class GCP(FetchMethod):
             ud.basename = os.path.basename(ud.path)

         ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
-        ud.basecmd = "gsutil stat"

     def get_gcp_client(self):
         from google.cloud import storage
@@ -63,13 +62,15 @@ class GCP(FetchMethod):
         if self.gcp_client is None:
             self.get_gcp_client()

-        bb.fetch2.check_network_access(d, ud.basecmd,
f"gs://{ud.host}{ud.path}")
-        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
+        bb.fetch2.check_network_access(d, "blob.download_to_filename",
f"gs://{ud.host}{ud.path}")

         # Path sometimes has leading slash, so strip it
         path = ud.path.lstrip("/")
         blob = self.gcp_client.bucket(ud.host).blob(path)
-        blob.download_to_filename(ud.localpath)
+        try:
+            blob.download_to_filename(ud.localpath)
+        except NotFound:
+            raise FetchError("The GCP API threw a NotFound exception")

         # Additional sanity checks copied from the wget class (although
there
         # are no known issues which mean these are required, treat the GCP
API
@@ -91,8 +92,7 @@ class GCP(FetchMethod):
         if self.gcp_client is None:
             self.get_gcp_client()

-        bb.fetch2.check_network_access(d, ud.basecmd,
f"gs://{ud.host}{ud.path}")
-        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
+        bb.fetch2.check_network_access(d,
"gcp_client.bucket(ud.host).blob(path).exists()",
f"gs://{ud.host}{ud.path}")

         # Path sometimes has leading slash, so strip it
         path = ud.path.lstrip("/")
Etienne Cordonnier Aug. 22, 2024, 11:30 a.m. UTC | #5
"Is that approach acceptable, or is it better to bundle it into the
download_to_filename logic as you describe?"

-> If I understand correctly you added a call to exists() inside the
function download(), before doing the call to download_to_filename()? The
advantage of catching the exception thrown by download_to_filename instead
of adding an additional call to exists() is that it does one network
transaction instead of two in the case where the file really exists. I
think the performance is important there, because there are potentially
thousands of calls being done when using it as a fetch pre-mirror or
sstate-cache mirror. Maybe we need to catch more exceptions than only the
"NotFound" one however.

Étienne


On Thu, Aug 22, 2024 at 12:38 PM Etienne Cordonnier via
lists.openembedded.org <ecordonnier=snap.com@lists.openembedded.org> wrote:

> Hi Charlie,
> I tested this patch locally, but I'm using google cloud only as
> state-cache mirror, and I tested that it works when I try to call
> download() on a file which doesn't exist in the google cloud bucket. Does
> it work for your use-case?
>
> Étienne
>
> From 2d10a74b01e4b6d9a7b7f94e62784e345ec85f0a Mon Sep 17 00:00:00 2001
> From: Etienne Cordonnier <ecordonnier@snap.com>
> Date: Wed, 21 Aug 2024 13:34:34 +0200
> Subject: [PATCH] gcp.py: remove slow calls to gsutil stat
>
> Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
> ---
>  bitbake/lib/bb/fetch2/gcp.py | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/bitbake/lib/bb/fetch2/gcp.py b/bitbake/lib/bb/fetch2/gcp.py
> index eb3e0c6a6bb..9ac2f752f6b 100644
> --- a/bitbake/lib/bb/fetch2/gcp.py
> +++ b/bitbake/lib/bb/fetch2/gcp.py
> @@ -19,11 +19,11 @@ Additionally, gsutil must also be installed.
>
>  import os
>  import bb
> +from google.api_core.exceptions import NotFound
>  import urllib.parse, urllib.error
>  from bb.fetch2 import FetchMethod
>  from bb.fetch2 import FetchError
>  from bb.fetch2 import logger
> -from bb.fetch2 import runfetchcmd
>
>  class GCP(FetchMethod):
>      """
> @@ -48,7 +48,6 @@ class GCP(FetchMethod):
>              ud.basename = os.path.basename(ud.path)
>
>          ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
> -        ud.basecmd = "gsutil stat"
>
>      def get_gcp_client(self):
>          from google.cloud import storage
> @@ -63,13 +62,15 @@ class GCP(FetchMethod):
>          if self.gcp_client is None:
>              self.get_gcp_client()
>
> -        bb.fetch2.check_network_access(d, ud.basecmd,
> f"gs://{ud.host}{ud.path}")
> -        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
> +        bb.fetch2.check_network_access(d, "blob.download_to_filename",
> f"gs://{ud.host}{ud.path}")
>
>          # Path sometimes has leading slash, so strip it
>          path = ud.path.lstrip("/")
>          blob = self.gcp_client.bucket(ud.host).blob(path)
> -        blob.download_to_filename(ud.localpath)
> +        try:
> +            blob.download_to_filename(ud.localpath)
> +        except NotFound:
> +            raise FetchError("The GCP API threw a NotFound exception")
>
>          # Additional sanity checks copied from the wget class (although
> there
>          # are no known issues which mean these are required, treat the
> GCP API
> @@ -91,8 +92,7 @@ class GCP(FetchMethod):
>          if self.gcp_client is None:
>              self.get_gcp_client()
>
> -        bb.fetch2.check_network_access(d, ud.basecmd,
> f"gs://{ud.host}{ud.path}")
> -        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
> +        bb.fetch2.check_network_access(d,
> "gcp_client.bucket(ud.host).blob(path).exists()",
> f"gs://{ud.host}{ud.path}")
>
>          # Path sometimes has leading slash, so strip it
>          path = ud.path.lstrip("/")
> --
> 2.43.0
>
>
> On Wed, Aug 21, 2024 at 7:41 PM Charlie Johnston via
> lists.openembedded.org
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.openembedded.org&d=DwMFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=dTQvc7buPp4YC9A0C-kX75ciZrOsa0UEcDEInebJzFU&e=>
> <charlie.johnston=loftorbital.com@lists.openembedded.org> wrote:
>
>> Hello,
>>
>> I actually ran into the slowness of gsutil stat recently and have been
>> testing a patch locally that does something similar to what you describe.
>> The difference is that I modified it to drop the runfetchcomd and raise a
>> FetchError if self.gcp_client.bucket(ud.host).blob(path).exists()  is
>> false.
>> Is that approach acceptable, or is it better to bundle it into the
>> download_to_filename logic as you describe?
>>
>> Regards,
>> Charlie
>>
>>
>>
>>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#16515):
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_message_16515&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=kCznyCMpR5CzmDe3KeIwDR5JM24ELmq8QZguI06JUuk&e=
> Mute This Topic:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_mt_102834834_7048771&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=tfd7JFbaM14OIscUydQKKmlSYkXlw08lV8SfY0gN-_s&e=
> Group Owner: bitbake-devel+owner@lists.openembedded.org
> Unsubscribe:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_unsub&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=c9W4p9VYP2ytBvDF3f0gpr_0bPqKxzKbTjzalHAwGyQ&e=
> [ecordonnier@snap.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>
>
Charlie Johnston Aug. 22, 2024, 6:30 p.m. UTC | #6
Yes, you understood correctly. Here's a small snippet from my local change:
         # Path sometimes has leading slash, so strip it
         path = ud.path.lstrip("/")
         blob = self.gcp_client.bucket(ud.host).blob(path)
+        if blob.exists() == False:
+            raise FetchError(f"The GCP API reported that
gs://{ud.host}{ud.path} does not exist")
         blob.download_to_filename(ud.localpath)

I have the same use case as you (sstate cache and mirrors), plus a few
public keys stored in gcp for recipes to install internal public keys
on our lab devices. I think your patch is better, and it definitely
was more performant after trying it locally.

Thanks for catching this!
Charlie

On Thu, Aug 22, 2024 at 5:30 AM Etienne Cordonnier <ecordonnier@snap.com> wrote:
>
> "Is that approach acceptable, or is it better to bundle it into the download_to_filename logic as you describe?"
>
> -> If I understand correctly you added a call to exists() inside the function download(), before doing the call to download_to_filename()? The advantage of catching the exception thrown by download_to_filename instead of adding an additional call to exists() is that it does one network transaction instead of two in the case where the file really exists. I think the performance is important there, because there are potentially thousands of calls being done when using it as a fetch pre-mirror or sstate-cache mirror. Maybe we need to catch more exceptions than only the "NotFound" one however.
>
> Étienne
>
>
> On Thu, Aug 22, 2024 at 12:38 PM Etienne Cordonnier via lists.openembedded.org <ecordonnier=snap.com@lists.openembedded.org> wrote:
>>
>> Hi Charlie,
>> I tested this patch locally, but I'm using google cloud only as state-cache mirror, and I tested that it works when I try to call download() on a file which doesn't exist in the google cloud bucket. Does it work for your use-case?
>>
>> Étienne
>>
>> From 2d10a74b01e4b6d9a7b7f94e62784e345ec85f0a Mon Sep 17 00:00:00 2001
>> From: Etienne Cordonnier <ecordonnier@snap.com>
>> Date: Wed, 21 Aug 2024 13:34:34 +0200
>> Subject: [PATCH] gcp.py: remove slow calls to gsutil stat
>>
>> Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
>> ---
>>  bitbake/lib/bb/fetch2/gcp.py | 14 +++++++-------
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/bitbake/lib/bb/fetch2/gcp.py b/bitbake/lib/bb/fetch2/gcp.py
>> index eb3e0c6a6bb..9ac2f752f6b 100644
>> --- a/bitbake/lib/bb/fetch2/gcp.py
>> +++ b/bitbake/lib/bb/fetch2/gcp.py
>> @@ -19,11 +19,11 @@ Additionally, gsutil must also be installed.
>>
>>  import os
>>  import bb
>> +from google.api_core.exceptions import NotFound
>>  import urllib.parse, urllib.error
>>  from bb.fetch2 import FetchMethod
>>  from bb.fetch2 import FetchError
>>  from bb.fetch2 import logger
>> -from bb.fetch2 import runfetchcmd
>>
>>  class GCP(FetchMethod):
>>      """
>> @@ -48,7 +48,6 @@ class GCP(FetchMethod):
>>              ud.basename = os.path.basename(ud.path)
>>
>>          ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
>> -        ud.basecmd = "gsutil stat"
>>
>>      def get_gcp_client(self):
>>          from google.cloud import storage
>> @@ -63,13 +62,15 @@ class GCP(FetchMethod):
>>          if self.gcp_client is None:
>>              self.get_gcp_client()
>>
>> -        bb.fetch2.check_network_access(d, ud.basecmd, f"gs://{ud.host}{ud.path}")
>> -        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
>> +        bb.fetch2.check_network_access(d, "blob.download_to_filename", f"gs://{ud.host}{ud.path}")
>>
>>          # Path sometimes has leading slash, so strip it
>>          path = ud.path.lstrip("/")
>>          blob = self.gcp_client.bucket(ud.host).blob(path)
>> -        blob.download_to_filename(ud.localpath)
>> +        try:
>> +            blob.download_to_filename(ud.localpath)
>> +        except NotFound:
>> +            raise FetchError("The GCP API threw a NotFound exception")
>>
>>          # Additional sanity checks copied from the wget class (although there
>>          # are no known issues which mean these are required, treat the GCP API
>> @@ -91,8 +92,7 @@ class GCP(FetchMethod):
>>          if self.gcp_client is None:
>>              self.get_gcp_client()
>>
>> -        bb.fetch2.check_network_access(d, ud.basecmd, f"gs://{ud.host}{ud.path}")
>> -        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
>> +        bb.fetch2.check_network_access(d, "gcp_client.bucket(ud.host).blob(path).exists()", f"gs://{ud.host}{ud.path}")
>>
>>          # Path sometimes has leading slash, so strip it
>>          path = ud.path.lstrip("/")
>> --
>> 2.43.0
>>
>>
>> On Wed, Aug 21, 2024 at 7:41 PM Charlie Johnston via lists.openembedded.org <charlie.johnston=loftorbital.com@lists.openembedded.org> wrote:
>>>
>>> Hello,
>>>
>>> I actually ran into the slowness of gsutil stat recently and have been testing a patch locally that does something similar to what you describe. The difference is that I modified it to drop the runfetchcomd and raise a FetchError if self.gcp_client.bucket(ud.host).blob(path).exists()  is false.
>>> Is that approach acceptable, or is it better to bundle it into the download_to_filename logic as you describe?
>>>
>>> Regards,
>>> Charlie
>>>
>>>
>>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-
>> Links: You receive all messages sent to this group.
>> View/Reply Online (#16515): https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_message_16515&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=kCznyCMpR5CzmDe3KeIwDR5JM24ELmq8QZguI06JUuk&e=
>> Mute This Topic: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_mt_102834834_7048771&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=tfd7JFbaM14OIscUydQKKmlSYkXlw08lV8SfY0gN-_s&e=
>> Group Owner: bitbake-devel+owner@lists.openembedded.org
>> Unsubscribe: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.openembedded.org_g_bitbake-2Ddevel_unsub&d=DwIFaQ&c=ncDTmphkJTvjIDPh0hpF_4vCHvabgGkICC2epckfdiw&r=AhkbNonVuMIGRfPx_Qj9TsRih1DULJTKUkSGa66m67E&m=hqGfVHbClPTscdMaVha_b7aC2hJiz3JOLacaaSf0Ybjrzp2ypdUAKDjAqWArfIIg&s=c9W4p9VYP2ytBvDF3f0gpr_0bPqKxzKbTjzalHAwGyQ&e=  [ecordonnier@snap.com]
>> -=-=-=-=-=-=-=-=-=-=-=-
>>
diff mbox series

Patch

diff --git a/lib/bb/fetch2/gcp.py b/lib/bb/fetch2/gcp.py
index f42c81fd..f40ce2ea 100644
--- a/lib/bb/fetch2/gcp.py
+++ b/lib/bb/fetch2/gcp.py
@@ -47,6 +47,7 @@  class GCP(FetchMethod):
             ud.basename = os.path.basename(ud.path)
 
         ud.localfile = d.expand(urllib.parse.unquote(ud.basename))
+        ud.basecmd = "gsutil stat"
 
     def get_gcp_client(self):
         from google.cloud import storage
@@ -61,7 +62,8 @@  class GCP(FetchMethod):
         if self.gcp_client is None:
             self.get_gcp_client()
 
-        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
+        bb.fetch2.check_network_access(d, ud.basecmd, f"gs://{ud.host}{ud.path}")
+        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
 
         # Path sometimes has leading slash, so strip it
         path = ud.path.lstrip("/")
@@ -88,7 +90,8 @@  class GCP(FetchMethod):
         if self.gcp_client is None:
             self.get_gcp_client()
 
-        bb.fetch2.check_network_access(d, "gsutil stat", ud.url)
+        bb.fetch2.check_network_access(d, ud.basecmd, f"gs://{ud.host}{ud.path}")
+        runfetchcmd("%s %s" % (ud.basecmd, f"gs://{ud.host}{ud.path}"), d)
 
         # Path sometimes has leading slash, so strip it
         path = ud.path.lstrip("/")