diff mbox series

fetch2/wget: Add user_agent parameter so it can be used optionally

Message ID 20240606203103.910-2-egyszeregy@freemail.hu
State New
Headers show
Series fetch2/wget: Add user_agent parameter so it can be used optionally | expand

Commit Message

Livius June 6, 2024, 8:31 p.m. UTC
s=20181004; d=freemail.hu;

	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Type:Content-Transfer-Encoding;

	l=4439; bh=gFL07AWBpzLHjoHsYCU5FRpc1g7rOLfpDR0elgN2oHU=;

	b=ZebVVJVNPmsmIhe/BkOmgRofAUPlcVM6Wga9g08NdLal0vLonZh7jReMw61NJZFn

	TUbIIMhyUzoqXLgfxBc7n7mXnwwfxtCcW/cFXyagcAcZ87+GLRHGDUyprBla+05b0s0

	/dX6spa4pd4TujOssYWmNzRwd7t6y8zbZDJYc2uJ8TNI0qiQ8RCtbPuGM3KcQZmMZuO

	vLfc1ER1EJedDn/9jBIadCeg5j1rzOD+xDUJh6zTpfIBeW3bks0D53QJEjrkR0MJ+Jm

	tuTC/I1SYstFcry03oMewVsK8vEHL0D8KUK4xiXzDkiEvpgrhND6orgIxt3f01V6CuB

	bYKIOcUvjA==
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

From: Benjamin Sz=C5=91ke <egyszeregy@freemail.hu>

Add the "user_agent" optional parameter for wget fetcher to able
to use it if HTTP servers block requests with the default wget
user agent.

Signed-off-by: Benjamin Sz=C5=91ke <egyszeregy@freemail.hu>
---
 .../bitbake-user-manual-fetching.rst          | 20 ++++++++++++-------
 .../bitbake-user-manual-ref-variables.rst     |  4 ++++
 lib/bb/fetch2/wget.py                         | 11 +++++++++-
 3 files changed, 27 insertions(+), 8 deletions(-)

cko/20100101 Firefox/126.0"
=20
     def check_certs(self, d):
         """
@@ -89,6 +89,15 @@ class Wget(FetchMethod):
=20
         self.basecmd =3D d.getVar("FETCHCMD_wget") or "/usr/bin/env wget=
 -t 2 -T 30"
=20
+        is_user_agent_enabled =3D ud.parm.get("user_agent","0") =3D=3D "=
1"
+        if is_user_agent_enabled:
+            bb_user_agent =3D d.getVar("BB_USER_AGENT")
+            if bb_user_agent is not None:
+                cmd_user_agent =3D bb_user_agent
+            else:
+                cmd_user_agent =3D self.user_agent
+            self.basecmd +=3D f" --user-agent=3D'{cmd_user_agent}'"
+
         if ud.type =3D=3D 'ftp' or ud.type =3D=3D 'ftps':
             self.basecmd +=3D " --passive-ftp"
=20
--=20
2.45.2.windows.1

Comments

Quentin Schulz June 7, 2024, 9:34 a.m. UTC | #1
Hi Benjamin,

On 6/6/24 10:31 PM, Livius via lists.openembedded.org wrote:
> From: Benjamin Szőke <egyszeregy@freemail.hu>
> 
> Add the "user_agent" optional parameter for wget fetcher to able
> to use it if HTTP servers block requests with the default wget
> user agent.
> 
> Signed-off-by: Benjamin Szőke <egyszeregy@freemail.hu>
> ---
>   .../bitbake-user-manual-fetching.rst          | 20 ++++++++++++-------
>   .../bitbake-user-manual-ref-variables.rst     |  4 ++++
>   lib/bb/fetch2/wget.py                         | 11 +++++++++-
>   3 files changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst b/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst
> index fb4f0a23d..899fa2f33 100644
> --- a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst
> +++ b/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst
> @@ -221,13 +221,18 @@ HTTP/FTP wget fetcher (``http://``, ``ftp://``, ``https://``)
>   This fetcher obtains files from web and FTP servers. Internally, the
>   fetcher uses the wget utility.
>   
> -The executable and parameters used are specified by the
> -``FETCHCMD_wget`` variable, which defaults to sensible values. The
> -fetcher supports a parameter "downloadfilename" that allows the name of
> -the downloaded file to be specified. Specifying the name of the
> -downloaded file is useful for avoiding collisions in
> -:term:`DL_DIR` when dealing with multiple files that
> -have the same name.
> +The executable and parameters used are specified by the ``FETCHCMD_wget``
> +variable, which defaults to sensible values. The fetcher supports
> +parameters, "downloadfilename" that allows the name of the downloaded
> +file to be specified and "user_agent" parameter which enable to use
> +a default ``Mozilla/5.0`` user-agent or a custom string value
> +via usage of :term:`BB_USER_AGENT`.
> +
> +Specifying the name of the downloaded file is useful for avoiding
> +collisions in :term:`DL_DIR` when dealing with multiple files
> +that have the same name. A few HTTP servers block requests with
> +the default wget user-agent, in this case specifying a valid
> +user-agent can solve this issue.
>   

If I may suggest, could you please make a list of all supported 
parameters, the same way it is currently done for 
https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bitbake-user-manual-fetching.html#the-unpack 
or 
https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bitbake-user-manual-fetching.html#cvs-fetcher-cvs 
?

>   If a username and password are specified in the ``SRC_URI``, a Basic
>   Authorization header will be added to each request, including across redirects.
> @@ -239,6 +244,7 @@ Some example URLs are as follows::
>      SRC_URI = "http://oe.handhelds.org/not_there.aac"
>      SRC_URI = "ftp://oe.handhelds.org/not_there_as_well.aac"
>      SRC_URI = "ftp://you@oe.handhelds.org/home/you/secret.plan"
> +   SRC_URI = "https://oe.handhelds.org/not_there.aac;user_agent=1"
>   
>   .. note::
>   
> diff --git a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
> index 899e584f9..a6c05a6bf 100644
> --- a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
> +++ b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
> @@ -699,6 +699,10 @@ overview of their function and contents.
>         Within an executing task, this variable holds the hash of the task as
>         returned by the currently enabled signature generator.
>   
> +   :term:`BB_USER_AGENT`
> +      Specifies a user-agent string which BitBake uses if "user_agent"
> +      parameter is enabled for HTTP/FTP wget fetcher.
> +
>      :term:`BB_VERBOSE_LOGS`
>         Controls how verbose BitBake is during builds. If set, shell scripts
>         echo commands and shell script output appears on standard out
> diff --git a/lib/bb/fetch2/wget.py b/lib/bb/fetch2/wget.py
> index d76b1d0d3..db4327ead 100644
> --- a/lib/bb/fetch2/wget.py
> +++ b/lib/bb/fetch2/wget.py
> @@ -56,7 +56,7 @@ class Wget(FetchMethod):
>       # CDNs like CloudFlare may do a 'browser integrity test' which can fail
>       # with the standard wget/urllib User-Agent, so pretend to be a modern
>       # browser.
> -    user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"
> +    user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0"
>   

What about moving this to conf/bitbake.conf as

BB_USER_AGENT ?= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) 
Gecko/20100101 Firefox/126.0"

or even ??=

(I see that we don't have weak operators in that file and it's pretty 
small, so maybe it isn't the right place?)

>       def check_certs(self, d):
>           """
> @@ -89,6 +89,15 @@ class Wget(FetchMethod):
>   
>           self.basecmd = d.getVar("FETCHCMD_wget") or "/usr/bin/env wget -t 2 -T 30"
>   
> +        is_user_agent_enabled = ud.parm.get("user_agent","0") == "1"
> +        if is_user_agent_enabled:
> +            bb_user_agent = d.getVar("BB_USER_AGENT")
> +            if bb_user_agent is not None:
> +                cmd_user_agent = bb_user_agent
> +            else:
> +                cmd_user_agent = self.user_agent

It would allow to not have to check that BB_USER_AGENT is defined, just 
use it.

I think we should probably expand the tests to account for this new 
parameter, e.g. in lib/bb/tests/fetch.py?

> +            self.basecmd += f" --user-agent='{cmd_user_agent}'"
> +

Should we shlex (https://docs.python.org/3/library/shlex.html) it? to 
avoid the quote in the string in BB_USER_AGENT to break the command?

Cheers,
Quentin
Ross Burton June 7, 2024, 3:23 p.m. UTC | #2
On 6 Jun 2024, at 21:31, Livius via lists.openembedded.org <egyszeregy=freemail.hu@lists.openembedded.org> wrote:
> 
> From: Benjamin Szőke <egyszeregy@freemail.hu>
> 
> Add the "user_agent" optional parameter for wget fetcher to able
> to use it if HTTP servers block requests with the default wget
> user agent.

What servers, why are they blocking, and what user-agent can fool them?  As you’ve seen there is already code to change the user-agent for some codepaths but not all of them: we should probably ensure the same logic is used everywhere.

What happens if we’re mostly truthful and have a useragent of “Bitbake” instead of wget (which could be throttled) or faking a browser?

Ross
Livius June 9, 2024, 9:07 p.m. UTC | #3
Randomly there are many servers which failed with default wget's user-agent. There were a commit about it a half year ago:
https://github.com/openembedded/bitbake/commit/d6fa261a9603677f0b3abbd309c1ca6073b63f4c

But later it reverted because Jfrog Artifactory does not like it.
https://github.com/openembedded/bitbake/commit/feef5cd12e877f42ffcace168d44b0e6eb80a907

There is also an AMD-Xilinx link which is failed to download without any Browser user-agent:
https://support.xilinx.com/s/question/0D54U00008RolRMSAZ/yocto-metaxilinxcore-cannot-find-pmuromnative-url?language=en_US

Best practice to use a latest FireFox user-agent like "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0" it was fine for me, also.
Livius June 10, 2024, 11:34 p.m. UTC | #4
On Fri, Jun  7, 2024 at 05:23 PM, Ross Burton wrote:

> 
> What happens if we’re mostly truthful and have a useragent of “Bitbake”
> instead of wget (which could be throttled) or faking a browser?
> 
> Ross
> 
>

Rules of user-agent says libraries like wget, curl etc just need to use their name and version number, so it means it can be BB_FETCH_USER_AGENT ??= "bitbake/2.8.0" for Bitbake.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent#library_and_net_tool_ua_strings

But in case of crazy HTTP servers it is completely unpredictable how good it will be in general. For example wget, curl has this format also but there were some servers which denide them to use based on their user-agent.
All in all bitbake should use a default like BB_FETCH_USER_AGENT ??= "bitbake/2.8.0" and if there are any recipe which has issue, need to use a fake browser user-agent to solve it, sure.
diff mbox series

Patch

diff --git a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst b/d=
oc/bitbake-user-manual/bitbake-user-manual-fetching.rst
index fb4f0a23d..899fa2f33 100644
--- a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst
+++ b/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst
@@ -221,13 +221,18 @@  HTTP/FTP wget fetcher (``http://``, ``ftp://``, ``h=
ttps://``)
 This fetcher obtains files from web and FTP servers. Internally, the
 fetcher uses the wget utility.
=20
-The executable and parameters used are specified by the
-``FETCHCMD_wget`` variable, which defaults to sensible values. The
-fetcher supports a parameter "downloadfilename" that allows the name of
-the downloaded file to be specified. Specifying the name of the
-downloaded file is useful for avoiding collisions in
-:term:`DL_DIR` when dealing with multiple files that
-have the same name.
+The executable and parameters used are specified by the ``FETCHCMD_wget`=
`
+variable, which defaults to sensible values. The fetcher supports
+parameters, "downloadfilename" that allows the name of the downloaded
+file to be specified and "user_agent" parameter which enable to use
+a default ``Mozilla/5.0`` user-agent or a custom string value
+via usage of :term:`BB_USER_AGENT`.
+
+Specifying the name of the downloaded file is useful for avoiding
+collisions in :term:`DL_DIR` when dealing with multiple files
+that have the same name. A few HTTP servers block requests with
+the default wget user-agent, in this case specifying a valid
+user-agent can solve this issue.
=20
 If a username and password are specified in the ``SRC_URI``, a Basic
 Authorization header will be added to each request, including across red=
irects.
@@ -239,6 +244,7 @@  Some example URLs are as follows::
    SRC_URI =3D "http://oe.handhelds.org/not_there.aac"
    SRC_URI =3D "ftp://oe.handhelds.org/not_there_as_well.aac"
    SRC_URI =3D "ftp://you@oe.handhelds.org/home/you/secret.plan"
+   SRC_URI =3D "https://oe.handhelds.org/not_there.aac;user_agent=3D1"
=20
 .. note::
=20
diff --git a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rs=
t b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
index 899e584f9..a6c05a6bf 100644
--- a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
+++ b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst
@@ -699,6 +699,10 @@  overview of their function and contents.
       Within an executing task, this variable holds the hash of the task=
 as
       returned by the currently enabled signature generator.
=20
+   :term:`BB_USER_AGENT`
+      Specifies a user-agent string which BitBake uses if "user_agent"
+      parameter is enabled for HTTP/FTP wget fetcher.
+
    :term:`BB_VERBOSE_LOGS`
       Controls how verbose BitBake is during builds. If set, shell scrip=
ts
       echo commands and shell script output appears on standard out
diff --git a/lib/bb/fetch2/wget.py b/lib/bb/fetch2/wget.py
index d76b1d0d3..db4327ead 100644
--- a/lib/bb/fetch2/wget.py
+++ b/lib/bb/fetch2/wget.py
@@ -56,7 +56,7 @@  class Wget(FetchMethod):
     # CDNs like CloudFlare may do a 'browser integrity test' which can f=
ail
     # with the standard wget/urllib User-Agent, so pretend to be a moder=
n
     # browser.
-    user_agent =3D "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gec=
ko/20100101 Firefox/84.0"
+    user_agent =3D "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Ge=