Message ID | 20240606203103.910-2-egyszeregy@freemail.hu |
---|---|
State | New |
Headers | show |
Series | fetch2/wget: Add user_agent parameter so it can be used optionally | expand |
Hi Benjamin, On 6/6/24 10:31 PM, Livius via lists.openembedded.org wrote: > From: Benjamin Szőke <egyszeregy@freemail.hu> > > Add the "user_agent" optional parameter for wget fetcher to able > to use it if HTTP servers block requests with the default wget > user agent. > > Signed-off-by: Benjamin Szőke <egyszeregy@freemail.hu> > --- > .../bitbake-user-manual-fetching.rst | 20 ++++++++++++------- > .../bitbake-user-manual-ref-variables.rst | 4 ++++ > lib/bb/fetch2/wget.py | 11 +++++++++- > 3 files changed, 27 insertions(+), 8 deletions(-) > > diff --git a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst b/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst > index fb4f0a23d..899fa2f33 100644 > --- a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst > +++ b/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst > @@ -221,13 +221,18 @@ HTTP/FTP wget fetcher (``http://``, ``ftp://``, ``https://``) > This fetcher obtains files from web and FTP servers. Internally, the > fetcher uses the wget utility. > > -The executable and parameters used are specified by the > -``FETCHCMD_wget`` variable, which defaults to sensible values. The > -fetcher supports a parameter "downloadfilename" that allows the name of > -the downloaded file to be specified. Specifying the name of the > -downloaded file is useful for avoiding collisions in > -:term:`DL_DIR` when dealing with multiple files that > -have the same name. > +The executable and parameters used are specified by the ``FETCHCMD_wget`` > +variable, which defaults to sensible values. The fetcher supports > +parameters, "downloadfilename" that allows the name of the downloaded > +file to be specified and "user_agent" parameter which enable to use > +a default ``Mozilla/5.0`` user-agent or a custom string value > +via usage of :term:`BB_USER_AGENT`. > + > +Specifying the name of the downloaded file is useful for avoiding > +collisions in :term:`DL_DIR` when dealing with multiple files > +that have the same name. A few HTTP servers block requests with > +the default wget user-agent, in this case specifying a valid > +user-agent can solve this issue. > If I may suggest, could you please make a list of all supported parameters, the same way it is currently done for https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bitbake-user-manual-fetching.html#the-unpack or https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bitbake-user-manual-fetching.html#cvs-fetcher-cvs ? > If a username and password are specified in the ``SRC_URI``, a Basic > Authorization header will be added to each request, including across redirects. > @@ -239,6 +244,7 @@ Some example URLs are as follows:: > SRC_URI = "http://oe.handhelds.org/not_there.aac" > SRC_URI = "ftp://oe.handhelds.org/not_there_as_well.aac" > SRC_URI = "ftp://you@oe.handhelds.org/home/you/secret.plan" > + SRC_URI = "https://oe.handhelds.org/not_there.aac;user_agent=1" > > .. note:: > > diff --git a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst > index 899e584f9..a6c05a6bf 100644 > --- a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst > +++ b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst > @@ -699,6 +699,10 @@ overview of their function and contents. > Within an executing task, this variable holds the hash of the task as > returned by the currently enabled signature generator. > > + :term:`BB_USER_AGENT` > + Specifies a user-agent string which BitBake uses if "user_agent" > + parameter is enabled for HTTP/FTP wget fetcher. > + > :term:`BB_VERBOSE_LOGS` > Controls how verbose BitBake is during builds. If set, shell scripts > echo commands and shell script output appears on standard out > diff --git a/lib/bb/fetch2/wget.py b/lib/bb/fetch2/wget.py > index d76b1d0d3..db4327ead 100644 > --- a/lib/bb/fetch2/wget.py > +++ b/lib/bb/fetch2/wget.py > @@ -56,7 +56,7 @@ class Wget(FetchMethod): > # CDNs like CloudFlare may do a 'browser integrity test' which can fail > # with the standard wget/urllib User-Agent, so pretend to be a modern > # browser. > - user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" > + user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0" > What about moving this to conf/bitbake.conf as BB_USER_AGENT ?= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0" or even ??= (I see that we don't have weak operators in that file and it's pretty small, so maybe it isn't the right place?) > def check_certs(self, d): > """ > @@ -89,6 +89,15 @@ class Wget(FetchMethod): > > self.basecmd = d.getVar("FETCHCMD_wget") or "/usr/bin/env wget -t 2 -T 30" > > + is_user_agent_enabled = ud.parm.get("user_agent","0") == "1" > + if is_user_agent_enabled: > + bb_user_agent = d.getVar("BB_USER_AGENT") > + if bb_user_agent is not None: > + cmd_user_agent = bb_user_agent > + else: > + cmd_user_agent = self.user_agent It would allow to not have to check that BB_USER_AGENT is defined, just use it. I think we should probably expand the tests to account for this new parameter, e.g. in lib/bb/tests/fetch.py? > + self.basecmd += f" --user-agent='{cmd_user_agent}'" > + Should we shlex (https://docs.python.org/3/library/shlex.html) it? to avoid the quote in the string in BB_USER_AGENT to break the command? Cheers, Quentin
On 6 Jun 2024, at 21:31, Livius via lists.openembedded.org <egyszeregy=freemail.hu@lists.openembedded.org> wrote: > > From: Benjamin Szőke <egyszeregy@freemail.hu> > > Add the "user_agent" optional parameter for wget fetcher to able > to use it if HTTP servers block requests with the default wget > user agent. What servers, why are they blocking, and what user-agent can fool them? As you’ve seen there is already code to change the user-agent for some codepaths but not all of them: we should probably ensure the same logic is used everywhere. What happens if we’re mostly truthful and have a useragent of “Bitbake” instead of wget (which could be throttled) or faking a browser? Ross
Randomly there are many servers which failed with default wget's user-agent. There were a commit about it a half year ago: https://github.com/openembedded/bitbake/commit/d6fa261a9603677f0b3abbd309c1ca6073b63f4c But later it reverted because Jfrog Artifactory does not like it. https://github.com/openembedded/bitbake/commit/feef5cd12e877f42ffcace168d44b0e6eb80a907 There is also an AMD-Xilinx link which is failed to download without any Browser user-agent: https://support.xilinx.com/s/question/0D54U00008RolRMSAZ/yocto-metaxilinxcore-cannot-find-pmuromnative-url?language=en_US Best practice to use a latest FireFox user-agent like "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0" it was fine for me, also.
On Fri, Jun 7, 2024 at 05:23 PM, Ross Burton wrote: > > What happens if we’re mostly truthful and have a useragent of “Bitbake” > instead of wget (which could be throttled) or faking a browser? > > Ross > > Rules of user-agent says libraries like wget, curl etc just need to use their name and version number, so it means it can be BB_FETCH_USER_AGENT ??= "bitbake/2.8.0" for Bitbake. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent#library_and_net_tool_ua_strings But in case of crazy HTTP servers it is completely unpredictable how good it will be in general. For example wget, curl has this format also but there were some servers which denide them to use based on their user-agent. All in all bitbake should use a default like BB_FETCH_USER_AGENT ??= "bitbake/2.8.0" and if there are any recipe which has issue, need to use a fake browser user-agent to solve it, sure.
diff --git a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst b/d= oc/bitbake-user-manual/bitbake-user-manual-fetching.rst index fb4f0a23d..899fa2f33 100644 --- a/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst +++ b/doc/bitbake-user-manual/bitbake-user-manual-fetching.rst @@ -221,13 +221,18 @@ HTTP/FTP wget fetcher (``http://``, ``ftp://``, ``h= ttps://``) This fetcher obtains files from web and FTP servers. Internally, the fetcher uses the wget utility. =20 -The executable and parameters used are specified by the -``FETCHCMD_wget`` variable, which defaults to sensible values. The -fetcher supports a parameter "downloadfilename" that allows the name of -the downloaded file to be specified. Specifying the name of the -downloaded file is useful for avoiding collisions in -:term:`DL_DIR` when dealing with multiple files that -have the same name. +The executable and parameters used are specified by the ``FETCHCMD_wget`= ` +variable, which defaults to sensible values. The fetcher supports +parameters, "downloadfilename" that allows the name of the downloaded +file to be specified and "user_agent" parameter which enable to use +a default ``Mozilla/5.0`` user-agent or a custom string value +via usage of :term:`BB_USER_AGENT`. + +Specifying the name of the downloaded file is useful for avoiding +collisions in :term:`DL_DIR` when dealing with multiple files +that have the same name. A few HTTP servers block requests with +the default wget user-agent, in this case specifying a valid +user-agent can solve this issue. =20 If a username and password are specified in the ``SRC_URI``, a Basic Authorization header will be added to each request, including across red= irects. @@ -239,6 +244,7 @@ Some example URLs are as follows:: SRC_URI =3D "http://oe.handhelds.org/not_there.aac" SRC_URI =3D "ftp://oe.handhelds.org/not_there_as_well.aac" SRC_URI =3D "ftp://you@oe.handhelds.org/home/you/secret.plan" + SRC_URI =3D "https://oe.handhelds.org/not_there.aac;user_agent=3D1" =20 .. note:: =20 diff --git a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rs= t b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst index 899e584f9..a6c05a6bf 100644 --- a/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst +++ b/doc/bitbake-user-manual/bitbake-user-manual-ref-variables.rst @@ -699,6 +699,10 @@ overview of their function and contents. Within an executing task, this variable holds the hash of the task= as returned by the currently enabled signature generator. =20 + :term:`BB_USER_AGENT` + Specifies a user-agent string which BitBake uses if "user_agent" + parameter is enabled for HTTP/FTP wget fetcher. + :term:`BB_VERBOSE_LOGS` Controls how verbose BitBake is during builds. If set, shell scrip= ts echo commands and shell script output appears on standard out diff --git a/lib/bb/fetch2/wget.py b/lib/bb/fetch2/wget.py index d76b1d0d3..db4327ead 100644 --- a/lib/bb/fetch2/wget.py +++ b/lib/bb/fetch2/wget.py @@ -56,7 +56,7 @@ class Wget(FetchMethod): # CDNs like CloudFlare may do a 'browser integrity test' which can f= ail # with the standard wget/urllib User-Agent, so pretend to be a moder= n # browser. - user_agent =3D "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gec= ko/20100101 Firefox/84.0" + user_agent =3D "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Ge=