diff mbox series

[1/2] fetch2: Revert decodeurl() to not use the URI class

Message ID 20250228231501.3905069-1-pkj@axis.com
State New
Headers show
Series [1/2] fetch2: Revert decodeurl() to not use the URI class | expand

Commit Message

Peter Kjellerstedt Feb. 28, 2025, 11:15 p.m. UTC
This reverts the recent change of decodeurl() to use the URI class to
parse the URL instead of doing it itself. While reusing code is
generally a good idea, using urllib.parse.urlparse() (which the URI
class does) to parse the regular expression "URLs" that are used in
PREMIRRORS and MIRRORS does not work. A regular expression URL
containing https?://... would be silently ignored, while a URL using a
negative lookahead such as git://(?!internal\.git\.server).*/.* would
result in a cryptic error:

  Exception: re.error: missing ), unterminated subpattern at position 0

The problem is that urllib.parse.urlparse() treats the ? as the start of
URL parameters and thus stops parsing whatever part of the URL it was
parsing.

Signed-off-by: Peter Kjellerstedt <peter.kjellerstedt@axis.com>
---
 bitbake/lib/bb/fetch2/__init__.py | 46 +++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 3 deletions(-)

Comments

Richard Purdie March 2, 2025, 8:21 a.m. UTC | #1
On Sat, 2025-03-01 at 00:15 +0100, Peter Kjellerstedt via lists.openembedded.org wrote:
> This reverts the recent change of decodeurl() to use the URI class to
> parse the URL instead of doing it itself. While reusing code is
> generally a good idea, using urllib.parse.urlparse() (which the URI
> class does) to parse the regular expression "URLs" that are used in
> PREMIRRORS and MIRRORS does not work. A regular expression URL
> containing https?://... would be silently ignored, while a URL using a
> negative lookahead such as git://(?!internal\.git\.server).*/.* would
> result in a cryptic error:
> 
>   Exception: re.error: missing ), unterminated subpattern at position 0
> 
> The problem is that urllib.parse.urlparse() treats the ? as the start of
> URL parameters and thus stops parsing whatever part of the URL it was
> parsing.
> 
> Signed-off-by: Peter Kjellerstedt <peter.kjellerstedt@axis.com>
> ---
>  bitbake/lib/bb/fetch2/__init__.py | 46 +++++++++++++++++++++++++++++--
>  1 file changed, 43 insertions(+), 3 deletions(-)

I'm forever coming under pressure to take a patch like the one this
reverts, it is the first thing most people new to the fetcher dislike.
I can understand why.

I wonder if we should restore the function but only use it for parsing
mirror urls, and comment it accordingly?

Cheers,

Richard
diff mbox series

Patch

diff --git a/bitbake/lib/bb/fetch2/__init__.py b/bitbake/lib/bb/fetch2/__init__.py
index 93fe012ec3..464b9ee4c3 100644
--- a/bitbake/lib/bb/fetch2/__init__.py
+++ b/bitbake/lib/bb/fetch2/__init__.py
@@ -353,9 +353,49 @@  def decodeurl(url):
     user, password, parameters).
     """
 
-    uri = URI(url)
-    path = uri.path if uri.path else "/"
-    return uri.scheme, uri.hostport, path, uri.username, uri.password, uri.params
+    m = re.compile('(?P<type>[^:]*)://((?P<user>[^/;]+)@)?(?P<location>[^;]+)(;(?P<parm>.*))?').match(url)
+    if not m:
+        raise MalformedUrl(url)
+
+    type = m.group('type')
+    location = m.group('location')
+    if not location:
+        raise MalformedUrl(url)
+    user = m.group('user')
+    parm = m.group('parm')
+
+    locidx = location.find('/')
+    if locidx != -1 and type.lower() != 'file':
+        host = location[:locidx]
+        path = location[locidx:]
+    elif type.lower() == 'file':
+        host = ""
+        path = location
+        if user:
+            path = user + '@' + path
+            user = ""
+    else:
+        host = location
+        path = "/"
+    if user:
+        m = re.compile('(?P<user>[^:]+)(:?(?P<pswd>.*))').match(user)
+        if m:
+            user = m.group('user')
+            pswd = m.group('pswd')
+    else:
+        user = ''
+        pswd = ''
+
+    p = collections.OrderedDict()
+    if parm:
+        for s in parm.split(';'):
+            if s:
+                if not '=' in s:
+                    raise MalformedUrl(url, "The URL: '%s' is invalid: parameter %s does not specify a value (missing '=')" % (url, s))
+                s1, s2 = s.split('=', 1)
+                p[s1] = s2
+
+    return type, host, urllib.parse.unquote(path), user, pswd, p
 
 def encodeurl(decoded):
     """Encodes a URL from tokens (scheme, network location, path,