Message ID | 20250228231501.3905069-1-pkj@axis.com |
---|---|
State | New |
Headers | show |
Series | [1/2] fetch2: Revert decodeurl() to not use the URI class | expand |
On Sat, 2025-03-01 at 00:15 +0100, Peter Kjellerstedt via lists.openembedded.org wrote: > This reverts the recent change of decodeurl() to use the URI class to > parse the URL instead of doing it itself. While reusing code is > generally a good idea, using urllib.parse.urlparse() (which the URI > class does) to parse the regular expression "URLs" that are used in > PREMIRRORS and MIRRORS does not work. A regular expression URL > containing https?://... would be silently ignored, while a URL using a > negative lookahead such as git://(?!internal\.git\.server).*/.* would > result in a cryptic error: > > Exception: re.error: missing ), unterminated subpattern at position 0 > > The problem is that urllib.parse.urlparse() treats the ? as the start of > URL parameters and thus stops parsing whatever part of the URL it was > parsing. > > Signed-off-by: Peter Kjellerstedt <peter.kjellerstedt@axis.com> > --- > bitbake/lib/bb/fetch2/__init__.py | 46 +++++++++++++++++++++++++++++-- > 1 file changed, 43 insertions(+), 3 deletions(-) I'm forever coming under pressure to take a patch like the one this reverts, it is the first thing most people new to the fetcher dislike. I can understand why. I wonder if we should restore the function but only use it for parsing mirror urls, and comment it accordingly? Cheers, Richard
diff --git a/bitbake/lib/bb/fetch2/__init__.py b/bitbake/lib/bb/fetch2/__init__.py index 93fe012ec3..464b9ee4c3 100644 --- a/bitbake/lib/bb/fetch2/__init__.py +++ b/bitbake/lib/bb/fetch2/__init__.py @@ -353,9 +353,49 @@ def decodeurl(url): user, password, parameters). """ - uri = URI(url) - path = uri.path if uri.path else "/" - return uri.scheme, uri.hostport, path, uri.username, uri.password, uri.params + m = re.compile('(?P<type>[^:]*)://((?P<user>[^/;]+)@)?(?P<location>[^;]+)(;(?P<parm>.*))?').match(url) + if not m: + raise MalformedUrl(url) + + type = m.group('type') + location = m.group('location') + if not location: + raise MalformedUrl(url) + user = m.group('user') + parm = m.group('parm') + + locidx = location.find('/') + if locidx != -1 and type.lower() != 'file': + host = location[:locidx] + path = location[locidx:] + elif type.lower() == 'file': + host = "" + path = location + if user: + path = user + '@' + path + user = "" + else: + host = location + path = "/" + if user: + m = re.compile('(?P<user>[^:]+)(:?(?P<pswd>.*))').match(user) + if m: + user = m.group('user') + pswd = m.group('pswd') + else: + user = '' + pswd = '' + + p = collections.OrderedDict() + if parm: + for s in parm.split(';'): + if s: + if not '=' in s: + raise MalformedUrl(url, "The URL: '%s' is invalid: parameter %s does not specify a value (missing '=')" % (url, s)) + s1, s2 = s.split('=', 1) + p[s1] = s2 + + return type, host, urllib.parse.unquote(path), user, pswd, p def encodeurl(decoded): """Encodes a URL from tokens (scheme, network location, path,
This reverts the recent change of decodeurl() to use the URI class to parse the URL instead of doing it itself. While reusing code is generally a good idea, using urllib.parse.urlparse() (which the URI class does) to parse the regular expression "URLs" that are used in PREMIRRORS and MIRRORS does not work. A regular expression URL containing https?://... would be silently ignored, while a URL using a negative lookahead such as git://(?!internal\.git\.server).*/.* would result in a cryptic error: Exception: re.error: missing ), unterminated subpattern at position 0 The problem is that urllib.parse.urlparse() treats the ? as the start of URL parameters and thus stops parsing whatever part of the URL it was parsing. Signed-off-by: Peter Kjellerstedt <peter.kjellerstedt@axis.com> --- bitbake/lib/bb/fetch2/__init__.py | 46 +++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 3 deletions(-)