Message ID | 20250502203501.3364973-1-ejo@pengutronix.de |
---|---|
State | New |
Headers | show |
Series | conf.py: tweak SearchLanguage to be hyphen-friendly | expand |
Hi Enrico, On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote: > This modifies the default indexer split() and js splitQuery() > methods to support searching for words with hyphens. > > While this might not be an ideal, rock-solid, or fully future-proof > solution, it at least allows searching for strings that include hyphens, > such as 'bitbake-layers', 'send-error-report', or 'oe-core'. > > Below is a bit more detailed explanation of the two modifications done: > > 1) The default split regex in the sphinx-doc SearchLanguage class is: > > | _word_re = re.compile(r'\w+') > > which we simply extend to include hyphens '-'. > > This will result in a searchindex.js that contains words with hyphens, > too. > > 2) The 'searchtool.js' code notes for its splitQuery() implementation: > > | /** > | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a > | * custom function per language. > | * > | * The regular expression works by splitting the string on consecutive characters > | * that are not Unicode letters, numbers, underscores, or emoji characters. > | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. > | */ > | if (typeof splitQuery === "undefined") { > | var splitQuery = (query) => query > | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) > | .filter(term => term) // remove remaining empty strings > | } > > The hook for this is documented in the sphinx-docs 'SearchLanguage' > class. > > | .. attribute:: js_splitter_code > | > | Return splitter function of JavaScript version. The function should be > | named as ``splitQuery``. And it should take a string and return list of > | strings. > | > | .. versionadded:: 3.0 > > We use this to define a simplified splitQuery() function with a split > argument that splits on empty spaces only. > > [YOCTO #14534] > > Signed-off-by: Enrico Jörns <ejo@pengutronix.de> > --- > documentation/conf.py | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/documentation/conf.py b/documentation/conf.py > index 2aceeb8e7..02397cd20 100644 > --- a/documentation/conf.py > +++ b/documentation/conf.py > @@ -13,6 +13,7 @@ > # documentation root, use os.path.abspath to make it absolute, like shown here. > # > import os > +import re > import sys > import datetime > try: > @@ -173,6 +174,26 @@ latex_elements = { > 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', > } > > + > +from sphinx.search import SearchLanguage > +from sphinx.search import languages > +class DashFriendlySearchLanguage(SearchLanguage): Could you extend SearchEnglish instead, which is I believe what we should be using today? > + lang = 'en' > + This would then be redundant. > + # Accept words that can include hyphens > + _word_re = re.compile(r'[\w-]+') I would recommend to have the dash as first character in case we ever want to expand this regex to avoid inadvertently making it a range, c.f. https://docs.python.org/3/library/re.html#module-re [] section. Or simply escape it so it's explicit what we're expecting from it :) > + > + def split(self, input: str) -> list[str]: > + return self._word_re.findall(input) > + This is already what the split method from SearchLanguage does, therefore no need to override it. > + js_splitter_code = """ > +function splitQuery(query) { > + return query.split(/\\s+/g).filter(term => term.length > 0); Why not simply add '-' to the default splitQuery as in sphinx/themes/basic/static/searchtools.js so we keep as close as possible to the original behavior, just with the added dash? Cheers, Quentin
diff --git a/documentation/conf.py b/documentation/conf.py index 2aceeb8e7..02397cd20 100644 --- a/documentation/conf.py +++ b/documentation/conf.py @@ -13,6 +13,7 @@ # documentation root, use os.path.abspath to make it absolute, like shown here. # import os +import re import sys import datetime try: @@ -173,6 +174,26 @@ latex_elements = { 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', } + +from sphinx.search import SearchLanguage +from sphinx.search import languages +class DashFriendlySearchLanguage(SearchLanguage): + lang = 'en' + + # Accept words that can include hyphens + _word_re = re.compile(r'[\w-]+') + + def split(self, input: str) -> list[str]: + return self._word_re.findall(input) + + js_splitter_code = """ +function splitQuery(query) { + return query.split(/\\s+/g).filter(term => term.length > 0); +} +""" + +languages['en'] = DashFriendlySearchLanguage + # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG from sphinx.builders.epub3 import Epub3Builder Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']
This modifies the default indexer split() and js splitQuery() methods to support searching for words with hyphens. While this might not be an ideal, rock-solid, or fully future-proof solution, it at least allows searching for strings that include hyphens, such as 'bitbake-layers', 'send-error-report', or 'oe-core'. Below is a bit more detailed explanation of the two modifications done: 1) The default split regex in the sphinx-doc SearchLanguage class is: | _word_re = re.compile(r'\w+') which we simply extend to include hyphens '-'. This will result in a searchindex.js that contains words with hyphens, too. 2) The 'searchtool.js' code notes for its splitQuery() implementation: | /** | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a | * custom function per language. | * | * The regular expression works by splitting the string on consecutive characters | * that are not Unicode letters, numbers, underscores, or emoji characters. | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. | */ | if (typeof splitQuery === "undefined") { | var splitQuery = (query) => query | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) | .filter(term => term) // remove remaining empty strings | } The hook for this is documented in the sphinx-docs 'SearchLanguage' class. | .. attribute:: js_splitter_code | | Return splitter function of JavaScript version. The function should be | named as ``splitQuery``. And it should take a string and return list of | strings. | | .. versionadded:: 3.0 We use this to define a simplified splitQuery() function with a split argument that splits on empty spaces only. [YOCTO #14534] Signed-off-by: Enrico Jörns <ejo@pengutronix.de> --- documentation/conf.py | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)