Message ID | 20250502203501.3364973-1-ejo@pengutronix.de |
---|---|
State | Under Review |
Headers | show |
Series | conf.py: tweak SearchLanguage to be hyphen-friendly | expand |
Hi Enrico, On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote: > This modifies the default indexer split() and js splitQuery() > methods to support searching for words with hyphens. > > While this might not be an ideal, rock-solid, or fully future-proof > solution, it at least allows searching for strings that include hyphens, > such as 'bitbake-layers', 'send-error-report', or 'oe-core'. > > Below is a bit more detailed explanation of the two modifications done: > > 1) The default split regex in the sphinx-doc SearchLanguage class is: > > | _word_re = re.compile(r'\w+') > > which we simply extend to include hyphens '-'. > > This will result in a searchindex.js that contains words with hyphens, > too. > > 2) The 'searchtool.js' code notes for its splitQuery() implementation: > > | /** > | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a > | * custom function per language. > | * > | * The regular expression works by splitting the string on consecutive characters > | * that are not Unicode letters, numbers, underscores, or emoji characters. > | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. > | */ > | if (typeof splitQuery === "undefined") { > | var splitQuery = (query) => query > | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) > | .filter(term => term) // remove remaining empty strings > | } > > The hook for this is documented in the sphinx-docs 'SearchLanguage' > class. > > | .. attribute:: js_splitter_code > | > | Return splitter function of JavaScript version. The function should be > | named as ``splitQuery``. And it should take a string and return list of > | strings. > | > | .. versionadded:: 3.0 > > We use this to define a simplified splitQuery() function with a split > argument that splits on empty spaces only. > > [YOCTO #14534] > > Signed-off-by: Enrico Jörns <ejo@pengutronix.de> > --- > documentation/conf.py | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/documentation/conf.py b/documentation/conf.py > index 2aceeb8e7..02397cd20 100644 > --- a/documentation/conf.py > +++ b/documentation/conf.py > @@ -13,6 +13,7 @@ > # documentation root, use os.path.abspath to make it absolute, like shown here. > # > import os > +import re > import sys > import datetime > try: > @@ -173,6 +174,26 @@ latex_elements = { > 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', > } > > + > +from sphinx.search import SearchLanguage > +from sphinx.search import languages > +class DashFriendlySearchLanguage(SearchLanguage): Could you extend SearchEnglish instead, which is I believe what we should be using today? > + lang = 'en' > + This would then be redundant. > + # Accept words that can include hyphens > + _word_re = re.compile(r'[\w-]+') I would recommend to have the dash as first character in case we ever want to expand this regex to avoid inadvertently making it a range, c.f. https://docs.python.org/3/library/re.html#module-re [] section. Or simply escape it so it's explicit what we're expecting from it :) > + > + def split(self, input: str) -> list[str]: > + return self._word_re.findall(input) > + This is already what the split method from SearchLanguage does, therefore no need to override it. > + js_splitter_code = """ > +function splitQuery(query) { > + return query.split(/\\s+/g).filter(term => term.length > 0); Why not simply add '-' to the default splitQuery as in sphinx/themes/basic/static/searchtools.js so we keep as close as possible to the original behavior, just with the added dash? Cheers, Quentin
Hi Enrico, On Fri May 2, 2025 at 10:35 PM CEST, Enrico Jörns wrote: > This modifies the default indexer split() and js splitQuery() > methods to support searching for words with hyphens. > > While this might not be an ideal, rock-solid, or fully future-proof > solution, it at least allows searching for strings that include hyphens, > such as 'bitbake-layers', 'send-error-report', or 'oe-core'. > > Below is a bit more detailed explanation of the two modifications done: > > 1) The default split regex in the sphinx-doc SearchLanguage class is: > > | _word_re = re.compile(r'\w+') > > which we simply extend to include hyphens '-'. > > This will result in a searchindex.js that contains words with hyphens, > too. > > 2) The 'searchtool.js' code notes for its splitQuery() implementation: > > | /** > | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a > | * custom function per language. > | * > | * The regular expression works by splitting the string on consecutive characters > | * that are not Unicode letters, numbers, underscores, or emoji characters. > | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. > | */ > | if (typeof splitQuery === "undefined") { > | var splitQuery = (query) => query > | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) > | .filter(term => term) // remove remaining empty strings > | } > > The hook for this is documented in the sphinx-docs 'SearchLanguage' > class. > > | .. attribute:: js_splitter_code > | > | Return splitter function of JavaScript version. The function should be > | named as ``splitQuery``. And it should take a string and return list of > | strings. > | > | .. versionadded:: 3.0 > > We use this to define a simplified splitQuery() function with a split > argument that splits on empty spaces only. > > [YOCTO #14534] > > Signed-off-by: Enrico Jörns <ejo@pengutronix.de> > --- > documentation/conf.py | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/documentation/conf.py b/documentation/conf.py > index 2aceeb8e7..02397cd20 100644 > --- a/documentation/conf.py > +++ b/documentation/conf.py > @@ -13,6 +13,7 @@ > # documentation root, use os.path.abspath to make it absolute, like shown here. > # > import os > +import re > import sys > import datetime > try: > @@ -173,6 +174,26 @@ latex_elements = { > 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', > } > > + > +from sphinx.search import SearchLanguage > +from sphinx.search import languages > +class DashFriendlySearchLanguage(SearchLanguage): > + lang = 'en' > + > + # Accept words that can include hyphens > + _word_re = re.compile(r'[\w-]+') > + > + def split(self, input: str) -> list[str]: > + return self._word_re.findall(input) > + > + js_splitter_code = """ > +function splitQuery(query) { > + return query.split(/\\s+/g).filter(term => term.length > 0); > +} > +""" > + > +languages['en'] = DashFriendlySearchLanguage > + > # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG > from sphinx.builders.epub3 import Epub3Builder > Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg'] This is nice thanks. I've tested it with "bitbake-layers" or "ide-sdk", and it gives me much better results. The searchindex.js generated file grows from 936 bytes to 1180 bytes, I think we can live with that though - just an observation. Also a side-note: last time I checked, the kernel documentation had the same problem. Maybe a worthy contribution over there too? Antonin
Hi quentin, On Mon May 5, 2025 at 11:11 AM CEST, Quentin Schulz via lists.yoctoproject.org wrote: > Hi Enrico, > > On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote: >> This modifies the default indexer split() and js splitQuery() >> methods to support searching for words with hyphens. >> >> While this might not be an ideal, rock-solid, or fully future-proof >> solution, it at least allows searching for strings that include hyphens, >> such as 'bitbake-layers', 'send-error-report', or 'oe-core'. >> >> Below is a bit more detailed explanation of the two modifications done: >> >> 1) The default split regex in the sphinx-doc SearchLanguage class is: >> >> | _word_re = re.compile(r'\w+') >> >> which we simply extend to include hyphens '-'. >> >> This will result in a searchindex.js that contains words with hyphens, >> too. >> >> 2) The 'searchtool.js' code notes for its splitQuery() implementation: >> >> | /** >> | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a >> | * custom function per language. >> | * >> | * The regular expression works by splitting the string on consecutive characters >> | * that are not Unicode letters, numbers, underscores, or emoji characters. >> | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. >> | */ >> | if (typeof splitQuery === "undefined") { >> | var splitQuery = (query) => query >> | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) >> | .filter(term => term) // remove remaining empty strings >> | } >> >> The hook for this is documented in the sphinx-docs 'SearchLanguage' >> class. >> >> | .. attribute:: js_splitter_code >> | >> | Return splitter function of JavaScript version. The function should be >> | named as ``splitQuery``. And it should take a string and return list of >> | strings. >> | >> | .. versionadded:: 3.0 >> >> We use this to define a simplified splitQuery() function with a split >> argument that splits on empty spaces only. >> >> [YOCTO #14534] >> >> Signed-off-by: Enrico Jörns <ejo@pengutronix.de> >> --- >> documentation/conf.py | 21 +++++++++++++++++++++ >> 1 file changed, 21 insertions(+) >> >> diff --git a/documentation/conf.py b/documentation/conf.py >> index 2aceeb8e7..02397cd20 100644 >> --- a/documentation/conf.py >> +++ b/documentation/conf.py >> @@ -13,6 +13,7 @@ >> # documentation root, use os.path.abspath to make it absolute, like shown here. >> # >> import os >> +import re >> import sys >> import datetime >> try: >> @@ -173,6 +174,26 @@ latex_elements = { >> 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', >> } >> >> + >> +from sphinx.search import SearchLanguage >> +from sphinx.search import languages >> +class DashFriendlySearchLanguage(SearchLanguage): > > Could you extend SearchEnglish instead, which is I believe what we > should be using today? A question on top of this: using SearchLanguage instead of SearchEnglish results in _not_ using the stopwords and stemmer code defined in SearchEnglish. Is that intentional? Maybe there is a good reason for not using SearchEnglish, but I'm curious. Antonin
Hi folks, Am Montag, dem 12.05.2025 um 17:19 +0200 schrieb Antonin Godard: > > > Signed-off-by: Enrico Jörns <ejo@pengutronix.de> > > > --- > > > documentation/conf.py | 21 +++++++++++++++++++++ > > > 1 file changed, 21 insertions(+) > > > > > > diff --git a/documentation/conf.py b/documentation/conf.py > > > index 2aceeb8e7..02397cd20 100644 > > > --- a/documentation/conf.py > > > +++ b/documentation/conf.py > > > @@ -13,6 +13,7 @@ > > > # documentation root, use os.path.abspath to make it absolute, like shown here. > > > # > > > import os > > > +import re > > > import sys > > > import datetime > > > try: > > > @@ -173,6 +174,26 @@ latex_elements = { > > > 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', > > > } > > > > > > + > > > +from sphinx.search import SearchLanguage > > > +from sphinx.search import languages > > > +class DashFriendlySearchLanguage(SearchLanguage): > > > > Could you extend SearchEnglish instead, which is I believe what we > > should be using today? > > A question on top of this: using SearchLanguage instead of SearchEnglish results > in _not_ using the stopwords and stemmer code defined in SearchEnglish. Is that > intentional? Maybe there is a good reason for not using SearchEnglish, but I'm > curious. I have no strong opinion on this or what's better for a technical documentation. Searching for 'effectiveness' in the documentation with stemmer enabled gives you results like 'effectively', etc. Surprisingly, the exact occurrence does not match for me anymore and the other results are not highlighted... But maybe with the intention to be least invasive, using SearchEnglish is fine anyway. Regards, Enrico > Antonin >
Hi Quentin, Am Montag, dem 05.05.2025 um 11:11 +0200 schrieb Quentin Schulz: > > --- > > documentation/conf.py | 21 +++++++++++++++++++++ > > 1 file changed, 21 insertions(+) > > > > diff --git a/documentation/conf.py b/documentation/conf.py > > index 2aceeb8e7..02397cd20 100644 > > --- a/documentation/conf.py > > +++ b/documentation/conf.py > > @@ -13,6 +13,7 @@ > > # documentation root, use os.path.abspath to make it absolute, like shown here. > > # > > import os > > +import re > > import sys > > import datetime > > try: > > @@ -173,6 +174,26 @@ latex_elements = { > > 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', > > } > > > > + > > +from sphinx.search import SearchLanguage > > +from sphinx.search import languages > > +class DashFriendlySearchLanguage(SearchLanguage): > > Could you extend SearchEnglish instead, which is I believe what we > should be using today? yes, fine for me. > > + lang = 'en' > > + > > This would then be redundant. Ok. > > + # Accept words that can include hyphens > > + _word_re = re.compile(r'[\w-]+') > > I would recommend to have the dash as first character in case we ever > want to expand this regex to avoid inadvertently making it a range, c.f. > https://docs.python.org/3/library/re.html#module-re [] section. Or > simply escape it so it's explicit what we're expecting from it :) Yes, might save us from later surprises. > > + > > + def split(self, input: str) -> list[str]: > > + return self._word_re.findall(input) > > + > > This is already what the split method from SearchLanguage does, > therefore no need to override it. Indeed. No idea why I thought this is required. > > + js_splitter_code = """ > > +function splitQuery(query) { > > + return query.split(/\\s+/g).filter(term => term.length > 0); > > Why not simply add '-' to the default splitQuery as in > sphinx/themes/basic/static/searchtools.js so we keep as close as > possible to the original behavior, just with the added dash? Wasn't sure if including or excluding is better, but maybe having it closer to the original one makes sense. Will send a v2 soon. Regards, Enrico > Cheers, > Quentin >
diff --git a/documentation/conf.py b/documentation/conf.py index 2aceeb8e7..02397cd20 100644 --- a/documentation/conf.py +++ b/documentation/conf.py @@ -13,6 +13,7 @@ # documentation root, use os.path.abspath to make it absolute, like shown here. # import os +import re import sys import datetime try: @@ -173,6 +174,26 @@ latex_elements = { 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', } + +from sphinx.search import SearchLanguage +from sphinx.search import languages +class DashFriendlySearchLanguage(SearchLanguage): + lang = 'en' + + # Accept words that can include hyphens + _word_re = re.compile(r'[\w-]+') + + def split(self, input: str) -> list[str]: + return self._word_re.findall(input) + + js_splitter_code = """ +function splitQuery(query) { + return query.split(/\\s+/g).filter(term => term.length > 0); +} +""" + +languages['en'] = DashFriendlySearchLanguage + # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG from sphinx.builders.epub3 import Epub3Builder Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']
This modifies the default indexer split() and js splitQuery() methods to support searching for words with hyphens. While this might not be an ideal, rock-solid, or fully future-proof solution, it at least allows searching for strings that include hyphens, such as 'bitbake-layers', 'send-error-report', or 'oe-core'. Below is a bit more detailed explanation of the two modifications done: 1) The default split regex in the sphinx-doc SearchLanguage class is: | _word_re = re.compile(r'\w+') which we simply extend to include hyphens '-'. This will result in a searchindex.js that contains words with hyphens, too. 2) The 'searchtool.js' code notes for its splitQuery() implementation: | /** | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a | * custom function per language. | * | * The regular expression works by splitting the string on consecutive characters | * that are not Unicode letters, numbers, underscores, or emoji characters. | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. | */ | if (typeof splitQuery === "undefined") { | var splitQuery = (query) => query | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) | .filter(term => term) // remove remaining empty strings | } The hook for this is documented in the sphinx-docs 'SearchLanguage' class. | .. attribute:: js_splitter_code | | Return splitter function of JavaScript version. The function should be | named as ``splitQuery``. And it should take a string and return list of | strings. | | .. versionadded:: 3.0 We use this to define a simplified splitQuery() function with a split argument that splits on empty spaces only. [YOCTO #14534] Signed-off-by: Enrico Jörns <ejo@pengutronix.de> --- documentation/conf.py | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)