Message ID | 20250520094514.2672646-1-ejo@pengutronix.de |
---|---|
State | Accepted |
Headers | show |
Series | [v2] conf.py: tweak SearchEnglish to be hyphen-friendly | expand |
On Tue, 20 May 2025 11:45:14 +0200, Enrico Jörns wrote: > This modifies the default indexer split() and js splitQuery() > methods to support searching for words with hyphens. > > While this might not be an ideal, rock solid, and fully future-proof > solution, it allows at least to search for strings inlcuding hyphens, > such as 'bitbake-layers', 'send-error-report', or 'oe-core'. > > [...] Applied, thanks! [1/1] conf.py: tweak SearchEnglish to be hyphen-friendly commit: d4a98ee19e0cbd6be96923dc72faee143a6b294b Best regards,
On Tue, 2025-05-20 at 11:45 +0200, Enrico Jörns via lists.yoctoproject.org wrote: > This modifies the default indexer split() and js splitQuery() > methods to support searching for words with hyphens. > > While this might not be an ideal, rock solid, and fully future-proof > solution, it allows at least to search for strings inlcuding hyphens, > such as 'bitbake-layers', 'send-error-report', or 'oe-core'. > > Below is a bit more detailed explanation of the two modifications > done: > > 1) The default split regex in the sphinx-doc SearchLanguage base > class > is: > > | _word_re = re.compile(r'\w+') > > which we simply extend to include hyphens '-'. > > This will result in a searchindex.js that contains words with > hyphens, > too. > > 2) The 'searchtool.js' code notes for its splitQuery() > implementation: > > | /** > | * Default splitQuery function. Can be overridden in > ``sphinx.search`` with a > | * custom function per language. > | * > | * The regular expression works by splitting the string on > consecutive characters > | * that are not Unicode letters, numbers, underscores, or emoji > characters. > | * This is the same as ``\W+`` in Python, preserving the > surrogate pair area. > | */ > | if (typeof splitQuery === "undefined") { > | var splitQuery = (query) => query > | > .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) > | .filter(term => term) // remove remaining empty strings > | } > > The hook for this is documented in the sphinx-docs > 'SearchLanguage' > base class. > > | .. attribute:: js_splitter_code > | > | Return splitter function of JavaScript version. The > function should be > | named as ``splitQuery``. And it should take a string and > return list of > | strings. > | > | .. versionadded:: 3.0 > > We use this to define a simplified splitQuery() function with a > split > argument that splits on empty spaces only. > > We extend SearchEnglish (which extends SearchLanguage) here to retain > the stemmer code and stopwords for English. > > [YOCTO #14534] > > Signed-off-by: Enrico Jörns <ejo@pengutronix.de> > --- > > Changes v1 -> v2 > > * extend SearchEnglish instead of SearchLanguage to retain stemmer > code > and stopword handing (rename class accordingly) > * drop "lang = 'en'" > * Escape '-' in _word_re to prevent future misinterpretation as range > symbol > * drop useless split() method override > * Use extended original regex for splitQuery() instead of using a > custom > one > > documentation/conf.py | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/documentation/conf.py b/documentation/conf.py > index 2aceeb8e7..ad60d9113 100644 > --- a/documentation/conf.py > +++ b/documentation/conf.py > @@ -13,6 +13,7 @@ > # documentation root, use os.path.abspath to make it absolute, like > shown here. > # > import os > +import re > import sys > import datetime > try: > @@ -173,6 +174,24 @@ latex_elements = { > 'preamble': > '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', > } > > + > +from sphinx.search import SearchEnglish > +from sphinx.search import languages > +class DashFriendlySearchEnglish(SearchEnglish): > + > + # Accept words that can include hyphens > + _word_re = re.compile(r'[\w\-]+') > + > + js_splitter_code = """ > +function splitQuery(query) { > + return query > + .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu) > + .filter(term => term.length > 0); > +} > +""" With this patch I get this warning: Running Sphinx v7.3.7 ...poky/documentation/conf.py:188: SyntaxWarning: invalid escape sequence '\p' .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu) Should that be a raw string? Like this: + js_splitter_code = r""" Regards, Adrian > + > +languages['en'] = DashFriendlySearchEnglish > + > # Make the EPUB builder prefer PNG to SVG because of issues > rendering Inkscape SVG > from sphinx.builders.epub3 import Epub3Builder > Epub3Builder.supported_image_types = ['image/png', 'image/gif', > 'image/jpeg'] > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#6847): > https://lists.yoctoproject.org/g/docs/message/6847 > Mute This Topic: https://lists.yoctoproject.org/mt/113208155/3616858 > Group Owner: docs+owner@lists.yoctoproject.org > Unsubscribe: > https://lists.yoctoproject.org/g/docs/unsub [adrian.freihofer@siemens.com > ] > -=-=-=-=-=-=-=-=-=-=-=- >
Hi Adrian, On 6/6/25 5:18 PM, Adrian Freihofer via lists.yoctoproject.org wrote: > On Tue, 2025-05-20 at 11:45 +0200, Enrico Jörns via > lists.yoctoproject.org wrote: [...] >> + js_splitter_code = """ >> +function splitQuery(query) { >> + return query >> + .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu) >> + .filter(term => term.length > 0); >> +} >> +""" > > With this patch I get this warning: > > Running Sphinx v7.3.7 > ...poky/documentation/conf.py:188: SyntaxWarning: invalid escape > sequence '\p' > .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu) > > Should that be a raw string? Like this: > + js_splitter_code = r""" > I believe so: https://lore.kernel.org/yocto-docs/20250606-conf-syntax-warning-v1-1-4ebc90ae7d69@cherry.de/T/#u Cheers, Quentin
diff --git a/documentation/conf.py b/documentation/conf.py index 2aceeb8e7..ad60d9113 100644 --- a/documentation/conf.py +++ b/documentation/conf.py @@ -13,6 +13,7 @@ # documentation root, use os.path.abspath to make it absolute, like shown here. # import os +import re import sys import datetime try: @@ -173,6 +174,24 @@ latex_elements = { 'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}', } + +from sphinx.search import SearchEnglish +from sphinx.search import languages +class DashFriendlySearchEnglish(SearchEnglish): + + # Accept words that can include hyphens + _word_re = re.compile(r'[\w\-]+') + + js_splitter_code = """ +function splitQuery(query) { + return query + .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu) + .filter(term => term.length > 0); +} +""" + +languages['en'] = DashFriendlySearchEnglish + # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG from sphinx.builders.epub3 import Epub3Builder Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']
This modifies the default indexer split() and js splitQuery() methods to support searching for words with hyphens. While this might not be an ideal, rock solid, and fully future-proof solution, it allows at least to search for strings inlcuding hyphens, such as 'bitbake-layers', 'send-error-report', or 'oe-core'. Below is a bit more detailed explanation of the two modifications done: 1) The default split regex in the sphinx-doc SearchLanguage base class is: | _word_re = re.compile(r'\w+') which we simply extend to include hyphens '-'. This will result in a searchindex.js that contains words with hyphens, too. 2) The 'searchtool.js' code notes for its splitQuery() implementation: | /** | * Default splitQuery function. Can be overridden in ``sphinx.search`` with a | * custom function per language. | * | * The regular expression works by splitting the string on consecutive characters | * that are not Unicode letters, numbers, underscores, or emoji characters. | * This is the same as ``\W+`` in Python, preserving the surrogate pair area. | */ | if (typeof splitQuery === "undefined") { | var splitQuery = (query) => query | .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu) | .filter(term => term) // remove remaining empty strings | } The hook for this is documented in the sphinx-docs 'SearchLanguage' base class. | .. attribute:: js_splitter_code | | Return splitter function of JavaScript version. The function should be | named as ``splitQuery``. And it should take a string and return list of | strings. | | .. versionadded:: 3.0 We use this to define a simplified splitQuery() function with a split argument that splits on empty spaces only. We extend SearchEnglish (which extends SearchLanguage) here to retain the stemmer code and stopwords for English. [YOCTO #14534] Signed-off-by: Enrico Jörns <ejo@pengutronix.de> --- Changes v1 -> v2 * extend SearchEnglish instead of SearchLanguage to retain stemmer code and stopword handing (rename class accordingly) * drop "lang = 'en'" * Escape '-' in _word_re to prevent future misinterpretation as range symbol * drop useless split() method override * Use extended original regex for splitQuery() instead of using a custom one documentation/conf.py | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)