diff mbox series

conf.py: tweak SearchLanguage to be hyphen-friendly

Message ID 20250502203501.3364973-1-ejo@pengutronix.de
State New
Headers show
Series conf.py: tweak SearchLanguage to be hyphen-friendly | expand

Commit Message

Enrico Jörns May 2, 2025, 8:35 p.m. UTC
This modifies the default indexer split() and js splitQuery()
methods to support searching for words with hyphens.

While this might not be an ideal, rock-solid, or fully future-proof
solution, it at least allows searching for strings that include hyphens,
such as 'bitbake-layers', 'send-error-report', or 'oe-core'.

Below is a bit more detailed explanation of the two modifications done:

1) The default split regex in the sphinx-doc SearchLanguage class is:

   | _word_re = re.compile(r'\w+')

   which we simply extend to include hyphens '-'.

   This will result in a searchindex.js that contains words with hyphens,
   too.

2) The 'searchtool.js' code notes for its splitQuery() implementation:

   | /**
   |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
   |  * custom function per language.
   |  *
   |  * The regular expression works by splitting the string on consecutive characters
   |  * that are not Unicode letters, numbers, underscores, or emoji characters.
   |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
   |  */
   | if (typeof splitQuery === "undefined") {
   |   var splitQuery = (query) => query
   |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
   |       .filter(term => term)  // remove remaining empty strings
   | }

   The hook for this is documented in the sphinx-docs 'SearchLanguage'
   class.

   |    .. attribute:: js_splitter_code
   |
   |       Return splitter function of JavaScript version.  The function should be
   |       named as ``splitQuery``.  And it should take a string and return list of
   |       strings.
   |
   |       .. versionadded:: 3.0

   We use this to define a simplified splitQuery() function with a split
   argument that splits on empty spaces only.

[YOCTO #14534]

Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
---
 documentation/conf.py | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Quentin Schulz May 5, 2025, 9:11 a.m. UTC | #1
Hi Enrico,

On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
> 
> While this might not be an ideal, rock-solid, or fully future-proof
> solution, it at least allows searching for strings that include hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
> 
> Below is a bit more detailed explanation of the two modifications done:
> 
> 1) The default split regex in the sphinx-doc SearchLanguage class is:
> 
>     | _word_re = re.compile(r'\w+')
> 
>     which we simply extend to include hyphens '-'.
> 
>     This will result in a searchindex.js that contains words with hyphens,
>     too.
> 
> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
> 
>     | /**
>     |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>     |  * custom function per language.
>     |  *
>     |  * The regular expression works by splitting the string on consecutive characters
>     |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>     |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>     |  */
>     | if (typeof splitQuery === "undefined") {
>     |   var splitQuery = (query) => query
>     |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>     |       .filter(term => term)  // remove remaining empty strings
>     | }
> 
>     The hook for this is documented in the sphinx-docs 'SearchLanguage'
>     class.
> 
>     |    .. attribute:: js_splitter_code
>     |
>     |       Return splitter function of JavaScript version.  The function should be
>     |       named as ``splitQuery``.  And it should take a string and return list of
>     |       strings.
>     |
>     |       .. versionadded:: 3.0
> 
>     We use this to define a simplified splitQuery() function with a split
>     argument that splits on empty spaces only.
> 
> [YOCTO #14534]
> 
> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> ---
>   documentation/conf.py | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/documentation/conf.py b/documentation/conf.py
> index 2aceeb8e7..02397cd20 100644
> --- a/documentation/conf.py
> +++ b/documentation/conf.py
> @@ -13,6 +13,7 @@
>   # documentation root, use os.path.abspath to make it absolute, like shown here.
>   #
>   import os
> +import re
>   import sys
>   import datetime
>   try:
> @@ -173,6 +174,26 @@ latex_elements = {
>       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>   }
>   
> +
> +from sphinx.search import SearchLanguage
> +from sphinx.search import languages
> +class DashFriendlySearchLanguage(SearchLanguage):

Could you extend SearchEnglish instead, which is I believe what we 
should be using today?

> +    lang = 'en'
> +

This would then be redundant.

> +    # Accept words that can include hyphens
> +    _word_re = re.compile(r'[\w-]+')

I would recommend to have the dash as first character in case we ever 
want to expand this regex to avoid inadvertently making it a range, c.f. 
https://docs.python.org/3/library/re.html#module-re [] section. Or 
simply escape it so it's explicit what we're expecting from it :)

> +
> +    def split(self, input: str) -> list[str]:
> +        return self._word_re.findall(input)
> +

This is already what the split method from SearchLanguage does, 
therefore no need to override it.

> +    js_splitter_code = """
> +function splitQuery(query) {
> +    return query.split(/\\s+/g).filter(term => term.length > 0);

Why not simply add '-' to the default splitQuery as in 
sphinx/themes/basic/static/searchtools.js so we keep as close as 
possible to the original behavior, just with the added dash?

Cheers,
Quentin
diff mbox series

Patch

diff --git a/documentation/conf.py b/documentation/conf.py
index 2aceeb8e7..02397cd20 100644
--- a/documentation/conf.py
+++ b/documentation/conf.py
@@ -13,6 +13,7 @@ 
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
 import os
+import re
 import sys
 import datetime
 try:
@@ -173,6 +174,26 @@  latex_elements = {
     'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
 }
 
+
+from sphinx.search import SearchLanguage
+from sphinx.search import languages
+class DashFriendlySearchLanguage(SearchLanguage):
+    lang = 'en'
+
+    # Accept words that can include hyphens
+    _word_re = re.compile(r'[\w-]+')
+
+    def split(self, input: str) -> list[str]:
+        return self._word_re.findall(input)
+
+    js_splitter_code = """
+function splitQuery(query) {
+    return query.split(/\\s+/g).filter(term => term.length > 0);
+}
+"""
+
+languages['en'] = DashFriendlySearchLanguage
+
 # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG
 from sphinx.builders.epub3 import Epub3Builder
 Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']