diff mbox series

[v2] conf.py: tweak SearchEnglish to be hyphen-friendly

Message ID 20250520094514.2672646-1-ejo@pengutronix.de
State Accepted
Headers show
Series [v2] conf.py: tweak SearchEnglish to be hyphen-friendly | expand

Commit Message

Enrico Jörns May 20, 2025, 9:45 a.m. UTC
This modifies the default indexer split() and js splitQuery()
methods to support searching for words with hyphens.

While this might not be an ideal, rock solid, and fully future-proof
solution, it allows at least to search for strings inlcuding hyphens,
such as 'bitbake-layers', 'send-error-report', or 'oe-core'.

Below is a bit more detailed explanation of the two modifications done:

1) The default split regex in the sphinx-doc SearchLanguage base class
   is:

   | _word_re = re.compile(r'\w+')

   which we simply extend to include hyphens '-'.

   This will result in a searchindex.js that contains words with hyphens,
   too.

2) The 'searchtool.js' code notes for its splitQuery() implementation:

   | /**
   |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
   |  * custom function per language.
   |  *
   |  * The regular expression works by splitting the string on consecutive characters
   |  * that are not Unicode letters, numbers, underscores, or emoji characters.
   |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
   |  */
   | if (typeof splitQuery === "undefined") {
   |   var splitQuery = (query) => query
   |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
   |       .filter(term => term)  // remove remaining empty strings
   | }

   The hook for this is documented in the sphinx-docs 'SearchLanguage'
   base class.

   |    .. attribute:: js_splitter_code
   |
   |       Return splitter function of JavaScript version.  The function should be
   |       named as ``splitQuery``.  And it should take a string and return list of
   |       strings.
   |
   |       .. versionadded:: 3.0

   We use this to define a simplified splitQuery() function with a split
   argument that splits on empty spaces only.

We extend SearchEnglish (which extends SearchLanguage) here to retain
the stemmer code and stopwords for English.

[YOCTO #14534]

Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
---

Changes v1 -> v2

* extend SearchEnglish instead of SearchLanguage to retain stemmer code
  and stopword handing (rename class accordingly)
* drop "lang = 'en'"
* Escape '-' in _word_re to prevent future misinterpretation as range
  symbol
* drop useless split() method override
* Use extended original regex for splitQuery() instead of using a custom
  one

 documentation/conf.py | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Antonin Godard May 27, 2025, 6:33 a.m. UTC | #1
On Tue, 20 May 2025 11:45:14 +0200, Enrico Jörns wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
> 
> While this might not be an ideal, rock solid, and fully future-proof
> solution, it allows at least to search for strings inlcuding hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
> 
> [...]

Applied, thanks!

[1/1] conf.py: tweak SearchEnglish to be hyphen-friendly
      commit: d4a98ee19e0cbd6be96923dc72faee143a6b294b

Best regards,
Freihofer, Adrian June 6, 2025, 3:18 p.m. UTC | #2
On Tue, 2025-05-20 at 11:45 +0200, Enrico Jörns via
lists.yoctoproject.org wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
> 
> While this might not be an ideal, rock solid, and fully future-proof
> solution, it allows at least to search for strings inlcuding hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
> 
> Below is a bit more detailed explanation of the two modifications
> done:
> 
> 1) The default split regex in the sphinx-doc SearchLanguage base
> class
>    is:
> 
>    | _word_re = re.compile(r'\w+')
> 
>    which we simply extend to include hyphens '-'.
> 
>    This will result in a searchindex.js that contains words with
> hyphens,
>    too.
> 
> 2) The 'searchtool.js' code notes for its splitQuery()
> implementation:
> 
>    | /**
>    |  * Default splitQuery function. Can be overridden in
> ``sphinx.search`` with a
>    |  * custom function per language.
>    |  *
>    |  * The regular expression works by splitting the string on
> consecutive characters
>    |  * that are not Unicode letters, numbers, underscores, or emoji
> characters.
>    |  * This is the same as ``\W+`` in Python, preserving the
> surrogate pair area.
>    |  */
>    | if (typeof splitQuery === "undefined") {
>    |   var splitQuery = (query) => query
>    |      
> .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>    |       .filter(term => term)  // remove remaining empty strings
>    | }
> 
>    The hook for this is documented in the sphinx-docs
> 'SearchLanguage'
>    base class.
> 
>    |    .. attribute:: js_splitter_code
>    |
>    |       Return splitter function of JavaScript version.  The
> function should be
>    |       named as ``splitQuery``.  And it should take a string and
> return list of
>    |       strings.
>    |
>    |       .. versionadded:: 3.0
> 
>    We use this to define a simplified splitQuery() function with a
> split
>    argument that splits on empty spaces only.
> 
> We extend SearchEnglish (which extends SearchLanguage) here to retain
> the stemmer code and stopwords for English.
> 
> [YOCTO #14534]
> 
> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> ---
> 
> Changes v1 -> v2
> 
> * extend SearchEnglish instead of SearchLanguage to retain stemmer
> code
>   and stopword handing (rename class accordingly)
> * drop "lang = 'en'"
> * Escape '-' in _word_re to prevent future misinterpretation as range
>   symbol
> * drop useless split() method override
> * Use extended original regex for splitQuery() instead of using a
> custom
>   one
> 
>  documentation/conf.py | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/documentation/conf.py b/documentation/conf.py
> index 2aceeb8e7..ad60d9113 100644
> --- a/documentation/conf.py
> +++ b/documentation/conf.py
> @@ -13,6 +13,7 @@
>  # documentation root, use os.path.abspath to make it absolute, like
> shown here.
>  #
>  import os
> +import re
>  import sys
>  import datetime
>  try:
> @@ -173,6 +174,24 @@ latex_elements = {
>      'preamble':
> '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>  }
>  
> +
> +from sphinx.search import SearchEnglish
> +from sphinx.search import languages
> +class DashFriendlySearchEnglish(SearchEnglish):
> +
> +    # Accept words that can include hyphens
> +    _word_re = re.compile(r'[\w\-]+')
> +
> +    js_splitter_code = """
> +function splitQuery(query) {
> +    return query
> +        .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu)
> +        .filter(term => term.length > 0);
> +}
> +"""

With this patch I get this warning:

Running Sphinx v7.3.7
...poky/documentation/conf.py:188: SyntaxWarning: invalid escape
sequence '\p'
  .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu)

Should that be a raw string? Like this:
+    js_splitter_code = r"""

Regards,
Adrian



> +
> +languages['en'] = DashFriendlySearchEnglish
> +
>  # Make the EPUB builder prefer PNG to SVG because of issues
> rendering Inkscape SVG
>  from sphinx.builders.epub3 import Epub3Builder
>  Epub3Builder.supported_image_types = ['image/png', 'image/gif',
> 'image/jpeg']
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#6847):
> https://lists.yoctoproject.org/g/docs/message/6847
> Mute This Topic: https://lists.yoctoproject.org/mt/113208155/3616858
> Group Owner: docs+owner@lists.yoctoproject.org
> Unsubscribe:
> https://lists.yoctoproject.org/g/docs/unsub [adrian.freihofer@siemens.com
> ]
> -=-=-=-=-=-=-=-=-=-=-=-
>
Quentin Schulz June 6, 2025, 3:21 p.m. UTC | #3
Hi Adrian,

On 6/6/25 5:18 PM, Adrian Freihofer via lists.yoctoproject.org wrote:
> On Tue, 2025-05-20 at 11:45 +0200, Enrico Jörns via
> lists.yoctoproject.org wrote:
[...]
>> +    js_splitter_code = """
>> +function splitQuery(query) {
>> +    return query
>> +        .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu)
>> +        .filter(term => term.length > 0);
>> +}
>> +"""
> 
> With this patch I get this warning:
> 
> Running Sphinx v7.3.7
> ...poky/documentation/conf.py:188: SyntaxWarning: invalid escape
> sequence '\p'
>    .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu)
> 
> Should that be a raw string? Like this:
> +    js_splitter_code = r"""
> 

I believe so: 
https://lore.kernel.org/yocto-docs/20250606-conf-syntax-warning-v1-1-4ebc90ae7d69@cherry.de/T/#u

Cheers,
Quentin
diff mbox series

Patch

diff --git a/documentation/conf.py b/documentation/conf.py
index 2aceeb8e7..ad60d9113 100644
--- a/documentation/conf.py
+++ b/documentation/conf.py
@@ -13,6 +13,7 @@ 
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
 import os
+import re
 import sys
 import datetime
 try:
@@ -173,6 +174,24 @@  latex_elements = {
     'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
 }
 
+
+from sphinx.search import SearchEnglish
+from sphinx.search import languages
+class DashFriendlySearchEnglish(SearchEnglish):
+
+    # Accept words that can include hyphens
+    _word_re = re.compile(r'[\w\-]+')
+
+    js_splitter_code = """
+function splitQuery(query) {
+    return query
+        .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu)
+        .filter(term => term.length > 0);
+}
+"""
+
+languages['en'] = DashFriendlySearchEnglish
+
 # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG
 from sphinx.builders.epub3 import Epub3Builder
 Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']