diff mbox series

conf.py: tweak SearchLanguage to be hyphen-friendly

Message ID 20250502203501.3364973-1-ejo@pengutronix.de
State Under Review
Headers show
Series conf.py: tweak SearchLanguage to be hyphen-friendly | expand

Commit Message

Enrico Jörns May 2, 2025, 8:35 p.m. UTC
This modifies the default indexer split() and js splitQuery()
methods to support searching for words with hyphens.

While this might not be an ideal, rock-solid, or fully future-proof
solution, it at least allows searching for strings that include hyphens,
such as 'bitbake-layers', 'send-error-report', or 'oe-core'.

Below is a bit more detailed explanation of the two modifications done:

1) The default split regex in the sphinx-doc SearchLanguage class is:

   | _word_re = re.compile(r'\w+')

   which we simply extend to include hyphens '-'.

   This will result in a searchindex.js that contains words with hyphens,
   too.

2) The 'searchtool.js' code notes for its splitQuery() implementation:

   | /**
   |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
   |  * custom function per language.
   |  *
   |  * The regular expression works by splitting the string on consecutive characters
   |  * that are not Unicode letters, numbers, underscores, or emoji characters.
   |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
   |  */
   | if (typeof splitQuery === "undefined") {
   |   var splitQuery = (query) => query
   |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
   |       .filter(term => term)  // remove remaining empty strings
   | }

   The hook for this is documented in the sphinx-docs 'SearchLanguage'
   class.

   |    .. attribute:: js_splitter_code
   |
   |       Return splitter function of JavaScript version.  The function should be
   |       named as ``splitQuery``.  And it should take a string and return list of
   |       strings.
   |
   |       .. versionadded:: 3.0

   We use this to define a simplified splitQuery() function with a split
   argument that splits on empty spaces only.

[YOCTO #14534]

Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
---
 documentation/conf.py | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Comments

Quentin Schulz May 5, 2025, 9:11 a.m. UTC | #1
Hi Enrico,

On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
> 
> While this might not be an ideal, rock-solid, or fully future-proof
> solution, it at least allows searching for strings that include hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
> 
> Below is a bit more detailed explanation of the two modifications done:
> 
> 1) The default split regex in the sphinx-doc SearchLanguage class is:
> 
>     | _word_re = re.compile(r'\w+')
> 
>     which we simply extend to include hyphens '-'.
> 
>     This will result in a searchindex.js that contains words with hyphens,
>     too.
> 
> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
> 
>     | /**
>     |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>     |  * custom function per language.
>     |  *
>     |  * The regular expression works by splitting the string on consecutive characters
>     |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>     |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>     |  */
>     | if (typeof splitQuery === "undefined") {
>     |   var splitQuery = (query) => query
>     |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>     |       .filter(term => term)  // remove remaining empty strings
>     | }
> 
>     The hook for this is documented in the sphinx-docs 'SearchLanguage'
>     class.
> 
>     |    .. attribute:: js_splitter_code
>     |
>     |       Return splitter function of JavaScript version.  The function should be
>     |       named as ``splitQuery``.  And it should take a string and return list of
>     |       strings.
>     |
>     |       .. versionadded:: 3.0
> 
>     We use this to define a simplified splitQuery() function with a split
>     argument that splits on empty spaces only.
> 
> [YOCTO #14534]
> 
> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> ---
>   documentation/conf.py | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/documentation/conf.py b/documentation/conf.py
> index 2aceeb8e7..02397cd20 100644
> --- a/documentation/conf.py
> +++ b/documentation/conf.py
> @@ -13,6 +13,7 @@
>   # documentation root, use os.path.abspath to make it absolute, like shown here.
>   #
>   import os
> +import re
>   import sys
>   import datetime
>   try:
> @@ -173,6 +174,26 @@ latex_elements = {
>       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>   }
>   
> +
> +from sphinx.search import SearchLanguage
> +from sphinx.search import languages
> +class DashFriendlySearchLanguage(SearchLanguage):

Could you extend SearchEnglish instead, which is I believe what we 
should be using today?

> +    lang = 'en'
> +

This would then be redundant.

> +    # Accept words that can include hyphens
> +    _word_re = re.compile(r'[\w-]+')

I would recommend to have the dash as first character in case we ever 
want to expand this regex to avoid inadvertently making it a range, c.f. 
https://docs.python.org/3/library/re.html#module-re [] section. Or 
simply escape it so it's explicit what we're expecting from it :)

> +
> +    def split(self, input: str) -> list[str]:
> +        return self._word_re.findall(input)
> +

This is already what the split method from SearchLanguage does, 
therefore no need to override it.

> +    js_splitter_code = """
> +function splitQuery(query) {
> +    return query.split(/\\s+/g).filter(term => term.length > 0);

Why not simply add '-' to the default splitQuery as in 
sphinx/themes/basic/static/searchtools.js so we keep as close as 
possible to the original behavior, just with the added dash?

Cheers,
Quentin
Antonin Godard May 12, 2025, 3:19 p.m. UTC | #2
Hi Enrico,

On Fri May 2, 2025 at 10:35 PM CEST, Enrico Jörns wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
>
> While this might not be an ideal, rock-solid, or fully future-proof
> solution, it at least allows searching for strings that include hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
>
> Below is a bit more detailed explanation of the two modifications done:
>
> 1) The default split regex in the sphinx-doc SearchLanguage class is:
>
>    | _word_re = re.compile(r'\w+')
>
>    which we simply extend to include hyphens '-'.
>
>    This will result in a searchindex.js that contains words with hyphens,
>    too.
>
> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
>
>    | /**
>    |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>    |  * custom function per language.
>    |  *
>    |  * The regular expression works by splitting the string on consecutive characters
>    |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>    |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>    |  */
>    | if (typeof splitQuery === "undefined") {
>    |   var splitQuery = (query) => query
>    |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>    |       .filter(term => term)  // remove remaining empty strings
>    | }
>
>    The hook for this is documented in the sphinx-docs 'SearchLanguage'
>    class.
>
>    |    .. attribute:: js_splitter_code
>    |
>    |       Return splitter function of JavaScript version.  The function should be
>    |       named as ``splitQuery``.  And it should take a string and return list of
>    |       strings.
>    |
>    |       .. versionadded:: 3.0
>
>    We use this to define a simplified splitQuery() function with a split
>    argument that splits on empty spaces only.
>
> [YOCTO #14534]
>
> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> ---
>  documentation/conf.py | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/documentation/conf.py b/documentation/conf.py
> index 2aceeb8e7..02397cd20 100644
> --- a/documentation/conf.py
> +++ b/documentation/conf.py
> @@ -13,6 +13,7 @@
>  # documentation root, use os.path.abspath to make it absolute, like shown here.
>  #
>  import os
> +import re
>  import sys
>  import datetime
>  try:
> @@ -173,6 +174,26 @@ latex_elements = {
>      'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>  }
>  
> +
> +from sphinx.search import SearchLanguage
> +from sphinx.search import languages
> +class DashFriendlySearchLanguage(SearchLanguage):
> +    lang = 'en'
> +
> +    # Accept words that can include hyphens
> +    _word_re = re.compile(r'[\w-]+')
> +
> +    def split(self, input: str) -> list[str]:
> +        return self._word_re.findall(input)
> +
> +    js_splitter_code = """
> +function splitQuery(query) {
> +    return query.split(/\\s+/g).filter(term => term.length > 0);
> +}
> +"""
> +
> +languages['en'] = DashFriendlySearchLanguage
> +
>  # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG
>  from sphinx.builders.epub3 import Epub3Builder
>  Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']

This is nice thanks. I've tested it with "bitbake-layers" or "ide-sdk", and it
gives me much better results.

The searchindex.js generated file grows from 936 bytes to 1180 bytes, I think we
can live with that though - just an observation.

Also a side-note: last time I checked, the kernel documentation had the same
problem. Maybe a worthy contribution over there too?

Antonin
Antonin Godard May 12, 2025, 3:19 p.m. UTC | #3
Hi quentin,

On Mon May 5, 2025 at 11:11 AM CEST, Quentin Schulz via lists.yoctoproject.org wrote:
> Hi Enrico,
>
> On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote:
>> This modifies the default indexer split() and js splitQuery()
>> methods to support searching for words with hyphens.
>> 
>> While this might not be an ideal, rock-solid, or fully future-proof
>> solution, it at least allows searching for strings that include hyphens,
>> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
>> 
>> Below is a bit more detailed explanation of the two modifications done:
>> 
>> 1) The default split regex in the sphinx-doc SearchLanguage class is:
>> 
>>     | _word_re = re.compile(r'\w+')
>> 
>>     which we simply extend to include hyphens '-'.
>> 
>>     This will result in a searchindex.js that contains words with hyphens,
>>     too.
>> 
>> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
>> 
>>     | /**
>>     |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>>     |  * custom function per language.
>>     |  *
>>     |  * The regular expression works by splitting the string on consecutive characters
>>     |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>>     |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>>     |  */
>>     | if (typeof splitQuery === "undefined") {
>>     |   var splitQuery = (query) => query
>>     |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>>     |       .filter(term => term)  // remove remaining empty strings
>>     | }
>> 
>>     The hook for this is documented in the sphinx-docs 'SearchLanguage'
>>     class.
>> 
>>     |    .. attribute:: js_splitter_code
>>     |
>>     |       Return splitter function of JavaScript version.  The function should be
>>     |       named as ``splitQuery``.  And it should take a string and return list of
>>     |       strings.
>>     |
>>     |       .. versionadded:: 3.0
>> 
>>     We use this to define a simplified splitQuery() function with a split
>>     argument that splits on empty spaces only.
>> 
>> [YOCTO #14534]
>> 
>> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
>> ---
>>   documentation/conf.py | 21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>> 
>> diff --git a/documentation/conf.py b/documentation/conf.py
>> index 2aceeb8e7..02397cd20 100644
>> --- a/documentation/conf.py
>> +++ b/documentation/conf.py
>> @@ -13,6 +13,7 @@
>>   # documentation root, use os.path.abspath to make it absolute, like shown here.
>>   #
>>   import os
>> +import re
>>   import sys
>>   import datetime
>>   try:
>> @@ -173,6 +174,26 @@ latex_elements = {
>>       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>>   }
>>   
>> +
>> +from sphinx.search import SearchLanguage
>> +from sphinx.search import languages
>> +class DashFriendlySearchLanguage(SearchLanguage):
>
> Could you extend SearchEnglish instead, which is I believe what we 
> should be using today?

A question on top of this: using SearchLanguage instead of SearchEnglish results
in _not_ using the stopwords and stemmer code defined in SearchEnglish. Is that
intentional? Maybe there is a good reason for not using SearchEnglish, but I'm
curious.

Antonin
diff mbox series

Patch

diff --git a/documentation/conf.py b/documentation/conf.py
index 2aceeb8e7..02397cd20 100644
--- a/documentation/conf.py
+++ b/documentation/conf.py
@@ -13,6 +13,7 @@ 
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
 import os
+import re
 import sys
 import datetime
 try:
@@ -173,6 +174,26 @@  latex_elements = {
     'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
 }
 
+
+from sphinx.search import SearchLanguage
+from sphinx.search import languages
+class DashFriendlySearchLanguage(SearchLanguage):
+    lang = 'en'
+
+    # Accept words that can include hyphens
+    _word_re = re.compile(r'[\w-]+')
+
+    def split(self, input: str) -> list[str]:
+        return self._word_re.findall(input)
+
+    js_splitter_code = """
+function splitQuery(query) {
+    return query.split(/\\s+/g).filter(term => term.length > 0);
+}
+"""
+
+languages['en'] = DashFriendlySearchLanguage
+
 # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG
 from sphinx.builders.epub3 import Epub3Builder
 Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']