conf.py: tweak SearchLanguage to be hyphen-friendly

Message ID	20250502203501.3364973-1-ejo@pengutronix.de
State	Superseded
Headers	show Return-Path: <ejo@pengutronix.de> ip: 185.203.201.7, mailfrom: ejo@pengutronix.de) From: =?utf-8?q?Enrico_J=C3=B6rns?= <ejo@pengutronix.de> To: docs@lists.yoctoproject.org Cc: yocto@pengutronix.de Subject: [PATCH] conf.py: tweak SearchLanguage to be hyphen-friendly Date: Fri, 2 May 2025 22:35:01 +0200 Message-Id: <20250502203501.3364973-1-ejo@pengutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	conf.py: tweak SearchLanguage to be hyphen-friendly \| expand conf.py: tweak SearchLanguage to be hyphen-friendly

Enrico Jörns May 2, 2025, 8:35 p.m. UTC

This modifies the default indexer split() and js splitQuery()
methods to support searching for words with hyphens.

While this might not be an ideal, rock-solid, or fully future-proof
solution, it at least allows searching for strings that include hyphens,
such as 'bitbake-layers', 'send-error-report', or 'oe-core'.

Below is a bit more detailed explanation of the two modifications done:

1) The default split regex in the sphinx-doc SearchLanguage class is:

   | _word_re = re.compile(r'\w+')

   which we simply extend to include hyphens '-'.

   This will result in a searchindex.js that contains words with hyphens,
   too.

2) The 'searchtool.js' code notes for its splitQuery() implementation:

   | /**
   |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
   |  * custom function per language.
   |  *
   |  * The regular expression works by splitting the string on consecutive characters
   |  * that are not Unicode letters, numbers, underscores, or emoji characters.
   |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
   |  */
   | if (typeof splitQuery === "undefined") {
   |   var splitQuery = (query) => query
   |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
   |       .filter(term => term)  // remove remaining empty strings
   | }

   The hook for this is documented in the sphinx-docs 'SearchLanguage'
   class.

   |    .. attribute:: js_splitter_code
   |
   |       Return splitter function of JavaScript version.  The function should be
   |       named as ``splitQuery``.  And it should take a string and return list of
   |       strings.
   |
   |       .. versionadded:: 3.0

   We use this to define a simplified splitQuery() function with a split
   argument that splits on empty spaces only.

[YOCTO #14534]

Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
---
 documentation/conf.py | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Quentin Schulz May 5, 2025, 9:11 a.m. UTC | #1

Hi Enrico,

On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
> 
> While this might not be an ideal, rock-solid, or fully future-proof
> solution, it at least allows searching for strings that include hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
> 
> Below is a bit more detailed explanation of the two modifications done:
> 
> 1) The default split regex in the sphinx-doc SearchLanguage class is:
> 
>     | _word_re = re.compile(r'\w+')
> 
>     which we simply extend to include hyphens '-'.
> 
>     This will result in a searchindex.js that contains words with hyphens,
>     too.
> 
> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
> 
>     | /**
>     |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>     |  * custom function per language.
>     |  *
>     |  * The regular expression works by splitting the string on consecutive characters
>     |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>     |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>     |  */
>     | if (typeof splitQuery === "undefined") {
>     |   var splitQuery = (query) => query
>     |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>     |       .filter(term => term)  // remove remaining empty strings
>     | }
> 
>     The hook for this is documented in the sphinx-docs 'SearchLanguage'
>     class.
> 
>     |    .. attribute:: js_splitter_code
>     |
>     |       Return splitter function of JavaScript version.  The function should be
>     |       named as ``splitQuery``.  And it should take a string and return list of
>     |       strings.
>     |
>     |       .. versionadded:: 3.0
> 
>     We use this to define a simplified splitQuery() function with a split
>     argument that splits on empty spaces only.
> 
> [YOCTO #14534]
> 
> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> ---
>   documentation/conf.py | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/documentation/conf.py b/documentation/conf.py
> index 2aceeb8e7..02397cd20 100644
> --- a/documentation/conf.py
> +++ b/documentation/conf.py
> @@ -13,6 +13,7 @@
>   # documentation root, use os.path.abspath to make it absolute, like shown here.
>   #
>   import os
> +import re
>   import sys
>   import datetime
>   try:
> @@ -173,6 +174,26 @@ latex_elements = {
>       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>   }
>   
> +
> +from sphinx.search import SearchLanguage
> +from sphinx.search import languages
> +class DashFriendlySearchLanguage(SearchLanguage):

Could you extend SearchEnglish instead, which is I believe what we 
should be using today?

> +    lang = 'en'
> +

This would then be redundant.

> +    # Accept words that can include hyphens
> +    _word_re = re.compile(r'[\w-]+')

I would recommend to have the dash as first character in case we ever 
want to expand this regex to avoid inadvertently making it a range, c.f. 
https://docs.python.org/3/library/re.html#module-re [] section. Or 
simply escape it so it's explicit what we're expecting from it :)

> +
> +    def split(self, input: str) -> list[str]:
> +        return self._word_re.findall(input)
> +

This is already what the split method from SearchLanguage does, 
therefore no need to override it.

> +    js_splitter_code = """
> +function splitQuery(query) {
> +    return query.split(/\\s+/g).filter(term => term.length > 0);

Why not simply add '-' to the default splitQuery as in 
sphinx/themes/basic/static/searchtools.js so we keep as close as 
possible to the original behavior, just with the added dash?

Cheers,
Quentin

Antonin Godard May 12, 2025, 3:19 p.m. UTC | #2

Hi Enrico,

On Fri May 2, 2025 at 10:35 PM CEST, Enrico Jörns wrote:
> This modifies the default indexer split() and js splitQuery()
> methods to support searching for words with hyphens.
>
> While this might not be an ideal, rock-solid, or fully future-proof
> solution, it at least allows searching for strings that include hyphens,
> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
>
> Below is a bit more detailed explanation of the two modifications done:
>
> 1) The default split regex in the sphinx-doc SearchLanguage class is:
>
>    | _word_re = re.compile(r'\w+')
>
>    which we simply extend to include hyphens '-'.
>
>    This will result in a searchindex.js that contains words with hyphens,
>    too.
>
> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
>
>    | /**
>    |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>    |  * custom function per language.
>    |  *
>    |  * The regular expression works by splitting the string on consecutive characters
>    |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>    |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>    |  */
>    | if (typeof splitQuery === "undefined") {
>    |   var splitQuery = (query) => query
>    |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>    |       .filter(term => term)  // remove remaining empty strings
>    | }
>
>    The hook for this is documented in the sphinx-docs 'SearchLanguage'
>    class.
>
>    |    .. attribute:: js_splitter_code
>    |
>    |       Return splitter function of JavaScript version.  The function should be
>    |       named as ``splitQuery``.  And it should take a string and return list of
>    |       strings.
>    |
>    |       .. versionadded:: 3.0
>
>    We use this to define a simplified splitQuery() function with a split
>    argument that splits on empty spaces only.
>
> [YOCTO #14534]
>
> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> ---
>  documentation/conf.py | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/documentation/conf.py b/documentation/conf.py
> index 2aceeb8e7..02397cd20 100644
> --- a/documentation/conf.py
> +++ b/documentation/conf.py
> @@ -13,6 +13,7 @@
>  # documentation root, use os.path.abspath to make it absolute, like shown here.
>  #
>  import os
> +import re
>  import sys
>  import datetime
>  try:
> @@ -173,6 +174,26 @@ latex_elements = {
>      'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>  }
>  
> +
> +from sphinx.search import SearchLanguage
> +from sphinx.search import languages
> +class DashFriendlySearchLanguage(SearchLanguage):
> +    lang = 'en'
> +
> +    # Accept words that can include hyphens
> +    _word_re = re.compile(r'[\w-]+')
> +
> +    def split(self, input: str) -> list[str]:
> +        return self._word_re.findall(input)
> +
> +    js_splitter_code = """
> +function splitQuery(query) {
> +    return query.split(/\\s+/g).filter(term => term.length > 0);
> +}
> +"""
> +
> +languages['en'] = DashFriendlySearchLanguage
> +
>  # Make the EPUB builder prefer PNG to SVG because of issues rendering Inkscape SVG
>  from sphinx.builders.epub3 import Epub3Builder
>  Epub3Builder.supported_image_types = ['image/png', 'image/gif', 'image/jpeg']

This is nice thanks. I've tested it with "bitbake-layers" or "ide-sdk", and it
gives me much better results.

The searchindex.js generated file grows from 936 bytes to 1180 bytes, I think we
can live with that though - just an observation.

Also a side-note: last time I checked, the kernel documentation had the same
problem. Maybe a worthy contribution over there too?

Antonin

Antonin Godard May 12, 2025, 3:19 p.m. UTC | #3

Hi quentin,

On Mon May 5, 2025 at 11:11 AM CEST, Quentin Schulz via lists.yoctoproject.org wrote:
> Hi Enrico,
>
> On 5/2/25 10:35 PM, Enrico Jörns via lists.yoctoproject.org wrote:
>> This modifies the default indexer split() and js splitQuery()
>> methods to support searching for words with hyphens.
>> 
>> While this might not be an ideal, rock-solid, or fully future-proof
>> solution, it at least allows searching for strings that include hyphens,
>> such as 'bitbake-layers', 'send-error-report', or 'oe-core'.
>> 
>> Below is a bit more detailed explanation of the two modifications done:
>> 
>> 1) The default split regex in the sphinx-doc SearchLanguage class is:
>> 
>>     | _word_re = re.compile(r'\w+')
>> 
>>     which we simply extend to include hyphens '-'.
>> 
>>     This will result in a searchindex.js that contains words with hyphens,
>>     too.
>> 
>> 2) The 'searchtool.js' code notes for its splitQuery() implementation:
>> 
>>     | /**
>>     |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
>>     |  * custom function per language.
>>     |  *
>>     |  * The regular expression works by splitting the string on consecutive characters
>>     |  * that are not Unicode letters, numbers, underscores, or emoji characters.
>>     |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
>>     |  */
>>     | if (typeof splitQuery === "undefined") {
>>     |   var splitQuery = (query) => query
>>     |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
>>     |       .filter(term => term)  // remove remaining empty strings
>>     | }
>> 
>>     The hook for this is documented in the sphinx-docs 'SearchLanguage'
>>     class.
>> 
>>     |    .. attribute:: js_splitter_code
>>     |
>>     |       Return splitter function of JavaScript version.  The function should be
>>     |       named as ``splitQuery``.  And it should take a string and return list of
>>     |       strings.
>>     |
>>     |       .. versionadded:: 3.0
>> 
>>     We use this to define a simplified splitQuery() function with a split
>>     argument that splits on empty spaces only.
>> 
>> [YOCTO #14534]
>> 
>> Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
>> ---
>>   documentation/conf.py | 21 +++++++++++++++++++++
>>   1 file changed, 21 insertions(+)
>> 
>> diff --git a/documentation/conf.py b/documentation/conf.py
>> index 2aceeb8e7..02397cd20 100644
>> --- a/documentation/conf.py
>> +++ b/documentation/conf.py
>> @@ -13,6 +13,7 @@
>>   # documentation root, use os.path.abspath to make it absolute, like shown here.
>>   #
>>   import os
>> +import re
>>   import sys
>>   import datetime
>>   try:
>> @@ -173,6 +174,26 @@ latex_elements = {
>>       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
>>   }
>>   
>> +
>> +from sphinx.search import SearchLanguage
>> +from sphinx.search import languages
>> +class DashFriendlySearchLanguage(SearchLanguage):
>
> Could you extend SearchEnglish instead, which is I believe what we 
> should be using today?

A question on top of this: using SearchLanguage instead of SearchEnglish results
in _not_ using the stopwords and stemmer code defined in SearchEnglish. Is that
intentional? Maybe there is a good reason for not using SearchEnglish, but I'm
curious.

Antonin

Enrico Jörns May 20, 2025, 9:10 a.m. UTC | #4

Hi folks,

Am Montag, dem 12.05.2025 um 17:19 +0200 schrieb Antonin Godard:
> > > Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
> > > ---
> > >   documentation/conf.py | 21 +++++++++++++++++++++
> > >   1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/documentation/conf.py b/documentation/conf.py
> > > index 2aceeb8e7..02397cd20 100644
> > > --- a/documentation/conf.py
> > > +++ b/documentation/conf.py
> > > @@ -13,6 +13,7 @@
> > >   # documentation root, use os.path.abspath to make it absolute, like shown here.
> > >   #
> > >   import os
> > > +import re
> > >   import sys
> > >   import datetime
> > >   try:
> > > @@ -173,6 +174,26 @@ latex_elements = {
> > >       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
> > >   }
> > >   
> > > +
> > > +from sphinx.search import SearchLanguage
> > > +from sphinx.search import languages
> > > +class DashFriendlySearchLanguage(SearchLanguage):
> > 
> > Could you extend SearchEnglish instead, which is I believe what we 
> > should be using today?
> 
> A question on top of this: using SearchLanguage instead of SearchEnglish results
> in _not_ using the stopwords and stemmer code defined in SearchEnglish. Is that
> intentional? Maybe there is a good reason for not using SearchEnglish, but I'm
> curious.

I have no strong opinion on this or what's better for a technical documentation.

Searching for 'effectiveness' in the documentation with stemmer enabled gives you results like
'effectively', etc. Surprisingly, the exact occurrence does not match for me anymore and the other
results are not highlighted...

But maybe with the intention to be least invasive, using SearchEnglish is fine anyway.


Regards, Enrico

> Antonin
>

Enrico Jörns May 20, 2025, 9:30 a.m. UTC | #5

Hi Quentin,

Am Montag, dem 05.05.2025 um 11:11 +0200 schrieb Quentin Schulz:
> > ---
> >   documentation/conf.py | 21 +++++++++++++++++++++
> >   1 file changed, 21 insertions(+)
> > 
> > diff --git a/documentation/conf.py b/documentation/conf.py
> > index 2aceeb8e7..02397cd20 100644
> > --- a/documentation/conf.py
> > +++ b/documentation/conf.py
> > @@ -13,6 +13,7 @@
> >   # documentation root, use os.path.abspath to make it absolute, like shown here.
> >   #
> >   import os
> > +import re
> >   import sys
> >   import datetime
> >   try:
> > @@ -173,6 +174,26 @@ latex_elements = {
> >       'preamble': '\\usepackage[UTF8]{ctex}\n\\setcounter{tocdepth}{2}',
> >   }
> >   
> > +
> > +from sphinx.search import SearchLanguage
> > +from sphinx.search import languages
> > +class DashFriendlySearchLanguage(SearchLanguage):
> 
> Could you extend SearchEnglish instead, which is I believe what we 
> should be using today?

yes, fine for me.


> > +    lang = 'en'
> > +
> 
> This would then be redundant.

Ok.

> > +    # Accept words that can include hyphens
> > +    _word_re = re.compile(r'[\w-]+')
> 
> I would recommend to have the dash as first character in case we ever 
> want to expand this regex to avoid inadvertently making it a range, c.f. 
> https://docs.python.org/3/library/re.html#module-re [] section. Or 
> simply escape it so it's explicit what we're expecting from it :)

Yes, might save us from later surprises.

> > +
> > +    def split(self, input: str) -> list[str]:
> > +        return self._word_re.findall(input)
> > +
> 
> This is already what the split method from SearchLanguage does, 
> therefore no need to override it.

Indeed. No idea why I thought this is required.

> > +    js_splitter_code = """
> > +function splitQuery(query) {
> > +    return query.split(/\\s+/g).filter(term => term.length > 0);
> 
> Why not simply add '-' to the default splitQuery as in 
> sphinx/themes/basic/static/searchtools.js so we keep as close as 
> possible to the original behavior, just with the added dash?

Wasn't sure if including or excluding is better, but maybe having it closer to the original one
makes sense.

Will send a v2 soon.

Regards,
Enrico

> Cheers,
> Quentin
>

conf.py: tweak SearchLanguage to be hyphen-friendly

Commit Message

Comments

Patch