[PATCH] docs: conf.py: tweak SearchEnglish to be hyphen- and dot-friendly

mail archive of the barebox mailing list
 help / color / mirror / Atom feed

* [PATCH] docs: conf.py: tweak SearchEnglish to be hyphen- and dot-friendly
@ 2025-05-27  7:52 Enrico Jörns
  0 siblings, 0 replies; only message in thread
From: Enrico Jörns @ 2025-05-27  7:52 UTC (permalink / raw)
  To: barebox; +Cc: ejo

This modifies the default indexer split() and js splitQuery()
methods to support searching for words with 'inner' hyphens or dots.

While this might not be an ideal, rock solid, and fully future-proof
solution, since it relies on some upstream sphinx-docs methods to exist,
it allows to search for strings including hyphens and dots, such as
'OP-TEE', 'nv.bootchooser.last_chosen', or 'barebox-state'.

Below is a bit more detailed explanation of the two modifications done:

1) The default split regex in the sphinx-doc SearchLanguage base class
   is:

   | _word_re = re.compile(r'\w+')

   which we extend to include words with inner hyphens '-' and dots '.':

   | _word_re = re.compile(r'\w+(?:[\.\-]\w+)*')

   This will result in a searchindex.js that contains words with hyphens
   and dots.

2) The 'searchtool.js' code notes for its splitQuery() implementation:

   | /**
   |  * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
   |  * custom function per language.
   |  *
   |  * The regular expression works by splitting the string on consecutive characters
   |  * that are not Unicode letters, numbers, underscores, or emoji characters.
   |  * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
   |  */
   | if (typeof splitQuery === "undefined") {
   |   var splitQuery = (query) => query
   |       .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}]+/gu)
   |       .filter(term => term)  // remove remaining empty strings
   | }

   The hook for this is documented in the sphinx-docs 'SearchLanguage'
   base class.

   |    .. attribute:: js_splitter_code
   |
   |       Return splitter function of JavaScript version.  The function should be
   |       named as ``splitQuery``.  And it should take a string and return list of
   |       strings.
   |
   |       .. versionadded:: 3.0

   We use this to define a simplified splitQuery() function with a split
   argument that splits on empty spaces only.

We extend SearchEnglish (which extends SearchLanguage) here to retain
the stemmer code and stopwords for English.

Signed-off-by: Enrico Jörns <ejo@pengutronix.de>
---
 Documentation/conf.py | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/conf.py b/Documentation/conf.py
index 5fb8b07c38..01c430dfa6 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -14,6 +14,7 @@

 import sys
 import os
+import re

 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
@@ -260,3 +261,20 @@ texinfo_documents = [
 #texinfo_no_detailmenu = False

 highlight_language = 'none'
+
+from sphinx.search import SearchEnglish
+from sphinx.search import languages
+class DashFriendlySearchEnglish(SearchEnglish):
+
+    # Accept words that can include 'inner' hyphens or dots
+    _word_re = re.compile(r'[\w]+(?:[\.\-][\w]+)*')
+
+    js_splitter_code = """
+function splitQuery(query) {
+    return query
+        .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}\-\.]+/gu)
+        .filter(term => term.length > 0);
+}
+"""
+
+languages['en'] = DashFriendlySearchEnglish
-- 
2.39.5

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2025-05-27  7:53 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-27  7:52 [PATCH] docs: conf.py: tweak SearchEnglish to be hyphen- and dot-friendly Enrico Jörns

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox