Skip to content

Conversation

@PouyaMohseni
Copy link
Contributor

  • Added apply_highlights method to InstrumentNameSet to apply Solr highlight snippets to each instrument name.
  • Replaced matching parts of instrument names with highlighted <b> text from Solr.
  • Updated SolrInstrument to store highlights and pass them to InstrumentNameSet.
  • Updated template to render instrument names with applied highlights.

resolves #343

example

]

# Extract highlight info
highlights = getattr(solr_response, "highlighting", {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer to rename highlights to highlight_info or highlight_data.

"hl.fl": "text",
"hl.simple.pre": "<b>",
"hl.simple.post": "</b>",
"hl.snippets": 1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 is too much...50 should be sufficient for most instruments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some instruments like guitar have labels in 200 languages, adding aliases it can increase even more (many of them share a common ar). Although for one language we might not have more than 10 labels, we are not sure if our highlight query hits all of the labels in the selected language.

As an example, 50 would not be enough for a query of ar in guitar as it hits more then 100 times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, switching to token-level highlights makes more sense.

Comment on lines 303 to 306
highlight_map = {
snippet.replace("<b>", "").replace("</b>", ""): snippet
for snippet in hl_snippets
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maps the whole snippets, which can be unreliable. For example, instead of creating a map

hl_snippets = [ 
       "Stradivarius <b>Violin</b>",  
      "<b>Violin</b> Concerto in D Major" 
     ]  

It's better to use

"Violin": "<b>Violin</b>" 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why this would be unreliable. Aren't both highlight and instrumentname_set solr outputs? Also, if we want to reduce these hits ("Stradivarius <b>Violin</b>", "<b>Violin</b> Concerto in D Major" ) to (<b>Violin</b>), isn't it better to define a new field in solr storing tokenized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both values come from Solr. My concern is that full highlight snippets are context-dependent and may include surrounding text, which makes them brittle as mapping keys. Mapping at the matched-term level (e.g. "Violin" → "Violin") is more robust and less coupled to Solr’s snippet construction. And yes, tokenizers would be a better solution.

Comment on lines 303 to 306
highlight_map = {
snippet.replace("<b>", "").replace("</b>", ""): snippet
for snippet in hl_snippets
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both values come from Solr. My concern is that full highlight snippets are context-dependent and may include surrounding text, which makes them brittle as mapping keys. Mapping at the matched-term level (e.g. "Violin" → "Violin") is more robust and less coupled to Solr’s snippet construction. And yes, tokenizers would be a better solution.

"hl.fl": "text",
"hl.simple.pre": "<b>",
"hl.simple.post": "</b>",
"hl.snippets": 1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, switching to token-level highlights makes more sense.

Comment on lines 71 to 77
for name in self._names:
for orig, highlighted in highlight_dict.items():
if orig and orig.lower() in name.lower():
name = re.sub(
re.escape(orig), highlighted, name, flags=re.IGNORECASE
)
updated_names.append(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also realize this is O(nm)...not great. And this doesn't highlight harp in harpsichord

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed it would be a bad idea to create a different tokenization for highlighted field in Solr indexing as this field would need to replicate our query structure to able to highlight as intended i(e.g. containing Ngram). Maybe it is better to keep the highlighted field as our query field but apply deduplication/tokenization as post-processing for a more robust highlighting.

- Added `apply_highlights` method to `InstrumentNameSet` to apply Solr highlight snippets to each instrument name.
- Replaced matching parts of instrument names with highlighted `<b>` text from Solr.
- Updated `SolrInstrument` to store highlights and pass them to `InstrumentNameSet`.

resolves #343
@PouyaMohseni PouyaMohseni force-pushed the highlight-matched-test branch from bbe332f to f6db6fe Compare December 21, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Search] Show the word that matches the search query

3 participants