-
Notifications
You must be signed in to change notification settings - Fork 2
feat: render Solr highlights as bold in instrument list #466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| ] | ||
|
|
||
| # Extract highlight info | ||
| highlights = getattr(solr_response, "highlighting", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer to rename highlights to highlight_info or highlight_data.
| "hl.fl": "text", | ||
| "hl.simple.pre": "<b>", | ||
| "hl.simple.post": "</b>", | ||
| "hl.snippets": 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 is too much...50 should be sufficient for most instruments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some instruments like guitar have labels in 200 languages, adding aliases it can increase even more (many of them share a common ar). Although for one language we might not have more than 10 labels, we are not sure if our highlight query hits all of the labels in the selected language.
As an example, 50 would not be enough for a query of ar in guitar as it hits more then 100 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, switching to token-level highlights makes more sense.
| highlight_map = { | ||
| snippet.replace("<b>", "").replace("</b>", ""): snippet | ||
| for snippet in hl_snippets | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maps the whole snippets, which can be unreliable. For example, instead of creating a map
hl_snippets = [
"Stradivarius <b>Violin</b>",
"<b>Violin</b> Concerto in D Major"
]
It's better to use
"Violin": "<b>Violin</b>"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why this would be unreliable. Aren't both highlight and instrumentname_set solr outputs? Also, if we want to reduce these hits ("Stradivarius <b>Violin</b>", "<b>Violin</b> Concerto in D Major" ) to (<b>Violin</b>), isn't it better to define a new field in solr storing tokenized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both values come from Solr. My concern is that full highlight snippets are context-dependent and may include surrounding text, which makes them brittle as mapping keys. Mapping at the matched-term level (e.g. "Violin" → "Violin") is more robust and less coupled to Solr’s snippet construction. And yes, tokenizers would be a better solution.
| highlight_map = { | ||
| snippet.replace("<b>", "").replace("</b>", ""): snippet | ||
| for snippet in hl_snippets | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both values come from Solr. My concern is that full highlight snippets are context-dependent and may include surrounding text, which makes them brittle as mapping keys. Mapping at the matched-term level (e.g. "Violin" → "Violin") is more robust and less coupled to Solr’s snippet construction. And yes, tokenizers would be a better solution.
| "hl.fl": "text", | ||
| "hl.simple.pre": "<b>", | ||
| "hl.simple.post": "</b>", | ||
| "hl.snippets": 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, switching to token-level highlights makes more sense.
| for name in self._names: | ||
| for orig, highlighted in highlight_dict.items(): | ||
| if orig and orig.lower() in name.lower(): | ||
| name = re.sub( | ||
| re.escape(orig), highlighted, name, flags=re.IGNORECASE | ||
| ) | ||
| updated_names.append(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also realize this is O(nm)...not great. And this doesn't highlight harp in harpsichord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed it would be a bad idea to create a different tokenization for highlighted field in Solr indexing as this field would need to replicate our query structure to able to highlight as intended i(e.g. containing Ngram). Maybe it is better to keep the highlighted field as our query field but apply deduplication/tokenization as post-processing for a more robust highlighting.
- Added `apply_highlights` method to `InstrumentNameSet` to apply Solr highlight snippets to each instrument name. - Replaced matching parts of instrument names with highlighted `<b>` text from Solr. - Updated `SolrInstrument` to store highlights and pass them to `InstrumentNameSet`. resolves #343
bbe332f to
f6db6fe
Compare
apply_highlightsmethod toInstrumentNameSetto apply Solr highlight snippets to each instrument name.<b>text from Solr.SolrInstrumentto store highlights and pass them toInstrumentNameSet.resolves #343