This tutorial teaches you how to perform advanced searches using the Multi-Modal Academic Research System. You'll learn about query syntax, field boosting, filters, OpenSearch DSL, and optimizing search relevance.
- Basic Search Concepts
- Advanced Query Syntax
- Field Boosting
- Combining Filters
- OpenSearch Query DSL
- Optimizing Search Relevance
- Practical Examples
The system uses hybrid search combining:
- Keyword Matching (BM25): Traditional text search
- Semantic Similarity: Embedding-based similarity using SentenceTransformer
- Field Weighting: Different fields have different importance
When you search, the system looks through these fields:
title(3x weight): Paper/video/podcast titleabstract(2x weight): Paper abstract or descriptioncontent: Full text contenttranscript: Video/podcast transcriptskey_concepts(2x weight): Extracted conceptsauthors: Author namesmetadata: Additional metadata fields
Using the Gradio UI Research tab:
Query: "machine learning transformers"
Result: Searches across all fields, prioritizing title matches
Query: "attention mechanisms in neural networks"
Result: Semantic search finds related papers even with different wording
Combine terms using Boolean logic:
AND operator:
machine learning AND transformers
Both terms must appear in the document.
OR operator:
transformers OR attention mechanisms
Either term can appear in the document.
NOT operator:
neural networks NOT convolutional
Excludes documents containing "convolutional".
Grouping with parentheses:
(deep learning OR neural networks) AND (computer vision OR image recognition)
Search for exact phrases using quotes:
"attention is all you need"
Matches only documents containing this exact phrase.
"generative adversarial network"
More precise than individual words.
Use wildcards for pattern matching:
Asterisk (*): Matches any characters
transform*
Matches: transformer, transformers, transformation, transformed
Question mark (?): Matches single character
neural networ?
Matches: neural network (typo tolerance)
Handle typos and variations:
transformer~
Finds similar words like: transformers, transformer, transformed
Specify edit distance:
transformer~2
Allows up to 2 character differences.
Search within date or numeric ranges:
Date ranges:
publication_date:[2023-01-01 TO 2023-12-31]
Numeric ranges:
views:[1000 TO *]
Videos with 1000+ views.
Boost importance of specific fields in your searches.
The system uses these default weights:
title: 3xabstract: 2xkey_concepts: 2xcontent: 1xtranscript: 1x
Search specific fields with custom weights using Python API:
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
manager = OpenSearchManager()
# Custom search query
search_query = {
'query': {
'multi_match': {
'query': 'deep learning',
'fields': [
'title^5', # 5x weight on title
'abstract^3', # 3x weight on abstract
'content^1', # 1x weight on content
'key_concepts^4' # 4x weight on concepts
],
'type': 'best_fields'
}
},
'size': 20
}
results = manager.client.search(
index='research_assistant',
body=search_query
)Search only in specific fields:
# Search only in titles
search_query = {
'query': {
'match': {
'title': {
'query': 'transformers',
'boost': 1.0
}
}
}
}
# Search only in abstracts
search_query = {
'query': {
'match': {
'abstract': 'attention mechanisms'
}
}
}
# Search only in transcripts (for videos)
search_query = {
'query': {
'match': {
'transcript': 'neural network architectures'
}
}
}# Title must match, abstract should match
search_query = {
'query': {
'bool': {
'must': [
{'match': {'title': 'transformer'}}
],
'should': [
{'match': {'abstract': 'attention mechanisms'}}
]
}
}
}Filters narrow down results without affecting relevance scores.
Filter by content type (paper, video, podcast):
# Only papers
search_query = {
'query': {
'bool': {
'must': [
{'multi_match': {'query': 'machine learning', 'fields': ['title', 'abstract']}}
],
'filter': [
{'term': {'content_type': 'paper'}}
]
}
}
}
# Only videos
search_query = {
'query': {
'bool': {
'must': [
{'multi_match': {'query': 'deep learning tutorial', 'fields': ['title', 'transcript']}}
],
'filter': [
{'term': {'content_type': 'video'}}
]
}
}
}
# Papers or videos (exclude podcasts)
search_query = {
'query': {
'bool': {
'must': [
{'match': {'title': 'neural networks'}}
],
'filter': [
{'terms': {'content_type': ['paper', 'video']}}
]
}
}
}Filter by publication date:
# Papers from last year
search_query = {
'query': {
'bool': {
'must': [
{'match': {'title': 'transformers'}}
],
'filter': [
{
'range': {
'publication_date': {
'gte': '2023-01-01',
'lte': '2023-12-31'
}
}
}
]
}
}
}
# Recent papers (last 6 months)
from datetime import datetime, timedelta
six_months_ago = (datetime.now() - timedelta(days=180)).isoformat()
search_query = {
'query': {
'bool': {
'must': [
{'match': {'content': 'deep learning'}}
],
'filter': [
{
'range': {
'publication_date': {
'gte': six_months_ago
}
}
}
]
}
}
}Filter by specific authors:
# Papers by specific author
search_query = {
'query': {
'bool': {
'must': [
{'match': {'content': 'attention mechanisms'}}
],
'filter': [
{'term': {'authors': 'Vaswani'}}
]
}
}
}
# Multiple authors
search_query = {
'query': {
'bool': {
'must': [
{'match': {'title': 'transformers'}}
],
'filter': [
{'terms': {'authors': ['Vaswani', 'Hinton', 'Bengio']}}
]
}
}
}Filter by ArXiv categories or tags:
# Machine learning papers
search_query = {
'query': {
'bool': {
'must': [
{'match': {'title': 'neural networks'}}
],
'filter': [
{'term': {'metadata.categories': 'cs.LG'}}
]
}
}
}
# Multiple categories
search_query = {
'query': {
'bool': {
'must': [
{'match_all': {}}
],
'filter': [
{
'terms': {
'metadata.categories': ['cs.LG', 'cs.AI', 'cs.CV']
}
}
]
}
}
}# Papers from 2023, in computer vision, with high relevance
search_query = {
'query': {
'bool': {
'must': [
{
'multi_match': {
'query': 'object detection',
'fields': ['title^3', 'abstract^2', 'content']
}
}
],
'filter': [
{'term': {'content_type': 'paper'}},
{
'range': {
'publication_date': {
'gte': '2023-01-01',
'lte': '2023-12-31'
}
}
},
{'term': {'metadata.categories': 'cs.CV'}}
]
}
},
'size': 50
}For maximum control, use OpenSearch Query DSL directly.
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
manager = OpenSearchManager()
query_dsl = {
'query': {
# Query goes here
},
'size': 10, # Number of results
'from': 0, # Offset for pagination
'sort': [], # Sorting criteria
'_source': [] # Fields to return
}
results = manager.client.search(
index='research_assistant',
body=query_dsl
)Simple text matching:
query_dsl = {
'query': {
'match': {
'title': {
'query': 'machine learning',
'operator': 'and', # Both words must appear
'fuzziness': 'AUTO' # Allow typos
}
}
}
}Search across multiple fields:
query_dsl = {
'query': {
'multi_match': {
'query': 'deep learning neural networks',
'fields': ['title^3', 'abstract^2', 'content'],
'type': 'best_fields', # Use best matching field
'tie_breaker': 0.3, # Consider other fields
'minimum_should_match': '75%'
}
}
}Match types:
best_fields: Use highest scoring field (default)most_fields: Combine scores from all fieldscross_fields: Treat fields as one big fieldphrase: Match as phrase across fields
Combine multiple query clauses:
query_dsl = {
'query': {
'bool': {
'must': [
# Must match all of these
{'match': {'title': 'transformers'}}
],
'should': [
# Should match some of these (boosts score)
{'match': {'abstract': 'attention'}},
{'match': {'content': 'self-attention'}}
],
'must_not': [
# Must not match any of these
{'match': {'content': 'deprecated'}}
],
'filter': [
# Must match but doesn't affect score
{'term': {'content_type': 'paper'}},
{'range': {'publication_date': {'gte': '2023-01-01'}}}
],
'minimum_should_match': 1 # At least 1 should clause
}
}
}Boost or demote documents:
query_dsl = {
'query': {
'boosting': {
'positive': {
# Boost documents matching this
'match': {'title': 'neural networks'}
},
'negative': {
# Demote documents matching this
'match': {'content': 'outdated'}
},
'negative_boost': 0.3 # Reduce score to 30%
}
}
}Get statistics about your search results:
query_dsl = {
'query': {
'match_all': {}
},
'size': 0, # Don't return documents
'aggs': {
'by_content_type': {
'terms': {
'field': 'content_type',
'size': 10
}
},
'by_year': {
'date_histogram': {
'field': 'publication_date',
'calendar_interval': 'year'
}
},
'by_author': {
'terms': {
'field': 'authors',
'size': 20
}
}
}
}
results = manager.client.search(index='research_assistant', body=query_dsl)
aggregations = results['aggregations']Custom sort order:
query_dsl = {
'query': {
'match': {'title': 'machine learning'}
},
'sort': [
{'publication_date': {'order': 'desc'}}, # Newest first
{'_score': {'order': 'desc'}}, # Then by relevance
{'title.keyword': {'order': 'asc'}} # Then alphabetically
]
}Highlight matching terms in results:
query_dsl = {
'query': {
'match': {'content': 'neural networks'}
},
'highlight': {
'fields': {
'content': {
'fragment_size': 150,
'number_of_fragments': 3,
'pre_tags': ['<strong>'],
'post_tags': ['</strong>']
}
}
}
}
results = manager.client.search(index='research_assistant', body=query_dsl)
for hit in results['hits']['hits']:
if 'highlight' in hit:
print(hit['highlight']['content'])Modify the hybrid search in opensearch_manager.py:
def custom_hybrid_search(self, index_name: str, query: str, k: int = 10) -> List[Dict]:
"""Custom hybrid search with adjusted weights"""
# Generate query embedding
query_embedding = self.embedding_model.encode(query).tolist()
search_query = {
'size': k,
'query': {
'bool': {
'should': [
# Text search with custom weights
{
'multi_match': {
'query': query,
'fields': [
'title^5', # Increase title weight
'abstract^3', # Increase abstract weight
'key_concepts^4', # Increase concepts weight
'content^1',
'transcript^1'
],
'type': 'best_fields',
'tie_breaker': 0.3,
'fuzziness': 'AUTO'
},
'boost': 1.0 # Text search weight
},
# Add vector search if needed
]
}
}
}
response = self.client.search(index=index_name, body=search_query)
return [{'score': hit['_score'], 'source': hit['_source']}
for hit in response['hits']['hits']]1. Adjust field weights based on your use case:
For technical papers:
'fields': ['abstract^4', 'title^3', 'key_concepts^3', 'content^1']For finding specific concepts:
'fields': ['key_concepts^5', 'title^2', 'abstract^2', 'content^1']For broad topic search:
'fields': ['title^2', 'abstract^2', 'content^2', 'transcript^2']2. Use minimum_should_match:
'multi_match': {
'query': 'machine learning neural networks deep',
'fields': ['title', 'abstract'],
'minimum_should_match': '75%' # Match at least 3 of 4 words
}3. Enable fuzzy matching for typo tolerance:
'match': {
'title': {
'query': 'transformr', # Typo
'fuzziness': 'AUTO'
}
}4. Use phrase matching for exact terms:
'bool': {
'must': [
{'match_phrase': {'title': 'attention is all you need'}}
]
}Implement custom re-ranking logic:
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
def rerank_results(results, query, boost_recent=True):
"""Custom re-ranking function"""
from datetime import datetime
reranked = []
for result in results:
score = result['score']
source = result['source']
# Boost recent papers
if boost_recent and 'publication_date' in source:
pub_date = datetime.fromisoformat(source['publication_date'])
days_old = (datetime.now() - pub_date).days
# Boost papers less than 1 year old
if days_old < 365:
recency_boost = 1.0 + (365 - days_old) / 365 * 0.5
score *= recency_boost
# Boost papers with more authors (collaborative work)
if 'authors' in source and len(source['authors']) > 3:
score *= 1.1
# Boost papers with key concepts matching query terms
if 'key_concepts' in source:
query_terms = set(query.lower().split())
concepts = set(c.lower() for c in source['key_concepts'])
overlap = len(query_terms & concepts)
if overlap > 0:
score *= 1.0 + (overlap * 0.1)
reranked.append({
'score': score,
'source': source
})
# Sort by new scores
reranked.sort(key=lambda x: x['score'], reverse=True)
return reranked
# Usage
manager = OpenSearchManager()
results = manager.hybrid_search('research_assistant', 'machine learning', k=50)
reranked = rerank_results(results, 'machine learning', boost_recent=True)
top_10 = reranked[:10]from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
from datetime import datetime, timedelta
manager = OpenSearchManager()
# Search for papers from last 3 months
three_months_ago = (datetime.now() - timedelta(days=90)).isoformat()
query = {
'query': {
'bool': {
'must': [
{
'multi_match': {
'query': 'large language models',
'fields': ['title^3', 'abstract^2', 'content'],
'fuzziness': 'AUTO'
}
}
],
'filter': [
{'term': {'content_type': 'paper'}},
{
'range': {
'publication_date': {
'gte': three_months_ago
}
}
}
]
}
},
'sort': [
{'publication_date': {'order': 'desc'}},
{'_score': {'order': 'desc'}}
],
'size': 20
}
results = manager.client.search(index='research_assistant', body=query)
print(f"Found {results['hits']['total']['value']} recent papers")
for hit in results['hits']['hits']:
source = hit['_source']
print(f"\n{source['title']}")
print(f"Published: {source['publication_date']}")
print(f"Authors: {', '.join(source['authors'][:3])}")# Find papers by specific author on a topic
query = {
'query': {
'bool': {
'must': [
{
'multi_match': {
'query': 'reinforcement learning',
'fields': ['title', 'abstract']
}
}
],
'filter': [
{'term': {'authors': 'Sutton'}}
]
}
}
}
results = manager.client.search(index='research_assistant', body=query)# Search across all content types, group by type
query = {
'query': {
'multi_match': {
'query': 'neural network architectures',
'fields': ['title^3', 'abstract^2', 'content', 'transcript']
}
},
'aggs': {
'by_type': {
'terms': {
'field': 'content_type'
},
'aggs': {
'top_by_type': {
'top_hits': {
'size': 5,
'sort': [{'_score': {'order': 'desc'}}]
}
}
}
}
},
'size': 30
}
results = manager.client.search(index='research_assistant', body=query)
# Extract top items per type
for bucket in results['aggregations']['by_type']['buckets']:
content_type = bucket['key']
count = bucket['doc_count']
top_items = bucket['top_by_type']['hits']['hits']
print(f"\n{content_type.upper()} ({count} total):")
for item in top_items:
print(f" - {item['_source']['title']}")Find papers similar to a specific paper:
# Get a reference paper
reference_query = {
'query': {
'match': {'title': 'Attention is All You Need'}
},
'size': 1
}
ref_result = manager.client.search(index='research_assistant', body=reference_query)
reference_paper = ref_result['hits']['hits'][0]['_source']
# Search for similar papers using key concepts and abstract
similar_query = {
'query': {
'more_like_this': {
'fields': ['abstract', 'key_concepts', 'content'],
'like': [
{
'_index': 'research_assistant',
'_id': ref_result['hits']['hits'][0]['_id']
}
],
'min_term_freq': 1,
'max_query_terms': 25,
'min_doc_freq': 1
}
},
'size': 10
}
similar_results = manager.client.search(index='research_assistant', body=similar_query)
print(f"Papers similar to '{reference_paper['title']}':")
for hit in similar_results['hits']['hits']:
print(f" - {hit['_source']['title']} (score: {hit['_score']:.2f})")# Faceted search with multiple filters
query = {
'query': {
'match': {'content': 'computer vision'}
},
'aggs': {
'content_types': {
'terms': {'field': 'content_type'}
},
'publication_years': {
'date_histogram': {
'field': 'publication_date',
'calendar_interval': 'year'
}
},
'top_authors': {
'terms': {
'field': 'authors',
'size': 10
}
},
'categories': {
'terms': {
'field': 'metadata.categories',
'size': 20
}
}
},
'size': 0 # Only get facets, not documents
}
results = manager.client.search(index='research_assistant', body=query)
# Display facets
print("Content Types:")
for bucket in results['aggregations']['content_types']['buckets']:
print(f" {bucket['key']}: {bucket['doc_count']}")
print("\nTop Authors:")
for bucket in results['aggregations']['top_authors']['buckets']:
print(f" {bucket['key']}: {bucket['doc_count']} papers")- Learn about Exporting Citations for your research
- Explore Visualization Dashboard to analyze search patterns
- Check Extending the System to customize search behavior
- Review Collecting Papers to build your research database