Tutorial: Collecting Academic Papers

This tutorial provides step-by-step instructions for collecting academic papers using the Multi-Modal Academic Research System. You'll learn how to use both the Gradio UI and Python API directly to gather papers from multiple sources.

Prerequisites
Using the Gradio UI
Using the Python API
Different Search Strategies
Troubleshooting

Prerequisites

Before collecting papers, ensure:

Your virtual environment is activated:

source venv/bin/activate  # Mac/Linux
venv\Scripts\activate     # Windows

OpenSearch is running (for automatic indexing):

docker run -p 9200:9200 -e "discovery.type=single-node" opensearchproject/opensearch:latest

Your .env file contains your Gemini API key:

GEMINI_API_KEY=your_api_key_here
OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200

Using the Gradio UI

Step 1: Launch the Application

Start the main application:

python main.py

You should see output like:

Starting Multi-Modal Research Assistant Application
Connected to OpenSearch at localhost:9200
Research Assistant ready!
Opening web interface...
Running on local URL: http://0.0.0.0:7860
Running on public URL: https://xxxxx.gradio.live

Step 2: Navigate to Data Collection Tab

Open the local URL (http://localhost:7860) in your web browser
Click on the "Data Collection" tab at the top
You'll see three main sections:
- Left panel: Collection options and controls
- Right panel: Status updates and results

Step 3: Configure Your Search

Select Data Source:

Choose "ArXiv Papers" from the radio button options
Other options include "YouTube Lectures" and "Podcasts"

Enter Search Query:

Type your research topic in the "Search Query" field
Examples:
- machine learning
- quantum computing
- natural language processing
- computer vision transformers

Set Maximum Results:

Use the slider to choose how many papers to collect (5-100)
Start with 10-20 for testing
Note: Larger values take longer to process

Step 4: Collect Papers

Click the "Collect Data" button

Watch the status updates in the "Collection Status" box:

Collecting papers from ArXiv...
Collected 15 papers

Indexing data into OpenSearch...
Indexed 15 items into OpenSearch

Collection and indexing complete!

Review the results in the "Collection Results" JSON output:

{
  "papers_collected": 15,
  "items_indexed": 15
}

Step 5: Verify Collection

Option 1: Check the Data Visualization Tab

Click on the "Data Visualization" tab
Click "Refresh Statistics"
You should see your newly collected papers in the totals

Option 2: Use the Research Tab

Go to the "Research" tab
Enter a query related to your collected papers
The system should retrieve relevant papers from your collection

What Happens Behind the Scenes

When you collect papers via the UI:

Collection Phase:
- System queries the ArXiv API with your search terms
- Downloads PDFs to data/papers/ directory
- Extracts metadata (title, authors, abstract, etc.)
Database Tracking:
- Each paper is logged in the SQLite database
- Collection statistics are recorded
- Metadata is stored for future reference
Indexing Phase:
- Paper content is formatted for OpenSearch
- Embeddings are generated using SentenceTransformer
- Documents are bulk-indexed for fast retrieval
Completion:
- Papers marked as indexed in the database
- Ready for searching and querying

Using the Python API

For more control or automation, use the Python API directly.

Basic Example

Create a Python script (collect_papers.py):

from dotenv import load_dotenv
from multi_modal_rag.data_collectors.paper_collector import AcademicPaperCollector
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
from multi_modal_rag.database import CollectionDatabaseManager

# Load environment variables
load_dotenv()

# Initialize components
paper_collector = AcademicPaperCollector()
opensearch_manager = OpenSearchManager()
db_manager = CollectionDatabaseManager()

# Collect papers from ArXiv
query = "deep learning neural networks"
max_results = 20

print(f"Collecting papers for: {query}")
papers = paper_collector.collect_arxiv_papers(query, max_results)

print(f"Collected {len(papers)} papers")

# Process and track each paper
for paper in papers:
    # Add to database
    collection_id = db_manager.add_collection(
        content_type='paper',
        title=paper['title'],
        source='arxiv',
        url=paper.get('pdf_url', ''),
        metadata={'query': query}
    )

    # Store paper details
    db_manager.add_paper(collection_id, paper)

    # Index in OpenSearch
    document = {
        'content_type': 'paper',
        'title': paper['title'],
        'abstract': paper['abstract'],
        'authors': paper['authors'],
        'url': paper.get('pdf_url', ''),
        'publication_date': paper['published'],
        'metadata': {
            'arxiv_id': paper['arxiv_id'],
            'categories': paper['categories']
        }
    }

    opensearch_manager.index_document('research_assistant', document)
    db_manager.mark_as_indexed(collection_id)

    print(f"  - {paper['title'][:80]}...")

# Log collection statistics
db_manager.log_collection_stats('paper', query, len(papers), 'arxiv')

print("Collection complete!")

Run the script:

python collect_papers.py

Advanced Example: Batch Collection

Collect papers for multiple topics:

from multi_modal_rag.data_collectors.paper_collector import AcademicPaperCollector
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
from multi_modal_rag.database import CollectionDatabaseManager
from dotenv import load_dotenv
import time

load_dotenv()

# Initialize
paper_collector = AcademicPaperCollector()
opensearch_manager = OpenSearchManager()
db_manager = CollectionDatabaseManager()

# Define topics
topics = [
    "transformer architecture attention mechanisms",
    "reinforcement learning robotics",
    "graph neural networks",
    "few-shot learning meta-learning",
    "generative adversarial networks"
]

# Collect for each topic
for topic in topics:
    print(f"\nCollecting papers for: {topic}")

    papers = paper_collector.collect_arxiv_papers(topic, max_results=10)

    # Prepare documents for bulk indexing
    documents = []
    for paper in papers:
        # Track in database
        collection_id = db_manager.add_collection(
            content_type='paper',
            title=paper['title'],
            source='arxiv',
            url=paper.get('pdf_url', ''),
            metadata={'query': topic, 'categories': paper['categories']}
        )
        db_manager.add_paper(collection_id, paper)

        # Prepare for indexing
        doc = {
            'content_type': 'paper',
            'title': paper['title'],
            'abstract': paper['abstract'],
            'authors': paper['authors'],
            'url': paper.get('pdf_url', ''),
            'publication_date': paper['published'],
            'metadata': {
                'arxiv_id': paper['arxiv_id'],
                'categories': paper['categories']
            }
        }
        documents.append(doc)

    # Bulk index
    if documents:
        opensearch_manager.bulk_index('research_assistant', documents)
        print(f"Indexed {len(documents)} papers")

        # Mark as indexed
        for paper in papers:
            # Get collection_id and mark
            pass  # Simplified for brevity

    # Log stats
    db_manager.log_collection_stats('paper', topic, len(papers), 'arxiv')

    # Be respectful to the API
    time.sleep(3)

print("\nBatch collection complete!")

Using Collection Filters

Filter papers by specific criteria:

# Collect only recent papers (ArXiv example)
papers = paper_collector.collect_arxiv_papers(
    "machine learning",
    max_results=50
)

# Filter by publication year
from datetime import datetime, timedelta

recent_papers = [
    p for p in papers
    if datetime.fromisoformat(p['published']) > datetime.now() - timedelta(days=365)
]

print(f"Found {len(recent_papers)} papers from the last year")

# Filter by category
ml_papers = [
    p for p in papers
    if any(cat.startswith('cs.LG') for cat in p['categories'])
]

print(f"Found {len(ml_papers)} machine learning papers")

Different Search Strategies

1. ArXiv Papers

Best for: Computer science, physics, mathematics, quantitative biology

Search Tips:

Use specific technical terms: "attention mechanisms" instead of "AI"
Include category codes: cat:cs.LG for machine learning
Use Boolean operators: machine learning AND interpretability
Filter by date: submittedDate:[20230101 TO 20231231]

Example Searches:

# Specific subfield
papers = paper_collector.collect_arxiv_papers(
    "cat:cs.CV AND (object detection OR semantic segmentation)",
    max_results=30
)

# Recent papers on a topic
papers = paper_collector.collect_arxiv_papers(
    "large language models AND submittedDate:[20230101 TO *]",
    max_results=50
)

# Papers by specific author
papers = paper_collector.collect_arxiv_papers(
    "au:Hinton AND cat:cs.LG",
    max_results=20
)

2. Semantic Scholar

Best for: Open access papers across all disciplines

Search Tips:

Broader coverage than ArXiv
Automatically filters for open access PDFs
Good for interdisciplinary research

Example:

# Collect from Semantic Scholar
papers = paper_collector.collect_semantic_scholar(
    "climate change machine learning",
    max_results=30
)

# Filter for papers with PDFs
papers_with_pdfs = [p for p in papers if p.get('pdf_url')]

3. PubMed Central

Best for: Biomedical and life sciences research

Search Tips:

Use MeSH terms for better results
Filter for open access content
Combine with Boolean operators

Example:

# Collect biomedical papers
papers = paper_collector.collect_pubmed_central(
    "CRISPR gene editing",
    max_results=25
)

# Search with MeSH terms
papers = paper_collector.collect_pubmed_central(
    '"Machine Learning"[MeSH] AND "Cancer"[MeSH]',
    max_results=30
)

4. Combined Strategy

Collect from multiple sources:

from multi_modal_rag.data_collectors.paper_collector import AcademicPaperCollector

collector = AcademicPaperCollector()
all_papers = []

# Collect from ArXiv
arxiv_papers = collector.collect_arxiv_papers("neural networks", 20)
all_papers.extend(arxiv_papers)

# Collect from Semantic Scholar
ss_papers = collector.collect_semantic_scholar("neural networks", 20)
all_papers.extend(ss_papers)

# Collect from PubMed Central (if biomedical topic)
pmc_papers = collector.collect_pubmed_central("neural networks medical imaging", 15)
all_papers.extend(pmc_papers)

# Deduplicate by title
unique_papers = []
seen_titles = set()

for paper in all_papers:
    title = paper['title'].lower().strip()
    if title not in seen_titles:
        unique_papers.append(paper)
        seen_titles.add(title)

print(f"Collected {len(unique_papers)} unique papers from {len(all_papers)} total")

Troubleshooting

Common Issues and Solutions

Issue 1: No Papers Collected

Symptoms:

Collected 0 papers

Solutions:

Check your query: Make it more general

# Too specific
papers = collector.collect_arxiv_papers("very specific rare topic XYZ123", 50)

# Better
papers = collector.collect_arxiv_papers("machine learning", 50)

Verify internet connection: ArXiv requires network access
```
curl https://arxiv.org
```

Check for API rate limits: Add delays between requests

import time
time.sleep(3)  # Wait 3 seconds between collections

Issue 2: OpenSearch Not Available

Symptoms:

Cannot index document - OpenSearch not connected

Solutions:

Start OpenSearch:

docker run -p 9200:9200 -e "discovery.type=single-node" opensearchproject/opensearch:latest

Verify OpenSearch is running:
```
curl http://localhost:9200
```

Check your .env file:

OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200

Issue 3: PDF Download Failures

Symptoms:

Error downloading PDF for paper: Connection timeout

Solutions:

Increase timeout in paper_collector.py:

result.download_pdf(dirpath=self.save_dir, timeout=120)

Skip PDFs and index metadata only:

# Don't download PDFs, just index metadata
for result in search.results():
    paper_data = {
        'title': result.title,
        'abstract': result.summary,
        # ... other metadata
        'local_path': None  # Skip PDF download
    }
    papers.append(paper_data)

Check disk space:
```
df -h data/papers/
```

Issue 4: Duplicate Papers

Symptoms: Multiple copies of the same paper in search results

Solutions:

Implement deduplication:

def deduplicate_papers(papers):
    seen = set()
    unique = []

    for paper in papers:
        # Use arxiv_id or URL as unique identifier
        identifier = paper.get('arxiv_id') or paper.get('pdf_url')

        if identifier and identifier not in seen:
            seen.add(identifier)
            unique.append(paper)

    return unique

papers = deduplicate_papers(collected_papers)

Check database for existing entries:

from multi_modal_rag.database import CollectionDatabaseManager

db = CollectionDatabaseManager()

# Before adding, check if URL exists
existing = db.search_collections(paper['pdf_url'], limit=1)
if not existing:
    # Add paper
    pass

Issue 5: Memory Issues with Large Collections

Symptoms:

MemoryError: Unable to allocate array

Solutions:

Process papers in batches:

batch_size = 10
total_papers = 100

for i in range(0, total_papers, batch_size):
    batch_papers = collector.collect_arxiv_papers(
        query,
        max_results=min(batch_size, total_papers - i)
    )

    # Process batch
    opensearch_manager.bulk_index('research_assistant', batch_papers)

    # Clear memory
    del batch_papers

Reduce embedding model memory:

# Use smaller model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # Smaller, faster

Issue 6: Slow Collection

Symptoms: Collecting papers takes a very long time

Solutions:

Reduce max_results:

# Instead of
papers = collector.collect_arxiv_papers(query, max_results=100)

# Try
papers = collector.collect_arxiv_papers(query, max_results=20)

Skip PDF processing (index metadata only):

# Modify collection to skip downloads
for result in search.results():
    # Don't call result.download_pdf()
    papers.append(metadata_only)

Use parallel processing:

from concurrent.futures import ThreadPoolExecutor

def collect_batch(query_batch):
    return collector.collect_arxiv_papers(query_batch, 10)

queries = ["ML topic 1", "ML topic 2", "ML topic 3"]

with ThreadPoolExecutor(max_workers=3) as executor:
    results = list(executor.map(collect_batch, queries))

Getting Help

If you encounter issues not covered here:

Check the logs:

tail -f logs/research_assistant_YYYYMMDD_HHMMSS.log

Enable debug logging: Edit multi_modal_rag/logging_config.py and set level to DEBUG

Test components individually:

# Test collector
from multi_modal_rag.data_collectors.paper_collector import AcademicPaperCollector
collector = AcademicPaperCollector()
papers = collector.collect_arxiv_papers("test", 1)
print(papers)

# Test OpenSearch
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
manager = OpenSearchManager()
print(manager.connected)

Check database integrity:

from multi_modal_rag.database import CollectionDatabaseManager
db = CollectionDatabaseManager()
stats = db.get_statistics()
print(stats)

Best Practices

Start Small: Begin with 10-20 papers to test your setup
Use Specific Queries: More specific queries yield better results
Monitor Resources: Watch disk space and memory usage
Regular Backups: Backup your data/ directory regularly
Respect APIs: Use appropriate delays between requests
Verify Indexing: Always check that papers are indexed successfully
Track Collections: Use the database to avoid duplicate collections

Next Steps

Learn about Custom Searches to query your collected papers
Explore Exporting Citations for research writing
Check Visualization Dashboard to analyze your collection
Read Extending the System to add new data sources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: Collecting Academic Papers

Table of Contents

Prerequisites

Using the Gradio UI

Step 1: Launch the Application

Step 2: Navigate to Data Collection Tab

Step 3: Configure Your Search

Step 4: Collect Papers

Step 5: Verify Collection

What Happens Behind the Scenes

Using the Python API

Basic Example

Advanced Example: Batch Collection

Using Collection Filters

Different Search Strategies

1. ArXiv Papers

2. Semantic Scholar

3. PubMed Central

4. Combined Strategy

Troubleshooting

Common Issues and Solutions

Issue 1: No Papers Collected

Issue 2: OpenSearch Not Available

Issue 3: PDF Download Failures

Issue 4: Duplicate Papers

Issue 5: Memory Issues with Large Collections

Issue 6: Slow Collection

Getting Help

Best Practices

Next Steps

FilesExpand file tree

collect-papers.md

Latest commit

History

collect-papers.md

File metadata and controls

Tutorial: Collecting Academic Papers

Table of Contents

Prerequisites

Using the Gradio UI

Step 1: Launch the Application

Step 2: Navigate to Data Collection Tab

Step 3: Configure Your Search

Step 4: Collect Papers

Step 5: Verify Collection

What Happens Behind the Scenes

Using the Python API

Basic Example

Advanced Example: Batch Collection

Using Collection Filters

Different Search Strategies

1. ArXiv Papers

2. Semantic Scholar

3. PubMed Central

4. Combined Strategy

Troubleshooting

Common Issues and Solutions

Issue 1: No Papers Collected

Issue 2: OpenSearch Not Available

Issue 3: PDF Download Failures

Issue 4: Duplicate Papers

Issue 5: Memory Issues with Large Collections

Issue 6: Slow Collection

Getting Help

Best Practices

Next Steps