Skip to content

Latest commit

 

History

History
187 lines (136 loc) · 5.61 KB

File metadata and controls

187 lines (136 loc) · 5.61 KB

Debugging Summary - Multi-Modal Research System

Issues Found and Fixes Applied

1. ✅ Logging System Issues

Problem: Log files were created but remained empty (0 bytes) because of buffering.

Fix Applied: Updated multi_modal_rag/logging_config.py to flush log entries immediately after writing.

Result: Logs now write to logs/research_system_TIMESTAMP.log in real-time.


2. ✅ YouTube Collection Broken (FIXED)

Problems Found:

  1. Error: TypeError: post() got an unexpected keyword argument 'proxies'

    • Cause: youtube-search-python library incompatible with newer httpx
  2. Error: HTTPError: HTTP Error 400: Bad Request

    • Cause: pytube is completely broken (YouTube changed their API)
    • pytube hasn't been updated and can't fetch video metadata anymore

Fix Applied:

  • Completely replaced pytube and youtube-search-python with yt-dlp
  • yt-dlp is actively maintained and handles YouTube API changes automatically
  • Added yt-dlp==2024.3.10 to requirements.txt
  • Rewrote both search_youtube_lectures() and collect_video_metadata() methods

Action Required:

pip install yt-dlp==2024.3.10

Then restart the application (Ctrl+C and run python main.py again)


3. ⚠️ Podcast Collection - Potential Issues

Warning: pydub cannot find ffmpeg or avconv

Impact: Audio transcription with Whisper will fail if podcast audio needs format conversion.

Fix Required:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows
# Download from: https://ffmpeg.org/download.html

4. ✅ Search Functionality - Working (returns 0 results as expected)

Status: OpenSearch connected successfully. Search is working but returns 0 results.

Reason: The collected videos haven't been indexed into OpenSearch yet.

What's Happening:

  • YouTube collection works ✅ (5 videos collected successfully)
  • Videos are collected but not indexed ⚠️ (was the issue)
  • Search returns 0 results because index is empty ✅ (expected behavior)

Fix Applied:

  • Data collection now automatically indexes collected items into OpenSearch
  • Added _index_data() and _format_document() methods to handle indexing
  • Added handle_reindex() method for the "Reindex All Data" button
  • Connected the reindex button to its handler
  • Added comprehensive logging to track indexing progress

After restart: When you collect data, it will be automatically indexed and searchable!


5. ✅ Gemini API Model Error - FIXED (Updated to Newer SDK!)

Problem:

  • Error: 404 models/gemini-1.5-flash is not found for API version v1beta
  • Was using old google-generativeai SDK with outdated model names

Fix Applied - Upgraded to Newer Gemini SDK:

  • Added google-genai package (newer, better SDK)
  • Updated PDF Processor to use new SDK pattern:
    • Text analysis: gemini-2.0-flash-lite (fastest free model)
    • Vision analysis: gemini-2.0-flash-exp (supports images)
    • Uses new genai.Client() and types.Content() patterns
  • Updated Video Processor to use new SDK pattern:
    • Model: gemini-2.0-flash-lite
  • Updated Research Orchestrator:
    • Model: gemini-1.5-flash-latest (LangChain compatible)

Inspired by: User-provided code showing proper Gemini SDK usage

Action Required:

  1. Install new SDK: pip install google-genai
  2. Restart the application

Installation Instructions

Step 1: Install yt-dlp

pip install yt-dlp==2024.3.10

Step 2: Install ffmpeg (for podcast audio processing)

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

Step 3: Restart the application

Stop the current running instance (Ctrl+C) and run:

python main.py

Step 4: Test YouTube Collection

Try collecting YouTube videos through the UI to verify the fix works.

Step 5: Check Logs

After testing, review the log file:

# Find the latest log file
ls -lt logs/

# View the log
tail -f logs/research_system_YYYYMMDD_HHMMSS.log

Log File Locations

All logs are written to: logs/research_system_YYYYMMDD_HHMMSS.log

The application displays the log file path when it starts:

📝 Logs are being written to: logs/research_system_20251002_221805.log

What to Look For in Logs

YouTube Collection

  • Search: Look for "Starting YouTube search for query"
  • Success: "Successfully collected N videos"
  • Errors: "Error searching YouTube lectures"

Podcast Collection

  • RSS Parsing: "Collecting podcast episodes from RSS"
  • Audio Download: "Downloading audio from"
  • Transcription: "Transcribing audio (this may take several minutes)"
  • Errors: "Error collecting podcast episodes" or "Error transcribing audio"

Search Operations

  • Query: "Processing research query"
  • Results: "Retrieved N search results"
  • LLM Response: "Generated response"
  • Errors: "Error processing query" or "Cannot search - OpenSearch not connected"

Known Limitations

  1. YouTube Search: Now uses yt-dlp which is more reliable but slightly slower than the old library
  2. Podcast Transcription: Requires Whisper model download (happens automatically on first use)
  3. Search: Requires data to be indexed first - papers, videos, or podcasts must be collected before searching

Testing Checklist

  • Install yt-dlp: pip install yt-dlp==2024.3.10
  • Install ffmpeg (if using podcast features)
  • Restart application
  • Try YouTube collection with query "Machine Learning"
  • Try searching (if you have indexed data)
  • Check log file for detailed error information
  • Report any new errors with log file excerpts