Users reported "all methods failed" errors when extracting video information, particularly for YouTube videos. The extraction was failing silently without providing enough detail about which method failed and why.
- Insufficient fallback methods: Only 2 extraction methods (YouTube API and oEmbed)
- Poor error reporting: Generic failures without detailed error messages
- No HTML scraping: When APIs failed, there was no fallback to parse HTML
- Silent failures: Errors were logged but not tracked or reported comprehensively
Implemented a comprehensive 3-method fallback system:
# Method 1: YouTube Data API (if API key provided)
# - Most comprehensive data
# - Requires API key
# - Rate limited
# Method 2: YouTube oEmbed API
# - Good basic metadata
# - No API key required
# - Publicly available
# Method 3: HTML Scraping
# - Last resort fallback
# - Parses YouTube HTML directly
# - Always available but less reliableAdded _extract_from_html() method with multiple extraction strategies:
def _extract_from_html(self, video_id: str) -> Optional[Dict[str, Any]]:
# Try multiple HTML parsing methods:
# 1. Extract from <title> tag
# 2. Extract from og:title meta tag
# 3. Extract from JSON-LD structured data
# 4. Extract author from itemprop='name'
# 5. Extract thumbnail from og:imageTrack errors from each method and report them:
errors = []
# Try method 1
try:
...
except Exception as e:
errors.append(f"YouTube API: {str(e)}")
# Try method 2
try:
...
except Exception as e:
errors.append(f"oEmbed: {str(e)}")
# Try method 3
try:
...
except Exception as e:
errors.append(f"HTML scraping: {str(e)}")
# Report all errors if all methods fail
if all_failed:
error_msg = 'All extraction methods failed: ' + '; '.join(errors)Added detailed debug logging for each step:
self.logger.debug(f"Successfully extracted via YouTube API: {video_id}")
self.logger.debug(f"Successfully extracted via oEmbed: {video_id}")
self.logger.debug(f"Successfully extracted via HTML scraping: {video_id}")
self.logger.warning(f"All extraction methods failed for {video_id}: {errors}")Even when all methods fail, return basic video information:
# Return minimal info with clear error message
return {
'video_id': video_id,
'url': f"https://www.youtube.com/watch?v={video_id}",
'title': f"YouTube Video {video_id}",
'status': 'failed',
'error': 'All extraction methods failed: ...',
'data_source': 'none',
'author_name': None,
'platform': 'youtube'
}WARNING: Could not extract video info for: https://youtube.com/watch?v=ABC123
- No video info returned
- No details about what failed
- Processing stopped
DEBUG: YouTube API: 403 Quota exceeded
DEBUG: oEmbed: HTTP 429 Too Many Requests
DEBUG: Successfully extracted via HTML scraping: ABC123
- Video info extracted successfully
- Detailed error messages for debugging
- Processing continues with fallback method
When all methods fail:
{
"video_id": "ABC123",
"title": "YouTube Video ABC123",
"status": "failed",
"error": "All extraction methods failed: YouTube API: 403 Quota exceeded; oEmbed: HTTP 429; HTML scraping: No title found",
"data_source": "none"
}The HTML scraping method extracts data from multiple sources in the YouTube HTML:
<title>tag: Basic page title (e.g., "Video Title - YouTube")- Open Graph meta tag:
<meta property="og:title" content="..."> - JSON-LD structured data:
<script type="application/ld+json">
- Link itemprop:
<link itemprop="name" content="Channel Name">
- Open Graph image:
<meta property="og:image" content="...">
YouTube APIs have rate limits:
- Quota: 10,000 units per day (default)
- Per video: 1 unit
- Solution: Use HTML scraping as fallback when quota exhausted
- Rate limit: ~100 requests per minute
- HTTP 429: Too Many Requests
- Solution: Automatic fallback to HTML scraping
- Rate limit: Subject to YouTube's general rate limiting
- Advantage: No API key required
- Disadvantage: Less stable, may break if YouTube changes HTML structure
- Provide YouTube API key for best results
- Monitor quota usage
- Implement request batching
- HTML scraping works without API key
- May be slower (needs full page fetch)
- Subject to HTML structure changes
- Always check
statusfield - Parse
errorfield for debugging - Log
data_sourceto understand which method succeeded
Test the improved extraction:
from logseq_py.utils import LogseqUtils
# Test with API key
video_info = LogseqUtils.get_video_info(
"https://youtube.com/watch?v=ABC123",
youtube_api_key="YOUR_API_KEY"
)
# Test without API key (uses oEmbed + HTML fallback)
video_info = LogseqUtils.get_video_info(
"https://youtube.com/watch?v=ABC123"
)
# Check result
print(f"Status: {video_info['status']}")
print(f"Source: {video_info['data_source']}")
print(f"Title: {video_info['title']}")
if video_info['status'] == 'failed':
print(f"Error: {video_info['error']}")Potential improvements:
- Cache extraction results to reduce API calls
- Batch API requests for multiple videos
- Add more platforms (Vimeo, TikTok HTML scraping)
- Implement retry logic with exponential backoff
- Parse more metadata from HTML (views, upload date, etc.)
logseq_py/pipeline/extractors.py- Main implementationlogseq_py/utils.py- Utility wrapper for extractionSUBTITLE_EXTRACTION.md- Related subtitle extraction documentation