Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
c0a00b6
feat: 添加小红书数据查看器前端
J1anYi Apr 20, 2026
c5644c4
docs: add AGENTS.md with git and PR guidelines
J1anYi Apr 20, 2026
3cb6601
feat(viewer): 添加知乎数据展示标签页
J1anYi Apr 20, 2026
e603239
feat: 实现实时数据同步 - 文件监视器 + WebSocket 推送
J1anYi Apr 20, 2026
2824237
fix: 修复模态框关闭按钮无法工作的问题
J1anYi Apr 21, 2026
c9bcb6e
fix: 修复 WebSocket 数据推送问题
J1anYi Apr 22, 2026
f329e93
fix: 修复刷新按钮和优化数据显示
J1anYi Apr 22, 2026
088465f
fix(xhs): 移除 stats_update 的通知触发
J1anYi Apr 22, 2026
880efb6
fix(dy): 移除 stats_update 的通知触发
J1anYi Apr 22, 2026
acc13ec
fix(bili): 移除 stats_update 的通知触发
J1anYi Apr 22, 2026
dcf5c01
fix(zhihu): 移除 stats_update 的通知触发
J1anYi Apr 22, 2026
a1367ef
feat(notifications): 添加通知去重数据结构
J1anYi Apr 22, 2026
dd008f3
feat(notifications): 添加去重哈希函数和检查函数
J1anYi Apr 22, 2026
297e9b7
feat(notifications): 在 showDataNotification 中应用去重逻辑
J1anYi Apr 22, 2026
46d337b
feat(notifications): 支持 titles 参数显示标题列表
J1anYi Apr 22, 2026
78f46ef
feat(api): 扩展 DataUpdateMessage 添加 new_count 和 titles 字段
J1anYi Apr 22, 2026
9e67a7a
feat(ws): 添加平台记录计数追踪功能
J1anYi Apr 22, 2026
4bfa4d7
feat(api): 添加获取最新记录和计数的辅助函数
J1anYi Apr 22, 2026
b3b308a
feat(ws): broadcast_platform_update 包含新增记录数量和标题
J1anYi Apr 22, 2026
5fd055f
feat(viewer): 实现无限滚动加载和排序选择器UI
J1anYi Apr 22, 2026
98650fe
fix(viewer): 修复 UAT 发现的两个问题
J1anYi Apr 22, 2026
f1ff371
fix(viewer): 修复时间显示和无限滚动问题
J1anYi Apr 22, 2026
f63c1cd
debug(viewer): add more logging for infinite scroll debugging
J1anYi Apr 22, 2026
2632eb7
fix(viewer): 修复 sentinel 位置问题
J1anYi Apr 22, 2026
905e516
feat(api): add image download task queue system
J1anYi Apr 23, 2026
604b493
feat(api): add local image URL support with frontend fallback
J1anYi Apr 23, 2026
117b158
feat(store): integrate image queue into xhs store
J1anYi Apr 24, 2026
c74f473
feat: Phase 4-8 图片下载队列与前端显示优化
J1anYi Apr 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# MediaCrawler 项目指南

## 项目概述

MediaCrawler 是一个小红书爬虫项目,支持爬取笔记数据并提供 Web 可视化界面。

## 技术栈

- **后端**: Python + FastAPI
- **前端**: 纯 HTML/CSS/JavaScript(无框架)
- **数据存储**: JSONL 文件

## 开发规范

### Git 分支与 PR 规范

#### 分支命名

- `feat/xxx` - 新功能开发
- `fix/xxx` - Bug 修复
- `refactor/xxx` - 代码重构
- `docs/xxx` - 文档更新

#### PR 目标分支

**重要**: 所有 PR 必须提交到 `main` 分支,而非其他特性分支。

```
正确: feature/xxx → main
错误: feature/xxx → feat/xxx
```

#### 远程仓库配置

项目有两个远程仓库:

| 名称 | 地址 | 用途 |
|------|------|------|
| `origin` | git@github.com:NanmiCoder/MediaCrawler.git | 上游原始仓库 |
| `myfork` | git@github.com:J1anYi/MediaCrawler.git | 个人 Fork 仓库 |

推送代码到 Fork 仓库:
```bash
git push myfork feat/xxx:feat/xxx
```

创建 PR 时,确保:
- **Base 仓库**: J1anYi/MediaCrawler
- **Base 分支**: main
- **Head 分支**: feat/xxx

### 代码风格

#### Python

- 使用 4 空格缩进
- 遵循 PEP 8 规范
- 函数必须有类型注解
- 安全相关的输入验证必须完善

#### JavaScript

- 使用 2 空格缩进
- 使用 ES6+ 语法
- 避免全局变量污染,使用 `window.xxx` 导出

#### CSS

- 使用 BEM 命名规范
- 响应式设计优先

### 安全规范

1. **路径遍历防护**: 所有文件路径相关的用户输入必须验证
2. **输入验证**: 使用 FastAPI Query/Path 验证器
3. **类型注解**: 函数必须有返回类型注解

## 目录结构

```
MediaCrawler/
├── api/ # FastAPI 后端
│ └── routers/
│ └── notes.py # 笔记 API
├── viewer/ # 前端可视化界面
│ ├── index.html
│ └── static/
│ ├── css/
│ └── js/
│ ├── app.js # 主应用逻辑
│ ├── api.js # API 封装
│ ├── modal.js # 模态框组件
│ └── monitor.js # 监控组件
├── data/ # 数据目录
│ └── xhs/
│ ├── jsonl/ # 笔记数据
│ └── images/ # 图片资源
└── docs/ # 文档
```

## 开发流程

1. 从 `main` 分支创建特性分支
2. 开发并测试
3. 运行代码审查
4. 推送到 Fork 仓库
5. 创建 PR 到 `main` 分支

## 本地开发

启动开发服务器:
```bash
python api/main.py
```

访问:
- 可视化界面: http://localhost:8081/viewer/index.html
- API 文档: http://localhost:8081/docs
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,13 +161,13 @@ MediaCrawler 提供了基于 Web 的可视化操作界面,无需命令行也

```shell
# 启动 API 服务器(默认端口 8080)
uv run uvicorn api.main:app --port 8080 --reload
uv run uvicorn api.main:app --port 8081 --reload

# 或者使用模块方式启动
uv run python -m api.main
```

启动成功后,访问 `http://localhost:8080` 即可打开 WebUI 界面。
启动成功后,访问 `http://localhost:8081` 即可打开 WebUI 界面。

#### WebUI 功能特性

Expand Down
6 changes: 3 additions & 3 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,14 +142,14 @@ MediaCrawler provides a web-based visual operation interface, allowing you to ea
#### Start WebUI Service

```shell
# Start API server (default port 8080)
uv run uvicorn api.main:app --port 8080 --reload
# Start API server (default port 8081)
uv run uvicorn api.main:app --port 8081 --reload

# Or start using module method
uv run python -m api.main
```

After successful startup, visit `http://localhost:8080` to open the WebUI interface.
After successful startup, visit `http://localhost:8081` to open the WebUI interface.

#### WebUI Features

Expand Down
4 changes: 2 additions & 2 deletions README_es.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,13 +143,13 @@ MediaCrawler proporciona una interfaz de operación visual basada en web, permit

```shell
# Iniciar servidor API (puerto predeterminado 8080)
uv run uvicorn api.main:app --port 8080 --reload
uv run uvicorn api.main:app --port 8081 --reload

# O iniciar usando método de módulo
uv run python -m api.main
```

Después de iniciar exitosamente, visite `http://localhost:8080` para abrir la interfaz WebUI.
Después de iniciar exitosamente, visite `http://localhost:8081` para abrir la interfaz WebUI.

#### Características de WebUI

Expand Down
139 changes: 92 additions & 47 deletions api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,78 @@
"""
import asyncio
import os
import shutil
import subprocess
from contextlib import asynccontextmanager
from pathlib import Path

import uvicorn
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse

from .routers import crawler_router, data_router, websocket_router
from .routers import crawler_router, data_router, websocket_router, notes_router, zhihu_router, bilibili_router, douyin_router, subscriptions_router, trends_router, image_queue_router
from .services import file_watcher, image_task_db, image_downloader, image_queue_service, image_scheduler
from .services.file_watcher import PLATFORMS
from .routers.websocket import broadcast_stats_update, broadcast_platform_update


# Data directory for JSONL files
DATA_DIR = Path(__file__).parent.parent / "data"


async def on_file_change_callback(platform: str):
"""
Callback for file watcher when a platform's data changes.
Broadcasts both platform-specific update and global stats update.

Args:
platform: The platform identifier (e.g., "xhs", "dy", "bili", "zhihu")
"""
# Broadcast platform-specific update (for viewer components)
await broadcast_platform_update(platform)
# Broadcast global stats update (for WebUI compatibility)
await broadcast_stats_update(platform)


@asynccontextmanager
async def lifespan(app: FastAPI):
"""Application lifespan manager - start/stop file watcher for all platforms."""
# Initialize task database
await image_task_db.init_db()

# Initialize image downloader
await image_downloader.init()

# Start queue service (launches consumers)
image_queue_service.start()

# Start scheduler (scans for timeout and retry tasks)
image_scheduler.start()

# Startup - watch all platform directories
file_watcher.start(
platforms=PLATFORMS,
base_callback=on_file_change_callback,
base_path=str(DATA_DIR)
)
app.state.file_watcher = file_watcher

yield

# Shutdown
image_scheduler.stop()
image_queue_service.stop()
await image_downloader.close()
file_watcher.stop()


app = FastAPI(
title="MediaCrawler WebUI API",
description="API for controlling MediaCrawler from WebUI",
version="1.0.0"
version="1.0.0",
lifespan=lifespan
)

# Get webui static files directory
Expand All @@ -59,6 +118,13 @@
app.include_router(crawler_router, prefix="/api")
app.include_router(data_router, prefix="/api")
app.include_router(websocket_router, prefix="/api")
app.include_router(notes_router, prefix="/api")
app.include_router(zhihu_router, prefix="/api")
app.include_router(bilibili_router, prefix="/api")
app.include_router(douyin_router, prefix="/api")
app.include_router(subscriptions_router, prefix="/api")
app.include_router(trends_router, prefix="/api")
app.include_router(image_queue_router, prefix="/api")


@app.get("/")
Expand All @@ -83,50 +149,14 @@ async def health_check():
@app.get("/api/env/check")
async def check_environment():
"""Check if MediaCrawler environment is configured correctly"""
try:
# Run uv run main.py --help command to check environment
process = await asyncio.create_subprocess_exec(
"uv", "run", "main.py", "--help",
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
cwd="." # Project root directory
)
stdout, stderr = await asyncio.wait_for(
process.communicate(),
timeout=30.0 # 30 seconds timeout
)

if process.returncode == 0:
return {
"success": True,
"message": "MediaCrawler environment configured correctly",
"output": stdout.decode("utf-8", errors="ignore")[:500] # Truncate to first 500 characters
}
else:
error_msg = stderr.decode("utf-8", errors="ignore") or stdout.decode("utf-8", errors="ignore")
return {
"success": False,
"message": "Environment check failed",
"error": error_msg[:500]
}
except asyncio.TimeoutError:
return {
"success": False,
"message": "Environment check timeout",
"error": "Command execution exceeded 30 seconds"
}
except FileNotFoundError:
return {
"success": False,
"message": "uv command not found",
"error": "Please ensure uv is installed and configured in system PATH"
}
except Exception as e:
return {
"success": False,
"message": "Environment check error",
"error": str(e)
}
# Simple check - just verify uv is available
uv_path = shutil.which("uv")

return {
"success": uv_path is not None,
"message": "Environment ready" if uv_path else "uv not found",
"uv_path": uv_path
}


@app.get("/api/config/platforms")
Expand Down Expand Up @@ -182,6 +212,21 @@ async def get_config_options():
# Mount other static files (e.g., vite.svg)
app.mount("/static", StaticFiles(directory=WEBUI_DIR), name="webui-static")

# Mount viewer static files
VIEWER_DIR = os.path.join(os.path.dirname(__file__), "..", "viewer", "static")
if os.path.exists(VIEWER_DIR):
app.mount("/viewer", StaticFiles(directory=VIEWER_DIR, html=True), name="viewer")

# Mount images directory for note images
IMAGES_DIR = os.path.join(os.path.dirname(__file__), "..", "data", "xhs", "images")
if os.path.exists(IMAGES_DIR):
app.mount("/images", StaticFiles(directory=IMAGES_DIR), name="images")

# Mount data directory for local images (downloaded images with hash-based paths)
LOCAL_DATA_DIR = os.path.join(os.path.dirname(__file__), "..", "data")
if os.path.exists(LOCAL_DATA_DIR):
app.mount("/local-images", StaticFiles(directory=LOCAL_DATA_DIR), name="local-images")


if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080)
uvicorn.run(app, host="0.0.0.0", port=8081)
9 changes: 8 additions & 1 deletion api/routers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,12 @@
from .crawler import router as crawler_router
from .data import router as data_router
from .websocket import router as websocket_router
from .notes import router as notes_router
from .zhihu import router as zhihu_router
from .bilibili import router as bilibili_router
from .douyin import router as douyin_router
from .subscriptions import router as subscriptions_router
from .trends import router as trends_router
from .image_queue import router as image_queue_router

__all__ = ["crawler_router", "data_router", "websocket_router"]
__all__ = ["crawler_router", "data_router", "websocket_router", "notes_router", "zhihu_router", "bilibili_router", "douyin_router", "subscriptions_router", "trends_router", "image_queue_router"]
Loading