Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 40 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,4 +179,43 @@ agent_zone
debug_tools

database/*.db
.omx/

# === auto-added by ensure-repo-hygiene.sh ===
# 敏感文件
.env
.env.*
!.env.example
*.key
*.pem
credentials.json
credentials.*
secrets.*

# Python
__pycache__/
*.py[cod]
*.egg-info/
.venv/
venv/
.pytest_cache/
.mypy_cache/
.ruff_cache/

# Node
node_modules/
*.log
npm-debug.log*
yarn-debug.log*
.pnpm-debug.log*

# IDE
.vscode/
.idea/
*.swp
.DS_Store

# Build
dist/
build/
*.egg
.omx/
34 changes: 34 additions & 0 deletions docs/business-product-logic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# MediaCrawler 业务与产品逻辑

## 这是什么

MediaCrawler 是多平台自媒体公开数据采集框架。它的核心价值不是某一个平台的逆向细节,而是把“多平台采集、登录态、存储、代理、反爬”统一成同一套抽象。

## 核心产品逻辑

- 用户通过统一入口选择平台、登录方式、爬取类型,框架再分发到具体平台实现。
- Playwright/CDP 登录态、代理池和签名逻辑是基础设施层,不该散落在业务脚本里。
- 内容、评论、创作者信息和媒体文件都被视为同一条采集流水线的输出,而不是互相割裂的脚本。

## 核心业务闭环

1. 读取命令行参数和配置。
2. 选择平台 crawler 与对应 API client。
3. 建立浏览器上下文、登录态与代理能力。
4. 执行搜索 / 详情 / 创作者抓取。
5. 解析内容、评论、媒体资源。
6. 写入文件或数据库存储层。

## 当前业务边界

- 这是通用采集框架,不直接承担 Workspace 内某个具体内容业务的最终决策。
- 多平台统一抽象是第一性设计,不应把平台特例直接回写到入口层。
- README 和架构文档都强调“学习/研究用途”,不应把它包装成当前 Workspace 的线上生产中台。

## 推荐阅读顺序

1. `README.md`
2. `docs/项目架构文档.md`
3. `docs/项目代码结构.md`
4. `docs/CDP模式使用指南.md`

58 changes: 58 additions & 0 deletions docs/dev-task-closeout-plan-2026-05-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Dev Task Closeout 2026-05-20

## Discovery Gate

### 用户目标

- 完成 stop hook 判定未闭环的 MediaCrawler 工程收尾:清理 dirty tree、提交、push、确认 PR 和验证 gate。

### 任务分类

- C:既有分支收尾与冲突自愈,不新增架构或第三方集成。
- 调研结论:基于现有分支、Workspace finisher 和 PR validation 收尾;不从零实现。
- 缺少 codebase map:仓内未发现 `docs/codebase-map.md`,本轮通过 git diff、PR metadata、`uv.lock` 和测试文件临时补足相关面。

### AI 推断的相关面清单

- source / caller: `media_platform/douyin/login.py`, `media_platform/douyin/client.py`, `media_platform/tieba/*`, `cmd_arg/arg.py`
- tests / fixtures / contracts: `test/test_utils.py`, `tests/test_cmd_arg_tieba.py`, `tests/test_tieba_client_pagination.py`, `tests/test_tieba_extractor.py`
- docs / 历史决策: `README.md`, `README_en.md`, `docs/business-product-logic.md`, CDP guide 和常见问题文档
- config / flags / env: `pyproject.toml`, `uv.lock`, `.gitignore`; no env or secrets edited
- runtime entrypoint / release surface: no local launchd or release entrypoint touched
- external side effect / online write path: GitHub branch push and PR update only

### 证据

- `codex-dev-task-finisher.py` initially reported dirty tree and missing PR recognition.
- `gh pr view` confirmed PR #880, fork head branch, and earlier `mergeStateStatus=DIRTY`.
- `upstream/main` was merged into the task branch to resolve the dirty merge state.
- `.gitignore` conflict kept both repository hygiene rules and upstream `.omx/`.
- `uv.lock` conflict was resolved, then stale duplicate tuna package records inconsistent with current `pyproject.toml` default PyPI index were removed so `uv` could parse and normalize the lockfile.

### 不确定项 / 需要用户拍板

- None for local closure.
- Merge remains subject to upstream repository permissions, review, and checks.

### 验证矩阵

- local: `git diff --check`
- contract: `uv lock --check`; `uv lock`
- smoke: `uv run pytest test/test_utils.py tests/test_cmd_arg_tieba.py tests/test_tieba_client_pagination.py tests/test_tieba_extractor.py`
- online regression: not applicable; no live runtime, launchd, release entrypoint, or external owner write path touched
- external endstate: branch pushed to `origin/chore/dirty-worktree-checkpoint-20260422`; PR #880 updated

### Compatibility Oracle

- 旧调用方 / caller: existing CLI and platform crawler callers remain on the same public module paths; no caller contract was intentionally changed in this closeout slice.
- legacy fixture / oracle: existing tests plus upstream-added Tieba tests are the oracle for preserved behavior in this branch.
- 行为是否变化: no behavior change was introduced by the closeout work itself; only merge conflict resolution, lockfile normalization, hygiene ignore rules, and documentation evidence were added.

## TODO

- [x] Commit dirty worktree changes with precise staged files.
- [x] Push branch and read back remote branch head.
- [x] Confirm PR #880 exists and update PR evidence.
- [x] Resolve upstream merge conflicts and push updated head.
- [x] Run local contract and smoke validation.
- [ ] Re-run PR validation and final finisher after this plan file is committed.
9 changes: 5 additions & 4 deletions media_platform/douyin/login.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,18 +248,19 @@ async def move_slider(self, back_selector: str, gap_selector: str, move_step: in
element = await self.context_page.query_selector(gap_selector)
bounding_box = await element.bounding_box() # type: ignore

await self.context_page.mouse.move(bounding_box["x"] + bounding_box["width"] / 2, # type: ignore
bounding_box["y"] + bounding_box["height"] / 2) # type: ignore
slider_center_x = bounding_box["x"] + bounding_box["width"] / 2 # type: ignore
slider_center_y = bounding_box["y"] + bounding_box["height"] / 2 # type: ignore
await self.context_page.mouse.move(slider_center_x, slider_center_y)
# Get x coordinate center position
x = bounding_box["x"] + bounding_box["width"] / 2 # type: ignore
x = slider_center_x
# Simulate sliding operation
await element.hover() # type: ignore
await self.context_page.mouse.down()

for track in tracks:
# Loop mouse movement according to trajectory
# steps controls the ratio of single movement speed, default is 1, meaning the distance moves in 0.1 seconds no matter how far, larger value means slower
await self.context_page.mouse.move(x + track, 0, steps=move_step)
await self.context_page.mouse.move(x + track, slider_center_y, steps=move_step)
x += track
await self.context_page.mouse.up()

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ dependencies = [
]

[[tool.uv.index]]
url = "https://pypi.tuna.tsinghua.edu.cn/simple"
url = "https://pypi.org/simple"
default = true
Loading