|
1 | | -# CleanBook —— 智能书签清理与分类 |
| 1 | +# CleanBook — Smart Bookmark Cleaning & Classification |
2 | 2 |
|
3 | 3 | [](https://lessup.github.io/bookmarks-cleaner/) |
4 | 4 |
|
5 | | -简体中文 | [English](README.en.md) |
| 5 | +English | [简体中文](README.zh-CN.md) |
6 | 6 |
|
7 | | -KISS:规则 + 机器学习 + 可选 LLM,默认离线可用。统一清理标题 emoji,强力去重,输出 HTML/Markdown/JSON。 |
| 7 | +KISS: Rules + ML + optional LLM, offline-ready by default. Unified title emoji cleanup, powerful deduplication, outputs HTML/Markdown/JSON. |
8 | 8 |
|
9 | | -## 特性 |
| 9 | +## Features |
10 | 10 |
|
11 | | -- 规则优先,ML/语义辅助,LLM 可选接入(失败自动降级)。 |
12 | | -- 统一标题清理,避免 emoji 前缀叠加。 |
13 | | -- 去重全时开启,跨浏览器导出合并更稳。 |
14 | | -- 输出分类结构最多两级,结果更简洁。 |
| 11 | +- Rules first, ML/semantic assist, optional LLM integration (auto-fallback on failure) |
| 12 | +- Unified title cleaning to avoid stacked emoji prefixes |
| 13 | +- Always-on deduplication for stable cross-browser export merging |
| 14 | +- Output classification limited to two levels for cleaner results |
15 | 15 |
|
16 | | -## 安装(推荐 pipx) |
| 16 | +## Installation (pipx Recommended) |
17 | 17 |
|
18 | 18 | ```powershell |
19 | 19 | python -m pip install --user pipx |
20 | 20 | python -m pipx ensurepath |
21 | 21 | pipx install . |
22 | 22 | ``` |
23 | 23 |
|
24 | | -安装后将得到两个命令: |
| 24 | +Two commands available after installation: |
25 | 25 |
|
26 | | -- `cleanbook`:命令式处理(等价于 `python main.py`) |
27 | | -- `cleanbook-wizard`:向导式体验(交互菜单) |
| 26 | +- `cleanbook`: Command-line processing (equivalent to `python main.py`) |
| 27 | +- `cleanbook-wizard`: Interactive wizard experience |
28 | 28 |
|
29 | | -## 最小示例 |
| 29 | +## Quick Example |
30 | 30 |
|
31 | 31 | ```powershell |
32 | 32 | cleanbook -i examples/demo_bookmarks.html -o output |
33 | 33 | cleanbook -i "tests/input/*.html" --train |
34 | 34 | cleanbook-wizard |
35 | 35 | ``` |
36 | 36 |
|
37 | | -常用参数:`--workers` 并行,`--train` 训练 ML,`--no-ml` 禁用 ML,`--health-check` 可达性巡检。 |
| 37 | +Common flags: `--workers` parallel, `--train` train ML, `--no-ml` disable ML, `--health-check` reachability check. |
38 | 38 |
|
39 | | -## LLM(可选) |
| 39 | +## LLM (Optional) |
40 | 40 |
|
41 | | -编辑 `config.json` 启用: |
| 41 | +Edit `config.json` to enable: |
42 | 42 |
|
43 | 43 | ```json |
44 | 44 | "llm": { |
45 | 45 | "enable": true, |
46 | 46 | "provider": "openai", |
47 | 47 | "base_url": "https://api.openai.com", |
48 | 48 | "model": "gpt-4o-mini", |
49 | | - "api_key_env": "OPENAI_API_KEY", |
50 | | - "prompt": { |
51 | | - "task_description": "请作为 CleanBook-Agent,精确完成浏览器书签分类与信心标注。", |
52 | | - "steps": [ |
53 | | - "分析书签的标题、URL、域名、关键词,推测主题、意图、受众", |
54 | | - "从提供的分类库中寻找最合适的主/子类,必要时选择 '未分类'", |
55 | | - "输出格式化 JSON,包含 category, confidence, reasons, facets 等字段" |
56 | | - ], |
57 | | - "scoring_notes": "当存在不确定性时,降低 confidence 并指出疑惑点;若需人工复核,可在 facets.priority_tags 中加入 'review'。", |
58 | | - "force_json": true |
59 | | - }, |
60 | | - "organizer": { |
61 | | - "enable": true, |
62 | | - "max_examples_per_category": 5, |
63 | | - "max_domains_per_category": 5, |
64 | | - "max_tokens": 1800 |
65 | | - } |
| 49 | + "api_key_env": "OPENAI_API_KEY" |
66 | 50 | } |
67 | 51 | ``` |
68 | 52 |
|
69 | | -然后设置环境变量(PowerShell): |
| 53 | +Set environment variable: |
70 | 54 |
|
71 | 55 | ```powershell |
72 | | -$env:OPENAI_API_KEY = "你的_API_Key" |
| 56 | +$env:OPENAI_API_KEY = "your_api_key" |
73 | 57 | ``` |
74 | 58 |
|
75 | | -未设置 Key 或失败时自动回退到离线分类。 |
| 59 | +Falls back to offline classification when key is unset or API fails. |
76 | 60 |
|
77 | | -开启 `organizer.enable` 后,会在分类完成后调用同一套 OpenAI 兼容接口,对书签类别进行二次聚类、排序与总结: |
| 61 | +With `organizer.enable`, a secondary LLM pass clusters, sorts and summarizes categories after classification. |
78 | 62 |
|
79 | | -- 自动生成更紧凑的主分类、子分类,并带有洞察说明; |
80 | | -- 失败或未开启时自动回退至传统分类树; |
81 | | -- 统计信息与导出报告会附带 LLM 参与情况,便于复盘。 |
82 | | - |
83 | | -此外,可通过 `llm.prompt` 定制提示词模板: |
84 | | - |
85 | | -- `task_description`:高层任务定义,可结合团队术语; |
86 | | -- `steps`:指引 LLM 按步骤执行(支持多条); |
87 | | -- `scoring_notes`:补充置信度与复核策略; |
88 | | -- `few_shots`(可选):提供示例输入/输出,提升对特定类别的识别准确率。 |
89 | | - |
90 | | -## 目录结构(建议) |
| 63 | +## Project Structure |
91 | 64 |
|
92 | 65 | ``` |
93 | 66 | . |
94 | 67 | ├─ src/ |
95 | | -│ ├─ cleanbook/ # 统一 CLI 包装 |
| 68 | +│ ├─ cleanbook/ # Unified CLI wrapper |
96 | 69 | │ │ └─ cli.py |
97 | | -│ ├─ ai_classifier.py # 规则+ML+语义+用户画像+LLM(可选) |
| 70 | +│ ├─ ai_classifier.py # Rules + ML + semantic + user profile + LLM |
98 | 71 | │ ├─ enhanced_classifier.py |
99 | 72 | │ ├─ enhanced_clean_tidy.py |
100 | 73 | │ ├─ bookmark_processor.py |
101 | | -│ ├─ placeholder_modules.py # 导出、占位模块 |
102 | | -│ ├─ emoji_cleaner.py # 标题 emoji 清理 |
| 74 | +│ ├─ emoji_cleaner.py # Title emoji cleaning |
103 | 75 | │ └─ ... |
104 | | -├─ models/ # 模型与缓存 |
| 76 | +├─ models/ # Models & cache |
105 | 77 | ├─ examples/ |
106 | 78 | ├─ docs/ |
107 | | -│ └─ quickstart_zh.md |
108 | 79 | ├─ config.json |
109 | | -├─ main.py # 顶层入口 |
110 | | -├─ pyproject.toml # 打包与命令入口 |
| 80 | +├─ main.py # Top-level entry |
| 81 | +├─ pyproject.toml # Packaging & CLI entry points |
111 | 82 | └─ changelog/ |
112 | 83 | ``` |
113 | 84 |
|
114 | | -不必要的历史/重复文件建议迁入 `legacy/` 或删除(如旧版文档 `doc/` 目录)。 |
115 | | - |
116 | | -## 发布与分发 |
117 | | - |
118 | | -- 本地/团队:推荐 `pipx install .`,获得全局命令且环境隔离。 |
119 | | -- 开源分发: |
120 | | - - GitHub 发布源码与 Release 附带示例数据; |
121 | | - - 可选发布到 PyPI(`python -m build && twine upload dist/*`)。 |
122 | | -- Windows 免 Python:可选使用 `PyInstaller` 打包单文件 EXE(进阶)。 |
| 85 | +## Distribution |
123 | 86 |
|
124 | | -更多细节见 `docs/quickstart_zh.md`。 |
| 87 | +- **Local/Team**: `pipx install .` for isolated global commands |
| 88 | +- **Open Source**: GitHub Release with example data; optionally publish to PyPI |
| 89 | +- **Windows standalone**: Optional PyInstaller single-file EXE |
125 | 90 |
|
126 | | -## 许可证 |
| 91 | +## License |
127 | 92 |
|
128 | | -MIT,见 `LICENSE`。 |
| 93 | +MIT — see `LICENSE`. |
0 commit comments