This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This project follows Spec-Driven Development (SDD) methodology. Before writing any code, always review the specification documents in the /specs directory:
/specs/product/: Product requirements and acceptance criteria/specs/rfc/: Technical design documents and architecture proposals (See RFC 0001 for architecture upgrade)/specs/testing/: BDD test specifications and acceptance tests
AI Agent Workflow:
- Review Specs First: Read relevant specs before coding
- Spec-First Update: For new features or interface changes, update specs first and wait for confirmation
- Implement to Spec: Code must 100% comply with specs (no gold-plating)
- Test against Spec: Write tests based on spec acceptance criteria
For complete AI agent workflow instructions, see AGENTS.md.
# Install dependencies
pip install -r requirements.txt
# Install in development mode (creates CLI commands: cleanbook, cleanbook-wizard)
pip install -e .
# Run interactive mode
python main.py --interactive
python main.py -i examples/demo_bookmarks.html
# Health check
python main.py --health-check# Run focused runtime tests
pytest -q tests/test_runtime_paths.py
# Run broader test suite
pytest -q
# Test with sample data
python main.py -i examples/demo_bookmarks.html -o output/
python main.py -i examples/demo_bookmarks.html --train# Process bookmarks with ML training
python main.py -i bookmarks.html --train
# Batch process with custom settings
python main.py -i file1.html file2.html -o results/ --workers 8 --threshold 0.8
# Debug mode
python main.py -i bookmarks.html --log-level DEBUG --limit 100
# Disable ML to save memory
python main.py -i bookmarks.html --no-mlThis is an AI-powered bookmark classification system with a plugin-based architecture. The system uses a pipeline pattern where multiple classifiers process bookmarks and results are fused.
- Main Entry:
main.py- Handles CLI parsing and delegates to components - AI Classifier:
src/ai_classifier.py- Central orchestrator that manages multiple classification strategies - Bookmark Processor:
src/bookmark_processor.py- Coordinates batch processing, parallelization, and I/O - Plugin System:
src/plugins/- Modular classifiers with registry patternrule_classifier.py- Fast pattern matching (domain, title, URL)ml_classifier.py- scikit-learn based ML classificationembedding_classifier.py- Semantic similarity using embeddingsllm_classifier.py- Optional LLM integration (OpenAI-compatible)
- Services Layer:
src/services/- Cross-cutting concerns- Active learning, taxonomy standardization, embedding service, performance monitoring
HTML Bookmark → BookmarkFeatures (dataclass) → Plugin Pipeline → ClassificationResult (dataclass) → Exporter
↓
Parallel Processing (ThreadPoolExecutor)
↓
Result Fusion (weighted voting by confidence)
- Plugin Registry: Classifiers are registered in
src/plugins/registry.pyand discovered dynamically - Lazy Loading: Components are initialized on first use to minimize startup time
- Fallback Chain: If ML fails, system falls back to rule-based classification
- Confidence Scoring: All classifiers return confidence scores (0.0-1.0) for result fusion
- Caching: LRU cache for repeated classifications and URL validation
├── main.py # CLI entry point
├── pyproject.toml # Project metadata, CLI scripts
├── config.json # Main configuration (rules, categories, ML settings)
├── src/
│ ├── ai_classifier.py # Orchestrates classification pipeline
│ ├── bookmark_processor.py # Batch processing coordinator
│ ├── plugins/
│ │ ├── pipeline.py # Execution pipeline
│ │ ├── registry.py # Plugin registration
│ │ └── classifiers/ # Individual classifier plugins
│ └── services/
│ ├── embedding_service.py
│ ├── taxonomy_service.py
│ └── performance_monitor.py
├── tests/ # Property-based tests (Hypothesis)
└── examples/ # Demo bookmark files
The config.json file controls:
- Classification Rules: Domain patterns, keywords, priorities
- AI Settings: Confidence thresholds, cache sizes, worker counts
- Category Taxonomy: Ordered list of categories/subcategories
- LLM Settings: Optional OpenAI-compatible API integration
See existing config.json for examples. The TaxonomyStandardizer in src/taxonomy_standardizer.py enforces naming consistency.
The codebase uses property-based testing with Hypothesis:
- Test files follow
test_*.pynaming in/tests - Property tests validate invariants (e.g., "confidence scores always 0.0-1.0")
- Mock external dependencies (LLM APIs, ML libraries)
- Test data generation in
tests/output-round-2/generate_bookmarks.py
To add a new classifier:
- Create class in
src/plugins/classifiers/your_classifier.py - Inherit from
src/plugins/base.py::BaseClassifier - Implement
classify()method returningClassificationResult - Register in
src/plugins/registry.py::CLASSIFIER_REGISTRY
- Large datasets (>1000 bookmarks): Use
--workers 4-8for parallelism - Memory constraints: Use
--no-mlto disable ML classification - The system automatically caches repeated classifications
LLM classification is optional and falls back to ML/rules if:
- API key not configured
- API request fails
- Response parsing fails
Configure in config.json under llm section.
- Language: Mixed Chinese/English codebase (comments, variable names, docs)
- Type Hints: Widely used throughout for clarity
- Error Handling: Graceful degradation (e.g., ML fails → rules work)
- Logging: Centralized in
logs/ai_classifier.logwith configurable levels - Dependencies: Heavy ML stack (scikit-learn, numpy, jieba for Chinese)
src/ai_classifier.py::AIBookmarkClassifier- Central orchestratorsrc/bookmark_processor.py::BookmarkProcessor- Batch processingsrc/plugins/pipeline.py- Plugin execution flowsrc/config_manager.py- Configuration loadingdocs/design/system_architecture.md- Detailed architecture
Edit config.json → category_rules mapping:
{
"💻 编程/代码仓库": {
"rules": [
{
"match": "domain",
"keywords": ["github.com"],
"weight": 20
}
]
}
}- Add method to
src/bookmark_processor.py::BookmarkProcessor - Update
supported_formatslist - Handle in CLI arguments parsing (
main.py)
- Adjust
max_workersin config based on CPU cores - Tune
cache_sizefor memory/performance tradeoff - Use
confidence_thresholdto control precision vs recall
Core ML Stack: scikit-learn, numpy, pandas, jieba (Chinese text) CLI/UI: rich (terminal formatting), click, tqdm Web/Parse: beautifulsoup4, requests, lxml Testing: pytest, pytest-cov, hypothesis (property-based)
See requirements.txt and pyproject.toml for exact versions.