This project automates the process of collecting, processing, and organizing your liked content from X (Twitter) into a structured Knowledge Base.
- Ingestion: Fetches liked tweets via Twitter API (or uses mock data for testing).
- Extraction: Fetches and extracts content from linked URLs (HTML, PDF).
- Analysis: Uses LLM (via LM Studio) to classify content, generate summaries, and extract key takeaways.
- First-pass X classification:
postvsarticle. - Post subtype classification:
normal_post,comment_or_reply,paper_or_project_reco,course_teaching,release. comment_or_replyposts are ignored.- Article outputs include title, abstract, and ~5 keywords.
- Paper/project/course/release posts include structured links and metadata when available.
- Coarse content classification is still retained:
paper,article(X Article),long_blog,blog,thread,comment, etc. - Fine AI taxonomy (for
paper/article/long_blog/blog): e.g.LLM,RAG,Agent,MLOps,Evaluation.
- First-pass X classification:
- Knowledge Base: Generates a Markdown-based knowledge base with:
- Individual item files
- Tag index
- Weekly digests
- Tag Control: Applies a tag allowlist from
keep.txt(orTAGS_KEEP_FILE) to prevent noisy tags.
- Python 3.8+
- LM Studio running locally (compatible with OpenAI API).
- Start LM Studio server on
http://127.0.0.1:1234. - Load a model (e.g., Mistral, Llama 3).
- Start LM Studio server on
- Twitter API Bearer Token (Optional, for real data).
- Clone the repository.
- Set up the environment:
# Create virtual environment python3 -m venv xlike source xlike/bin/activate # Install dependencies pip install -r requirements.txt # Install Playwright browsers playwright install
- Configure environment:
cp .env.example .env # Edit .env with your configuration
You can use the helper script start.sh to automatically activate the environment and run the script.
This will generate 5 mock items to test the pipeline.
./start.sh --mock --limit 5Ensure TWITTER_BEARER_TOKEN is set in .env for API mode, or TWITTER_USERNAME/PASSWORD for Browser mode.
API Mode (Fast, but limited metadata):
./start.sh --api --limit 10Browser Mode (Recommended for full classification): Uses Playwright to login and scrape likes, handling threads/blogs classification.
./start.sh --browser --limit 10
# Run in visible mode to debug login
./start.sh --browser --visible --limit 10Use your normal Chrome profile to avoid repeated login challenges.
- Fully quit Chrome first (important, or you may hit
ProcessSingletonlock). - Run with system Chrome profile path:
BROWSER_USER_DATA_DIR="$HOME/Library/Application Support/Google/Chrome" \
CHROME_PROFILE_DIR="Default" \
PROXY_URL='' \
HEADLESS=false \
./start.sh --browser --visible --all --since 2026-01-01 --until 2026-03-09If your daily profile is not Default, set CHROME_PROFILE_DIR to Profile 1 / Profile 2, etc.
If direct profile mode conflicts with Chrome locks or startup errors, use CDP mode.
Practical setup tested on 2026-03-10:
# 1) Prepare a reusable profile clone (one-time or occasional refresh)
mkdir -p /tmp/chrome_cdp_userdata
rsync -a --delete "$HOME/Library/Application Support/Google/Chrome/Default" /tmp/chrome_cdp_userdata/
cp "$HOME/Library/Application Support/Google/Chrome/Local State" /tmp/chrome_cdp_userdata/Local\ State
# 2) Start real Chrome with CDP
open -na "Google Chrome" --args \
--remote-debugging-port=9222 \
--user-data-dir=/tmp/chrome_cdp_userdata \
--profile-directory=Default
# 3) Run x-likes via CDP for a date range
CHROME_CDP_URL='http://127.0.0.1:9222' PROXY_URL='' HEADLESS=false \
./start.sh --browser --all --since 2026-03-09 --until 2026-03-10Legacy direct-CDP command (may fail on newer Chrome builds when using default profile dir directly):
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/Library/Application Support/Google/Chrome" \
--profile-directory=Default
CHROME_CDP_URL="http://127.0.0.1:9222" ./start.sh --browser --visible --limit 10- Range tested:
2026-03-09to2026-03-10. - Ingestion + classification + summarization completed successfully in CDP mode.
- 5 items were processed (all matched items were on
2026-03-09). - Post/Article outputs were generated as expected:
article: 1post: 4 (release: 1,normal_post: 3)
- Example outputs:
output/items/2026/03/2026-03-09-gpt-oss-inference-from-scratch-2030893100181401958.md(article withAbstract/Keywords)output/items/2026/03/2026-03-09-2031120122619060445.md(post release withRelease Info)
--limit N: Number of items to process.--all: Try to fetch all likes in the date range.--since YYYY-MM-DD: Inclusive start date filter.--until YYYY-MM-DD: Inclusive end date filter.--visible: Run browser in visible mode.--sync-obsidian: After processing, sync markdown reports to Obsidian.--sync-only: Sync existing markdown reports to Obsidian only (no ingestion/classification/summarization).
./start.sh --browser --all --since 2026-01-01 --until 2026-03-09 --visiblesrc/agents: LLM agents for classification and summarization.src/browser: Browser automation (Playwright) and HTTP fetching.src/ingest: Data ingestion (Twitter API, Mock).src/knowledge_base: Markdown generation logic.src/parser: Content extraction (Readability, PyPDF).data/: Raw data storage (optional).output/: Generated Knowledge Base.
Check the output/ directory for the generated Markdown files.
output/items/YYYY/MM/: Individual Markdown files for each liked item, partitioned by year/month.output/tags/: Index files for each tag.output/weekly/: Weekly digest files.
This project supports syncing generated markdown reports into an Obsidian vault via Obsidian CLI.
- Enable/install Obsidian CLI in Obsidian (1.12+).
- Configure vault path in
.env:OBSIDIAN_VAULT_PATH=/absolute/path/to/your/vault OBSIDIAN_XLIKE_ROOT=x-like OBSIDIAN_SYNC_ITEMS_LIMIT=20 OBSIDIAN_CLI_BIN=obsidian
- Run sync:
# Run full x-likes pipeline and then sync ./start.sh --browser --limit 10 --sync-obsidian # Sync only existing output markdown (no new fetch/analyze) ./start.sh --sync-only
Sync policy:
items: sync only most recent N markdown files (OBSIDIAN_SYNC_ITEMS_LIMIT), append-only (existing files are skipped).weekly: sync all weekly markdown files, overwrite enabled to keep weekly reports up to date.
Real vault integration was tested against:
/Users/wonster/Library/Mobile Documents/iCloud~md~obsidian/Documents/Notes
What is working:
x-like/root folder is created in vault.weeklyreport sync works and supports overwrite.- Most
itemsfiles sync correctly in append-only mode. - Sync logic now verifies that target file content was actually written after each CLI
createcall. - Sync logic now uses command timeouts to prevent indefinite hanging.
Known issue observed on current Obsidian install:
- Obsidian CLI prints installer warning (
out of date) and can become unstable. - In test runs, two target item files repeatedly failed to land in vault despite prior "created" logs:
items/2026/03/2026-03-09-2030896015830827038.mditems/2026/03/2026-03-09-gpt-oss-inference-from-scratch-2030893100181401958.md
- With strict verification enabled, these cases now fail fast with explicit timeout/error instead of silently passing.
Operational note:
- Before debugging, a vault backup was created at:
/tmp/obsidian-vault-Notes-backup-20260310-153844