Summarize, explain, fact-check, or translate any text, URL, or file. No GPU. No cloud. One command.
fftext s https://en.wikipedia.org/wiki/Llama.cppThree bullet points, streamed to your terminal, generated on your CPU. No API key. No round-trip to anyone's server.
πΊ Watch the demo on YouTube
- β‘ Fast on CPU. Powered by a quantized 0.8B Qwen3.5 (Q4_K_M GGUF, ~500 MB) running through
llama.cpp. Streams tokens as they're generated so you see the answer build, not a spinner. No CUDA. No Metal-only tricks. Plain old cores. - π Files, URLs, or raw strings. Point it at a
.txt, paste an article URL, or just type the text inline. URLs get fetched, run throughreadability-lxmlfor main-content extraction, and stripped to clean prose before the model sees them. - π΄ Offline after first run. The model downloads once to your Hugging Face cache and stays there. Your text never leaves your machine (except for
check, which needs the web β see below). - πͺΆ Lean deps.
llama-cpp-python,requests,beautifulsoup4,readability-lxml,lxml. That's it. No PyTorch, no LangChain, no cloud SDKs. - π§ Four tasks, four prompts, one binary. Summarize, explain like I'm five, fact-check against the live web, or translate into any language or register you can describe. Each task is a separate, focused prompt β not one mega-prompt trying to do everything.
- π£ Translate into anything you can describe.
--lang "formal German",--lang "casual Japanese",--lang "Brazilian Portuguese"β whatever string you pass goes straight into the prompt. You drive the register and dialect. - π Fact-check with citations.
fftext checkextracts claims, ranks them, web-searches each one (Mojeek and Startpage, rotated), and labels them SUPPORTED, REFUTED, CONFLICTING, or INSUFFICIENT β with a source URL per claim. CPU-only, no API key, no Google.
# Install
pip install .
# Try the four tasks
fftext s notes.txt # summarize a file
fftext e https://en.wikipedia.org/wiki/Photosynthesis # ELI5 a URL
fftext c "The Eiffel Tower was built in 1822." # fact-check a string
fftext t --lang "French" "How are you today?" # translateFirst run downloads ~500 MB of model weights. Every run after is offline (except check, which searches the web).
| Subcommand | Alias(es) | What it does |
|---|---|---|
summarize |
s |
Three short bullet points. Concrete and specific, no preamble. |
explain |
e, eli5 |
Plain-language explanation, 4β6 sentences, like to a curious kid. |
check |
c |
Extract claims β web-search each β label SUPPORTED / REFUTED / CONFLICTING / INSUFFICIENT. |
translate |
t |
Translate into any language/register you describe via --lang. |
Every task accepts the same three input shapes β file, URL, or raw string β resolved in that order.
# Summarize anything
fftext s notes.txt
fftext s https://example.com/post
fftext s "Paste a long block of text right here on the command line."
# Explain it like I'm ten
fftext e paper.pdf.txt
fftext eli5 https://en.wikipedia.org/wiki/Quantum_entanglement
# Fact-check
fftext c article.txt
fftext c "The Roman Empire fell in 476 AD."
fftext c --debug article.txt # show ranking, queries, snippets, raw verdicts
# Translate
fftext t hello.txt # defaults to English
fftext t --lang "formal German" hello.txt
fftext t --lang "casual Japanese" "How are you today?"
fftext t --lang "polite Brazilian Portuguese" letter.txt
fftext t -l "Egyptian Arabic" "Where is the train station?"<input> for any subcommand is resolved in this order:
- Starts with
http://orhttps://β fetched withrequests, parsed withreadability-lxmlto isolate the main article body, then stripped to plain text with paragraph breaks preserved. Falls back to a light tag-strip if readability can't find an article (common on docs pages and indexes). - Looks like an existing file path β read as UTF-8 (errors replaced).
- Anything else β treated literally as a string.
Long inputs are head-and-tail clipped to ~10,000 characters (~2,500 tokens) so prompt + generation + chat template fit comfortably in the 4,096-token context. You'll see a [note: input clipped...] line on stderr when that happens. The clip keeps the start and end of the document, which preserves intros and conclusions β what summaries and explanations care about most.
Streamed to stdout as it's generated. Notes and timing info go to stderr, so you can pipe just the answer:
fftext s long-doc.txt > summary.txt
fftext t --lang French letter.txt | tee letter.fr.txt- The author argues that small local models are now good enough for routine text tasks.
- Speed gains come from quantization and streaming, not better hardware.
- The main remaining gap is multilingual quality below 7B parameters.
A neural network is like a giant calculator that learns by example. You show it lots
of pictures of cats and dogs, and it slowly figures out which patterns mean "cat" and
which mean "dog." Each time it gets one wrong, it nudges its internal numbers a tiny
bit so it'll do better next time. After millions of nudges, it gets pretty good.
One line per claim, with a verdict label and the top supporting URL:
SUPPORTED The Eiffel Tower was completed in 1889. [https://en.wikipedia.org/wiki/Eiffel_Tower]
REFUTED It was built by Thomas Edison. [https://www.britannica.com/biography/Gustave-Eiffel]
INSUFFICIENT It is currently the tallest structure in Paris. [-]
Run with -v for timings and --debug to see ranked claims, generated search queries, raw snippets, and the model's reasoning before each verdict.
The translation, and nothing else. No "Here's the translation:" preamble, no original text echoed back, no transliteration unless the target language genuinely calls for it. Paragraph breaks and markdown formatting are preserved.
| Flag | Description |
|---|---|
-v, --verbose |
Print timing info to stderr (token rate, per-stage timings on check). |
-d, --debug |
check only. Dump claims, queries, snippets, verdicts, and dropped reasons. |
-l, --lang |
translate only. Target language description. Default: English. |
-h, --help |
Show usage and exit. |
Flags can appear anywhere on the command line. The subcommand has to come first.
One LLM call, streamed. The whole trick is keeping the system prompt short β a 0.8B model gets confused by long instructions and burns tokens echoing them back. Each task has its own tight system prompt (3β4 lines) and a sane max_tokens cap so the model doesn't ramble. Sampling is temperature=0.3, top_p=0.9, repeat_penalty=1.1 β faithful, not creative.
Per run:
- Extract claims. LLM emits a JSON array of factual statements (names, numbers, dates, roles, events). Robust parser tolerates trailing commas, smart quotes, missing brackets, and falls through to a numbered-list scrape if the JSON is hopeless. Deduped against normalized lowercase + whitespace.
- Rank. LLM picks the top three most fact-checkable claims out of up to twelve. Each surviving claim costs ~4 more LLM calls, so ranking 9β3 saves ~24 calls.
- Rewrite as keyword queries. One LLM call per claim turns
"James Talarico is a Presbyterian seminarian."into"James Talarico" Presbyterian seminarian. Real search engines weight rare tokens; sending whole sentences with stopwords tanks recall. Heuristic stopword-strip fallback if the rewrite looks suspicious. - Search. Mojeek and Startpage, rotated by claim index, with fallback to the other on empty. Jittered sleeps and a generic desktop UA. Sanitized queries to avoid tripping WAFs on
$, backticks, pipes, etc. Eight-thread pool, ~8s timeout per request. - Summarize evidence. LLM compresses each snippet into one sentence about the claim. Irrelevant snippets are dropped here, not at the judge stage.
- Synthesize. LLM lays out what supports, what contradicts, and what's missing β short and structured.
- Evaluate. Deterministic shortcuts handle the obvious cases (no support β REFUTED; nothing either way β INSUFFICIENT). Genuinely mixed evidence goes to one more LLM call with
<think>reasoning enabled, picking one of four labels.
Per-claim total: about four LLM calls and one search round-trip. The ranker keeps the bill from exploding on long inputs.
- Threads. Detected from
os.cpu_count()and halved βos.cpu_count()returns logical cores, and oversubscribing hyperthreads runs slower than just using the physical ones. Override withQWEN_THREADS=Nif you know your physical core count and want to skip the heuristic. - Context. Fixed at 4,096 tokens. Per-token generation cost scales with filled context, not the cap, so the cap itself is nearly free β what costs you is filling it via bigger inputs. The 10,000-character clip keeps that under control.
- Streaming. Matters more on CPU than on GPU. Total latency is what it is, but perceived latency drops a lot when the first token arrives in under a second.
- C-level log silencing.
llama.cppprints warnings via Cprintfthat bypass Python'sverbose=False. fftext installs a null log callback to kill then_ctx_seq < n_ctx_trainnag and friends. Trade-off: real C-level errors get swallowed too, but Python-level exceptions still propagate fine.
The default model is unsloth/Qwen3.5-0.8B-GGUF (Qwen3.5-0.8B-Q4_K_M.gguf, ~500 MB), downloaded on first run via huggingface-hub to your standard HF cache:
- macOS / Linux β
~/.cache/huggingface/hub/ - Windows β
%USERPROFILE%\.cache\huggingface\hub\
To use a different GGUF model, edit load_model() in llm.py and swap the repo_id / filename. Anything llama.cpp β₯ the bundled version can load (Qwen, Llama, Mistral, Gemma, Phi, etc.) should work, but the prompt templates and stop sequences are tuned for the Qwen3.5 chat format β your mileage on other families will vary.
Before fftext had subcommands it was a small wrapper around llama-cpp-python for testing. Those modes still work:
python main.py # canned demo prompt
python main.py "your prompt here" # one-shot
python main.py -i # interactive chat (Ctrl-C to quit)Mostly useful for sanity-checking the model load and sampling parameters when you change something.
- 0.8B is small. It's good enough for the four tasks above, and it's fast enough to actually be useful on a laptop. But it's not GPT-4. Long, complex documents get clipped, and the model occasionally hallucinates on edge-case claims.
checkexists precisely because the model can't be trusted as a one-shot oracle β let it propose, let the web dispose. checkdepends on scraping. Mojeek and Startpage rotate, with jittered sleeps and a desktop UA, but if both go down or both start serving captchas you'll see empty results andINSUFFICIENTverdicts. Run with--debugto confirm whether you're being blocked vs. just hitting a thin topic.- Translation works best between major languages. A 0.8B model handles English β French, Spanish, German, Italian, Portuguese, and Chinese well; smaller languages and complex register requests degrade more.
- URL parsing is best-effort.
readability-lxmlis strong on articles, weaker on docs pages, listings, and SPAs. The fallback tag-strip catches the rest. If you get garbage out of a particular URL, save the page as text first and pass the file.
Apache-2.0 for this project. The Qwen3.5 model is distributed under its own license β see the model card. Powered by llama.cpp via llama-cpp-python, with URL parsing courtesy of readability-lxml.
