This project provides tools to process and summarize academic papers for various downstream tasks.
This is a powerful Python script designed to process academic papers in Markdown format. It intelligently chunks the text and generates detailed, structured JSON summaries for each chunk using large language models via the OpenRouter API.
- Intelligent Chunking: Splits long Markdown files into smaller, overlapping chunks based on token count, ensuring context is preserved across chunk boundaries.
- Robust Summarization ("Reliable Recipe"): Implements a sophisticated two-step process to generate high-quality summaries efficiently:
- Gist Generation: First, it generates a concise, ~120–180 token summary (the "gist") for a text chunk, with an automatic retry mechanism if the initial summary is too short.
- Structured Data Extraction: It then uses the generated gist to extract structured data—such as key claims, figure references, and equations—into a clean JSON object. This step is isolated from the original text to ensure the model focuses only on extraction.
- Validation: Ensures the long-form gist is preserved in the final JSON; tolerates minor formatting differences by normalizing whitespace and replacing with the provided gist when needed. Accepts flexible figure references and coerces fields to the correct types.
- Model Flexibility: Supports any model available through the OpenRouter API, allowing users to balance cost, speed, and quality by specifying a model at runtime.
- Structured Output: Produces a
chunk_summaries.jsonlfile containing one JSON object per chunk, with fields likeid,page,gist,claims,figs,eqs,key_terms, andanchors.
-
Environment: The script requires Python 3 and a virtual environment is recommended.
-
API Key: Create a
.envfile in the root of the project and add your OpenRouter API key:OPENROUTER_API_KEY='your_api_key_here' -
Dependencies: While a
requirements.txtis not provided, the script depends on the following packages:pip install python-dotenv httpx tqdm tiktoken
To run the script, use the following command structure:
source .venv/bin/activate
python summarize_chunks.py --pages-dir <path_to_markdown_files> [OPTIONS]Required Arguments:
--pages-dir: The path to the directory containing the page-wise markdown files (e.g.,page_001.md,page_002.md).
Common Options:
--force: Force re-summarization of all chunks, even if a summary file already exists.--verbose: Enable verbose logging to see detailed progress and potential warnings.--chunk-size: Chunk size in tokens (default: 1024)--overlap: Token overlap between chunks (default: 128)
Model Selection:
- Use the
--modelargument to specify an OpenRouter model slug, or a comma-separated list for fallback. - Default:
meta-llama/llama-3.2-3b-instruct,google/gemma-2-9b-it. - The model can also be set via
OPENROUTER_SUMMARIZE_MODEL,OPENROUTER_EXTRACTOR_MODEL, orOPENROUTER_MODELenvironment variables. - Token cap:
--max-tokens(default: unlimited; omit to uncap). - Example:
python summarize_chunks.py \ --pages-dir mistral_responses/test_paper/markdown \ --outdir artifacts/test_paper \ --model meta-llama/llama-3.2-3b-instruct,google/gemma-2-9b-it
Plans a sequence of slide sections from the chunk summaries. Instead of planning one slide at a time, it groups slides into logical sections, each with a title and a list of topics to be covered.
- Prerequisite: Run
make_paper_card.pyfirst to create a governance card (paper_card.json). - Inputs (from a single artifacts subdir):
chunk_summaries.jsonlchunk_index.jsonlpaper_card.json(from the pre-pass)
- Output:
slide_plan.json: A list of slide sections with fieldssection_title,slide_topics,plan,learning_objective,references(≥2 chunk IDs per section), andfigures(filtered from referenced chunks only).
- Env:
OPENROUTER_API_KEYmust be set in.env- Optional
OPENROUTER_PLANNER_MODEL(default:qwen/qwen-2.5-7b-instruct,mistralai/mixtral-8x7b-instruct)
- Model Override: pass
--modelto override the default/env model. - Token Cap:
--max-tokens(default: unlimited) - Usage:
source .venv/bin/activate python plan_slides.py \ --summaries-dir artifacts/test_paper \ --outdir artifacts/test_paper \ --verbose --force
Notes:
- The planner follows the canonical section order from the Paper Card (default:
Overview → Method → Results → Discussion → Limitations → Conclusion). - It enforces evidence: each section must include at least 2
referencesto chunk IDs. - Figure IDs are validated against figures mentioned in the referenced chunks; unrelated figures are dropped.
- Adds
learning_objectiveper section (1–2 sentences).
Generates per-slide content JSON (Title, Content bullets, Audio narration, Figures) using the slide plan and source text.
- Inputs:
--artifacts-dircontaining:paper_card.json,slide_plan.json,chunk_summaries.jsonl,chunk_index.jsonl- OCR Markdown under
--ocr-dir/<pdf-name>/markdown/*.md(used to pull exact chunk text spans)
- Output:
presentation.json
- Env:
OPENROUTER_API_KEYmust be set in.env- Optional
OPENROUTER_GENERATOR_MODEL(default:mistralai/mistral-small-24b-instruct-2501,meta-llama/llama-3.2-3b-instruct)
- Model Override: pass
--modelto override the default/env model. - Token Cap:
--max-tokens(default: unlimited) - Figure Reuse:
--figure-reuse-limit(default: -1 for unlimited reuse across the deck) - Usage (Option A: keep outputs isolated per paper):
source .venv/bin/activate python generate_slides.py \ --ocr-dir mistral_responses \ --pdf-name test_paper \ --artifacts-dir artifacts/test_paper \ --outdir artifacts/test_paper \ --verbose --force - Usage (Option B: write to repo-level artifacts/presentation.json):
source .venv/bin/activate python generate_slides.py \ --ocr-dir mistral_responses \ --pdf-name test_paper \ --artifacts-dir artifacts/test_paper \ --outdir artifacts \ --verbose --force
Notes:
- Figures are selected from the union of figures in the slide's
references; planner suggestions are respected only if they intersect with referenced figures. - Figure reuse is unlimited by default; you can cap it with
--figure-reuse-limit. - If no figures are attached to a slide, the generator explicitly instructs the model not to mention figures to avoid mismatches.
- The generator enforces JSON-only output, with fallback to non-JSON mode and light repairs to handle model quirks (empty content or minor JSON issues).
- Slide prompt enforces narrative continuity via
WhyThisSlide,BridgeFromPrevious, andBridgeToNextfields. - De-duplication: the generator tracks previously used
claimsand prefers novel claims for each slide. - Context discipline: only the last 2 slide summaries plus compact checkpoint notes are passed to the LLM to avoid drift.
- Works well:
mistralai/mistral-small-24b-instruct-2501(consistent JSON withresponse_format={"type":"json_object"}). - Caveats:
openai/gpt-oss-120bsometimes returns empty content under JSON mode; the generator now falls back to non-JSON mode and repairs responses, but you may still prefer Mistral for reliability/cost. - Note: slugs without
:freesuffix are recommended for reliability.
Creates a governance card (paper_card.json) from the earliest and latest chunk summaries and figure captions to guide planning.
- Input:
chunk_summaries.jsonl(in an artifacts subdir)
- Output:
paper_card.jsonwith keys:tldr,contributions,method_oneliner,key_results,limitations, andsection_order(canonical deck order)
- Env:
OPENROUTER_API_KEYmust be set in.env- Optional
OPENROUTER_CARD_MODEL(default:mistralai/mistral-small-24b-instruct-2501,meta-llama/llama-3.2-3b-instruct)
- Usage:
source .venv/bin/activate python make_paper_card.py \ --artifacts-dir artifacts/test_paper \ --outdir artifacts/test_paper \ --verbose --force
Generates audio files from presentation.json using the Sarvam TTS API. This script includes a robust pipeline to handle texts that exceed the API's character limit:
- Sentence Splitting: The script first splits the slide's narration text into individual sentences.
- Per-Sentence Audio Generation: To work around potential API bugs with specific sentence combinations, each sentence is sent to the Sarvam API as a separate request to generate an audio chunk.
- Concatenation: The individual audio chunks for each sentence are then seamlessly concatenated into a single, complete WAV file for the slide using
ffmpeg.
Renders a PNG deck from a presentation.json file using Marp. This script reads the slide titles and content and creates deck.XXX.png images in artifacts/<paper>/pngs/.
- Inputs:
--presentation-filepath to the JSON deck (fromgenerate_slides.py)--output-dirwhere to writedeck.mdand PNGs--paper-nameused to resolve figure paths
- Cover Slide Tip: If your
presentation.jsonalready contains a cover slide (see below), pass--no-coverto avoid adding another cover during rendering. - OCR Dir: If you want the renderer to build a cover from OCR (not recommended when the generator already added one), pass
--ocr-dirand omit--no-cover.
Example:
python generate_slide_pngs.py \
--presentation-file artifacts/test_paper/presentation.json \
--output-dir artifacts/test_paper \
--paper-name test_paper \
--no-coverNotes:
- We fixed an empty-first-slide issue by not inserting a leading slide separator in the generated Marp markdown.
- Headings are sanitized to avoid stray Markdown
#prefixes in titles.
You can ask the generator to prepend a cover slide that introduces the paper with title, authors, and narration (Audio).
- Flags:
--add-coverto enable cover generation--cover-modelto select the OpenRouter model for extracting title/authors from OCR page 01
- Env:
OPENROUTER_API_KEYmust be set- Optional
OPENROUTER_COVER_MODELcan override the default cover model
- How it works:
- Extracts title/authors from
mistral_responses/<paper>/markdown/<paper>_page_01.mdusing an LLM (OpenRouter JSON mode) with a heuristic fallback - Prepends a cover slide JSON with
Title,Content(bullets), andAudio(spoken intro) - Seeds previous-slide context so slide 2 bridges naturally from the cover
- Extracts title/authors from
- Renderer tip: When cover is generated here, pass
--no-covertogenerate_slide_pngs.pyto avoid a duplicate cover
Example:
python generate_slides.py \
--ocr-dir mistral_responses \
--pdf-name test_paper \
--artifacts-dir artifacts/test_paper \
--outdir artifacts/test_paper \
--add-cover \
--model mistralai/mistral-small-24b-instruct-2501 \
--max-tokens 600 \
--forceCombine rendered PNG slides (deck.XXX.png) and generated WAV files (slide_XXX.wav) into a single MP4 using ffmpeg.
- Assumptions:
- PNGs:
artifacts/<paper>/pngs/deck.001.png,deck.002.png, ... - Audio:
artifacts/<paper>/audio/slide_001.wav,slide_002.wav, ... - Pairs matched by index; the script uses the intersection of indices found in both folders
- PNGs:
- Usage:
python stitch_video.py --paper-name test_paper # or explicit directories: python stitch_video.py \ --png-dir artifacts/test_paper/pngs \ --audio-dir artifacts/test_paper/audio \ --output artifacts/test_paper/video.mp4 - Troubleshooting (stale files): If you previously generated audio without a cover and now added a cover, the first WAV may belong to the old first content slide. Regenerate audio to match the current
presentation.json, or remove extra WAV/PNG files beyond your slide count.
# 1) Plan slides (after make_paper_card.py)
python plan_slides.py --summaries-dir artifacts/test_paper --outdir artifacts/test_paper --verbose --force
# 2) Generate slides with cover and reliable model
python generate_slides.py \
--ocr-dir mistral_responses \
--pdf-name test_paper \
--artifacts-dir artifacts/test_paper \
--outdir artifacts/test_paper \
--add-cover \
--model mistralai/mistral-small-24b-instruct-2501 \
--force
# 3) Render PNGs (no extra cover)
python generate_slide_pngs.py \
--presentation-file artifacts/test_paper/presentation.json \
--output-dir artifacts/test_paper \
--paper-name test_paper \
--no-cover
# 4) Generate audio WAVs (requires SARVAM_API_KEY)
python generate_audio.py \
--presentation-file artifacts/test_paper/presentation.json \
--output-dir artifacts/test_paper/audio \
--paper-name test_paper
# 5) Stitch to MP4
python stitch_video.py --paper-name test_paperTips:
- If you change the deck, regenerate audio to avoid stale WAVs.
- When a cover is added in the generator, always pass
--no-coverto the renderer.
- Generator (stable):
mistralai/mistral-small-24b-instruct-2501- Consistently returns valid JSON with
response_format={"type":"json_object"} - Leave
--max-tokensunset for uncapped output; set it only if you need to constrain cost
- Consistently returns valid JSON with
- Generator (budget/backup):
meta-llama/llama-3.2-3b-instruct- Lower cost; acceptable for drafts, though JSON reliability can vary
- Cover Extractor: Same as generator by default; can override with
--cover-modelorOPENROUTER_COVER_MODEL - Avoid
:freesuffixes unless verified available for your account