philo-benchmark

philo-benchmark is an evaluation pipeline for measuring physics reasoning failures in video generation models. It targets gravitational and collision physics plausibility — generating hundreds of prompt variants from a small set of seed scenarios, rendering videos through multiple model APIs, and scoring the results with a hybrid pipeline that combines optical-flow analysis with vision-language model (VLM) assessment. The benchmark produces a single TSV file of per-task, per-model scores across six rubric dimensions: trajectory fit, energy conservation, object permanence, physical consistency, causal reasoning, and surface realism.

The pipeline is designed to be fully automated and CLI-driven. Seed tasks are expanded through parameterization (varying objects, heights, materials) and perturbation (adding visual distractors), then optionally augmented with compositional chained-event variants and adversarial prompt rewrites. Rule-based scoring uses OpenCV optical flow with camera-motion subtraction and scipy curve fitting; VLM scoring uses Gemini 2.0 Flash with contrastive chain-of-thought prompting. A self-validation suite (validate_scorer.py) verifies the scoring pipeline against synthetic videos before any real API calls are made.

Project Structure

philo-benchmark/
├── seeds.py                # 6 seed tasks as frozen dataclasses
├── expand.py               # parameterize, perturb, compose, calibrate
├── generate.py             # Vertex AI Veo 2/3 + fal.ai Kling 1.6
├── score.py                # hybrid optical-flow + Gemini VLM scorer
├── additional_scorers.py   # contact timing, direction, temporal scorers
├── export.py               # writes tasks_and_rubrics.tsv
├── run_pipeline.py         # single CLI entry-point (expand→generate→score→export)
├── adversarial.py          # adversarial prompt rewrites (3 types)
├── validate_scorer.py      # synthetic-video self-tests for the scorer
├── analyze_results.py      # post-hoc analysis and summary tables
├── difficulty_predictor.py # LogisticRegression difficulty predictor
├── score_drift.py          # reproducibility checker (scorer drift detection)
├── elo_rating.py           # Physics Elo ratings across dimensions and seeds
├── vlm_top5.py             # standalone Gemini VLM scorer for existing videos
├── physics_simulator.py    # pymunk ground-truth reference trajectory generator
├── merge_tsvs.py           # merge multiple scored TSV files
├── report/main.tex         # LaTeX report comparing models
├── requirements.txt
├── .env                    # credentials (not committed)
└── README.md

Installation

python3 -m venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt

Create a .env file with your credentials:

GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
FAL_KEY=your-fal-api-key

Usage

Dry run (no API calls)

python3 run_pipeline.py --dry-run --max-variants 10

Real benchmark

python3 run_pipeline.py --max-variants 5 --model kling,veo3

With compositional and adversarial variants

python3 run_pipeline.py --max-variants 5 --model kling,veo3 \
    --compositional --adversarial --output full_benchmark.tsv

Validate the scorer (no API calls needed)

python3 validate_scorer.py

Analyze results

python3 analyze_results.py                    # reads tasks_and_rubrics.tsv
python3 analyze_results.py test_run.tsv       # or specify a different TSV

Reproducibility check (scorer drift detection)

python3 score_drift.py --record-all          # save baselines for all videos
python3 score_drift.py --check-all           # re-score and compare against baselines
python3 score_drift.py --record outputs/video.mp4 seed_id   # single video
python3 score_drift.py --check  outputs/video.mp4 seed_id   # single check

The scorer is fully deterministic: re-running on the same video produces identical scores (delta = 0.000000 across all dimensions and all 8 test videos). This is verified by score_drift.py, which saves score baselines to score_baseline.json and detects any deviation beyond a configurable tolerance (default: 0.01). Any code change to the scorer that alters output on existing videos will be caught immediately.

CLI flags

Flag	Description
`--dry-run`	Run without calling any APIs
`--model kling,veo3`	Comma-separated model list (`veo2`, `veo3`, `kling`, `all`)
`--seed-id gravity_freefall`	Restrict to a single seed
`--max-variants 500`	Max variants to expand
`--skip-vlm`	Skip Gemini VLM scoring (rule-based only)
`--compositional`	Include chained-event compositional variants
`--adversarial`	Generate adversarial prompt rewrites
`--force-regenerate`	Re-generate videos even if cached
`--output FILE`	Output TSV filename
`--budget 50`	Stop generating if estimated cost exceeds $50

Scoring Dimensions

Dimension	Method	Description
`trajectory_score`	Rule-based	Quadratic/parabolic fit R² with acceleration verification
`energy_score`	Rule-based	Pearson correlation between height loss and velocity²
`permanence_score`	Rule-based	Lucas-Kanade sparse tracking continuity
`consistency_score`	VLM	Physical consistency via contrastive CoT
`causal_score`	VLM	Causal chain plausibility
`surface_score`	VLM	Surface/material realism

Three additional rule-based dimensions are computed by additional_scorers.py and included in the TSV but not in total_score (available for extended analysis):

Dimension	Description
`contact_timing_score`	Whether contact/collision occurs at the expected time in the video
`direction_score`	Whether dominant motion direction matches physics (e.g., downward for freefall)
`temporal_score`	Whether the acceleration profile is monotonic (not reversed)

Results Summary (32 videos · Kling 1.6 vs Veo 3 · 6 seeds)

Model	N	Trajectory	Energy	Consistency	Causal	Surface	Permanence	Total
Kling	16	0.420±0.330	0.462±0.261	0.988±0.034	0.537±0.449	0.800±0.398	0.942±0.046	0.692±0.128
Veo 3	16	0.533±0.274	0.345±0.324	0.994±0.025	0.456±0.452	0.731±0.442	0.849±0.206	0.651±0.182

Key finding: Kling leads overall (0.692 vs 0.651) despite trailing on trajectory. Mean intrinsic/superficial faithfulness gap: Kling +0.484, Veo 3 +0.417 — both models appear more physically plausible than they are. Causal score is the weakest VLM dimension for both models (Kling 0.537, Veo 3 0.456), as it requires the correct physical effect, not just a visually convincing one.

Total API spend for this benchmark run: ~$6.10 (within the $200 reimbursement limit).

Prompts Used with Claude Code

Below is the complete list of prompts used to build this project, in order:

Project setup & seeds.py plan: "I'm building a video generation reasoning benchmark for a research take-home. Here's the full context: GOAL: Build an evaluation pipeline for physics reasoning failures in video generation models. Specifically: gravitational & collision physics plausibility. [...] Start by creating the project folder structure and seeds.py with all 5 seed tasks fully defined as Python dataclasses. Do not start coding yet — first show me the plan for seeds.py and confirm the structure looks right."
Implement seeds.py: "This looks good. Go ahead and create the project folder structure and implement seeds.py in full. After creating the file, run: python seeds.py to verify it imports cleanly with no errors."
Implement expand.py: "seeds.py looks good. Now implement expand.py only. Import the SeedTask dataclasses from seeds.py. It should have: parameterize(seed), perturb(variant), expand_all(seeds, max_variants=500). After writing it, run a quick test: expand seed_1 and print how many variants were generated."
Implement generate.py: "Now implement generate.py with two functions: generate_veo2(task, project_id, location) using Vertex AI VideoGenerationModel (veo-002), generate_kling(task, fal_api_key) using fal.ai. Both should save the video to outputs/{task_id}_{model}.mp4 and return the file path. Add a --dry-run flag. Test with --dry-run first."
Implement score.py: "Now implement score.py with: rule_based_scorer(video_path, seed) using OpenCV optical flow + scipy curve fitting, vlm_scorer(video_path, seed, project_id) using Gemini 2.0 Flash on Vertex AI with contrastive chain-of-thought prompting, final_scorer() combining both. All scores must be float 0.0-1.0."
Credentials setup: "Before we run anything, I need to set up credentials. What environment variables does this project need? List them all with descriptions so I can set them up."
Wire pipeline & dry run: "Credentials are fully set up now. Please: Make sure python-dotenv is in requirements.txt and .env is loaded in run_pipeline.py. Add .env to .gitignore. Run: python run_pipeline.py --dry-run."
Fix export.py & verify pipeline: "Dependencies are now installed via uv. We skipped two steps. Please: Create export.py that writes tasks_and_rubrics.tsv with columns [...]. Verify run_pipeline.py wires all 4 stages together. Then run: python3 run_pipeline.py --dry-run."
Real test with Kling: "Now run a real test with 2 videos on Kling only: python3 run_pipeline.py --seed-id incline_roll --max-variants 2 --model kling --output test_run.tsv."
Add Veo 3 & full benchmark: "Kling is working and generated 2 videos. Now: Wait for the current run to finish and confirm test_run.tsv is populated with scores. Then add Veo 3 as the second model alongside Kling. In generate.py add generate_veo3() using google-genai SDK with model 'veo-003'. Run the full benchmark with both models. Show me tasks_and_rubrics.tsv contents when done. Then write report/main.tex comparing Kling vs Veo 3 results across all 5 rubric dimensions."
9 code fixes + report update: "Fix broken deduplication in expand.py. Add camera motion subtraction to score.py. Replace dense flow centroid with Lucas-Kanade sparse tracking. Add energy_conservation_score() as a 6th scoring dimension. Verify score.py dispatches to seed-specific physics models. Add ScoreResult dataclass with weighted total property. Add --force-regenerate flag. Add calibrate_difficulty() function to expand.py. Update report/main.tex to include a VBVR-EvalKit alignment section."
Compositional expansion: "Add a new expansion method to expand.py called compose(seed_a, seed_b). It should take two seed tasks and chain their prompts with a causal connector. The composed rubric should require BOTH physics models to pass: composed_score = min(score_a, score_b). Add 10 composed variants using specific pairs. Tag them with 'category': 'compositional'. Add a --compositional flag to run_pipeline.py."
Adversarial expansion: "Create a new file adversarial.py with a function adversarialize(variant: dict) -> list[dict]. It should return 3 adversarial versions of any variant: UNDERSPECIFIED, CONTRADICTORY, ANTHROPOMORPHIZED. Each adversarial variant should inherit all scoring rubric from the original and be flagged in the TSV with a new column 'adversarial_type'. Add --adversarial flag to run_pipeline.py."
Scorer self-tests: "Create validate_scorer.py that self-tests the scoring pipeline using programmatically generated synthetic videos. Generate 4 synthetic test cases with OpenCV VideoWriter: PERFECT_FREEFALL, CONSTANT_VELOCITY, PERFECT_PERMANENCE, TELEPORTING_OBJECT. For each test case: generate a 3-second synthetic .mp4 at 24fps, run rule_based_scorer() on it, assert the expected score range, print PASS/FAIL."
Analysis & README: "Create analyze_results.py that reads tasks_and_rubrics.tsv and prints: MODEL COMPARISON TABLE, SEED DIFFICULTY RANKING, DISTRACTOR SENSITIVITY ANALYSIS, WINNER PER SEED, CALIBRATION REPORT. Also write a summary to results_summary.txt. Then update README.md to include project overview, installation, usage, results summary table, and list of ALL prompts used."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

philo-benchmark

Project Structure

Installation

Usage

Dry run (no API calls)

Real benchmark

With compositional and adversarial variants

Validate the scorer (no API calls needed)

Analyze results

Reproducibility check (scorer drift detection)

CLI flags

Scoring Dimensions

Results Summary (32 videos · Kling 1.6 vs Veo 3 · 6 seeds)

Prompts Used with Claude Code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
outputs		outputs
report		report
.gitignore		.gitignore
README.md		README.md
additional_scorers.py		additional_scorers.py
adversarial.py		adversarial.py
analyze_results.py		analyze_results.py
cost_log.json		cost_log.json
cursor_video_generation_reasoning_bench_recent.md		cursor_video_generation_reasoning_bench_recent.md
difficulty_predictor.py		difficulty_predictor.py
elo_rating.py		elo_rating.py
elo_ratings.json		elo_ratings.json
expand.py		expand.py
export.py		export.py
generate.py		generate.py
merge_tsvs.py		merge_tsvs.py
physics_simulator.py		physics_simulator.py
requirements.txt		requirements.txt
results_summary.txt		results_summary.txt
run_pipeline.py		run_pipeline.py
score.py		score.py
score_baseline.json		score_baseline.json
score_drift.py		score_drift.py
seeds.py		seeds.py
tasks_and_rubrics.tsv		tasks_and_rubrics.tsv
validate_scorer.py		validate_scorer.py
vlm_top5.py		vlm_top5.py

Folders and files

Latest commit

History

Repository files navigation

philo-benchmark

Project Structure

Installation

Usage

Dry run (no API calls)

Real benchmark

With compositional and adversarial variants

Validate the scorer (no API calls needed)

Analyze results

Reproducibility check (scorer drift detection)

CLI flags

Scoring Dimensions

Results Summary (32 videos · Kling 1.6 vs Veo 3 · 6 seeds)

Prompts Used with Claude Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages