philo-benchmark is an evaluation pipeline for measuring physics reasoning failures in video generation models. It targets gravitational and collision physics plausibility — generating hundreds of prompt variants from a small set of seed scenarios, rendering videos through multiple model APIs, and scoring the results with a hybrid pipeline that combines optical-flow analysis with vision-language model (VLM) assessment. The benchmark produces a single TSV file of per-task, per-model scores across six rubric dimensions: trajectory fit, energy conservation, object permanence, physical consistency, causal reasoning, and surface realism.
The pipeline is designed to be fully automated and CLI-driven. Seed tasks are expanded through parameterization (varying objects, heights, materials) and perturbation (adding visual distractors), then optionally augmented with compositional chained-event variants and adversarial prompt rewrites. Rule-based scoring uses OpenCV optical flow with camera-motion subtraction and scipy curve fitting; VLM scoring uses Gemini 2.0 Flash with contrastive chain-of-thought prompting. A self-validation suite (validate_scorer.py) verifies the scoring pipeline against synthetic videos before any real API calls are made.
philo-benchmark/
├── seeds.py # 6 seed tasks as frozen dataclasses
├── expand.py # parameterize, perturb, compose, calibrate
├── generate.py # Vertex AI Veo 2/3 + fal.ai Kling 1.6
├── score.py # hybrid optical-flow + Gemini VLM scorer
├── additional_scorers.py # contact timing, direction, temporal scorers
├── export.py # writes tasks_and_rubrics.tsv
├── run_pipeline.py # single CLI entry-point (expand→generate→score→export)
├── adversarial.py # adversarial prompt rewrites (3 types)
├── validate_scorer.py # synthetic-video self-tests for the scorer
├── analyze_results.py # post-hoc analysis and summary tables
├── difficulty_predictor.py # LogisticRegression difficulty predictor
├── score_drift.py # reproducibility checker (scorer drift detection)
├── elo_rating.py # Physics Elo ratings across dimensions and seeds
├── vlm_top5.py # standalone Gemini VLM scorer for existing videos
├── physics_simulator.py # pymunk ground-truth reference trajectory generator
├── merge_tsvs.py # merge multiple scored TSV files
├── report/main.tex # LaTeX report comparing models
├── requirements.txt
├── .env # credentials (not committed)
└── README.md
python3 -m venv .venv
source .venv/bin/activate
uv pip install -r requirements.txtCreate a .env file with your credentials:
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
FAL_KEY=your-fal-api-key
python3 run_pipeline.py --dry-run --max-variants 10python3 run_pipeline.py --max-variants 5 --model kling,veo3python3 run_pipeline.py --max-variants 5 --model kling,veo3 \
--compositional --adversarial --output full_benchmark.tsvpython3 validate_scorer.pypython3 analyze_results.py # reads tasks_and_rubrics.tsv
python3 analyze_results.py test_run.tsv # or specify a different TSVpython3 score_drift.py --record-all # save baselines for all videos
python3 score_drift.py --check-all # re-score and compare against baselines
python3 score_drift.py --record outputs/video.mp4 seed_id # single video
python3 score_drift.py --check outputs/video.mp4 seed_id # single checkThe scorer is fully deterministic: re-running on the same video produces identical scores (delta = 0.000000 across all dimensions and all 8 test videos). This is verified by score_drift.py, which saves score baselines to score_baseline.json and detects any deviation beyond a configurable tolerance (default: 0.01). Any code change to the scorer that alters output on existing videos will be caught immediately.
| Flag | Description |
|---|---|
--dry-run |
Run without calling any APIs |
--model kling,veo3 |
Comma-separated model list (veo2, veo3, kling, all) |
--seed-id gravity_freefall |
Restrict to a single seed |
--max-variants 500 |
Max variants to expand |
--skip-vlm |
Skip Gemini VLM scoring (rule-based only) |
--compositional |
Include chained-event compositional variants |
--adversarial |
Generate adversarial prompt rewrites |
--force-regenerate |
Re-generate videos even if cached |
--output FILE |
Output TSV filename |
--budget 50 |
Stop generating if estimated cost exceeds $50 |
| Dimension | Method | Description |
|---|---|---|
trajectory_score |
Rule-based | Quadratic/parabolic fit R² with acceleration verification |
energy_score |
Rule-based | Pearson correlation between height loss and velocity² |
permanence_score |
Rule-based | Lucas-Kanade sparse tracking continuity |
consistency_score |
VLM | Physical consistency via contrastive CoT |
causal_score |
VLM | Causal chain plausibility |
surface_score |
VLM | Surface/material realism |
Three additional rule-based dimensions are computed by additional_scorers.py and included in the TSV but not in total_score (available for extended analysis):
| Dimension | Description |
|---|---|
contact_timing_score |
Whether contact/collision occurs at the expected time in the video |
direction_score |
Whether dominant motion direction matches physics (e.g., downward for freefall) |
temporal_score |
Whether the acceleration profile is monotonic (not reversed) |
| Model | N | Trajectory | Energy | Consistency | Causal | Surface | Permanence | Total |
|---|---|---|---|---|---|---|---|---|
| Kling | 16 | 0.420±0.330 | 0.462±0.261 | 0.988±0.034 | 0.537±0.449 | 0.800±0.398 | 0.942±0.046 | 0.692±0.128 |
| Veo 3 | 16 | 0.533±0.274 | 0.345±0.324 | 0.994±0.025 | 0.456±0.452 | 0.731±0.442 | 0.849±0.206 | 0.651±0.182 |
Key finding: Kling leads overall (0.692 vs 0.651) despite trailing on trajectory. Mean intrinsic/superficial faithfulness gap: Kling +0.484, Veo 3 +0.417 — both models appear more physically plausible than they are. Causal score is the weakest VLM dimension for both models (Kling 0.537, Veo 3 0.456), as it requires the correct physical effect, not just a visually convincing one.
Total API spend for this benchmark run: ~$6.10 (within the $200 reimbursement limit).
Below is the complete list of prompts used to build this project, in order:
-
Project setup & seeds.py plan: "I'm building a video generation reasoning benchmark for a research take-home. Here's the full context: GOAL: Build an evaluation pipeline for physics reasoning failures in video generation models. Specifically: gravitational & collision physics plausibility. [...] Start by creating the project folder structure and seeds.py with all 5 seed tasks fully defined as Python dataclasses. Do not start coding yet — first show me the plan for seeds.py and confirm the structure looks right."
-
Implement seeds.py: "This looks good. Go ahead and create the project folder structure and implement seeds.py in full. After creating the file, run: python seeds.py to verify it imports cleanly with no errors."
-
Implement expand.py: "seeds.py looks good. Now implement expand.py only. Import the SeedTask dataclasses from seeds.py. It should have: parameterize(seed), perturb(variant), expand_all(seeds, max_variants=500). After writing it, run a quick test: expand seed_1 and print how many variants were generated."
-
Implement generate.py: "Now implement generate.py with two functions: generate_veo2(task, project_id, location) using Vertex AI VideoGenerationModel (veo-002), generate_kling(task, fal_api_key) using fal.ai. Both should save the video to outputs/{task_id}_{model}.mp4 and return the file path. Add a --dry-run flag. Test with --dry-run first."
-
Implement score.py: "Now implement score.py with: rule_based_scorer(video_path, seed) using OpenCV optical flow + scipy curve fitting, vlm_scorer(video_path, seed, project_id) using Gemini 2.0 Flash on Vertex AI with contrastive chain-of-thought prompting, final_scorer() combining both. All scores must be float 0.0-1.0."
-
Credentials setup: "Before we run anything, I need to set up credentials. What environment variables does this project need? List them all with descriptions so I can set them up."
-
Wire pipeline & dry run: "Credentials are fully set up now. Please: Make sure python-dotenv is in requirements.txt and .env is loaded in run_pipeline.py. Add .env to .gitignore. Run: python run_pipeline.py --dry-run."
-
Fix export.py & verify pipeline: "Dependencies are now installed via uv. We skipped two steps. Please: Create export.py that writes tasks_and_rubrics.tsv with columns [...]. Verify run_pipeline.py wires all 4 stages together. Then run: python3 run_pipeline.py --dry-run."
-
Real test with Kling: "Now run a real test with 2 videos on Kling only: python3 run_pipeline.py --seed-id incline_roll --max-variants 2 --model kling --output test_run.tsv."
-
Add Veo 3 & full benchmark: "Kling is working and generated 2 videos. Now: Wait for the current run to finish and confirm test_run.tsv is populated with scores. Then add Veo 3 as the second model alongside Kling. In generate.py add generate_veo3() using google-genai SDK with model 'veo-003'. Run the full benchmark with both models. Show me tasks_and_rubrics.tsv contents when done. Then write report/main.tex comparing Kling vs Veo 3 results across all 5 rubric dimensions."
-
9 code fixes + report update: "Fix broken deduplication in expand.py. Add camera motion subtraction to score.py. Replace dense flow centroid with Lucas-Kanade sparse tracking. Add energy_conservation_score() as a 6th scoring dimension. Verify score.py dispatches to seed-specific physics models. Add ScoreResult dataclass with weighted total property. Add --force-regenerate flag. Add calibrate_difficulty() function to expand.py. Update report/main.tex to include a VBVR-EvalKit alignment section."
-
Compositional expansion: "Add a new expansion method to expand.py called compose(seed_a, seed_b). It should take two seed tasks and chain their prompts with a causal connector. The composed rubric should require BOTH physics models to pass: composed_score = min(score_a, score_b). Add 10 composed variants using specific pairs. Tag them with 'category': 'compositional'. Add a --compositional flag to run_pipeline.py."
-
Adversarial expansion: "Create a new file adversarial.py with a function adversarialize(variant: dict) -> list[dict]. It should return 3 adversarial versions of any variant: UNDERSPECIFIED, CONTRADICTORY, ANTHROPOMORPHIZED. Each adversarial variant should inherit all scoring rubric from the original and be flagged in the TSV with a new column 'adversarial_type'. Add --adversarial flag to run_pipeline.py."
-
Scorer self-tests: "Create validate_scorer.py that self-tests the scoring pipeline using programmatically generated synthetic videos. Generate 4 synthetic test cases with OpenCV VideoWriter: PERFECT_FREEFALL, CONSTANT_VELOCITY, PERFECT_PERMANENCE, TELEPORTING_OBJECT. For each test case: generate a 3-second synthetic .mp4 at 24fps, run rule_based_scorer() on it, assert the expected score range, print PASS/FAIL."
-
Analysis & README: "Create analyze_results.py that reads tasks_and_rubrics.tsv and prints: MODEL COMPARISON TABLE, SEED DIFFICULTY RANKING, DISTRACTOR SENSITIVITY ANALYSIS, WINNER PER SEED, CALIBRATION REPORT. Also write a summary to results_summary.txt. Then update README.md to include project overview, installation, usage, results summary table, and list of ALL prompts used."