Skip to content

nevasini1/philo-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

philo-benchmark

philo-benchmark is an evaluation pipeline for measuring physics reasoning failures in video generation models. It targets gravitational and collision physics plausibility — generating hundreds of prompt variants from a small set of seed scenarios, rendering videos through multiple model APIs, and scoring the results with a hybrid pipeline that combines optical-flow analysis with vision-language model (VLM) assessment. The benchmark produces a single TSV file of per-task, per-model scores across six rubric dimensions: trajectory fit, energy conservation, object permanence, physical consistency, causal reasoning, and surface realism.

The pipeline is designed to be fully automated and CLI-driven. Seed tasks are expanded through parameterization (varying objects, heights, materials) and perturbation (adding visual distractors), then optionally augmented with compositional chained-event variants and adversarial prompt rewrites. Rule-based scoring uses OpenCV optical flow with camera-motion subtraction and scipy curve fitting; VLM scoring uses Gemini 2.0 Flash with contrastive chain-of-thought prompting. A self-validation suite (validate_scorer.py) verifies the scoring pipeline against synthetic videos before any real API calls are made.

Project Structure

philo-benchmark/
├── seeds.py                # 6 seed tasks as frozen dataclasses
├── expand.py               # parameterize, perturb, compose, calibrate
├── generate.py             # Vertex AI Veo 2/3 + fal.ai Kling 1.6
├── score.py                # hybrid optical-flow + Gemini VLM scorer
├── additional_scorers.py   # contact timing, direction, temporal scorers
├── export.py               # writes tasks_and_rubrics.tsv
├── run_pipeline.py         # single CLI entry-point (expand→generate→score→export)
├── adversarial.py          # adversarial prompt rewrites (3 types)
├── validate_scorer.py      # synthetic-video self-tests for the scorer
├── analyze_results.py      # post-hoc analysis and summary tables
├── difficulty_predictor.py # LogisticRegression difficulty predictor
├── score_drift.py          # reproducibility checker (scorer drift detection)
├── elo_rating.py           # Physics Elo ratings across dimensions and seeds
├── vlm_top5.py             # standalone Gemini VLM scorer for existing videos
├── physics_simulator.py    # pymunk ground-truth reference trajectory generator
├── merge_tsvs.py           # merge multiple scored TSV files
├── report/main.tex         # LaTeX report comparing models
├── requirements.txt
├── .env                    # credentials (not committed)
└── README.md

Installation

python3 -m venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt

Create a .env file with your credentials:

GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
FAL_KEY=your-fal-api-key

Usage

Dry run (no API calls)

python3 run_pipeline.py --dry-run --max-variants 10

Real benchmark

python3 run_pipeline.py --max-variants 5 --model kling,veo3

With compositional and adversarial variants

python3 run_pipeline.py --max-variants 5 --model kling,veo3 \
    --compositional --adversarial --output full_benchmark.tsv

Validate the scorer (no API calls needed)

python3 validate_scorer.py

Analyze results

python3 analyze_results.py                    # reads tasks_and_rubrics.tsv
python3 analyze_results.py test_run.tsv       # or specify a different TSV

Reproducibility check (scorer drift detection)

python3 score_drift.py --record-all          # save baselines for all videos
python3 score_drift.py --check-all           # re-score and compare against baselines
python3 score_drift.py --record outputs/video.mp4 seed_id   # single video
python3 score_drift.py --check  outputs/video.mp4 seed_id   # single check

The scorer is fully deterministic: re-running on the same video produces identical scores (delta = 0.000000 across all dimensions and all 8 test videos). This is verified by score_drift.py, which saves score baselines to score_baseline.json and detects any deviation beyond a configurable tolerance (default: 0.01). Any code change to the scorer that alters output on existing videos will be caught immediately.

CLI flags

Flag Description
--dry-run Run without calling any APIs
--model kling,veo3 Comma-separated model list (veo2, veo3, kling, all)
--seed-id gravity_freefall Restrict to a single seed
--max-variants 500 Max variants to expand
--skip-vlm Skip Gemini VLM scoring (rule-based only)
--compositional Include chained-event compositional variants
--adversarial Generate adversarial prompt rewrites
--force-regenerate Re-generate videos even if cached
--output FILE Output TSV filename
--budget 50 Stop generating if estimated cost exceeds $50

Scoring Dimensions

Dimension Method Description
trajectory_score Rule-based Quadratic/parabolic fit R² with acceleration verification
energy_score Rule-based Pearson correlation between height loss and velocity²
permanence_score Rule-based Lucas-Kanade sparse tracking continuity
consistency_score VLM Physical consistency via contrastive CoT
causal_score VLM Causal chain plausibility
surface_score VLM Surface/material realism

Three additional rule-based dimensions are computed by additional_scorers.py and included in the TSV but not in total_score (available for extended analysis):

Dimension Description
contact_timing_score Whether contact/collision occurs at the expected time in the video
direction_score Whether dominant motion direction matches physics (e.g., downward for freefall)
temporal_score Whether the acceleration profile is monotonic (not reversed)

Results Summary (32 videos · Kling 1.6 vs Veo 3 · 6 seeds)

Model N Trajectory Energy Consistency Causal Surface Permanence Total
Kling 16 0.420±0.330 0.462±0.261 0.988±0.034 0.537±0.449 0.800±0.398 0.942±0.046 0.692±0.128
Veo 3 16 0.533±0.274 0.345±0.324 0.994±0.025 0.456±0.452 0.731±0.442 0.849±0.206 0.651±0.182

Key finding: Kling leads overall (0.692 vs 0.651) despite trailing on trajectory. Mean intrinsic/superficial faithfulness gap: Kling +0.484, Veo 3 +0.417 — both models appear more physically plausible than they are. Causal score is the weakest VLM dimension for both models (Kling 0.537, Veo 3 0.456), as it requires the correct physical effect, not just a visually convincing one.

Total API spend for this benchmark run: ~$6.10 (within the $200 reimbursement limit).

Prompts Used with Claude Code

Below is the complete list of prompts used to build this project, in order:

  1. Project setup & seeds.py plan: "I'm building a video generation reasoning benchmark for a research take-home. Here's the full context: GOAL: Build an evaluation pipeline for physics reasoning failures in video generation models. Specifically: gravitational & collision physics plausibility. [...] Start by creating the project folder structure and seeds.py with all 5 seed tasks fully defined as Python dataclasses. Do not start coding yet — first show me the plan for seeds.py and confirm the structure looks right."

  2. Implement seeds.py: "This looks good. Go ahead and create the project folder structure and implement seeds.py in full. After creating the file, run: python seeds.py to verify it imports cleanly with no errors."

  3. Implement expand.py: "seeds.py looks good. Now implement expand.py only. Import the SeedTask dataclasses from seeds.py. It should have: parameterize(seed), perturb(variant), expand_all(seeds, max_variants=500). After writing it, run a quick test: expand seed_1 and print how many variants were generated."

  4. Implement generate.py: "Now implement generate.py with two functions: generate_veo2(task, project_id, location) using Vertex AI VideoGenerationModel (veo-002), generate_kling(task, fal_api_key) using fal.ai. Both should save the video to outputs/{task_id}_{model}.mp4 and return the file path. Add a --dry-run flag. Test with --dry-run first."

  5. Implement score.py: "Now implement score.py with: rule_based_scorer(video_path, seed) using OpenCV optical flow + scipy curve fitting, vlm_scorer(video_path, seed, project_id) using Gemini 2.0 Flash on Vertex AI with contrastive chain-of-thought prompting, final_scorer() combining both. All scores must be float 0.0-1.0."

  6. Credentials setup: "Before we run anything, I need to set up credentials. What environment variables does this project need? List them all with descriptions so I can set them up."

  7. Wire pipeline & dry run: "Credentials are fully set up now. Please: Make sure python-dotenv is in requirements.txt and .env is loaded in run_pipeline.py. Add .env to .gitignore. Run: python run_pipeline.py --dry-run."

  8. Fix export.py & verify pipeline: "Dependencies are now installed via uv. We skipped two steps. Please: Create export.py that writes tasks_and_rubrics.tsv with columns [...]. Verify run_pipeline.py wires all 4 stages together. Then run: python3 run_pipeline.py --dry-run."

  9. Real test with Kling: "Now run a real test with 2 videos on Kling only: python3 run_pipeline.py --seed-id incline_roll --max-variants 2 --model kling --output test_run.tsv."

  10. Add Veo 3 & full benchmark: "Kling is working and generated 2 videos. Now: Wait for the current run to finish and confirm test_run.tsv is populated with scores. Then add Veo 3 as the second model alongside Kling. In generate.py add generate_veo3() using google-genai SDK with model 'veo-003'. Run the full benchmark with both models. Show me tasks_and_rubrics.tsv contents when done. Then write report/main.tex comparing Kling vs Veo 3 results across all 5 rubric dimensions."

  11. 9 code fixes + report update: "Fix broken deduplication in expand.py. Add camera motion subtraction to score.py. Replace dense flow centroid with Lucas-Kanade sparse tracking. Add energy_conservation_score() as a 6th scoring dimension. Verify score.py dispatches to seed-specific physics models. Add ScoreResult dataclass with weighted total property. Add --force-regenerate flag. Add calibrate_difficulty() function to expand.py. Update report/main.tex to include a VBVR-EvalKit alignment section."

  12. Compositional expansion: "Add a new expansion method to expand.py called compose(seed_a, seed_b). It should take two seed tasks and chain their prompts with a causal connector. The composed rubric should require BOTH physics models to pass: composed_score = min(score_a, score_b). Add 10 composed variants using specific pairs. Tag them with 'category': 'compositional'. Add a --compositional flag to run_pipeline.py."

  13. Adversarial expansion: "Create a new file adversarial.py with a function adversarialize(variant: dict) -> list[dict]. It should return 3 adversarial versions of any variant: UNDERSPECIFIED, CONTRADICTORY, ANTHROPOMORPHIZED. Each adversarial variant should inherit all scoring rubric from the original and be flagged in the TSV with a new column 'adversarial_type'. Add --adversarial flag to run_pipeline.py."

  14. Scorer self-tests: "Create validate_scorer.py that self-tests the scoring pipeline using programmatically generated synthetic videos. Generate 4 synthetic test cases with OpenCV VideoWriter: PERFECT_FREEFALL, CONSTANT_VELOCITY, PERFECT_PERMANENCE, TELEPORTING_OBJECT. For each test case: generate a 3-second synthetic .mp4 at 24fps, run rule_based_scorer() on it, assert the expected score range, print PASS/FAIL."

  15. Analysis & README: "Create analyze_results.py that reads tasks_and_rubrics.tsv and prints: MODEL COMPARISON TABLE, SEED DIFFICULTY RANKING, DISTRACTOR SENSITIVITY ANALYSIS, WINNER PER SEED, CALIBRATION REPORT. Also write a summary to results_summary.txt. Then update README.md to include project overview, installation, usage, results summary table, and list of ALL prompts used."

About

Automated evaluation pipeline for physics reasoning plausibility in video generation models (Kling 1.6 vs Veo 3)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors