Skip to content

feat: add SQLite persistent storage for benchmark results (#26)#34

Open
Sugaria0427 wants to merge 1 commit into
Neal006:mainfrom
Sugaria0427:feat/sqlite-persistent-storage
Open

feat: add SQLite persistent storage for benchmark results (#26)#34
Sugaria0427 wants to merge 1 commit into
Neal006:mainfrom
Sugaria0427:feat/sqlite-persistent-storage

Conversation

@Sugaria0427
Copy link
Copy Markdown

Summary

Replace flat JSON/CSV log storage with a queryable SQLite database for benchmark results, while maintaining full backward compatibility.

Related issue

Closes #26

Type of change

  • API or infrastructure

How was this tested?

python tests/test_pipeline.py  (29/29 pass)
python main.py --turns 10 --checkpoints 5 10 --backends naive --log
python utils/migrate_legacy_logs.py

Checklist

  • All existing tests pass
  • New tests added (6 new tests)
  • Docstrings added on all new public classes and functions
  • Type hints used on all new function signatures
  • No API key required to run any new tests
  • Backward compatible — JSON/CSV output unchanged

Replaces flat JSON/CSV log storage with a queryable SQLite database
while maintaining full backward compatibility.

New files:
- utils/storage.py — Storage class wrapping Python stdlib sqlite3.
  Schema: runs (run_id, timestamp, config_json) + results (run_id FK,
  backend, turn, metric, value). API: save_run, get_run, list_runs,
  compare_runs.
- utils/migrate_legacy_logs.py — one-shot idempotent migration script
  that imports existing experiment_logs/*.json into SQLite.

Modified files:
- evaluation/logger.py — log_run() now calls Storage().save_run()
  alongside existing JSON+CSV writes. list_runs() queries SQLite
  first, falls back to filesystem scan. Fixes pre-existing
  has_llm_eval bug in _append_csv_summary.
- tests/test_pipeline.py — 6 new tests (4 Storage CRUD + 2 logger
  integration).
- CHANGELOG.md — documented the new feature.
- .gitignore — added experiment_logs/memorylens.db.

Closes Neal006#26
@Neal006
Copy link
Copy Markdown
Owner

Neal006 commented Jun 3, 2026

Code Review — PR #34 (SQLite persistent storage)

Thanks for the contribution! The overall design is clean and the backward-compatibility approach (JSON/CSV unchanged, SQLite as opt-in) is exactly right. I found 3 bugs that will cause crashes in production and 2 lower-priority issues that need addressing before merge.


🔴 Bug 1 — Duplicate results rows corrupt get_run() on run-id collision

File: utils/storage.py, save_run() (~line 1508)

self.conn.execute(
    "INSERT OR REPLACE INTO runs (run_id, timestamp, config_json) VALUES (?, ?, ?)",
    ...
)
# then: self.conn.executemany("INSERT INTO results ...", rows)

INSERT OR REPLACE replaces the row in the runs table but does not cascade-delete the existing rows in results (there is no ON DELETE CASCADE on the FK). On a second call with the same run_id (e.g., a re-run, a test cleanup failure, or the migration script running twice), results accumulates duplicate metric rows. get_run() then sees 2× the data per checkpoint and reconstructs wrong metric arrays.

Fix: delete old results before inserting new ones:

# inside `with self.conn:`, before executemany
self.conn.execute("DELETE FROM results WHERE run_id = ?", (run_id,))

Or add ON DELETE CASCADE to the FK definition and keep INSERT OR REPLACE.


🔴 Bug 2 — IndexError crash in _append_csv_summary when rows is empty

File: evaluation/logger.py, _append_csv_summary() (~line 442)

with open(csv_path, "a", newline="") as fh:
    writer = csv.DictWriter(fh, fieldnames=rows[0].keys())  # ← crashes if rows == []

If checkpoints is [] or every key in display_data is filtered out ("checkpoints" / "has_llm_eval"), rows is empty and rows[0] throws IndexError. The PR adds the has_llm_eval filter (good fix for the pre-existing TypeError) but still doesn't guard against empty rows.

Fix:

if not rows:
    return

Add this guard immediately before the with open(...) block.


🔴 Bug 3 — Storage() connections are never closed → file handle leak

File: evaluation/logger.py, log_run() and list_runs()

# log_run:
Storage().save_run(run_id, config, display_data)   # connection never closed

# list_runs:
store = Storage()
runs = store.list_runs(limit=50)
if runs:
    return runs   # returns without calling store.close()

Every call to log_run() (and every SQLite-path call to list_runs()) leaks a sqlite3.Connection. In the Streamlit dashboard, each benchmark run fires log_run and a re-render can trigger list_runs multiple times. On Windows, open SQLite file handles also block file deletion/migration.

Fix — option A (minimal): add try/finally:

# log_run:
try:
    store = Storage()
    store.save_run(run_id, config, display_data)
finally:
    store.close()

# list_runs:
try:
    store = Storage()
    runs = store.list_runs(limit=50)
    if runs:
        return runs
finally:
    store.close()

Fix — option B (cleaner): add __enter__/__exit__ to Storage so callers can use with Storage() as store:.


🟡 Issue 4 — get_run() returns None in metric arrays; downstream callers may crash

File: utils/storage.py, get_run() (~line 1574)

display[backend][metric] = [
    turn_map[t].get(metric) if metric in turn_map.get(t, {}) else None
    for t in cps
]

Metrics absent for a checkpoint are stored as None. compare_runs() and the dashboard iterate these arrays expecting floats. For example, compare_runs does list(v["recall"]) — that works — but any caller doing arithmetic (sum(arr), Plotly chart) on the returned list will encounter TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'.

Suggested fix: filter None at the point of reconstruction, or document clearly that consumers must handle None.


🟡 Issue 5 — Tests leak Storage connections and use raw SQL for cleanup

File: tests/test_pipeline.py, test_logger_writes_sqlite and test_list_runs_returns_sqlite_runs

Both tests do cleanup via store.conn.execute("DELETE FROM ...") but never call store.close(). The raw-SQL cleanup also bypasses the public API, making the tests fragile if the schema changes.

Suggested fix: call store.close() in each test's finally block (consistent with the other storage tests that already do this correctly).


Summary table

# Severity File Issue
1 🔴 High utils/storage.py Duplicate results rows corrupt get_run() on repeated run_id
2 🔴 High evaluation/logger.py IndexError when rows is empty in _append_csv_summary
3 🔴 High evaluation/logger.py Storage() connections never closed → file handle leak
4 🟡 Medium utils/storage.py None values in metric arrays break arithmetic downstream
5 🟡 Low tests/test_pipeline.py Integration tests leak connections and use raw SQL for cleanup

Once the three 🔴 issues are fixed I'm happy to approve. The design and test coverage are solid — these are all mechanical fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: SQLite persistent storage — replace flat JSON logs with a queryable database

2 participants