feat: add SQLite persistent storage for benchmark results (#26) by Sugaria0427 · Pull Request #34 · Neal006/memorylens

Sugaria0427 · 2026-05-24T12:49:43Z

Summary

Replace flat JSON/CSV log storage with a queryable SQLite database for benchmark results, while maintaining full backward compatibility.

Related issue

Closes #26

Type of change

API or infrastructure

How was this tested?

python tests/test_pipeline.py  (29/29 pass)
python main.py --turns 10 --checkpoints 5 10 --backends naive --log
python utils/migrate_legacy_logs.py

Checklist

All existing tests pass
New tests added (6 new tests)
Docstrings added on all new public classes and functions
Type hints used on all new function signatures
No API key required to run any new tests
Backward compatible — JSON/CSV output unchanged

Replaces flat JSON/CSV log storage with a queryable SQLite database while maintaining full backward compatibility. New files: - utils/storage.py — Storage class wrapping Python stdlib sqlite3. Schema: runs (run_id, timestamp, config_json) + results (run_id FK, backend, turn, metric, value). API: save_run, get_run, list_runs, compare_runs. - utils/migrate_legacy_logs.py — one-shot idempotent migration script that imports existing experiment_logs/*.json into SQLite. Modified files: - evaluation/logger.py — log_run() now calls Storage().save_run() alongside existing JSON+CSV writes. list_runs() queries SQLite first, falls back to filesystem scan. Fixes pre-existing has_llm_eval bug in _append_csv_summary. - tests/test_pipeline.py — 6 new tests (4 Storage CRUD + 2 logger integration). - CHANGELOG.md — documented the new feature. - .gitignore — added experiment_logs/memorylens.db. Closes Neal006#26

Neal006 · 2026-06-03T04:13:35Z

Code Review — PR #34 (SQLite persistent storage)

Thanks for the contribution! The overall design is clean and the backward-compatibility approach (JSON/CSV unchanged, SQLite as opt-in) is exactly right. I found 3 bugs that will cause crashes in production and 2 lower-priority issues that need addressing before merge.

🔴 Bug 1 — Duplicate `results` rows corrupt `get_run()` on run-id collision

File: utils/storage.py, save_run() (~line 1508)

self.conn.execute(
    "INSERT OR REPLACE INTO runs (run_id, timestamp, config_json) VALUES (?, ?, ?)",
    ...
)
# then: self.conn.executemany("INSERT INTO results ...", rows)

INSERT OR REPLACE replaces the row in the runs table but does not cascade-delete the existing rows in results (there is no ON DELETE CASCADE on the FK). On a second call with the same run_id (e.g., a re-run, a test cleanup failure, or the migration script running twice), results accumulates duplicate metric rows. get_run() then sees 2× the data per checkpoint and reconstructs wrong metric arrays.

Fix: delete old results before inserting new ones:

# inside `with self.conn:`, before executemany
self.conn.execute("DELETE FROM results WHERE run_id = ?", (run_id,))

Or add ON DELETE CASCADE to the FK definition and keep INSERT OR REPLACE.

🔴 Bug 2 — `IndexError` crash in `_append_csv_summary` when `rows` is empty

File: evaluation/logger.py, _append_csv_summary() (~line 442)

with open(csv_path, "a", newline="") as fh:
    writer = csv.DictWriter(fh, fieldnames=rows[0].keys())  # ← crashes if rows == []

If checkpoints is [] or every key in display_data is filtered out ("checkpoints" / "has_llm_eval"), rows is empty and rows[0] throws IndexError. The PR adds the has_llm_eval filter (good fix for the pre-existing TypeError) but still doesn't guard against empty rows.

Fix:

if not rows:
    return

Add this guard immediately before the with open(...) block.

🔴 Bug 3 — `Storage()` connections are never closed → file handle leak

File: evaluation/logger.py, log_run() and list_runs()

# log_run:
Storage().save_run(run_id, config, display_data)   # connection never closed

# list_runs:
store = Storage()
runs = store.list_runs(limit=50)
if runs:
    return runs   # returns without calling store.close()

Every call to log_run() (and every SQLite-path call to list_runs()) leaks a sqlite3.Connection. In the Streamlit dashboard, each benchmark run fires log_run and a re-render can trigger list_runs multiple times. On Windows, open SQLite file handles also block file deletion/migration.

Fix — option A (minimal): add try/finally:

# log_run:
try:
    store = Storage()
    store.save_run(run_id, config, display_data)
finally:
    store.close()

# list_runs:
try:
    store = Storage()
    runs = store.list_runs(limit=50)
    if runs:
        return runs
finally:
    store.close()

Fix — option B (cleaner): add __enter__/__exit__ to Storage so callers can use with Storage() as store:.

🟡 Issue 4 — `get_run()` returns `None` in metric arrays; downstream callers may crash

File: utils/storage.py, get_run() (~line 1574)

display[backend][metric] = [
    turn_map[t].get(metric) if metric in turn_map.get(t, {}) else None
    for t in cps
]

Metrics absent for a checkpoint are stored as None. compare_runs() and the dashboard iterate these arrays expecting floats. For example, compare_runs does list(v["recall"]) — that works — but any caller doing arithmetic (sum(arr), Plotly chart) on the returned list will encounter TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'.

Suggested fix: filter None at the point of reconstruction, or document clearly that consumers must handle None.

🟡 Issue 5 — Tests leak `Storage` connections and use raw SQL for cleanup

File: tests/test_pipeline.py, test_logger_writes_sqlite and test_list_runs_returns_sqlite_runs

Both tests do cleanup via store.conn.execute("DELETE FROM ...") but never call store.close(). The raw-SQL cleanup also bypasses the public API, making the tests fragile if the schema changes.

Suggested fix: call store.close() in each test's finally block (consistent with the other storage tests that already do this correctly).

Summary table

#	Severity	File	Issue
1	🔴 High	`utils/storage.py`	Duplicate `results` rows corrupt `get_run()` on repeated `run_id`
2	🔴 High	`evaluation/logger.py`	`IndexError` when `rows` is empty in `_append_csv_summary`
3	🔴 High	`evaluation/logger.py`	`Storage()` connections never closed → file handle leak
4	🟡 Medium	`utils/storage.py`	`None` values in metric arrays break arithmetic downstream
5	🟡 Low	`tests/test_pipeline.py`	Integration tests leak connections and use raw SQL for cleanup

Once the three 🔴 issues are fixed I'm happy to approve. The design and test coverage are solid — these are all mechanical fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SQLite persistent storage for benchmark results (#26)#34

feat: add SQLite persistent storage for benchmark results (#26)#34
Sugaria0427 wants to merge 1 commit into
Neal006:mainfrom
Sugaria0427:feat/sqlite-persistent-storage

Sugaria0427 commented May 24, 2026

Uh oh!

Neal006 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sugaria0427 commented May 24, 2026

Summary

Related issue

Type of change

How was this tested?

Checklist

Uh oh!

Neal006 commented Jun 3, 2026

Code Review — PR #34 (SQLite persistent storage)

🔴 Bug 1 — Duplicate results rows corrupt get_run() on run-id collision

🔴 Bug 2 — IndexError crash in _append_csv_summary when rows is empty

🔴 Bug 3 — Storage() connections are never closed → file handle leak

🟡 Issue 4 — get_run() returns None in metric arrays; downstream callers may crash

🟡 Issue 5 — Tests leak Storage connections and use raw SQL for cleanup

Summary table

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔴 Bug 1 — Duplicate `results` rows corrupt `get_run()` on run-id collision

🔴 Bug 2 — `IndexError` crash in `_append_csv_summary` when `rows` is empty

🔴 Bug 3 — `Storage()` connections are never closed → file handle leak

🟡 Issue 4 — `get_run()` returns `None` in metric arrays; downstream callers may crash

🟡 Issue 5 — Tests leak `Storage` connections and use raw SQL for cleanup