Hot Swap DB on static updates of gtfs source #158

mann-patwa · 2025-12-19T13:41:28Z

resolves #157

Database Update Improvements

Thread-Safe Database Switching

Added a dbMutex (sync.RWMutex) to the Manager to protect access to the GtfsDB client.
Read operations now acquire a read lock, preventing conflicts during database swaps.

Zero-Downtime Hot-Swap

The updateStaticGTFS process builds a completely new SQLite database (gtfs.db.tmp) in the background.
Once the build is complete:
- The in-memory client pointer is atomically swapped.
- The temporary database file is renamed to overwrite gtfs.db, ensuring zero downtime.

mann-patwa · 2025-12-22T16:47:30Z

@amrhossamdev @aaronbrethorst

After reviewing the concurrency model, I decided to proceed with using a single lock to protect both staticData and GtfsDB.
This approach significantly simplifies the logic and completely eliminates the risk of deadlocks caused by lock-ordering issues.

Changelog

Refactored buildGtfsDB
- Updated the function signature to accept a path argument, allowing explicit control over where the database is built.
Client Struct Update
- Added a helper method to retrieve the current database path (DB-path).
Implemented ForceUpdate
- Introduced a new method to safely handle live updates with the following sequence:
  1. Hot-swap active GtfsDB
    - Atomically replaces the active GtfsDB client in the manager.
    - New requests immediately use the updated database with zero downtime.
  2. Sync in-memory static data
    - Reloads and realigns in-memory static structures to ensure consistency with the newly swapped SQLite database.
  3. Clean up old database
    - Gracefully closes the old database connection and deletes the previous SQLite file once the swap is confirmed successful.
Added hot-swap concurrency test
- Introduced TestManager_HotSwapConcurrency to validate safe concurrent reads during ForceUpdate.
- The test spins up multiple concurrent readers accessing both static data and DB-backed paths while a live hot-swap is performed.
- Verifies that:
  - No panics or race conditions occur
  - ForceUpdate completes successfully
  - Post-swap static data remains consistent and correct

Next Steps

Update all handler functions to acquire this single lock whenever accessing either staticData or GtfsDB.

Once you confirm that this approach looks correct, I’ll proceed with updating the handlers accordingly.

mann-patwa · 2025-12-25T16:39:44Z

Hey @aaronbrethorst,

I think we’ll also need to update DirectionCalculator to accept gtfsManager itself as a dependency, rather than gtfsManager.GtfsDB.Queries. Otherwise, we’d have to either re-instantiate DirectionCalculator or update its internal queries reference on every static GTFS update.

Please correct me if I’m missing something here.

…pdate

Ahmedhossamdev · 2025-12-31T02:00:32Z

@mann-patwa Could you please update the config.example.json file?

also run make lint

mann-patwa · 2025-12-31T03:25:01Z

@Ahmedhossamdev what needs to be changed in the config.example.json file

Ahmedhossamdev · 2025-12-31T03:32:50Z

@Ahmedhossamdev what needs to be changed in the config.example.json file

Oops I thought you added a new attribute in config.json file that does not exist in example file

mann-patwa · 2025-12-31T03:50:22Z

@Ahmedhossamdev You can run the lint now, it should work

Ahmedhossamdev

Overall, this is great work. Thanks a lot for your contribution! <3'

Ahmedhossamdev · 2025-12-31T02:30:08Z

internal/gtfs/static.go

+	// 2. Build new DB
+	// Generate a unique filename for the new DB
+	timestamp := time.Now().Format("20060102_150405")
+	newDBPath := fmt.Sprintf("%s_%s.db", manager.config.GTFSDataPath, timestamp)


I think we can simplify the update flow by always ending up with a stable database name like gtfs.db like the old db after we delete it

This keeps the application logic simple since the server always opens gtfs.db, avoids dynamic path handling

I did think about this approach, but it feels a bit finicky. Opening a database with an already existing name, or opening it under a different name and then renaming it while it’s in use, doesn’t seem ideal.

That said, I can implement it this way if you think it’s the better approach, let me know your preference.

You can build the new DB separately, then atomically rename it to gtfs.db once it’s ready. Old DB queries can finish safely, and in-memory data is swapped together, so nothing will break.

ok, will do that!

Ahmedhossamdev · 2025-12-31T02:46:58Z

internal/gtfs/static.go

+	timestamp := time.Now().Format("20060102_150405")
+	newDBPath := fmt.Sprintf("%s_%s.db", manager.config.GTFSDataPath, timestamp)
+
+	// FIX: Use manager.isLocalFile here too


Could you remove comments that looks like this comment ?

Ahmedhossamdev · 2025-12-31T03:10:10Z

internal/gtfs/gtfs_manager.go

 	blockLayoverIndices            map[string][]*BlockLayoverIndex
 }

+func (manager *Manager) RUnlock() {


Do we need these RLock / RUnlock helper methods? Since they just delegate to staticMutex, we might be able to simplify by using the mutex directly.

And I notice that we only use them in sawp tests.

Yes, I'll remove the helper methods.

I had one question regarding the locking strategy: should we acquire the locks at the handler level (for example, via middleware)? This approach would ensure that we wait for all ongoing requests to finish before acquiring the write lock (WLock) for the database change.

The alternative approach is to acquire a read lock (RLock) only when accessing GtfsManager.GtfsDB. However, this could potentially lead to a deadlock scenario. For example:

func1 acquires an RLock

func1 calls func2, which attempts to acquire another RLock

A writer arrives between these two acquisitions and blocks further RLocks

The writer then waits for func1 to release its RLock

func2 is blocked waiting for the RLock, resulting in a deadlock

Let me know if I’m missing something here or if there’s a safer approach you’d recommend.

Because of that we use sync.RWMutex that allows multiple readers to hold RLocks concurrently, even if acquired sequentially

see: https://medium.com/@madhavj211/understanding-sync-mutex-vs-sync-rwmutex-in-go-with-benchmarks-bd9eddc46fb9

Yes, it’s true that multiple readers can hold an RLock at the same time. However, if we attempt to acquire an RLock inside a function that already holds an RLock, we could still run into a deadlock scenario once a writer tries to acquire the WLock.
Recursive read-locking is the problem
What way should we handle this in, I was thinking that we acquire the RLock at the handler level, and all calls and accesses to gtfsManager.GtfsDB or gtfsManager.staticData from inside the handler funcs do not need to acquire the lock, as we depend on the handler to already be holding the RLock.
Let me know, if I am missing something

Yes that's true, we can use the locks at the handler level.

I'll open a new PR for that, or should I change it in this one only?

Ahmedhossamdev · 2025-12-31T03:11:25Z

internal/gtfs/gtfs_manager.go

+func (manager *Manager) RUnlock() {
+	manager.staticMutex.RUnlock()
+}
+func (manager *Manager) RLock() {


internal/gtfs/static.go

…tions

mann-patwa · 2025-12-31T05:59:39Z

@Ahmedhossamdev take a look I updated the dbPath to be gtfs.db after hot swap and removed the (Lock and Unlock) helper functions, anymore changes needed?

mann-patwa · 2026-01-02T05:20:55Z

Hey @Ahmedhossamdev

Failures are due to tests in internal/restapi package.

1. Resource Leak in Test Setup

Issue: createTestApiWithRealTimeData was initializing a GTFSManager but failing to shut it down. This caused background goroutines and database connections to leak across tests.
Fix: Added gtfsManager.Shutdown() to the test cleanup function.

2. Date-Dependent Test Failures

Issue: TestScheduleForRouteHandler and TestScheduleForStopHandler were failing because they relied on time.Now(). The static test data (raba.zip) only contains services for a specific date range, causing these tests to fail when run outside that window.
Fix: Updated these tests to explicitly use a valid date (2025-06-12) that is guaranteed to be supported by the test fixture.

Ahmedhossamdev · 2026-01-03T22:33:00Z

@mann-patwa Great Work!

aaronbrethorst

Thanks for tackling this important issue! Supporting hot swap for the SQLite database during daily GTFS updates is a valuable improvement that will ensure SQL-backed endpoints always serve fresh data. The overall architecture of building a temporary database, then atomically swapping it in, is the right approach.

I found a few issues we'll need to address before merging.

Issues to Fix

1. Race Condition in DirectionCalculator

The DirectionCalculator now stores a reference to *Manager and directly accesses dc.gtfsManager.GtfsDB.Queries without holding any lock:

In internal/gtfs/direction_calculator.go:

func (dc *DirectionCalculator) CalculateStopDirection(ctx context.Context, stopID string) string {
    // Strategy 1: Check database for precomputed direction (O(1) lookup)
    stop, err := dc.gtfsManager.GtfsDB.Queries.GetStop(ctx, stopID)
    // ...
}

During a hot swap, ForceUpdate acquires staticMutex.Lock() and replaces manager.GtfsDB. But DirectionCalculator methods don't acquire any lock before accessing GtfsDB. This creates a data race where:

A handler calls DirectionCalculator.CalculateStopDirection()
Concurrently, ForceUpdate replaces manager.GtfsDB
The calculator may read a partially-swapped or closed database

Suggested fix: Either have DirectionCalculator methods acquire the read lock, or provide a thread-safe accessor method on Manager:

func (manager *Manager) GetQueries() *gtfsdb.Queries {
    manager.staticMutex.RLock()
    defer manager.staticMutex.RUnlock()
    return manager.GtfsDB.Queries
}

Then have DirectionCalculator call this method instead of directly accessing the field.

2. Potential Data Loss if Final DB Open Fails

In internal/gtfs/static.go, the ForceUpdate method renames the temp DB to the final path before verifying the new DB can be opened:

// Rename temp to final - this REPLACES the existing file
if err := os.Rename(tempDBPath, finalDBPath); err != nil {
    // ...
}

// Open the final DB - if this fails, we've already lost the old file!
finalGtfsDB, err := gtfsdb.NewClient(dbConfig)
if err != nil {
    logging.LogError(logger, "Error opening final GTFS DB", err)
    return err  // Bad state: old file is gone, new file can't be opened
}

If os.Rename succeeds but NewClient fails, the system is in a broken state - the old database file is gone and the manager is left with a stale GtfsDB pointer that references a deleted file.

Suggested fix: Keep the temp DB open, swap in-memory pointers first, then do file operations:

// Don't close temp DB yet - keep it open and ready
// manager.staticMutex.Lock()
// Swap the in-memory client pointer to the new DB
// manager.GtfsDB = newGtfsDB (still pointing to tempDBPath)
// manager.staticMutex.Unlock()

// Now close old DB, then rename files
// If rename fails, the new DB is still functional at tempDBPath

Or alternatively, open the temp DB as the final DB first, verify it works, then update the config's path tracking.

3. Old Database File Cleanup Logic is Ineffective

In internal/gtfs/static.go:

if oldDBPath != "" && oldDBPath != finalDBPath {
    if err := os.Remove(oldDBPath); err != nil {
        // ...
    }
}

The condition oldDBPath != finalDBPath will almost always be false because:

Initial load uses manager.config.GTFSDataPath → stored in GtfsDB
Hot swap renames temp to finalDBPath (same as GTFSDataPath)
GetDBPath() returns the path the client was opened with

Since the old client was opened with the same finalDBPath, and the new client is also opened with finalDBPath, this cleanup block never executes. The file cleanup happens implicitly via os.Rename overwriting, but this comment/code is misleading about what actually happens.

Suggested fix: Either remove this dead code block or clarify the intent. If the goal is to support different paths, ensure GetDBPath() reflects the actual file being used.

4. Missing Context Cancellation Checks

The ForceUpdate method accepts a context.Context but never checks for cancellation during its long-running operations:

func (manager *Manager) ForceUpdate(ctx context.Context) error {
    // No ctx.Err() checks throughout the function
    newStaticData, err := loadGTFSData(...)  // Could take minutes
    newGtfsDB, err := buildGtfsDB(...)       // Could take minutes
    // ...
}

The context is passed down to buildStopSpatialIndex but not to loadGTFSData or buildGtfsDB. If the caller cancels (e.g., server shutdown with the 5-minute timeout in updateStaticGTFS), the operation continues anyway.

Suggested fix: Add cancellation checks at key points:

if err := ctx.Err(); err != nil {
    return err
}

Minor Suggestions

Test Hardcoding

In internal/restapi/schedule_for_route_handler_test.go and schedule_for_stop_handler_test.go, you've added hardcoded dates:

"/api/where/schedule-for-route/"+routeID+".json?key=TEST&date=2025-06-12"

This fixes the tests for now, but they'll break when the test data changes or expires. Consider either:

Using a date derived from the test data's calendar validity period
Adding a comment explaining why the specific date was chosen

Import Ordering

The import reordering in cmd/api/app.go and vehicles_for_agency_handler_test.go is fine (stdlib first, then external), though it's unrelated to the feature. Just mentioning for awareness.

What's Working Well

The ForceUpdate method provides a clean public API for triggering updates
The TestManager_HotSwapConcurrency test is a good approach for validating thread safety
Adding GetDBPath() to the client is a nice utility method
The cleanup function addition to createTestApiWithRealTimeData fixes a resource leak

Once the race condition in DirectionCalculator and the failure-recovery issue are addressed, this PR will be in good shape.

mann-patwa · 2026-01-07T04:12:03Z

Hey @aaronbrethorst, thanks for the review!

DirectionCalculator and locking
Regarding DirectionCalculator acquiring the lock — my thought was to let the handlers acquire the lock instead. Since DirectionCalculator is always used through these handlers, all DB reads would already be thread-safe.

If we acquire the lock at the DirectionCalculator level, there’s a possibility of inconsistent data, especially if a WriteLock is taken in between DB accesses within the handler. Additionally, a deadlock could arise if both the handler and DirectionCalculator attempt to acquire locks (recursive lock issue).
Renaming the DB file vs swapping pointers
I initially chose to rename the file before opening a client based on Unix/macOS behavior:
Unix-like systems use file descriptors (inodes) that point to the file’s data on disk, not its name. Renaming a file only changes the directory entry; the application keeps its connection via the inode.

That said, if we don’t switch the GtfsDB pointer and the file is deleted, the process will still hold a reference to the inode — but we effectively lose the file on disk.

Your approach makes sense though. I’ll keep the tempDB open, swap the pointers, and then rename the file. We’ll also need to update the config, since it would still think we’re pointing to tempDBPath.
Cleanup logic
Agreed, the explicit cleanup isn’t necessary since os.Rename implicitly handles it. I’ll remove that dead code.
Context cancellation
You’re right — I missed the context cancellation checks. I’ll add those.

Will need clarification on 1. on how to proceed

mann-patwa force-pushed the db_update branch from 9f02a7e to f5ce27d Compare December 19, 2025 13:51

Ahmedhossamdev requested a review from aaronbrethorst December 19, 2025 22:11

This comment was marked as outdated.

Sign in to view

mann-patwa closed this Dec 22, 2025

mann-patwa force-pushed the db_update branch from 8ab1949 to 0599176 Compare December 22, 2025 13:57

Make method for building DB and hot-swapping static data + DB

5b3b901

mann-patwa reopened this Dec 22, 2025

cleanup

41058aa

mann-patwa force-pushed the db_update branch from 6c26d8e to 41058aa Compare December 22, 2025 16:32

mann-patwa and others added 3 commits December 27, 2025 16:41

Merge branch 'OneBusAway:main' into db_update

dcc7542

Update NewDirectionCalculator function and tests of Direction Calculator

c069e6c

Merge branch 'main' of https://github.com/OneBusAway/maglev into db_u…

8126da6

…pdate

Ahmedhossamdev self-requested a review December 31, 2025 01:57

cleanup

4918227

cleanup

cb7d55b

Ahmedhossamdev requested changes Dec 31, 2025

View reviewed changes

gtfsDB path kept consistent after hotSwap and remove lock helper func…

1fc5cbf

…tions

mann-patwa requested a review from Ahmedhossamdev January 1, 2026 03:49

fix tests

85e4016

Ahmedhossamdev approved these changes Jan 3, 2026

View reviewed changes

aaronbrethorst requested changes Jan 7, 2026

View reviewed changes

Merge branch 'OneBusAway:main' into db_update

ee6ce4d

Hot Swap DB on static updates of gtfs source #158

Are you sure you want to change the base?

Hot Swap DB on static updates of gtfs source #158

Conversation

mann-patwa commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Database Update Improvements

Thread-Safe Database Switching

Zero-Downtime Hot-Swap

Uh oh!

This comment was marked as outdated.

mann-patwa commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Next Steps

Uh oh!

mann-patwa commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ahmedhossamdev commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mann-patwa commented Dec 31, 2025

Uh oh!

Ahmedhossamdev commented Dec 31, 2025

Uh oh!

mann-patwa commented Dec 31, 2025

Uh oh!

Ahmedhossamdev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mann-patwa Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mann-patwa commented Dec 31, 2025

Uh oh!

mann-patwa commented Jan 2, 2026

1. Resource Leak in Test Setup

2. Date-Dependent Test Failures

Uh oh!

Ahmedhossamdev commented Jan 3, 2026

Uh oh!

aaronbrethorst left a comment

Choose a reason for hiding this comment

Issues to Fix

1. Race Condition in DirectionCalculator

2. Potential Data Loss if Final DB Open Fails

3. Old Database File Cleanup Logic is Ineffective

4. Missing Context Cancellation Checks

Minor Suggestions

Test Hardcoding

Import Ordering

What's Working Well

Uh oh!

mann-patwa commented Jan 7, 2026

Uh oh!

mann-patwa commented Dec 19, 2025 •

edited

Loading

mann-patwa commented Dec 22, 2025 •

edited

Loading

mann-patwa commented Dec 25, 2025 •

edited

Loading

Ahmedhossamdev commented Dec 31, 2025 •

edited

Loading

mann-patwa Dec 31, 2025 •

edited

Loading