Skip to content

Conversation

@mann-patwa
Copy link
Contributor

@mann-patwa mann-patwa commented Dec 19, 2025

resolves #157

Database Update Improvements

Thread-Safe Database Switching

  • Added a dbMutex (sync.RWMutex) to the Manager to protect access to the GtfsDB client.
  • Read operations now acquire a read lock, preventing conflicts during database swaps.

Zero-Downtime Hot-Swap

  • The updateStaticGTFS process builds a completely new SQLite database (gtfs.db.tmp) in the background.
  • Once the build is complete:
    • The in-memory client pointer is atomically swapped.
    • The temporary database file is renamed to overwrite gtfs.db, ensuring zero downtime.

@mann-patwa

This comment was marked as outdated.

@mann-patwa mann-patwa reopened this Dec 22, 2025
@mann-patwa
Copy link
Contributor Author

mann-patwa commented Dec 22, 2025

@amrhossamdev @aaronbrethorst

After reviewing the concurrency model, I decided to proceed with using a single lock to protect both staticData and GtfsDB.
This approach significantly simplifies the logic and completely eliminates the risk of deadlocks caused by lock-ordering issues.

Changelog

  • Refactored buildGtfsDB

    • Updated the function signature to accept a path argument, allowing explicit control over where the database is built.
  • Client Struct Update

    • Added a helper method to retrieve the current database path (DB-path).
  • Implemented ForceUpdate

    • Introduced a new method to safely handle live updates with the following sequence:
      1. Hot-swap active GtfsDB
        • Atomically replaces the active GtfsDB client in the manager.
        • New requests immediately use the updated database with zero downtime.
      2. Sync in-memory static data
        • Reloads and realigns in-memory static structures to ensure consistency with the newly swapped SQLite database.
      3. Clean up old database
        • Gracefully closes the old database connection and deletes the previous SQLite file once the swap is confirmed successful.
  • Added hot-swap concurrency test

    • Introduced TestManager_HotSwapConcurrency to validate safe concurrent reads during ForceUpdate.
    • The test spins up multiple concurrent readers accessing both static data and DB-backed paths while a live hot-swap is performed.
    • Verifies that:
      • No panics or race conditions occur
      • ForceUpdate completes successfully
      • Post-swap static data remains consistent and correct

Next Steps

  • Update all handler functions to acquire this single lock whenever accessing either staticData or GtfsDB.

Once you confirm that this approach looks correct, I’ll proceed with updating the handlers accordingly.

@mann-patwa
Copy link
Contributor Author

mann-patwa commented Dec 25, 2025

Hey @aaronbrethorst,

I think we’ll also need to update DirectionCalculator to accept gtfsManager itself as a dependency, rather than gtfsManager.GtfsDB.Queries. Otherwise, we’d have to either re-instantiate DirectionCalculator or update its internal queries reference on every static GTFS update.

Please correct me if I’m missing something here.

@Ahmedhossamdev Ahmedhossamdev self-requested a review December 31, 2025 01:57
@Ahmedhossamdev
Copy link
Member

Ahmedhossamdev commented Dec 31, 2025

@mann-patwa Could you please update the config.example.json file?

also run make lint

@mann-patwa
Copy link
Contributor Author

@Ahmedhossamdev what needs to be changed in the config.example.json file

@Ahmedhossamdev
Copy link
Member

@Ahmedhossamdev what needs to be changed in the config.example.json file

Oops I thought you added a new attribute in config.json file that does not exist in example file

@mann-patwa
Copy link
Contributor Author

@Ahmedhossamdev You can run the lint now, it should work

Copy link
Member

@Ahmedhossamdev Ahmedhossamdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is great work. Thanks a lot for your contribution! <3'

// 2. Build new DB
// Generate a unique filename for the new DB
timestamp := time.Now().Format("20060102_150405")
newDBPath := fmt.Sprintf("%s_%s.db", manager.config.GTFSDataPath, timestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the update flow by always ending up with a stable database name like gtfs.db like the old db after we delete it

This keeps the application logic simple since the server always opens gtfs.db, avoids dynamic path handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about this approach, but it feels a bit finicky. Opening a database with an already existing name, or opening it under a different name and then renaming it while it’s in use, doesn’t seem ideal.

That said, I can implement it this way if you think it’s the better approach, let me know your preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can build the new DB separately, then atomically rename it to gtfs.db once it’s ready. Old DB queries can finish safely, and in-memory data is swapped together, so nothing will break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do that!

timestamp := time.Now().Format("20060102_150405")
newDBPath := fmt.Sprintf("%s_%s.db", manager.config.GTFSDataPath, timestamp)

// FIX: Use manager.isLocalFile here too
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove comments that looks like this comment ?

blockLayoverIndices map[string][]*BlockLayoverIndex
}

func (manager *Manager) RUnlock() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these RLock / RUnlock helper methods? Since they just delegate to staticMutex, we might be able to simplify by using the mutex directly.

And I notice that we only use them in sawp tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll remove the helper methods.

I had one question regarding the locking strategy: should we acquire the locks at the handler level (for example, via middleware)? This approach would ensure that we wait for all ongoing requests to finish before acquiring the write lock (WLock) for the database change.

The alternative approach is to acquire a read lock (RLock) only when accessing GtfsManager.GtfsDB. However, this could potentially lead to a deadlock scenario. For example:

  • func1 acquires an RLock

  • func1 calls func2, which attempts to acquire another RLock

  • A writer arrives between these two acquisitions and blocks further RLocks

  • The writer then waits for func1 to release its RLock

  • func2 is blocked waiting for the RLock, resulting in a deadlock

Let me know if I’m missing something here or if there’s a safer approach you’d recommend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of that we use sync.RWMutex that allows multiple readers to hold RLocks concurrently, even if acquired sequentially

see: https://medium.com/@madhavj211/understanding-sync-mutex-vs-sync-rwmutex-in-go-with-benchmarks-bd9eddc46fb9

Copy link
Contributor Author

@mann-patwa mann-patwa Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it’s true that multiple readers can hold an RLock at the same time. However, if we attempt to acquire an RLock inside a function that already holds an RLock, we could still run into a deadlock scenario once a writer tries to acquire the WLock.
Recursive read-locking is the problem
What way should we handle this in, I was thinking that we acquire the RLock at the handler level, and all calls and accesses to gtfsManager.GtfsDB or gtfsManager.staticData from inside the handler funcs do not need to acquire the lock, as we depend on the handler to already be holding the RLock.
Let me know, if I am missing something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true, we can use the locks at the handler level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a new PR for that, or should I change it in this one only?

func (manager *Manager) RUnlock() {
manager.staticMutex.RUnlock()
}
func (manager *Manager) RLock() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@mann-patwa
Copy link
Contributor Author

@Ahmedhossamdev take a look I updated the dbPath to be gtfs.db after hot swap and removed the (Lock and Unlock) helper functions, anymore changes needed?

@mann-patwa
Copy link
Contributor Author

Hey @Ahmedhossamdev

Failures are due to tests in internal/restapi package.

1. Resource Leak in Test Setup

Issue: createTestApiWithRealTimeData was initializing a GTFSManager but failing to shut it down. This caused background goroutines and database connections to leak across tests.
Fix: Added gtfsManager.Shutdown() to the test cleanup function.

2. Date-Dependent Test Failures

Issue: TestScheduleForRouteHandler and TestScheduleForStopHandler were failing because they relied on time.Now(). The static test data (raba.zip) only contains services for a specific date range, causing these tests to fail when run outside that window.
Fix: Updated these tests to explicitly use a valid date (2025-06-12) that is guaranteed to be supported by the test fixture.

@Ahmedhossamdev
Copy link
Member

@mann-patwa Great Work!

Copy link
Member

@aaronbrethorst aaronbrethorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this important issue! Supporting hot swap for the SQLite database during daily GTFS updates is a valuable improvement that will ensure SQL-backed endpoints always serve fresh data. The overall architecture of building a temporary database, then atomically swapping it in, is the right approach.

I found a few issues we'll need to address before merging.


Issues to Fix

1. Race Condition in DirectionCalculator

The DirectionCalculator now stores a reference to *Manager and directly accesses dc.gtfsManager.GtfsDB.Queries without holding any lock:

In internal/gtfs/direction_calculator.go:

func (dc *DirectionCalculator) CalculateStopDirection(ctx context.Context, stopID string) string {
    // Strategy 1: Check database for precomputed direction (O(1) lookup)
    stop, err := dc.gtfsManager.GtfsDB.Queries.GetStop(ctx, stopID)
    // ...
}

During a hot swap, ForceUpdate acquires staticMutex.Lock() and replaces manager.GtfsDB. But DirectionCalculator methods don't acquire any lock before accessing GtfsDB. This creates a data race where:

  • A handler calls DirectionCalculator.CalculateStopDirection()
  • Concurrently, ForceUpdate replaces manager.GtfsDB
  • The calculator may read a partially-swapped or closed database

Suggested fix: Either have DirectionCalculator methods acquire the read lock, or provide a thread-safe accessor method on Manager:

func (manager *Manager) GetQueries() *gtfsdb.Queries {
    manager.staticMutex.RLock()
    defer manager.staticMutex.RUnlock()
    return manager.GtfsDB.Queries
}

Then have DirectionCalculator call this method instead of directly accessing the field.


2. Potential Data Loss if Final DB Open Fails

In internal/gtfs/static.go, the ForceUpdate method renames the temp DB to the final path before verifying the new DB can be opened:

// Rename temp to final - this REPLACES the existing file
if err := os.Rename(tempDBPath, finalDBPath); err != nil {
    // ...
}

// Open the final DB - if this fails, we've already lost the old file!
finalGtfsDB, err := gtfsdb.NewClient(dbConfig)
if err != nil {
    logging.LogError(logger, "Error opening final GTFS DB", err)
    return err  // Bad state: old file is gone, new file can't be opened
}

If os.Rename succeeds but NewClient fails, the system is in a broken state - the old database file is gone and the manager is left with a stale GtfsDB pointer that references a deleted file.

Suggested fix: Keep the temp DB open, swap in-memory pointers first, then do file operations:

// Don't close temp DB yet - keep it open and ready
// manager.staticMutex.Lock()
// Swap the in-memory client pointer to the new DB
// manager.GtfsDB = newGtfsDB (still pointing to tempDBPath)
// manager.staticMutex.Unlock()

// Now close old DB, then rename files
// If rename fails, the new DB is still functional at tempDBPath

Or alternatively, open the temp DB as the final DB first, verify it works, then update the config's path tracking.


3. Old Database File Cleanup Logic is Ineffective

In internal/gtfs/static.go:

if oldDBPath != "" && oldDBPath != finalDBPath {
    if err := os.Remove(oldDBPath); err != nil {
        // ...
    }
}

The condition oldDBPath != finalDBPath will almost always be false because:

  1. Initial load uses manager.config.GTFSDataPath → stored in GtfsDB
  2. Hot swap renames temp to finalDBPath (same as GTFSDataPath)
  3. GetDBPath() returns the path the client was opened with

Since the old client was opened with the same finalDBPath, and the new client is also opened with finalDBPath, this cleanup block never executes. The file cleanup happens implicitly via os.Rename overwriting, but this comment/code is misleading about what actually happens.

Suggested fix: Either remove this dead code block or clarify the intent. If the goal is to support different paths, ensure GetDBPath() reflects the actual file being used.


4. Missing Context Cancellation Checks

The ForceUpdate method accepts a context.Context but never checks for cancellation during its long-running operations:

func (manager *Manager) ForceUpdate(ctx context.Context) error {
    // No ctx.Err() checks throughout the function
    newStaticData, err := loadGTFSData(...)  // Could take minutes
    newGtfsDB, err := buildGtfsDB(...)       // Could take minutes
    // ...
}

The context is passed down to buildStopSpatialIndex but not to loadGTFSData or buildGtfsDB. If the caller cancels (e.g., server shutdown with the 5-minute timeout in updateStaticGTFS), the operation continues anyway.

Suggested fix: Add cancellation checks at key points:

if err := ctx.Err(); err != nil {
    return err
}

Minor Suggestions

Test Hardcoding

In internal/restapi/schedule_for_route_handler_test.go and schedule_for_stop_handler_test.go, you've added hardcoded dates:

"/api/where/schedule-for-route/"+routeID+".json?key=TEST&date=2025-06-12"

This fixes the tests for now, but they'll break when the test data changes or expires. Consider either:

  • Using a date derived from the test data's calendar validity period
  • Adding a comment explaining why the specific date was chosen

Import Ordering

The import reordering in cmd/api/app.go and vehicles_for_agency_handler_test.go is fine (stdlib first, then external), though it's unrelated to the feature. Just mentioning for awareness.


What's Working Well

  • The ForceUpdate method provides a clean public API for triggering updates
  • The TestManager_HotSwapConcurrency test is a good approach for validating thread safety
  • Adding GetDBPath() to the client is a nice utility method
  • The cleanup function addition to createTestApiWithRealTimeData fixes a resource leak

Once the race condition in DirectionCalculator and the failure-recovery issue are addressed, this PR will be in good shape.

@mann-patwa
Copy link
Contributor Author

Hey @aaronbrethorst, thanks for the review!

  1. DirectionCalculator and locking
    Regarding DirectionCalculator acquiring the lock — my thought was to let the handlers acquire the lock instead. Since DirectionCalculator is always used through these handlers, all DB reads would already be thread-safe.

    If we acquire the lock at the DirectionCalculator level, there’s a possibility of inconsistent data, especially if a WriteLock is taken in between DB accesses within the handler. Additionally, a deadlock could arise if both the handler and DirectionCalculator attempt to acquire locks (recursive lock issue).

  2. Renaming the DB file vs swapping pointers
    I initially chose to rename the file before opening a client based on Unix/macOS behavior:
    Unix-like systems use file descriptors (inodes) that point to the file’s data on disk, not its name. Renaming a file only changes the directory entry; the application keeps its connection via the inode.

    That said, if we don’t switch the GtfsDB pointer and the file is deleted, the process will still hold a reference to the inode — but we effectively lose the file on disk.

    Your approach makes sense though. I’ll keep the tempDB open, swap the pointers, and then rename the file. We’ll also need to update the config, since it would still think we’re pointing to tempDBPath.

  3. Cleanup logic
    Agreed, the explicit cleanup isn’t necessary since os.Rename implicitly handles it. I’ll remove that dead code.

  4. Context cancellation
    You’re right — I missed the context cancellation checks. I’ll add those.

Will need clarification on 1. on how to proceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Support SQLite GTFS Hot-Swap on Daily Static Updates

3 participants