-
Notifications
You must be signed in to change notification settings - Fork 15
Hot Swap DB on static updates of gtfs source #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9f02a7e to
f5ce27d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
8ab1949 to
0599176
Compare
6c26d8e to
41058aa
Compare
|
After reviewing the concurrency model, I decided to proceed with using a single lock to protect both Changelog
Next Steps
Once you confirm that this approach looks correct, I’ll proceed with updating the handlers accordingly. |
|
Hey @aaronbrethorst, I think we’ll also need to update Please correct me if I’m missing something here. |
|
@mann-patwa Could you please update the config.example.json file? also run make lint |
|
@Ahmedhossamdev what needs to be changed in the |
Oops I thought you added a new attribute in config.json file that does not exist in example file |
|
@Ahmedhossamdev You can run the lint now, it should work |
Ahmedhossamdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is great work. Thanks a lot for your contribution! <3'
internal/gtfs/static.go
Outdated
| // 2. Build new DB | ||
| // Generate a unique filename for the new DB | ||
| timestamp := time.Now().Format("20060102_150405") | ||
| newDBPath := fmt.Sprintf("%s_%s.db", manager.config.GTFSDataPath, timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify the update flow by always ending up with a stable database name like gtfs.db like the old db after we delete it
This keeps the application logic simple since the server always opens gtfs.db, avoids dynamic path handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think about this approach, but it feels a bit finicky. Opening a database with an already existing name, or opening it under a different name and then renaming it while it’s in use, doesn’t seem ideal.
That said, I can implement it this way if you think it’s the better approach, let me know your preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can build the new DB separately, then atomically rename it to gtfs.db once it’s ready. Old DB queries can finish safely, and in-memory data is swapped together, so nothing will break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do that!
internal/gtfs/static.go
Outdated
| timestamp := time.Now().Format("20060102_150405") | ||
| newDBPath := fmt.Sprintf("%s_%s.db", manager.config.GTFSDataPath, timestamp) | ||
|
|
||
| // FIX: Use manager.isLocalFile here too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove comments that looks like this comment ?
internal/gtfs/gtfs_manager.go
Outdated
| blockLayoverIndices map[string][]*BlockLayoverIndex | ||
| } | ||
|
|
||
| func (manager *Manager) RUnlock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these RLock / RUnlock helper methods? Since they just delegate to staticMutex, we might be able to simplify by using the mutex directly.
And I notice that we only use them in sawp tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll remove the helper methods.
I had one question regarding the locking strategy: should we acquire the locks at the handler level (for example, via middleware)? This approach would ensure that we wait for all ongoing requests to finish before acquiring the write lock (WLock) for the database change.
The alternative approach is to acquire a read lock (RLock) only when accessing GtfsManager.GtfsDB. However, this could potentially lead to a deadlock scenario. For example:
-
func1 acquires an RLock
-
func1 calls func2, which attempts to acquire another RLock
-
A writer arrives between these two acquisitions and blocks further RLocks
-
The writer then waits for func1 to release its RLock
-
func2 is blocked waiting for the RLock, resulting in a deadlock
Let me know if I’m missing something here or if there’s a safer approach you’d recommend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of that we use sync.RWMutex that allows multiple readers to hold RLocks concurrently, even if acquired sequentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it’s true that multiple readers can hold an RLock at the same time. However, if we attempt to acquire an RLock inside a function that already holds an RLock, we could still run into a deadlock scenario once a writer tries to acquire the WLock.
Recursive read-locking is the problem
What way should we handle this in, I was thinking that we acquire the RLock at the handler level, and all calls and accesses to gtfsManager.GtfsDB or gtfsManager.staticData from inside the handler funcs do not need to acquire the lock, as we depend on the handler to already be holding the RLock.
Let me know, if I am missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true, we can use the locks at the handler level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a new PR for that, or should I change it in this one only?
internal/gtfs/gtfs_manager.go
Outdated
| func (manager *Manager) RUnlock() { | ||
| manager.staticMutex.RUnlock() | ||
| } | ||
| func (manager *Manager) RLock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
|
@Ahmedhossamdev take a look I updated the dbPath to be |
|
Hey @Ahmedhossamdev Failures are due to tests in 1. Resource Leak in Test SetupIssue: 2. Date-Dependent Test FailuresIssue: |
|
@mann-patwa Great Work! |
aaronbrethorst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this important issue! Supporting hot swap for the SQLite database during daily GTFS updates is a valuable improvement that will ensure SQL-backed endpoints always serve fresh data. The overall architecture of building a temporary database, then atomically swapping it in, is the right approach.
I found a few issues we'll need to address before merging.
Issues to Fix
1. Race Condition in DirectionCalculator
The DirectionCalculator now stores a reference to *Manager and directly accesses dc.gtfsManager.GtfsDB.Queries without holding any lock:
In internal/gtfs/direction_calculator.go:
func (dc *DirectionCalculator) CalculateStopDirection(ctx context.Context, stopID string) string {
// Strategy 1: Check database for precomputed direction (O(1) lookup)
stop, err := dc.gtfsManager.GtfsDB.Queries.GetStop(ctx, stopID)
// ...
}During a hot swap, ForceUpdate acquires staticMutex.Lock() and replaces manager.GtfsDB. But DirectionCalculator methods don't acquire any lock before accessing GtfsDB. This creates a data race where:
- A handler calls
DirectionCalculator.CalculateStopDirection() - Concurrently,
ForceUpdatereplacesmanager.GtfsDB - The calculator may read a partially-swapped or closed database
Suggested fix: Either have DirectionCalculator methods acquire the read lock, or provide a thread-safe accessor method on Manager:
func (manager *Manager) GetQueries() *gtfsdb.Queries {
manager.staticMutex.RLock()
defer manager.staticMutex.RUnlock()
return manager.GtfsDB.Queries
}Then have DirectionCalculator call this method instead of directly accessing the field.
2. Potential Data Loss if Final DB Open Fails
In internal/gtfs/static.go, the ForceUpdate method renames the temp DB to the final path before verifying the new DB can be opened:
// Rename temp to final - this REPLACES the existing file
if err := os.Rename(tempDBPath, finalDBPath); err != nil {
// ...
}
// Open the final DB - if this fails, we've already lost the old file!
finalGtfsDB, err := gtfsdb.NewClient(dbConfig)
if err != nil {
logging.LogError(logger, "Error opening final GTFS DB", err)
return err // Bad state: old file is gone, new file can't be opened
}If os.Rename succeeds but NewClient fails, the system is in a broken state - the old database file is gone and the manager is left with a stale GtfsDB pointer that references a deleted file.
Suggested fix: Keep the temp DB open, swap in-memory pointers first, then do file operations:
// Don't close temp DB yet - keep it open and ready
// manager.staticMutex.Lock()
// Swap the in-memory client pointer to the new DB
// manager.GtfsDB = newGtfsDB (still pointing to tempDBPath)
// manager.staticMutex.Unlock()
// Now close old DB, then rename files
// If rename fails, the new DB is still functional at tempDBPathOr alternatively, open the temp DB as the final DB first, verify it works, then update the config's path tracking.
3. Old Database File Cleanup Logic is Ineffective
In internal/gtfs/static.go:
if oldDBPath != "" && oldDBPath != finalDBPath {
if err := os.Remove(oldDBPath); err != nil {
// ...
}
}The condition oldDBPath != finalDBPath will almost always be false because:
- Initial load uses
manager.config.GTFSDataPath→ stored inGtfsDB - Hot swap renames temp to
finalDBPath(same asGTFSDataPath) GetDBPath()returns the path the client was opened with
Since the old client was opened with the same finalDBPath, and the new client is also opened with finalDBPath, this cleanup block never executes. The file cleanup happens implicitly via os.Rename overwriting, but this comment/code is misleading about what actually happens.
Suggested fix: Either remove this dead code block or clarify the intent. If the goal is to support different paths, ensure GetDBPath() reflects the actual file being used.
4. Missing Context Cancellation Checks
The ForceUpdate method accepts a context.Context but never checks for cancellation during its long-running operations:
func (manager *Manager) ForceUpdate(ctx context.Context) error {
// No ctx.Err() checks throughout the function
newStaticData, err := loadGTFSData(...) // Could take minutes
newGtfsDB, err := buildGtfsDB(...) // Could take minutes
// ...
}The context is passed down to buildStopSpatialIndex but not to loadGTFSData or buildGtfsDB. If the caller cancels (e.g., server shutdown with the 5-minute timeout in updateStaticGTFS), the operation continues anyway.
Suggested fix: Add cancellation checks at key points:
if err := ctx.Err(); err != nil {
return err
}Minor Suggestions
Test Hardcoding
In internal/restapi/schedule_for_route_handler_test.go and schedule_for_stop_handler_test.go, you've added hardcoded dates:
"/api/where/schedule-for-route/"+routeID+".json?key=TEST&date=2025-06-12"This fixes the tests for now, but they'll break when the test data changes or expires. Consider either:
- Using a date derived from the test data's calendar validity period
- Adding a comment explaining why the specific date was chosen
Import Ordering
The import reordering in cmd/api/app.go and vehicles_for_agency_handler_test.go is fine (stdlib first, then external), though it's unrelated to the feature. Just mentioning for awareness.
What's Working Well
- The
ForceUpdatemethod provides a clean public API for triggering updates - The
TestManager_HotSwapConcurrencytest is a good approach for validating thread safety - Adding
GetDBPath()to the client is a nice utility method - The cleanup function addition to
createTestApiWithRealTimeDatafixes a resource leak
Once the race condition in DirectionCalculator and the failure-recovery issue are addressed, this PR will be in good shape.
|
Hey @aaronbrethorst, thanks for the review!
Will need clarification on |
resolves #157
Database Update Improvements
Thread-Safe Database Switching
dbMutex(sync.RWMutex) to theManagerto protect access to theGtfsDBclient.Zero-Downtime Hot-Swap
updateStaticGTFSprocess builds a completely new SQLite database (gtfs.db.tmp) in the background.gtfs.db, ensuring zero downtime.