Skip to content

Retry peeruserimport task on Database or connection errors#13821

Merged
nucleogenesis merged 22 commits intolearningequality:release-v0.19.xfrom
AlexVelezLl:fix-lod-import-multi-users
Feb 24, 2026
Merged

Retry peeruserimport task on Database or connection errors#13821
nucleogenesis merged 22 commits intolearningequality:release-v0.19.xfrom
AlexVelezLl:fix-lod-import-multi-users

Conversation

@AlexVelezLl
Copy link
Copy Markdown
Member

@AlexVelezLl AlexVelezLl commented Oct 8, 2025

Summary

  • Adds support for a retry_on argument in the @task decorator to specify a list of potential non-deterministic exceptions that can be retried if the job failed because of them.
    • The user won't see the task failed until the same task was re-attempted 3 times.
  • Updates the setupwizard frontend to handle failed tasks.
  • Updates the setupwizard frontend to persists users being imported.
  • Disable back button in the import users page when importing users to prevent unexpected page layouts.
  • Add a semaphore on the frontend to let only 3 task creation requests run at a time.
grab.mov

References

Closes #11836.

Reviewer guidance

@AlexVelezLl AlexVelezLl requested a review from rtibbles October 8, 2025 23:22
@github-actions github-actions Bot added DEV: backend Python, databases, networking, filesystem... APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: frontend SIZE: medium labels Oct 8, 2025
@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 4e2fbc9 to 236f654 Compare October 8, 2025 23:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 8, 2025

@rtibbles rtibbles self-assigned this Oct 9, 2025
Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can maintain the current separation of concerns, and it may be worth the effort of adding a new column to track the retries rather than keeping it in the extra_metadata.

To allow us to migrate the SQLAlchemy table, adding alembic as a dependency feels a bit heavy duty. So perhaps the answer is to clear the jobs table of any finished tasks, then dump the remainder to a temporary CSV, clear the table, recreate, and then reload the data?

permission_classes=None,
long_running=False,
status_fn=None,
retry_on=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job avoiding a classic Python gotcha! (passing mutable values as default arguments, such as [] is a very common mistake that can cause issues)

Comment thread kolibri/core/tasks/job.py Outdated
total_progress=0,
result=None,
long_running=False,
retry_on=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we don't need to store this in the job object - we're not allowing this to be customized per job, only per task - so I think we can just reference this from the task itself, rather than having to pass it in at job initialization. This also saves us having to coerce the exception classes to import paths.

Comment thread kolibri/core/tasks/job.py Outdated
)
setattr(current_state_tracker, "job", None)

def should_retry(self, exception):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather defer all this logic to the reschedule_finished_job_if_needed method on the storage class, rather than having it in the job class.

Comment thread kolibri/core/tasks/job.py Outdated

def should_retry(self, exception):
retries = self.extra_metadata.get("retries", 0) + 1
self.extra_metadata["retries"] = retries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit iffy about using extra_metadata for tracking this - I think if we want to hack the existing schema, 'repeat' is probably a better place for this, but I wonder if instead we should add to the job table schema to add error_retries so that we can put a sensible default in place for failing tasks so they don't endlessly repeat.

I also think I'd rather have the retry interval defined by the task registration (we could also set a sensible default if retryable exceptions are set).

Comment on lines -145 to -149
from django import db

# Destroy current connections and create new ones:
db.connections.close_all()
db.connections = db.ConnectionHandler()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these db.connections overrides and have used the patch("django.db.connections" instead. These overrides were having some side effects on the job tests that involved having multiple threads, and it was messing things up in the teardown process.

However, not sure if removing these lines may cause somehow a false positive in the test.

Comment on lines -206 to -208
from django import db

db.connections["default"].connection = None
Copy link
Copy Markdown
Member Author

@AlexVelezLl AlexVelezLl Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem, will instead rely on the django.db.connections patch. But not sure if this may cause false positives.

@AlexVelezLl AlexVelezLl requested a review from rtibbles December 10, 2025 22:25
@rtibbles rtibbles changed the base branch from develop to release-v0.19.x December 17, 2025 21:39
@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from cd9821f to 402dadd Compare January 5, 2026 19:29
Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Exception/BaseException validation needs to be cleaned up, as well as the DatabaseLockedError, as I don't think it will catch what we are hoping it will catch.

The Pragma setting, if it's not being done for the additional databases can be deferred to follow up.

Import of storage from main is not a blocker, just a thought.

Comment thread kolibri/core/tasks/registry.py Outdated
if not isinstance(retry_on, list):
raise TypeError("retry_on must be a list of exceptions")
for item in retry_on:
if not issubclass(item, Exception):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to BaseException - it's a little uncommon, but sometimes exceptions are subclassed from this rather than the Exception class: https://docs.python.org/3/library/exceptions.html#BaseException

Comment thread kolibri/core/tasks/storage.py Outdated
def set_sqlite_pragmas(self):
"""
Sets the connection PRAGMAs for the sqlalchemy engine stored in self.engine.
Sets the connection PRAGMAs for the sqlite database.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is managed via Django... I think we should be doing this already, and if we're not doing it for all of the additional DBs, we should be.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I recall that we were just doing this for the default db, thats why I kept this function here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may solve this be1438b


def _update_job(self, job_id, state=None, **kwargs):
with self.session_scope() as session:
with transaction.atomic(using=self._get_job_database_alias()):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is needed because transaction.atomic by default only operates on the default database?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

"saved_job": job.to_json(),
}

if orm_job:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could potentially use update_or_create here - but given that we already know, this seems fine to me.

Comment thread kolibri/core/tasks/utils.py Outdated
return executor(max_workers=max_workers)


class DatabaseLockedError(OperationalError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure when this would ever get raised, because we have defined it here, but then we are never using it?

For this to work, it would have to be raised by the sync task that has it as an exception that it can retry on? We have some similar logic in our middleware that raises 502s on requests - perhaps we could create a broader context manager that catches OperationalErrors and reraises them as DatabaseLockedErrors if it meets the criterion?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a bit confused when creating this class, removed it, and used the OperationalError class instead!

raise TypeError("time delay must be a datetime.timedelta object")


def validate_exception(value):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used? It seems that this validation was happening inline elsewhere? (noting that here BaseException is being used though!)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It is being used here https://github.com/AlexVelezLl/kolibri/blob/402dadd608f01e48679bb4d528389d5ee93553f4/kolibri/core/tasks/storage.py#L517.

I think the inline validation you are talking about is this one https://github.com/AlexVelezLl/kolibri/blob/fix-lod-import-multi-users/kolibri/core/tasks/registry.py#L270, but that one is validating the class; this validate_exception is validating the object.

Comment thread kolibri/core/tasks/worker.py Outdated
connection = db_connection()

storage = Storage(connection)
storage = Storage()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder.. could we just import the storage object from main here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea! 😅

Comment thread kolibri/core/tasks/worker.py Outdated
self.future_job_mapping = {}

self.storage = Storage(connection)
self.storage = Storage()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here.

@AlexVelezLl AlexVelezLl requested a review from rtibbles January 5, 2026 21:28
@AlexVelezLl
Copy link
Copy Markdown
Member Author

Thanks @rtibbles! I have addressed all your comments!

Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are addressed! Let's get this QAed.

@rtibbles rtibbles dismissed their stale review January 5, 2026 23:07

All comments addressed.

@rtibbles
Copy link
Copy Markdown
Member

rtibbles commented Jan 5, 2026

@pcenov @radinamatic hopefully the issue has the details needed for replication - this has ended up being a slightly larger refactor, so doing some additional smoke tests of some async tasks, like content imports, and also checking some different syncs also.

There's also a possibility for regression in the Android App, so if we can test the import workflows on Android as well, that would be very helpful.

@pcenov pcenov self-requested a review January 6, 2026 08:59
@AlexVelezLl AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 6982ecd to 6c5c778 Compare February 23, 2026 19:52
@AlexVelezLl
Copy link
Copy Markdown
Member Author

Hi @pcenov! This should be ready for another round of QA. I have made some changes that may have fixed these two comments: #13821 (comment) and #13821 (comment).

In general, it should be okay if some tasks fail due to server overload, but if we retry them manually, we should eventually be able to import the learners. Infinite loads should be fixed now.

@pcenov
Copy link
Copy Markdown
Member

pcenov commented Feb 24, 2026

Hi @AlexVelezLl - it seems that this improvement should be further discussed with @rtibbles so that we can get a better idea of what exactly is the expected behaviour and how many users it should be possible to import without too much trouble.

Currently while I am technically able to import a few users simultaneously, when I try to do that with lets say about 20 users then the import takes a very long time, it increases the CPU usage of the devices and I am still getting multiple errors in the console and many of the users are not getting imported:

multiple.users.mp4

WindowsServer.zip
UbuntuLOD.zip

Unless we have clarity on the matter of the number of users that we can handle reliably, then I can't reliably confirm through manual testing that this is an actual improvement, as I am constantly getting inconsistent results.

@AlexVelezLl
Copy link
Copy Markdown
Member Author

AlexVelezLl commented Feb 24, 2026

Thanks @pcenov, yes, it will heavily depend on the device and server resources. We had discussed with @rtibbles, that we can potentially build a bulk import flow, where the admin can select multiple users at the same time, and then import them in a single action, but that will require some more decisions.

Another option, @rtibbles, would be to run just one task at a time and make the create task request only after the previous task has already been completed; that would take much longer, but would reduce errors much more.

The other thing is that if the devices don't have many resources, could we prompt the admin to adjust the REGULAR_PRIORITY_WORKERS option?

@AlexVelezLl
Copy link
Copy Markdown
Member Author

Follow up for further improvement: #14238

Copy link
Copy Markdown
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement on the current workflow. We've added a follow up to handle this completely robustly, but there's a lot of work here that will be more generally useful, and I don't want the perfect to be the enemy of the good.

@nucleogenesis nucleogenesis merged commit b1624d0 into learningequality:release-v0.19.x Feb 24, 2026
57 checks passed
@AlexVelezLl AlexVelezLl deleted the fix-lod-import-multi-users branch February 24, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: backend Python, databases, networking, filesystem... DEV: dev-ops Continuous integration & deployment DEV: frontend SIZE: large SIZE: medium SIZE: very large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setup Wizard - Confusing behavior when importing multiple learners

6 participants