Retry peeruserimport task on Database or connection errors by AlexVelezLl · Pull Request #13821 · learningequality/kolibri

AlexVelezLl · 2025-10-08T23:22:18Z

Summary

Adds support for a retry_on argument in the @task decorator to specify a list of potential non-deterministic exceptions that can be retried if the job failed because of them.
- The user won't see the task failed until the same task was re-attempted 3 times.
Updates the setupwizard frontend to handle failed tasks.
Updates the setupwizard frontend to persists users being imported.
Disable back button in the import users page when importing users to prevent unexpected page layouts.
Add a semaphore on the frontend to let only 3 task creation requests run at a time.

grab.mov

References

Closes #11836.

Reviewer guidance

Follow steps in Setup Wizard - Confusing behavior when importing multiple learners #11836.
Check that import flow in the LOD users page keeps working as expected.

github-actions · 2025-10-08T23:59:48Z

Build Artifacts

Asset type	Download link
PEX file	kolibri-0.19.2.dev0_git.20260223195224.pex
Windows Installer (EXE)	kolibri-0.19.2.dev0+git.20260223195224-windows-setup-unsigned.exe
Debian Package	kolibri_0.19.2.dev0+git.20260223195224-0ubuntu1_all.deb
Mac Installer (DMG)	kolibri-0.19.2.dev0+git.20260223195224.dmg
Android Package (APK)	kolibri-0.19.2.dev0+git.20260223195224-0.1.7-debug.apk
Raspberry Pi Image	kolibri-pi-image-0.19.2.dev0+git.20260223195224.zip
TAR file	kolibri-0.19.2.dev0+git.20260223195224.tar.gz
WHL file	kolibri-0.19.2.dev0+git.20260223195224-py2.py3-none-any.whl

rtibbles

I think we can maintain the current separation of concerns, and it may be worth the effort of adding a new column to track the retries rather than keeping it in the extra_metadata.

To allow us to migrate the SQLAlchemy table, adding alembic as a dependency feels a bit heavy duty. So perhaps the answer is to clear the jobs table of any finished tasks, then dump the remainder to a temporary CSV, clear the table, recreate, and then reload the data?

rtibbles · 2025-10-10T14:17:16Z

    permission_classes=None,
    long_running=False,
    status_fn=None,
+    retry_on=None,


Good job avoiding a classic Python gotcha! (passing mutable values as default arguments, such as [] is a very common mistake that can cause issues)

rtibbles · 2025-10-10T14:21:27Z

        total_progress=0,
        result=None,
        long_running=False,
+        retry_on=None,


I feel like we don't need to store this in the job object - we're not allowing this to be customized per job, only per task - so I think we can just reference this from the task itself, rather than having to pass it in at job initialization. This also saves us having to coerce the exception classes to import paths.

rtibbles · 2025-10-10T14:26:32Z

        )
        setattr(current_state_tracker, "job", None)

+    def should_retry(self, exception):


I think I'd rather defer all this logic to the reschedule_finished_job_if_needed method on the storage class, rather than having it in the job class.

rtibbles · 2025-10-10T14:28:47Z


+    def should_retry(self, exception):
+        retries = self.extra_metadata.get("retries", 0) + 1
+        self.extra_metadata["retries"] = retries


I am a bit iffy about using extra_metadata for tracking this - I think if we want to hack the existing schema, 'repeat' is probably a better place for this, but I wonder if instead we should add to the job table schema to add error_retries so that we can put a sensible default in place for failing tasks so they don't endlessly repeat.

I also think I'd rather have the retry interval defined by the task registration (we could also set a sensible default if retryable exceptions are set).

AlexVelezLl · 2025-12-10T17:00:45Z

-            from django import db
-
-            # Destroy current connections and create new ones:
-            db.connections.close_all()
-            db.connections = db.ConnectionHandler()


I have removed these db.connections overrides and have used the patch("django.db.connections" instead. These overrides were having some side effects on the job tests that involved having multiple threads, and it was messing things up in the teardown process.

However, not sure if removing these lines may cause somehow a false positive in the test.

AlexVelezLl · 2025-12-10T17:01:41Z

-        from django import db
-
-        db.connections["default"].connection = None


idem, will instead rely on the django.db.connections patch. But not sure if this may cause false positives.

rtibbles

The Exception/BaseException validation needs to be cleaned up, as well as the DatabaseLockedError, as I don't think it will catch what we are hoping it will catch.

The Pragma setting, if it's not being done for the additional databases can be deferred to follow up.

Import of storage from main is not a blocker, just a thought.

rtibbles · 2026-01-05T19:41:45Z

+        if not isinstance(retry_on, list):
+            raise TypeError("retry_on must be a list of exceptions")
+        for item in retry_on:
+            if not issubclass(item, Exception):


We should change this to BaseException - it's a little uncommon, but sometimes exceptions are subclassed from this rather than the Exception class: https://docs.python.org/3/library/exceptions.html#BaseException

rtibbles · 2026-01-05T19:44:45Z

    def set_sqlite_pragmas(self):
        """
-        Sets the connection PRAGMAs for the sqlalchemy engine stored in self.engine.
+        Sets the connection PRAGMAs for the sqlite database.


Now this is managed via Django... I think we should be doing this already, and if we're not doing it for all of the additional DBs, we should be.

Yes! I recall that we were just doing this for the default db, thats why I kept this function here

I think this may solve this be1438b

rtibbles · 2026-01-05T19:54:18Z

+
    def _update_job(self, job_id, state=None, **kwargs):
-        with self.session_scope() as session:
+        with transaction.atomic(using=self._get_job_database_alias()):


I assume this is needed because transaction.atomic by default only operates on the default database?

rtibbles · 2026-01-05T19:55:20Z

+                "saved_job": job.to_json(),
+            }
+
+            if orm_job:


Could potentially use update_or_create here - but given that we already know, this seems fine to me.

rtibbles · 2026-01-05T19:59:21Z

    return executor(max_workers=max_workers)
+
+
+class DatabaseLockedError(OperationalError):


I am not sure when this would ever get raised, because we have defined it here, but then we are never using it?

For this to work, it would have to be raised by the sync task that has it as an exception that it can retry on? We have some similar logic in our middleware that raises 502s on requests - perhaps we could create a broader context manager that catches OperationalErrors and reraises them as DatabaseLockedErrors if it meets the criterion?

Was a bit confused when creating this class, removed it, and used the OperationalError class instead!

rtibbles · 2026-01-05T19:59:57Z

        raise TypeError("time delay must be a datetime.timedelta object")


+def validate_exception(value):


Is this being used? It seems that this validation was happening inline elsewhere? (noting that here BaseException is being used though!)

Yes! It is being used here https://github.com/AlexVelezLl/kolibri/blob/402dadd608f01e48679bb4d528389d5ee93553f4/kolibri/core/tasks/storage.py#L517.

I think the inline validation you are talking about is this one https://github.com/AlexVelezLl/kolibri/blob/fix-lod-import-multi-users/kolibri/core/tasks/registry.py#L270, but that one is validating the class; this validate_exception is validating the object.

rtibbles · 2026-01-05T20:00:28Z

-    connection = db_connection()
-
-    storage = Storage(connection)
+    storage = Storage()


I wonder.. could we just import the storage object from main here?

Seems like a good idea! 😅

rtibbles · 2026-01-05T20:00:44Z

        self.future_job_mapping = {}

-        self.storage = Storage(connection)
+        self.storage = Storage()


Likewise here.

AlexVelezLl · 2026-01-05T21:29:33Z

Thanks @rtibbles! I have addressed all your comments!

rtibbles

All my comments are addressed! Let's get this QAed.

All comments addressed.

rtibbles · 2026-01-05T23:09:44Z

@pcenov @radinamatic hopefully the issue has the details needed for replication - this has ended up being a slightly larger refactor, so doing some additional smoke tests of some async tasks, like content imports, and also checking some different syncs also.

There's also a possibility for regression in the Android App, so if we can test the import workflows on Android as well, that would be very helpful.

…e time

…already

AlexVelezLl · 2026-02-23T21:09:55Z

Hi @pcenov! This should be ready for another round of QA. I have made some changes that may have fixed these two comments: #13821 (comment) and #13821 (comment).

In general, it should be okay if some tasks fail due to server overload, but if we retry them manually, we should eventually be able to import the learners. Infinite loads should be fixed now.

pcenov · 2026-02-24T14:16:08Z

Hi @AlexVelezLl - it seems that this improvement should be further discussed with @rtibbles so that we can get a better idea of what exactly is the expected behaviour and how many users it should be possible to import without too much trouble.

Currently while I am technically able to import a few users simultaneously, when I try to do that with lets say about 20 users then the import takes a very long time, it increases the CPU usage of the devices and I am still getting multiple errors in the console and many of the users are not getting imported:

multiple.users.mp4

WindowsServer.zip
UbuntuLOD.zip

Unless we have clarity on the matter of the number of users that we can handle reliably, then I can't reliably confirm through manual testing that this is an actual improvement, as I am constantly getting inconsistent results.

AlexVelezLl · 2026-02-24T14:32:46Z

Thanks @pcenov, yes, it will heavily depend on the device and server resources. We had discussed with @rtibbles, that we can potentially build a bulk import flow, where the admin can select multiple users at the same time, and then import them in a single action, but that will require some more decisions.

Another option, @rtibbles, would be to run just one task at a time and make the create task request only after the previous task has already been completed; that would take much longer, but would reduce errors much more.

The other thing is that if the devices don't have many resources, could we prompt the admin to adjust the REGULAR_PRIORITY_WORKERS option?

AlexVelezLl · 2026-02-24T19:22:35Z

Follow up for further improvement: #14238

rtibbles

This is an improvement on the current workflow. We've added a follow up to handle this completely robustly, but there's a lot of work here that will be more generally useful, and I don't want the perfect to be the enemy of the good.

AlexVelezLl requested a review from rtibbles October 8, 2025 23:22

github-actions Bot added DEV: backend Python, databases, networking, filesystem... APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: frontend SIZE: medium labels Oct 8, 2025

AlexVelezLl added this to the Kolibri 0.19: Bulk User Management milestone Oct 8, 2025

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 4e2fbc9 to 236f654 Compare October 8, 2025 23:26

rtibbles self-assigned this Oct 9, 2025

rtibbles reviewed Oct 10, 2025

View reviewed changes

github-actions Bot added the SIZE: large label Oct 16, 2025

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 6a0c872 to 55e3dba Compare October 24, 2025 20:10

github-actions Bot added the SIZE: very large label Oct 24, 2025

marcellamaki modified the milestones: Kolibri 0.19: Bulk User Management, Kolibri 0.19: Planned Patch 1 Oct 29, 2025

AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from a2765cb to 1a2f204 Compare December 10, 2025 15:18

AlexVelezLl commented Dec 10, 2025

View reviewed changes

AlexVelezLl requested a review from rtibbles December 10, 2025 22:25

rtibbles changed the base branch from develop to release-v0.19.x December 17, 2025 21:39

AlexVelezLl force-pushed the fix-lod-import-multi-users branch 2 times, most recently from cd9821f to 402dadd Compare January 5, 2026 19:29

rtibbles previously requested changes Jan 5, 2026

View reviewed changes

AlexVelezLl requested a review from rtibbles January 5, 2026 21:28

rtibbles reviewed Jan 5, 2026

View reviewed changes

pcenov self-requested a review January 6, 2026 08:59

AlexVelezLl added 19 commits February 23, 2026 14:45

Migrations for sqlalchemy job

983028c

Migrate ensuring jobs database exists to django migrations

5d27d33

Refactor retry_on logic

5f1ba18

Migrate Jobs model to django

e761f11

Migrate tests and add tests to retry_on param

0453e70

Use KolibriModelRouter to define the KolibriTasksRouter class

fda4588

Fix jobs tests running on multiple threads

dc99d76

Add JOB_STORAGE to ADDITIONAL_SQLITE_DATABASES array

52c2fc1

Standardize databases path computation

486e207

Fix db connections overrides that causes side effects on following tests

47ea1ba

HandlePR comments

6b9dc82

Activate sqlite pragmas on all DBs

2066b37

Add more safeguards against reaching overloaded server

d8cfead

Update kolibri-installer-android and morango versions

a512d7c

Add semaphore to limit the number of task creation request at the sam…

5c4a6b9

…e time

Enqueue automatic content download only if there isn't an active job …

cc455ec

…already

Query NetworkLocation model for building NetworkConnection from address

addae89

Fix mktime bug

4fa846c

Remove users being imported missing on tasks

6c5c778

AlexVelezLl force-pushed the fix-lod-import-multi-users branch from 6982ecd to 6c5c778 Compare February 23, 2026 19:52

AlexVelezLl mentioned this pull request Feb 24, 2026

Bulk user import flow for LOD devices #14238

Open

rtibbles approved these changes Feb 24, 2026

View reviewed changes

nucleogenesis merged commit b1624d0 into learningequality:release-v0.19.x Feb 24, 2026
57 checks passed

rtibbles mentioned this pull request Feb 24, 2026

Setup Wizard - Confusing behavior when importing multiple learners #11836

Closed

AlexVelezLl deleted the fix-lod-import-multi-users branch February 24, 2026 21:44

rtibbles mentioned this pull request Mar 6, 2026

Remove Python 2.7 legacy: Replace .timestamp() workaround #14065

Closed

		from django import db

		db.connections["default"].connection = None

		return executor(max_workers=max_workers)


		class DatabaseLockedError(OperationalError):

		raise TypeError("time delay must be a datetime.timedelta object")


		def validate_exception(value):

Uh oh!

Conversation

AlexVelezLl commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Reviewer guidance

Uh oh!

github-actions Bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexVelezLl Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexVelezLl commented Jan 5, 2026

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

rtibbles commented Jan 5, 2026

Uh oh!

AlexVelezLl commented Feb 23, 2026

Uh oh!

pcenov commented Feb 24, 2026

Uh oh!

AlexVelezLl commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexVelezLl commented Feb 24, 2026

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

AlexVelezLl commented Oct 8, 2025 •

edited

Loading

github-actions Bot commented Oct 8, 2025 •

edited

Loading

AlexVelezLl Dec 10, 2025 •

edited

Loading

AlexVelezLl commented Feb 24, 2026 •

edited

Loading