Skip to content

add script to create training data#2642

Merged
chris48s merged 3 commits intomasterfrom
20260112-training
Jan 26, 2026
Merged

add script to create training data#2642
chris48s merged 3 commits intomasterfrom
20260112-training

Conversation

@chris48s
Copy link
Copy Markdown
Member

This PR adds a management command which sets up some test data we can use for training.

I've based it off this SOPN

local.havering.2022-05-05.pdf

because it contains a number of useful features:

  1. It is for a council in London
  2. The whole council is one PDF and we need to split it into pages
  3. Most of the wards are split over >1 pages
  4. It has several large wards. Biggest ward has 14 candidates, but there are several with 12 or more
  5. It covers most of the common patterns for names that aren't just "SMITH John" etc

I think we can use the following 5 wards for training, and that would cover some large wards and the following names:

  • Harold Wood (Sally Omosun Onaiwu, Ian Sanderson)
  • Heaton (Wendy Brice-Thompson)
  • Mawneys (Alison De Melo, Christine Ann McGeary)
  • St Andrew's (Gerry O'Sullivan)
  • Gooshays (Grant Edward MacMaster)

I've set up Harold Wood with 0 suggested candidates.
Heaton, Mawneys and Gooshays have all got some suggested candidates on them, and they all match the SOPN.
St Andrew's has suggested candidates on it which cover various classes of mistakes/errors.

The one case this SOPN doesn't cover is withdrawn candidacy. One option is I can add test data for another election with a real SOPN that has a withdrawal on it. However, another option we could consider is we could produce a version of this PDF for training purposes which has been doctored to include a withdrawal or 2 in our test wards for use in the training. Then that would keep things pretty neat. I actually think maybe the second option might be better but let me know what you think on that.


The way I've done this management command is:
You can run it, and it will remove the test election plus any related ballot/membership objects. This means we can dry run the training session and then "factory reset" the data, or run the session multiple times re-setting the data between each run.
However, this script doesn't create everything from scratch. It will not run against a completely empty/clean DB. I have tried to minimise this by making relations null where possible, but it does expect certain DB objects to already exist, so we do need to import a real dump of some description and then run this on top of that. Hopefully it should be relatively agnostic about the point in time the dump was taken though. i.e: if we take another export in a month's time then this should still run on top of that base.
Specifically, this script assumes the existence of:

  • Organization object for local-authority:havering
  • Post objects for the 2022-onwards wards of Havering
  • Party objects for:
    • Conservative Party (PP52)
    • Labour Party (PP53)
    • Labour & Co-op Party (joint-party:53-119)
  • Various Person objects

There are probably some places I could DRY up or whatever, but I'm treating this more like throwaway code we will run once than code we plan to maintain for a long time.

Comment on lines +47 to +48
polling_day = date(2026, 5, 8)
polling_day_text = "2026-05-08"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hard-coded the date as 8 May 2026 (i.e: 1 day after the "real" elections). We could read date in as a param if you wanted.

Comment on lines +168 to +177
# This person is standing in this ward, but we've put them down as
# standing for Labour and Co-operative Party (joint-party:53-119)
# whereas they are actually standing for Labour Party (PP53).
# Easy mistake, but we will need to fix it.
self.create_membership(
person_id=91518,
party=labour_and_coop_party,
ballot_id=f"local.havering.st-andrews.{polling_day_text}",
org=org,
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at the real 2022 ballots, we didn't actually handle this Labour/Labour & C-op issue correctly in all cases for the 2022 election.
Is that a function of the featureset of YNR at the time and something we do better now, or is this not actually a useful edge case to train on?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a mistake in our data from back then. I wonder how we still have this problem, given the work Will DM did on importing results.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's a useful thing for us to catch, but FWIW, the BBC will squash the parties together in the end product)

Copy link
Copy Markdown
Member Author

@chris48s chris48s Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Do you think it is more useful to make a situation where the person already in the DB is just standing for a totally different party than they are listed with on the SOPN, or add that case as well, or just stick with this?

Comment on lines +125 to +133
(f"local.havering.heaton.{polling_day_text}", 91359),
(f"local.havering.heaton.{polling_day_text}", 91363),
(f"local.havering.heaton.{polling_day_text}", 43250),
(f"local.havering.mawneys.{polling_day_text}", 42931),
(f"local.havering.mawneys.{polling_day_text}", 42935),
(f"local.havering.mawneys.{polling_day_text}", 42937),
(f"local.havering.gooshays.{polling_day_text}", 91387),
(f"local.havering.gooshays.{polling_day_text}", 42361),
(f"local.havering.gooshays.{polling_day_text}", 91389),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Person IDs are pretty load-bearing, so I'm just hard-coding magic numbers here. Do you know off the top your head what would happen if we merged one of these into another person record in future. Will that just work or do we need to update this script if that happens?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't just work. I guess at this stage we can just raise a useful exception, but if you wanted you could catch that and look up the old ID form the PersonRedirect model.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@symroe symroe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, and I assume you've tested it.

I wonder if you've e.g added some election results and tried to delete objects? Or generally, I think we might have some issues where we can't cascade the deletes we need.

I also get that this isn't meant to be run from an empty DB, but it might be useful to document the data that needs to exist. For example the org is fetched with a .get, so it will fail if we don't have the right DB dump imported. This isn't a major problem, but it might be handy in the future to know what the pre-requisites are.

Comment on lines +125 to +133
(f"local.havering.heaton.{polling_day_text}", 91359),
(f"local.havering.heaton.{polling_day_text}", 91363),
(f"local.havering.heaton.{polling_day_text}", 43250),
(f"local.havering.mawneys.{polling_day_text}", 42931),
(f"local.havering.mawneys.{polling_day_text}", 42935),
(f"local.havering.mawneys.{polling_day_text}", 42937),
(f"local.havering.gooshays.{polling_day_text}", 91387),
(f"local.havering.gooshays.{polling_day_text}", 42361),
(f"local.havering.gooshays.{polling_day_text}", 91389),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't just work. I guess at this stage we can just raise a useful exception, but if you wanted you could catch that and look up the old ID form the PersonRedirect model.

Comment on lines +168 to +177
# This person is standing in this ward, but we've put them down as
# standing for Labour and Co-operative Party (joint-party:53-119)
# whereas they are actually standing for Labour Party (PP53).
# Easy mistake, but we will need to fix it.
self.create_membership(
person_id=91518,
party=labour_and_coop_party,
ballot_id=f"local.havering.st-andrews.{polling_day_text}",
org=org,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a mistake in our data from back then. I wonder how we still have this problem, given the work Will DM did on importing results.

Comment on lines +168 to +177
# This person is standing in this ward, but we've put them down as
# standing for Labour and Co-operative Party (joint-party:53-119)
# whereas they are actually standing for Labour Party (PP53).
# Easy mistake, but we will need to fix it.
self.create_membership(
person_id=91518,
party=labour_and_coop_party,
ballot_id=f"local.havering.st-andrews.{polling_day_text}",
org=org,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's a useful thing for us to catch, but FWIW, the BBC will squash the parties together in the end product)

@chris48s
Copy link
Copy Markdown
Member Author

I also get that this isn't meant to be run from an empty DB, but it might be useful to document the data that needs to exist. For example the org is fetched with a .get, so it will fail if we don't have the right DB dump imported. This isn't a major problem, but it might be handy in the future to know what the pre-requisites are.

Added in b05d5ae
I think fetching the DB objects we need with a .get() so we fail if they aren't there is preferable than attempting to plough on.

@chris48s
Copy link
Copy Markdown
Member Author

I think we might have some issues where we can't cascade the deletes we need

Following on from our conversation about this, I did some testing and realised that just calling .delete() on the election object does CASCADE and SET_NULL everything as expected because I'm not calling the safe_delete() methods on Election or Ballot.

So I think the teardown code is actually good is at stands.

I reckon this PR is probably good to go.

@chris48s chris48s merged commit 06db28b into master Jan 26, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants