-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathtscs_database_builder.tex
More file actions
963 lines (798 loc) · 45.2 KB
/
tscs_database_builder.tex
File metadata and controls
963 lines (798 loc) · 45.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
% if there's been a data update, run stats_and_visuals.ipynb to regenerate tables and figures
\documentclass{article}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{enumitem}
\usepackage{tabularx}
\usepackage[export]{adjustbox} % to keep graphics centred even if they're wider than \textwidth
\usepackage[utf8]{inputenc}
\usepackage{geometry}
\geometry{
a4paper,
left=20mm,
right=20mm,
top=25mm,
bottom=25mm
}
\setlength{\parindent}{0pt}
\setlength{\parskip}{1em}
\title{Database of third sector organisations: schema and process}
\author{Fiona Couper Kenney, Alan Duggan}
\date{Revised March 3, 2026, FCK}
\begin{document}
\maketitle
\section{Introduction}
This document is a guide to the UK Third Sector and Civil Society Database, a ‘spine’ of all
registered organisations comprising the third sector and civil society in the United Kingdom, along with
linkages between organisations which appear in more than one register. The dataset is available at the project website\footnote{\url{https://uk-third-sector-database.github.io/}}.
Ten data sources are combined in order to build the organisational spine, listed in Table \ref{tab:sources}, and further detailed in appendices.
This document briefly introduces the processes of data collection, pre-processing of
data per source, and combining all sources to create one master list. The spine files are first introduced, followed by a description of the process of matching organisations between sources.
Preprocessing of each source is detailed in the appendix. The project source code is available on GitHub\footnote{\url{https://github.com/uk-third-sector-database/tso-database-builder}}.
\section{Building the organisational spine}
Ten data sources are combined to create the organisational spine. All data in the spine is open and publicly available. The sources and
their online location are listed in Table \ref{tab:sources}.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Data Source & URL \\ \hline
Charity Commission, England and Wales & \url{www.gov.uk/government/organisations/charity-commission}\\ \hline
Scottish Charity Register & \url{www.oscr.org.uk/}\\ \hline
Charity Commission, Northern Ireland & \url{www.charitycommissionni.org.uk/} \\ \hline
Mutuals Public Register & \url{mutuals.fca.org.uk} \\ \hline
Companies House & \url{www.gov.uk/government/organisations/companies-house} \\ \hline
Co-operatives & \url{www.uk.coop/} \\ \hline
Scottish Housing Register & \url{www.housingregulator.gov.scot/} \\ \hline
Social Housing England & \url{www.gov.uk/government/publications/registered-providers-of-social-housing} \\ \hline
Care Inspectorate Scotland & \url{www.careinspectorate.com} \\ \hline
Care Quality Commission & \url{www.cqc.org.uk}\\ \hline
\end{tabularx}
\caption{Data sources for the organisational spine}
\label{tab:sources}
\end{table}
The organisational spine consists of three csv files:
\begin{itemize}
\item TSCS\_spine.spine.csv contains a row for each organisation considered to make up the UK's third sector, with name, address and date details. The data dictionary
is found in Table \ref{tab:spine_fields}.
\item TSCS\_spine.supplementary.csv contains any name, address and date information associated with an organisation which is additional to what is in the spine,
such as 'Also Known As' names, previous addresses. See Table \ref{tab:supplementary_fields} for data dictionary
\item TSCS\_spine.matches.csv lists the organisations which are linked or matched according to our searches (see \ref{sec:match}), along with the type of match. Data dictionary
in Table \ref{tab:match_fields}
\end{itemize}
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & UID created for each source, e.g. GB-CHC-1234\\ \hline
organisationname & Name as primary in source\\ \hline
normalisedname & Name normalised to capitalise, remove extra spaces and punctuation - to aid matching \\ \hline
fulladdress & Full address as found in source with highest precedence across any matched organisations\\ \hline
city & City matching full address as defined above\\ \hline
postcode & Postcode matching full address as defined above\\ \hline
registerdate & Earliest registration date found in records for this organisation across all sources\\ \hline
removeddate & Latest dissolution date found in records for this organisation across all sources\\ \hline
source\_register & Source of the data, e.g. Charity Commission for England and Wales \\ \hline
is\_cic & Boolean indicating if organisation is a Community Interest Company\\ \hline
\end{tabularx}
\caption{Public spine csv fields}
\label{tab:spine_fields}
\end{table}
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\ \hline \hline
uid & UID created for each source using source and id from original source \\ \hline
organisationname & Alternative name \\ \hline
normalisedname & (Alt name normalised) \\ \hline
fulladdress & Alternative address \\
city & Alternative address\\
postcode & Alternative address\\ \hline
registerdate & Alternative registration date \\ \hline
removeddate & Alternative removed / dissolved date \\ \hline
source\_register & Source of the data, e.g. 'Companies House', 'Charity Commission for England and Wales' \\ \hline
\end{tabularx}
\caption{Supplementary csv fields}
\label{tab:supplementary_fields}
\end{table}
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\ \hline \hline
uid & UID created for each source, matches Table \ref{tab:spine_fields} - if match is unambiguous; blank otherwise\\ \hline
orgA\_id\_in\_source & ID in original source, for matched organisation A\\ \hline
orgA\_source & original source of organisation A\\ \hline
orgA\_uid & UID created for organisation A\\ \hline
orgB\_id\_in\_source & ID in original source, for matched organisation B\\ \hline
orgB\_source & original source of organisation B\\ \hline
orgB\_uid & UID created for organisation B\\ \hline
match\_type & Type of match (e.g. by name, by companyid)\\ \hline
\end{tabularx}
\caption{Matched organisations fields}
\label{tab:match_fields}
\end{table}
\clearpage
\subsection{What constitutes a match?}
\label{sec:match}
The main dataset has one row per organisation. If an organisation appears in two datasets,
the linked organisation is noted by a row in the matches lookup table. We prioritise those organisations on the
charity register, and add new organisations to the main dataset from other sources only if they do not match one already present in the set.
The following routes lead to a match:
\begin{itemize}
\item Matching by Company ID:
The charity regulator in England and Wales, CCEW, collects information about organisations' associated registration at Companies House.
If two charities share the same companyid they match the same companies house record and are therefore linked.
This same companyid is used to generate matches between CCEW and Companies House directly.
Our historic data from Scotland's charity regulator, OSCR, contains a company identifier for some organisations. This is used
for a deterministic match between OSCR and Companies House.
Organisations in the Cooperatives register may have an additional id, 'Registered Number', which represents its linkage to
the Mutuals Public Register. Matches are thus sought between CoOps and Mutuals using this id.
\item Matching by Normalised Name:
If two organisations have the same normalised name they are tentatively matched, but require an additional check.
Any of the following being true will create a match:
\begin{enumerate}
\item Organisation is in Care Quality Commission (CQC) data, and the matched core organisation is flagged as having a CQC registration,
which is a flag in CCEW data.
\item Organisation is in the Scottish charity regular (OSCR) register and has a 'cross border' flag,
indicating that a match is expected between itself and an organisation in CCEW.
\item Organisation is in Scottish Housing Regulator or Care Inspectorate Scotland, and normalised name match is with an organisation in OSCR.
\item Organisation is in Social Housing England Regulator or Care Quality Commission, and matches an organisation in CCEW.
\end{enumerate}
\item Matching using work by Find That Charity:
A lookup table provided by David Kane\footnote{\url{https://raw.githubusercontent.com/drkane/charity-lookups/master/relationships/_sameas.csv}} is also used to generate matches.
These links include historic re-registrations, allowing
this dataset to be further deduplicated.
\end{itemize}
The order in Table \ref{tab:order} shows the precedence given to sources where matches occur: the higher precedence
source will provide the data and unique identifier (uid) in the main spine table. Matches will be found in the matches table, and any data
found in addition to the primary fields will be in the supplementary table, indexed by the uid given at source. The first eight sources are processed such that any organisations not already present
in the dataset are added to it. Organisations registered with CIS and CQC are only noted in the matches table where they link to a source present.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|c|X|c|}
\hline
Order & Source & Code \\ \hline \hline
1 & Charity Commission, England and Wales & CHC \\ \hline
2 & Scottish Charity Register & SC \\ \hline
3 & Charity Commission, Northern Ireland & NIC \\ \hline
4 & Mutuals Public Register & MPR \\ \hline
5 & Companies House & COH \\ \hline
6 & Co-operatives & COOP \\ \hline
7 & Scottish Housing Register & SHR \\ \hline
8 & Social Housing England & SHPE \\ \hline
- & Care Inspectorate Scotland & CIS \\ \hline
- & Care Quality Commission & CQC \\ \hline
\hline
\end{tabularx}
\caption{Precedence order of sources for matched organisations}
\label{tab:order}
\end{table}
\clearpage
\subsection{Match types visualised}
\label{sec:match_vis}
The plots in Figures \ref{fig:Matches1}, \ref{fig:Matches2} and \ref{fig:Matches3}
show the number of matches between sources in the public spine.
Figure \ref{fig:Matches1} is an UpSetPlot, which is representation of intersection between all registers (like a venn
diagram), shows how many matches are made
between the different registers. Figure \ref{fig:Matches2} is the same plot, without organisations which did not match
with any other organisation, allowing for greater visualisation of the size of the matches on each axis.
The bar chart in Figure \ref{fig:Matches3} shows the number of matches by match type,
showing the distribution of the types of linkages.
\begin{figure}[h]
\centering
\includegraphics[width=1.2\textwidth,center]{../public_spine_data/TSCS_spine.all_orgs.upsetplot.png}
\caption{Sources each organisation is registered with}
\label{fig:Matches1}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=1.2\textwidth,center]{../public_spine_data/TSCS_spine.matched_orgs.upsetplot.png}
\caption{Sources each organisation is registered with, for all organisations in at least 2 registers}
\label{fig:Matches2}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=0.6\textwidth,center]{../public_spine_data/TSCS_spine.matchtypes.png}
\caption{Distribution of match types used to deduplicate the spine}
\label{fig:Matches3}
\end{figure}
\subsection{Determining primary organisations}
The above techniques are used to match organisations between sources. Where a match occurs from within the same source regulator,
the primary organisation is determined by the following rules:
\begin{itemize}
\item If an organisation from the Charity Commission for England and Wales matches an organisation from the same commission,
the primary organisation uses the Find That Charity lookup to identify which of the two is the 'transferee' in a merger. The
transferor is then linked to the transferee in the matches table, with the transferee being the primary organisation.
\item In other cases where there are two organisations from the same regulator which match, the organisation considered primary
is the one without a removal date (and is therefore still active), or the one with the most recent removal date (and was therefore
active most recently).
\end{itemize}
If neither of these conditions lead to a decision, in particular where neither of the matched organisation have a removal date,
the organisation with the earliest registration date is considered primary.
\clearpage
\section{Ingesting data from each source}
Each of the ten sources use different conventions for their databases. These needed to be pre-processed individually to accommodate different fieldnames,
encodings, systems of marking alternative names, and so on. The pre-process step identified primary names and addresses for each organisation,
choosing the most recent where multiple iterations exist, and storing alternatives for completeness.
More information about the processes for each source are found in the appendix, as well as the project source code\footnote{\url{https://github.com/uk-third-sector-database/tso-database-builder}}.
\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{../public_spine_data/TSCS_spine.sources.png}
\caption{Sources found in public spine}
\label{fig:Sources}
\end{figure}
\begin{table}[h]
\centering
\input{../public_spine_data/TSCS_spine.all_orgs.spine_match.tab.tex}
\caption{Counts of organisations in the spine, and matched to the spine, from each source register}
\label{tab:counts_per_source}
\end{table}
\clearpage
\appendix
\section*{Appendix: Data Sources}
All data in the spine comes from publicly available sources. The number of organisations in the database (whether in the list of
core organisation or via matching) are shown in Table \ref{tab:counts_per_source}.
Details for each source are given in the sections below.
\subsection{Charity Commission for England and Wales (CCEW)}
\url{https://www.gov.uk/government/organisations/charity-commission}
\url{https://register-of-charities.charitycommission.gov.uk/en/register/full-register-download}
\subsubsection{Recent CCEW downloads}
'Public extract' data downloads are available at the CCEW website (above). We collected regular data downloads during 2024 and 2025,
and in pre-processing we extract fields for name, address, registration date and removal date, along with charity number and linked charity number.
We retain the date of the download, and where there are multiple options for any details we choose the most recent for the spine, and
retain other information to be provided as supplementary data. All data is normalised, deduplicated and divided between primary and
supplementary per charity number.
Typically the most recent data per charity number is allocated to the spine, and other data to the supplementary file. The exception to this
is where the charity contains an entity with a 'linked charity number' of zero. CCEW uses this to indicate an overarching identity for
a charity, with subsidiary entities being assigned non-zero integer values for linked charity number. The spine reflects this by
using the data for the overarching charity, with subsidiaries added to supplementary data.
\subsubsection{Historical CCEW data}
This section outlines the use of CCEW regulatory data in the construction of the third sector
organisational spine\footnote {For further detail on how this data was used, please see the Stata file
ccew\_publicspine\_prep.do in the project GitHub repository.}.
The initial processing of CCEW data for this project comprised 27 individual data files that were obtained
between 2001 and 2023, along with updates lat. These files were used to compile a list of unique charities, and to
extract as many unique names, addresses, registration dates, and removal dates for each
organisation. These files are detailed below:
\begin{itemize}
\item CC\_appended\_March23.dta [English and Welsh Charity Register - March 2023]
\item CC\_charity\_other\_names.dta [File containing additional names for English and Welsh
charities - March 2023]
\item ccr2001\_repaired.dta [Historical file from 2001, used to extract charity names,
addresses, company numbers, dates of registration and removal]
\item 24 historical snapshots from June 2011, March 2012 to 2019, April 2020 to 21, and March
2022 [Used to extract charity names, addresses, dates of registration and removal]
\end{itemize}
In order to compile all available information from these files, we created separate lists of unique
names, addresses, registration dates, and removal dates. These lists were grouped using the
identification number assigned to each charity by the CCEW (e.g., Stroud Earth Community in
Gloucestershire operates under the charity number 1199586). These lists were then normalised
in order to de-duplicate the information. For charity names, this process entailed removing
leading spaces, removing trailing spaces, collapsing consecutive internal spaces, and
converting all characters to lower case. In the case of addresses, the same process was followed
using a concatenated string comprising the first address line and the postcode. For registration
and removal dates, we converted all dates to the same format (dd/mm/yyyy).
Once this normalisation was complete, we de-duplicated each list to keep only unique pieces of
information. This de-duplication process keeps the first appearance of each piece of
information and we provide variables in the dataset to show from which data iteration each
value originates. These separate list of unique names, addresses, and dates are then compiled
into a single dataset which is later merged with data from all other data sources. Using the
historical file from 2001, we also extract 4,700 Companies House identification numbers that
are absent from the later files we use (i.e., 2011–2023). Later in the process, we use Companies
House IDs to link together the various data sources that produce the third sector organisational
spine.
The codebook of variables is found in Table \ref{tab:ccew_preprocess}.
\subsubsection{Limitations }
Despite the extensive temporal coverage of these data files (2001–2023), there are 13,243
unique charity numbers in the data for which we do not have address information. A large
majority of these organisations were removed from the CCEW register before 1997. Similarly,
three of these charity numbers do not have an organisation name in any of the data iterations
used.
\subsubsection{Notes}
For a subset of charity numbers, the 2001-2023 data contain multiple dates of registration and/or
removal. The data contains 341,990 unique charity numbers with 12,148 (3.55\%) having more
than one registration date. Of the 177,706 charity numbers that have a removal date, 4,074
(2.29\%) have more than one.
Below are some possible explanations for a charity having more than one date of registration or
removal, though these are not exhaustive.
\begin{enumerate}
\item Data entry errors that are corrected in later data iterations.
\item The CCEW removes organisations that they believe have ceased operating. In some
cases, the organisations are still active and the CCEW adds them back to the register
once they have confirmed their continued activity. This creates multiple distinct dates
for the same charity number.
\item These multiple values show the registration/removal dates for distinct organisations that
have since merged into a single organisation with a single charity number. In this case,
the component parts of the merger retain their distinct dates in historical data files and
these are kept in the final organisational spine dataset.
Related to point 3 above, there are charity numbers and organisations in the organisational spine
that have since been merged into single entities. For more detail on this, please see data on
merged charities held by the CCEW (\url{https://www.gov.uk/government/publications/register-of-
merged-charities}).
\end{enumerate}
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation in compiled data \\ \hline
charitynumber & ID assigned to each organisation by CCEW \\ \hline
organisationname & Name of organisation \\ \hline
normalisedname & \\ \hline
companyid & ID assigned to organisation by Companies House \\ \hline
housenumber & \\ \hline
addressline1 & First line of address \\ \hline
addressline2 & Second line of address \\ \hline
addressline3 & Third line of address \\ \hline
addressline4 & Fourth line of address \\ \hline
addressline5 & Fifth line of address \\ \hline
city & City \\ \hline
localauthority & Local Authority area \\ \hline
postcode & Postcode \\ \hline
registerdate & Date of registration \\ \hline
removeddate & Date of removal \\ \hline
name\_origin & Iteration from which the organisation name originates \\ \hline
address\_origin & Iteration from which the organisation address originates \\ \hline
regdate\_origin & Iteration from which the registration date originates \\ \hline
remdate\_origin & Iteration from which the removal date originates \\ \hline
iteration & Iteration from which all data in a given row originates \\ \hline
source & Variable to indicate the data source (in this case, CCEW) \\ \hline
%
\end{tabularx}
\caption{Fields in pre-processing of CCEW data}
\label{tab:ccew_preprocess}
\end{table}
\clearpage%
\subsection{Scottish Charity Regulator (OSCR)}
\url{https://www.oscr.org.uk/}
This section outlines the use of OSCR regulatory data in the construction of the third sector
organisational spine\footnote{ For further detail on how this data was used, please see the accompanying
oscr\_publicspine\_prep.do file.}.
The OSCR data used in this project comprised data files that were obtained
between 2012 and 2025. These files were used to compile a list of unique charities, and to
extract as many unique names, addresses, registration dates, and removal dates for each
organisation. These files are detailed below:
\begin{itemize}
\item CharityExport-Removed-05-Dec-2025.csv : List of removed charities, December 2025
\item CharityExport-03-Nov-2025.csv: Scottish Charity Register November 2025
\item CharityExport-28-Jul-2025.csv: Scottish Charity Register July 2025
\item CharityExport-05-Feb-2025.csv : Scottish Charity Register, February 2025
\item CharityExport-Removed-05-Feb-2025.csv : List of removed charities, February 2025
\item CharityExport-12-Apr-2023.csv : Scottish Charity Register, April 2023
\item CharityExport-Removed-12-Apr-2023.csv : List of removed charities, April 2023
\item oscr\_20210909.xlsx : Scottish Charity Register and Annual Returns, September 2021
\item oscr\_20210203.xlsx : Scottish Charity Register and Annual Returns, February 2021
\item ocsr\_scr-and-ar\_20190109.xlsx : Scottish Charity Register and Annual Returns,
January 2019
\item sc\_chars2012.xlsx [Historical file from 2012]
\end{itemize}
In order to compile all available information from these files, we created separate lists of unique
names, addresses, registration dates, and removal dates. These lists were grouped using the
identification number assigned to each charity by OSCR (e.g., Inverness Civic Trust operates
under the charity number SC021748). These lists were then normalised in order to de-duplicate
the information. For charity names, this process entailed removing leading spaces, removing
trailing spaces, collapsing consecutive internal spaces, and converting all characters to lower
case. In the case of addresses, the same process was followed using a concatenated string
comprising the first address line and the postcode. For registration and removal dates, we
converted all dates to the same format (dd/mm/yyyy).
Once this normalisation was complete, we de-duplicated each list to keep only unique pieces of
information. This de-duplication process keeps the first appearance of each piece of
information and we provide variables in the dataset to show from which data iteration each
value originates. These separate list of unique names, addresses, and dates are then compiled
into a single dataset which is later merged with data from all other data sources.
Finally, the 2012 historical data file includes additional charity regulator and companies house
ID numbers. These additional IDs were gathered at the time the data file was created and we
include them in the final dataset in case they prove useful for future research.
The codebook of variables is found in Table \ref{tab:oscr_preprocess}.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation in compiled data \\ \hline
charitynumber & ID assigned to each organisation by OSCR \\ \hline
organisationname & Name of organisation \\ \hline
normalisedname & \\ \hline
companyid & ID assigned to organisation by Companies House \\ \hline
address & Full street address \\ \hline
housenumber & \\ \hline
addressline1 & First line of address \\ \hline
addressline2 & Second line of address \\ \hline
addressline3 & Third line of address \\ \hline
addressline4 & Fourth line of address \\ \hline
addressline5 & Fifth line of address \\ \hline
addressline6 & Sixth line of address \\ \hline
addressline7 & Seventh line of address \\ \hline
addressline8 & Eighth line of address \\ \hline
city & City \\ \hline
localauthority & Local Authority area \\ \hline
postcode & Postcode \\ \hline
registerdate & Date of registration \\ \hline
removeddate & Date of removal \\ \hline
name\_origin & Iteration from which the organisation name originates \\ \hline
iteration & Iteration from which all data in a given row originates \\ \hline
charitynumber\_2012 & Alternative OSCR ID (originates in 2012 data) \\ \hline
companyid1\_2012 & Alternative Companies House ID (originates in 2012 data) \\ \hline
companyid2\_2012 & Alternative Companies House ID (originates in 2012 data) \\ \hline
companyid3\_2012 & Alternative Companies House ID (originates in 2012 data) \\ \hline
source & Variable to indicate the data source (in this case, OSCR) \\ \hline
crossborder & Boolean variable, reflecting whether 'Regulatory Type' is 'Cross Border' for this organisation \\ \hline
\end{tabularx}
\caption{Fields in pre-processing of OSCR data}
\label{tab:oscr_preprocess}
\end{table}
\subsubsection{Limitations}
There are 1,776 unique charity numbers in this data that don’t have organisation names or any
address information. Similarly, there is no useful data on registration or removal dates for these
organisations. OSCR uses default values for registration (01/01/1992) and removal
(02/06/2012) dates for many organisations in their data. For these 1,776 charities, all have the
default removal date and all but two have the default registration date. Due to the lack of
organisation names, addresses, and dates for these organisations, we exclude them from the
final dataset.
In addition to the 1,776 charities outlined above, there are 7,591 charities for which we don’t
have any address data. Of these 7,591 charities, 7,572 were removed from the register on the
default date of 02/06/2012. Despite these limitations, these charities are retained in the final
data because we do have info on charity name, registration date, and removal date.
Finally, there are 194 charities for which we have street address information, but the data files
do not contain the accompanying postcode. Of these 194 charities, 189 were removed from the
charity register on the default date of 02/06/2012. These charities are also retained in the final
data because we do have info on charity name, registration date, removal date, and partial
address data.
\subsubsection{Notes}
The 2012 historical data file does not include any information on addresses, registration dates,
or removal dates. As such, many of the names in our data file are first observed in 2012 while
the earliest point of observation for the other key pieces of information is 2019.
\clearpage
\subsection{Charity Commission for Northern Ireland (CCNI)}
\url{https://www.charitycommissionni.org.uk/}
This section outlines the use of CCNI regulatory data in the construction of the third sector
organisational spine\footnote{For further detail on how this data was used, please see the accompanying
ccni\_publicspine\_prep.do file.}.
The CCNI data used in this project comprised of open data files obtained in
\begin{itemize}
\item
\item November 2025 (ccni-charitydetails\_2025\_11\_03.csv) [data download]
\item November 2025 (ni-removals-2025-11-03.csv [removals from the register, compiled via web scrape])
\item May 2025 (charitydetails\_2025\_05\_19.csv [data download])
\item May 2025 (ni-removals-2025-05-16.csv [removals from the register, compiled via web scrape]
\item January 2025 (ccni-charitydetails\_2025\_01\_29\_14\_23\_47.csv [CCNI charity details - 29 January 2025])
\item July 2024 (ni-removals-2024-07-18.csv [removals from the register, compiled via web scrape])
\item April 2023 (ccni-charity-register-2023-04-19.csv [CCNI charity register - 19 April 2023])
\end{itemize}
These files were used to compile a list of unique charities with accompanying names, addresses, and
registration dates for each organisation.
Unlike the preceding sections on CCEW and OSCR data, the research team did not have
historical files to supplement the open date provided by the CCNI. Additionally, the CCNI data
provide only one organisation name, address, and registration date per CCNI charity number
(e.g., S\'{o}l\'{a}s in Belfast operates under the charity number 100114). For organisations whose details differed
between the two files, the most recent data was prioritised for the spine, and older data retained in the
'supplementary' table. This dataset is later merged with data from all other data sources.
The codebook of variables is found in Table \ref{tab:ccni_preprocess}.
\subsubsection{Limitations}
Unlike the other charity regulator data sources, our data for CCNI does not alternative or historic
unique names, addresses, or registration dates.
At this point in time, the CCNI data is less comprehensive than its counterparts in England, Wales,
and Scotland. Additionally, the CCNI does not provide information on the date that a charity is
removed from the charity register. This further limits the CCNI data at this time.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation in compiled data \\ \hline
charitynumber & ID assigned to each organisation by CCNI \\ \hline
organisationname & Name of organisation \\ \hline
normalisedname & \\ \hline
companyid & ID assigned to organisation by Companies House \\ \hline
housenumber & \\ \hline
address & Full street address and postcode \\ \hline
city & City \\ \hline
localauthority & Local Authority area \\ \hline
postcode & Postcode \\ \hline
registerdate & Date of registration \\ \hline
source & Variable to indicate the data source (in this case - CCNI) \\ \hline
\end{tabularx}
\caption{Fields in pre-processing of CCNI data}
\label{tab:ccni_preprocess}
\end{table}
\clearpage
\subsection{Mutuals Public Register }
\url{https://mutuals.fca.org.uk/}
The Mutuals Public Register is a public record of mutual societies registered by the Financial Conduct Authority.
Openly available register data has information for societies currently registered, and those no longer registered.
For this project, downloads from June 2023, April 2024, and January, July and October 2025 are used. The field mapping is shown in Table
\ref{tab:mpr_preprocess}.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation using \\
&'GB-MPR-' + \textbf{Full Registration Number} \\ \hline
organisationname & \textbf{Society Name} \\ \hline
normalisedname & Normalised version of organisationname\\ \hline
companyid & \\ \hline
fulladdress & \textbf{Society Address} \\ \hline
city & \\ \hline
postcode & Postcode taken from \textbf{Society Address} : \\
& = last two 'words' of address if their lengths are at most 4 characters each\\ \hline
registerdate & \textbf{Registration Date} mapped from yyyy-mm-dd format to dd/mm/yyyy\\ \hline
removeddate & \textbf{Deregistration Date} mapped from yyyy-mm-dd format to dd/mm/yyyy\\ \hline
source & Variable to indicate the data source (in this case - MPR) \\ \hline
\hline
\end{tabularx}
\caption{Fields in pre-procesing of Mutuals data (bold text indicates fieldname in Mutuals Public Register data)}
\label{tab:mpr_preprocess}
\end{table}
Where an organisation has multiple options for name and address, the most recent
is used in the primary data, and alternatives are stored in supplementary data.
\clearpage
\subsection{Companies House}
\url{https://www.gov.uk/government/organisations/companies-house}
Companies House release a bulk download every quarter of all currently live companies\footnote{\url{https://download.companieshouse.gov.uk/en_output.html}}. The Alpha public
spine encorporates these downloads for April, June and October 2023, March, May and September 2024, and January, July and November 2025.
Using bulk downloads alone would mean organisations that were dissolved earlier than 2023 would
not be present in the database. Using the Companies House API
we found all companies registered and dissolved which fit our inclusion criteria (see Table \ref{tab:CH_companytypes} for included
company types).
The code used to search the API is found in our
github repository
\url{https://github.com/uk-third-sector-database/ch_adv_scraper}. Additionally, since the bulk download does not contain any removed organisations, it contains no
removal dates. The API scraped data is used to fill in this field for all removed organisations.
% inlcuded company types from CH
\subsubsection{Company types}
Companies house data includes a 'company category' for each organisation. Table \ref{tab:CH_companytypes} shows the
full list of these, indicating which have been included in the public spine data.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|X|l|}
\hline
Company category & Included / Excluded \\ \hline
\hline
Charitable Incorporated Organisation & Included \\ \hline
Community Interest Company & Included \\ \hline
Industrial and Provident Society & Included \\ \hline
PRI/LBG/NSC (Private, Limited by guarantee, no share capital, use of 'Limited' exemption) & Included \\ \hline
PRI/LTD BY GUAR/NSC (Private, limited by guarantee, no share capital) & Included \\ \hline
Registered Society & Included \\ \hline
Scottish Charitable Incorporated Organisation & Included \\ \hline
Converted/Closed & Included \\ \hline
\hline
Private Limited Company & Excluded \\ \hline
Limited Partnership & Excluded \\ \hline
Limited Liability Partnership & Excluded \\ \hline
Public Limited Company & Excluded \\ \hline
Private Unlimited Company & Excluded \\ \hline
Scottish Partnership & Excluded \\ \hline
Private Unlimited & Excluded \\ \hline
Investment Company with Variable Capital(Umbrella) & Excluded \\ \hline
PRIV LTD SECT. 30 (Private limited company, section 30 of the Companies Act) & Excluded \\ \hline
Investment Company with Variable Capital (Securities) & Excluded \\ \hline
Investment Company with Variable Capital & Excluded \\ \hline
Overseas Entity & Excluded \\ \hline
United Kingdom Economic Interest Grouping & Excluded \\ \hline
Old Public Company & Excluded \\ \hline
United Kingdom Societas & Excluded \\ \hline
Other Company Type & Excluded \\ \hline
Protected Cell Company & Excluded \\ \hline
Royal Charter Company & Excluded \\ \hline
Further Education and Sixth Form College Corps & Excluded \\ \hline
Other company type & Excluded \\ \hline
\hline
\end{tabularx}
\caption{Companies house 'company category' field}
\label{tab:CH_companytypes}
\end{table}
Each of the companies house data sources were pre-processed to identify primary name and address for each
included organisation, with any alternative names and addresses also stored. The sources were then
also combined, since there is significant overlap between them, and the most recent name and address
per organisation was included in the companies house spine data, to be used for matching with other
registries to form the public spine.
%Counts of organisations ingested from each source, along with the overlaps in UIDS, are shown in
%Figure \ref{fig:ch_source_venn}.
%
%\begin{figure}[h]
% \centering
% \includegraphics[width=0.8\textwidth]{CH_venn.builder.png}
% \caption{Companies house organisations - intersections of UIDs across the three sources.}
% \label{fig:ch_source_venn}
%\end{figure}
%
%The distribution across company categories is seen in Table \ref{tab:CH_types_dist}. There are different abbreviations used, depending on
%source of the data. Additionally there are numbers of cases where the company type includes Community Interest Company along with another type.
%\begin{table}
%\input{CH.companytype.tab.tex}
%\caption{Companies house organisations by company type.}
%\label{tab:CH_companytypes}
%\end{table}
Compressing the company type categories such that all Community Interest Companies are grouped, and all remaining companies which are
limited by guarantee (with or without 'limited exemption', with or without 'private'),
gives the counts shown in Table \ref{tab:CH_types_compressed}, and in Figure \ref{fig:CH_types_compressed}.
\begin{table}[h]
\centering
\input{CH_dist.tab.tex}
\caption{Distribution of company types in dataset (compressed company types).}
\label{tab:CH_types_compressed}
\end{table}
\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{CH_dist.png}
\caption{Companies house organisations by company type (compressed).}
\label{fig:CH_types_compressed}
\end{figure}
\clearpage
\clearpage
\subsection{Co-operatives}
\url{https://www.uk.coop/resources/open-data}
The Co-operatives UK organisations used in this project were from open data downloads in April 2022, April 2024 and
January, July and October 2025.
Open data from the Co-operatives UK organisation was mapped to the format required by the public spine
according to the rules shown in Table \ref{tab:coop_preprocess} below.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation using \\
&'GB-COOP-' + \textbf{CUK Organisation ID} \\ \hline
organisationname & \textbf{Registered Name} \\ \hline
normalisedname & Normalised version of organisationname\\ \hline
companyid & \textbf{Registered Number} \\
&(COOPs data links to Mutuals data using this field) \\ \hline
fulladdress & \textbf{Registered Street} \\ \hline
city & \textbf{Registered City} \\ \hline
postcode & \textbf{Registered Postcode} \\ \hline
registerdate & \textbf{Incorporation Date} \\ \hline
removeddate & \textbf{Dissolved Date} \\ \hline
source & Variable to indicate the data source (in this case - COOP) \\ \hline
\hline \hline
organisationname & \textbf{Trading Name} \\
(in supplementary data) & \\ \hline
\end{tabularx}
\caption{Fields in pre-procesing of COOP data (bold text indicates fieldname in co-op open data)}
\label{tab:coop_preprocess}
\end{table}
Where an organisation has multiple options for name and address, the most recent
is used in the primary data, and alternatives are stored in supplementary data.
\clearpage
\subsection{Scottish Housing Register }
\url{https://www.housingregulator.gov.scot/landlord-performance/statistical-information/}
The Scottish Housing Regulator for social landlords publishes organisational data which has been included in the spine.
Datasets used in this project were annual downloads collected in 2018, 2019, 2020, 2021, 2022, 2024, 2025.
Table \ref{tab:shr_preprocess} shows the how the spine fields relate to the provided data.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation using \\
&'GB-SHR-' + \textbf{Reg No}\\ \hline
organisationname & \textbf{Social Landlord} \\ \hline
normalisedname & Normalised version of organisationname\\ \hline
companyid & \\ \hline
fulladdress & \\ \hline
city & \\ \hline
postcode & \\ \hline
registerdate & \\ \hline
removeddate & \\ \hline
source & Variable to indicate the data source (in this case - SHR) \\ \hline
\hline
\end{tabularx}
\caption{Fields in pre-procesing of Scottish Housing Regulator data (bold text indicates fieldname in register)}
\label{tab:shr_preprocess}
\end{table}
Where an organisation has multiple options for name and address, the most recent
is used in the primary data, and alternatives are stored in supplementary data.
\clearpage
\subsection{Social Housing England }
\url{https://www.gov.uk/government/publications/registered-providers-of-social-housing}
The government publishes a download monthly of registered social housing providers in England. The data in the
spine is compiled from downloads accessed January 2023, January, July and October 2025. The mapping to the fields
seen in the spine is shown in Table \ref{tab:shpe_preprocess}.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation using \\
&'GB-SHPE-' + \textbf{Registration number}\\ \hline
organisationname & \textbf{Organisation name} \\ \hline
normalisedname & Normalised version of organisationname\\ \hline
companyid & \\ \hline
fulladdress & \\ \hline
city & \\ \hline
postcode & \\ \hline
registerdate & \textbf{Registration date} \\ \hline
removeddate & \\ \hline
source & Variable to indicate the data source (in this case - SHPE) \\ \hline
\hline
\end{tabularx}
\caption{Fields in pre-procesing of Social Housing Providers England data (bold text indicates fieldname in register)}
\label{tab:shpe_preprocess}
\end{table}
Where an organisation has multiple options for name and address, the most recent
is used in the primary data, and alternatives are stored in supplementary data.
\clearpage
\subsection{Care Inspectorate Scotland }
\url{https://www.careinspectorate.com/index.php/statistics-and-analysis/data-and-analysis}
Care Inspectorate Scotland release data at their financial year end (March 31st). This dataset uses data published in the
years 2015 to 2025 inclusive.
Organisations are included where field "Service Type" = 'Voluntary or Not for Profit'. Organisations
have both a 'Provider Name' and a 'Service Name' in the Care Inspectorate data. We have retained
Service Names, however the primary name for matching is the Provider Name. Table \ref{tab:cis_preprocess} shows
the fields used to develop the spine.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation using \\
&'GB-CIS-' + \textbf{CSNumber} or \textbf{CaseNumber}\\ \hline
organisationname & \textbf{ServiceProvider} \\ \hline
normalisedname & Normalised version of organisationname\\ \hline
companyid & \\ \hline
fulladdress & Concatenation of \\
& \textbf{Address\_line\_1}, \textbf{Address\_line\_2}, \\
& \textbf{Address\_line\_3}, and
\textbf{Address\_line\_4}\\
\\ \hline
city & \textbf{Service\_town}\\ \hline
postcode & \textbf{Service\_Postcode}\\ \hline
registerdate & \textbf{DateReg} \\ \hline
removeddate & \\ \hline
source & Variable to indicate the data source (in this case - CIS) \\ \hline
\hline \hline
organisationname & \textbf{ServiceName} \\
(in supplementary data) & \\ \hline
\hline
\end{tabularx}
\caption{Fields in pre-procesing of Care Inspectorate Scotland data (bold text indicates fieldname in register)}
\label{tab:cis_preprocess}
\end{table}
\clearpage
\subsection{Care Quality Commission }
\url{https://www.cqc.org.uk/about-us/transparency/using-cqc-data}
Data from the Care Quality Commission is available as a Directory Download from their website. Monthly snapshots
are retrievable from their linked drive space. For this product we have used downloads from January 2023, April 2024,
and January, July and October 2025. Table \ref{tab:cqc_preprocess} shows the mapping between CQC data fields and spine fields.
\begin{table}[h]
\centering
\begin{tabularx}{\textwidth}{|l|X|}
\hline
Fieldname & Notes \\
\hline \hline
uid & Unique ID assigned to each organisation using \\
&'GB-CQC-' + \textbf{CQC Provider ID (for office use only)}\\ \hline
organisationname & \textbf{Provider name} \\ \hline
normalisedname & Normalised version of organisationname\\ \hline
companyid & \\ \hline
fulladdress & \textbf{Address}\\ \hline
city & \\ \hline
postcode & \textbf{Postcode}\\ \hline
registerdate & \textbf{} \\ \hline
removeddate & \\ \hline
source & Variable to indicate the data source (in this case - CQC) \\ \hline
\hline \hline
organisationname & \textbf{Name} \\
(in supplementary data) & \\ \hline
organisationname & \textbf{Also known as} \\
(in supplementary data) & \\ \hline
\hline
\end{tabularx}
\caption{Fields in pre-procesing of Care Inspectorate Scotland data (bold text indicates fieldname in register)}
\label{tab:cqc_preprocess}
\end{table}
\end{document}