This project contains crawlers that import source data, such as sanctions lists and other KYC/AML screening data, into the FollowTheMoney entities. It puts an emphasis on data cleaning. Much of the input data is semi-structured information published by government bodies - often rife with inconsistencies, manual data entry errors, etc. Our goal is to bring strict interpretation to these source datasets.
zavodcontains an ETL framework for crawlers, including definitions for metadata (zavod.meta), entity structure (zavod.entity.Entity) and crawler context (zavod.context.Context).- Documentation for the entity structure (available schemata and properties in
followthemoney) is available here: https://followthemoney.tech/explorer/schemata/ (sub paths eg. https://followthemoney.tech/explorer/schemata/Person/). Property types are documented here: https://followthemoney.tech/explorer/types/ (and eg. https://followthemoney.tech/explorer/types/name/). - Data cleaning functions from
rigourare documented at: https://rigour.followthemoney.tech/ - Write code for all zavod functions in
zavod/zavod/tests. - Run tests using
cd zavod && pytest zavod/tests/. - Run
cd zavod && mypy --strict --exclude zavod/tests zavod/after each change to zavod.
- Documentation for the entity structure (available schemata and properties in
datasetscontains crawlers. Each crawler is defined using a.ymlfile (eg.datasets/us/ofac/us_ofac_sdn.yml), and a code file (oftencrawler.py, but defined using theentry_pointkey of the dataset.yml). The dataset has aname, which is based on the.ymlfile name stem (e.g.us_ofac_sdn).- To run a crawler:
zavod crawl <file_path>in the project root. Running crawl several times might re-use the same data fetched in the initial run (context.fetch_resource). - When a crawler encounters uncertainty in any of the data it is parsing, it should crash or produce an error instead of emitting ambiguous data.
- Crawlers use
lookupsto override specific values for entity properties of a particular type. For ambiguous data, individual cases can be clarified by adding lookups. - After running a crawler, output data is written to
data/datasets/<dataset_name>/. The fileissues.logcontains line-based JSON of any warnings or errors produced by the crawler. Often the source data fetched bycontext.fetch_resourceis also available in that folder. - Crawlers commonly do
from zavod import helpers as h. The relevant code is inzavod/zavod/helpers. Use this pattern over direct imports.
- To run a crawler:
uicontains a NextJS user interface for reviewing and verifying information from crawlers. The contained table structures need to match those inzavod.stateful.docscontains documentation and best practices, especially with regards to semantic issues like Politically Exposed Persons
- Assume the venv you're running in has
zavodconfigured. - Write code that is specific (eg
if var is None:, notif var:) and breaks with an erorr when encountering unexpected conditions. Distrust all input, especially from the source files. - All zavod code needs to be fully typed, unit tested and thoroughly documented.
- When adding type hints, use the container types such as
set,tupleordictinstead of the now-deprecatedtyping.Set,typing.Tupleortyping.Dict. - All new crawlers should be written using typed Python. Suggest adding types to existing ones.
- Be extremely conservative in bringing in new dependencies. We use
lxmlfor parsing HTML/XML, andcontext.fetch_functions the retrieve online data. Other libraries are listed inzavod/pyproject.toml.