-
Notifications
You must be signed in to change notification settings - Fork 38
Description
DF.validate() does some basic checks but doesn't validate everything that is possible based on Table Schema. In particular, it does not validate primary keys and we have noted that this creates other currently untraced bugs (e.g.: load from a package with invalid primary keys and try to dump again, the package will be incomplete).
We need to explore one of:
- Support more features of table schema when validating rows, by enhancing the existing validator
- Use the resource validator in Frictionless ( e.g.: primary key check here https://github.com/frictionlessdata/frictionless-py/blob/1d8cc6cf2ad2521963fa82da8a78f368de4d1fd1/frictionless/resource.py#L934 )
The problem with adopting Frictionless is that it can't be incrementally adopted AFAIK - the validation is built into the Resource class and I don't know just from reading the code where that leads (if / how it complicates our code when we use different libraries for managing Frictionless Data specs). Also, it sets state in memory (seen data for primary keys and foreign keys), and I guess based on other patterns in Dataflows we would want to store that data outside of the running python process ( e.g.: using https://github.com/akariv/kvfile ).