I think there are some bits here that are being handled manually that should be handled in a more automated fashion.
With the hand-written tuples for which producer can correctly encode which query, we have the means to track regressions (DuckDB suddenly fails to run logb, e.g.) but not improvements (Isthmus can now encode logb).
This is a highly non-trivial problem, because the outcomes of the tests of the producers are essentially the text fixtures for the consumers.
We've been using pytest-snapshot to test that Ibis produces "good" or "golden" SQL for various expressions (https://pypi.org/project/pytest-snapshot/) and I wonder if that would be of help here.
Testing producers would mean generating substrait blobs, then comparing them to known good / valid snapshots of those blobs.
Testing consumers would consist of loading the snapshots blobs and attempting to execute.
I know I'm not covering everything that needs covering in the test matrix here, but I think it would be a very good idea to start sketching out more sustainable patterns.
Having said all of ^^^^that^^^^, I don't think that should block this PR.
I do think that we should be attempting to run all producer tests on all SQL snippets, and not manually filtering them down pre-test. If isthmus is going to fail one of those tests because it uses a different SQL dialect, so be it -- we can get creative in the xfail markers and distinguish between "tests that fail that should pass in the future" and "tests that fail that will always fail".
Alternatively, we might make use of sqlglot to translate string sql between dialects -- it's very good at that.
Originally posted by @gforsyth in #6 (review)
I think there are some bits here that are being handled manually that should be handled in a more automated fashion.
With the hand-written tuples for which
producercan correctly encode which query, we have the means to track regressions (DuckDBsuddenly fails to runlogb, e.g.) but not improvements (Isthmuscan now encodelogb).This is a highly non-trivial problem, because the outcomes of the tests of the producers are essentially the text fixtures for the consumers.
We've been using
pytest-snapshotto test that Ibis produces "good" or "golden" SQL for various expressions (https://pypi.org/project/pytest-snapshot/) and I wonder if that would be of help here.Testing producers would mean generating substrait blobs, then comparing them to known good / valid snapshots of those blobs.
Testing consumers would consist of loading the snapshots blobs and attempting to execute.
I know I'm not covering everything that needs covering in the test matrix here, but I think it would be a very good idea to start sketching out more sustainable patterns.
Having said all of ^^^^that^^^^, I don't think that should block this PR.
I do think that we should be attempting to run all producer tests on all SQL snippets, and not manually filtering them down pre-test. If
isthmusis going to fail one of those tests because it uses a different SQL dialect, so be it -- we can get creative in thexfailmarkers and distinguish between "tests that fail that should pass in the future" and "tests that fail that will always fail".Alternatively, we might make use of
sqlglotto translate string sql between dialects -- it's very good at that.Originally posted by @gforsyth in #6 (review)