Conversation
2010YOUY01
left a comment
There was a problem hiding this comment.
LGTM. I read through it and found the concepts well explained and easy to follow. One follow-up after publishing would be to link this blog from the doc comments of related APIs such as TableProvider.
pgwhalen
left a comment
There was a problem hiding this comment.
As someone who struggled in the past, I'm thrilled to see this get created now! I added some comments that highlight my biggest struggles.
Co-authored-by: Yongting You <2010youy01@gmail.com>
|
Thanks everyone for the feedback. The post is updated in case anyone wants another look. |
|
Starting to check this out |
alamb
left a comment
There was a problem hiding this comment.
Thank you so much @timsaucer -- this is really great and I think will help people write table providers a lot
The only thing I think we should be careful of is suggesting that people run CPU work on blocking threads as I don't think that is necessairly best practice -- I left some comments to that effect inline
Also, once we publish this blog, I think it would be sweet to incorporate a bunch of its content into the https://datafusion.apache.org/library-user-guide/custom-table-providers.html section of the doc
|
@timsaucer how is this post going? Shall we publish it? |
I was working on it as you pinged. I hope to get it wrapped today |
- Clarify intro sentence to mention planning/execution work - Label TableProvider as Logical Plan and ExecutionPlan as Physical Plan - Change "four phases" to "several phases" (list has 5 items) - "Some logical optimizations" and "rewrites such as" to signal non-exhaustive lists - Clarify scan() comment: "don't do any execution work here" - Rewrite partitioning section to lead with simple advice (match data layout) before covering target_partitions and hash partitioning subtleties - Narrow CPU thread pool advice: spawn_blocking is for blocking/long-running work, not all CPU work - Add "scan is single-threaded" as a reason to keep scan() lightweight Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses alamb's suggestion to move the section earlier so readers understand what level of work is required before diving in. - Moved section to just before Layer 1: TableProvider - Trimmed the file-based path detail to a short paragraph with links (the full trait hierarchy was too deep for an intro-position section) - Removed RecordBatchStreamAdapter reference (not yet introduced at that point in the article) - Added a sentence orienting the reader to what the rest of the post covers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix use-after-move bug in DatePartitionedExec construction (dirs.len() called after dirs moved into struct field) - Fix incorrect import: SessionState → catalog::Session in CountingTable example - Remove double space before scan_with_args link - Add missing blank line before '### Using EXPLAIN' heading - Split dense 'Only Push Down Filters' paragraph for readability - Change 'full working example' to 'illustrative example' for the filter pushdown code that contains todo!() stubs - Use 'Rerun is building' instead of repeating [Rerun.io] link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix grammar: "Best practices are" → "Best practice is" - Remove unused StringArray import from complete example - Fix outdated arrow-datafusion repo link → apache/datafusion - Add missing reviewers to acknowledgements: adriangb, kevinjqliu, Omega359 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
FYI filed apache/datafusion#21304 to track |
Closes apache/datafusion#16821
This blog post is designed to help new users of DataFusion write their own table providers and understand some of the core concepts.
Preview site: https://datafusion.staged.apache.org/blog/2026/03/20/writing-table-providers/