Skip to content

Blog post on writing table providers#161

Merged
timsaucer merged 19 commits intomainfrom
site/writing-table-providers
Mar 31, 2026
Merged

Blog post on writing table providers#161
timsaucer merged 19 commits intomainfrom
site/writing-table-providers

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented Mar 20, 2026

Closes apache/datafusion#16821

This blog post is designed to help new users of DataFusion write their own table providers and understand some of the core concepts.

Preview site: https://datafusion.staged.apache.org/blog/2026/03/20/writing-table-providers/

@timsaucer timsaucer marked this pull request as ready for review March 20, 2026 22:09
Copy link
Copy Markdown

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I read through it and found the concepts well explained and easy to follow. One follow-up after publishing would be to link this blog from the doc comments of related APIs such as TableProvider.

Copy link
Copy Markdown

@pgwhalen pgwhalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who struggled in the past, I'm thrilled to see this get created now! I added some comments that highlight my biggest struggles.

@timsaucer
Copy link
Copy Markdown
Member Author

Thanks everyone for the feedback. The post is updated in case anyone wants another look.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 24, 2026

Starting to check this out

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @timsaucer -- this is really great and I think will help people write table providers a lot

The only thing I think we should be careful of is suggesting that people run CPU work on blocking threads as I don't think that is necessairly best practice -- I left some comments to that effect inline

Also, once we publish this blog, I think it would be sweet to incorporate a bunch of its content into the https://datafusion.apache.org/library-user-guide/custom-table-providers.html section of the doc

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 31, 2026

@timsaucer how is this post going? Shall we publish it?

@timsaucer
Copy link
Copy Markdown
Member Author

@timsaucer how is this post going? Shall we publish it?

I was working on it as you pinged. I hope to get it wrapped today

timsaucer and others added 9 commits March 31, 2026 14:48
- Clarify intro sentence to mention planning/execution work
- Label TableProvider as Logical Plan and ExecutionPlan as Physical Plan
- Change "four phases" to "several phases" (list has 5 items)
- "Some logical optimizations" and "rewrites such as" to signal non-exhaustive lists
- Clarify scan() comment: "don't do any execution work here"
- Rewrite partitioning section to lead with simple advice (match data layout)
  before covering target_partitions and hash partitioning subtleties
- Narrow CPU thread pool advice: spawn_blocking is for blocking/long-running
  work, not all CPU work
- Add "scan is single-threaded" as a reason to keep scan() lightweight

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses alamb's suggestion to move the section earlier so readers
understand what level of work is required before diving in.

- Moved section to just before Layer 1: TableProvider
- Trimmed the file-based path detail to a short paragraph with links
  (the full trait hierarchy was too deep for an intro-position section)
- Removed RecordBatchStreamAdapter reference (not yet introduced at
  that point in the article)
- Added a sentence orienting the reader to what the rest of the post covers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix use-after-move bug in DatePartitionedExec construction (dirs.len()
  called after dirs moved into struct field)
- Fix incorrect import: SessionState → catalog::Session in CountingTable
  example
- Remove double space before scan_with_args link
- Add missing blank line before '### Using EXPLAIN' heading
- Split dense 'Only Push Down Filters' paragraph for readability
- Change 'full working example' to 'illustrative example' for the
  filter pushdown code that contains todo!() stubs
- Use 'Rerun is building' instead of repeating [Rerun.io] link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
timsaucer and others added 3 commits March 31, 2026 15:56
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix grammar: "Best practices are" → "Best practice is"
- Remove unused StringArray import from complete example
- Fix outdated arrow-datafusion repo link → apache/datafusion
- Add missing reviewers to acknowledgements: adriangb, kevinjqliu, Omega359

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@timsaucer timsaucer merged commit f4ee574 into main Mar 31, 2026
4 checks passed
@timsaucer timsaucer deleted the site/writing-table-providers branch March 31, 2026 20:13
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 1, 2026

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 1, 2026

Also, once we publish this blog, I think it would be sweet to incorporate a bunch of its content into the https://datafusion.apache.org/library-user-guide/custom-table-providers.html section of the doc

FYI filed apache/datafusion#21304 to track

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: What table provider features would be helpful in an example?

8 participants