Add kwarg to filter columns #412

JoaoAparicio · 2023-04-03T00:14:50Z

Currently we don't have the option to load just a subset of the columns. This matters e.g. when compression is the bottleneck.

For example, create a compressed arrow file.

using Arrow
p = tempname();
N = 1000000
tbl = (
    a=rand(N),
    b=rand(N),
    c=rand(N),
    d=rand(N),
    e=rand(N),
    f=[rand(rand(0:100)) for _ in 1:N],
);
Arrow.write(p, tbl; compress=:zstd);

Column f is the longest - it has an expected 50*N elements vs N for the rest Some times we only care for some of the other columns. Currently we must decompress all columns regardless:

using BenchmarkTools
@btime tbl = Arrow.Table(p);  # 359.205 ms (530 allocations: 794.23 MiB)

With this commit we can load only some of the columns

@btime tbl = Arrow.Table(p; filtercolumns=["a"]);  # 6.146 ms (231 allocations: 14.33 MiB)

Currently we don't have the option to load just a subset of the columns. This matters e.g. when compression is the bottleneck. For example, create a compressed arrow file. ```julia using Arrow p = tempname(); N = 1000000 tbl = ( a=rand(N), b=rand(N), c=rand(N), d=rand(N), e=rand(N), f=[rand(rand(0:100)) for _ in 1:N], ); Arrow.write(p, tbl; compress=:zstd); ``` Column `f` is the longest - it has an expected 50*N elements vs N for the rest Some times we only care for some of the other columns. Currently we must decompress all columns regardless: ```julia using BenchmarkTools @Btime tbl = Arrow.Table(p); # 359.205 ms (530 allocations: 794.23 MiB) ``` With this commit we can load only some of the columns ```julia @Btime tbl = Arrow.Table(p; filtercolumns=["a"]); # 6.146 ms (231 allocations: 14.33 MiB) ```

JoaoAparicio · 2023-04-03T00:18:49Z

#340
#353

JoaoAparicio · 2023-04-04T17:49:25Z

Converting this to draft as I'm working on something that will supersede this.

codecov-commenter · 2023-11-04T00:28:37Z

Codecov Report

Merging #412 (bc9169e) into main (787768f) will decrease coverage by 1.67%.
The diff coverage is 15.58%.

@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
- Coverage   87.45%   85.78%   -1.67%     
==========================================
  Files          26       26              
  Lines        3283     3356      +73     
==========================================
+ Hits         2871     2879       +8     
- Misses        412      477      +65

Files	Coverage Δ
src/table.jl	`81.97% <15.58%> (-10.52%)`	⬇️

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

JoaoAparicio · 2023-11-25T13:06:04Z

Does anyone wanna re-run CI? Looks like macos got stuck

kou · 2023-11-27T02:14:57Z

Done.

Yuan-Ru-Lin · 2025-07-16T23:00:37Z

Hi, what's the status of this PR? Would love to see what I can do @JoaoAparicio

oschulz · 2025-07-17T11:16:18Z

This would be a very important feature for us, too.

ericphanson · 2025-07-17T12:24:41Z

for the API, filtercolumns seems a bit ambiguous to me (are we keeping them or removing them). CSV.jl has select and drop (https://csv.juliadata.org/stable/reading.html#CSV.File) which seems nice. Maybe select or select_columns for the name here?

kou · 2025-07-18T00:19:33Z

We need to rebase on main to proceed this.

JoaoAparicio mentioned this pull request Apr 3, 2023

Feather file with compression and larger than RAM #340

Open

JoaoAparicio marked this pull request as draft April 4, 2023 17:48

JoaoAparicio added 5 commits November 3, 2023 17:42

Merge branch 'main' into filtercolumns

c950862

format

9322cd5

typo

7f3ef77

Merge branch 'main' into filtercolumns

0fa6fba

update manual

bc9169e

JoaoAparicio marked this pull request as ready for review November 4, 2023 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add kwarg to filter columns #412

Add kwarg to filter columns #412

Uh oh!

JoaoAparicio commented Apr 3, 2023

Uh oh!

JoaoAparicio commented Apr 3, 2023

Uh oh!

JoaoAparicio commented Apr 4, 2023

Uh oh!

codecov-commenter commented Nov 4, 2023 •

edited

Loading

Uh oh!

JoaoAparicio commented Nov 25, 2023

Uh oh!

kou commented Nov 27, 2023

Uh oh!

Yuan-Ru-Lin commented Jul 16, 2025

Uh oh!

oschulz commented Jul 17, 2025

Uh oh!

ericphanson commented Jul 17, 2025

Uh oh!

kou commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add kwarg to filter columns #412

Are you sure you want to change the base?

Add kwarg to filter columns #412

Uh oh!

Conversation

JoaoAparicio commented Apr 3, 2023

Uh oh!

JoaoAparicio commented Apr 3, 2023

Uh oh!

JoaoAparicio commented Apr 4, 2023

Uh oh!

codecov-commenter commented Nov 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JoaoAparicio commented Nov 25, 2023

Uh oh!

kou commented Nov 27, 2023

Uh oh!

Yuan-Ru-Lin commented Jul 16, 2025

Uh oh!

oschulz commented Jul 17, 2025

Uh oh!

ericphanson commented Jul 17, 2025

Uh oh!

kou commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Nov 4, 2023 •

edited

Loading