-
Notifications
You must be signed in to change notification settings - Fork 70
Add kwarg to filter columns #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently we don't have the option to load just a subset of the columns.
This matters e.g. when compression is the bottleneck.
For example, create a compressed arrow file.
```julia
using Arrow
p = tempname();
N = 1000000
tbl = (
a=rand(N),
b=rand(N),
c=rand(N),
d=rand(N),
e=rand(N),
f=[rand(rand(0:100)) for _ in 1:N],
);
Arrow.write(p, tbl; compress=:zstd);
```
Column `f` is the longest - it has an expected 50*N elements vs N for the rest
Some times we only care for some of the other columns. Currently we must
decompress all columns regardless:
```julia
using BenchmarkTools
@Btime tbl = Arrow.Table(p); # 359.205 ms (530 allocations: 794.23 MiB)
```
With this commit we can load only some of the columns
```julia
@Btime tbl = Arrow.Table(p; filtercolumns=["a"]); # 6.146 ms (231 allocations: 14.33 MiB)
```
|
Converting this to draft as I'm working on something that will supersede this. |
Codecov Report
@@ Coverage Diff @@
## main #412 +/- ##
==========================================
- Coverage 87.45% 85.78% -1.67%
==========================================
Files 26 26
Lines 3283 3356 +73
==========================================
+ Hits 2871 2879 +8
- Misses 412 477 +65
📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
|
Does anyone wanna re-run CI? Looks like macos got stuck |
|
Done. |
|
Hi, what's the status of this PR? Would love to see what I can do @JoaoAparicio |
|
This would be a very important feature for us, too. |
|
for the API, |
|
We need to rebase on main to proceed this. |
Currently we don't have the option to load just a subset of the columns. This matters e.g. when compression is the bottleneck.
For example, create a compressed arrow file.
Column
fis the longest - it has an expected 50*N elements vs N for the rest Some times we only care for some of the other columns. Currently we must decompress all columns regardless:With this commit we can load only some of the columns