Conversation
|
Thanks for this, @jayckaiser - exciting work. I'll dig into it, test, etc. probably the week after next. In the meantime, I want to ask/clarify two things:
I'm confused about which is faster... are you saying that with your PyArrow changes/optimizaitions, the runtime is slower on 3.12? (this seems counterintuitive to me) Or is it the other way around, it's faster on 3.12?
Suppose earthmover reads in a JSONL file containing a line/row/payload like (un-linearized) {
"field": {
"some": {
"deeply": {
"nested": {
"property": "value"
}
}
}
}
}Are you saying that
(I'd really like to avoid changes to earthmover that would require changes to projects' |
|
@jayckaiser I ran $ python3 -V
Python 3.10.12
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml
2024-10-18 10:38:33.953 earthmover INFO starting...
2024-10-18 10:38:34.024 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 11:59:51.878 earthmover INFO done!
User time (seconds): 3684.89
System time (seconds): 81.00
Percent of CPU this job got: 77%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21:19
Maximum resident set size (kbytes): 1347476
...(so 1hr 21min, max 1.3GB memory used - this was with a 3.2GB input TSV file, producing a 28GB JSONL file) With Python 3.12: $ python3 -V
Python 3.12.5
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml
2024-10-18 12:22:34.749 earthmover INFO starting...
2024-10-18 12:22:34.813 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 14:16:41.572 earthmover INFO done!
User time (seconds): 5460.58
System time (seconds): 127.11
Percent of CPU this job got: 81%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:09
Maximum resident set size (kbytes): 1722360
...so 1hr 54min (38% longer), max 1.7GB memory used (24% more). This confirms your result on a large dataset: slower and less memory efficient under Python 3.12 with Pyarrow strings 😢. |
This is a research branch trying to resolve the lack of PyArrow strings being used in Python 3.12. There are a few main changes:
read_csv()fromstr(i.e., Python strings) to"string"(i.e., default string type).FileSource.execute().That second change causes breakages when using nested JSON data (since we are no longer using generic
objectdatatypes. The following fixes are required:fromjson()Jinja macro to useast.literal_eval()when JSON has single-quotes.fromjson()be applied to Jinja templating in YAML when retrieving nested fields.Some observations:
earthmover -t.pyarrowbackend, since this is turned on by default when available and raises an error in 3.8 otherwise.Please let me know your thoughts.