Now that we have basic functionality working, it would be good to optimize for decreasing the disk space consumption and overall download volume efficiency.
Please see full context in my comments in #4956. Relevant excerpts below for convenience.
(one)
W.r.t. my previous comment about disk space/inodes consumed - it would be better to do this unpacking and CVE import per-file (i.e., per-ecosystem), and delete both the ZIP (as you do here already) and the unpacked files (as I propose for the process_data_from_disk()) as soon as a given ecosystem is done. That way you won't have more than one ZIP/set of JSONs on disk at any given time, decreasing the disk space usage significantly.
Originally posted by @alex-ter in #4956 (comment)
(two)
Aside from the other comment about correctness, this will fetch full ecosystem files every time, which AFAICS is about a gigabyte of data (judging by the undocumented all.zip at the root of the hierarchy, I presume it has all of them). It will then unpack and process all of them as well, which is quite inefficient.
EDIT: while testing, the unpacked size is 7.1 GB and there are about 600K files in the cache. [...]
Originally posted by @alex-ter in #4956 (comment)
Now that we have basic functionality working, it would be good to optimize for decreasing the disk space consumption and overall download volume efficiency.
Please see full context in my comments in #4956. Relevant excerpts below for convenience.
(one)
W.r.t. my previous comment about disk space/inodes consumed - it would be better to do this unpacking and CVE import per-file (i.e., per-ecosystem), and delete both the ZIP (as you do here already) and the unpacked files (as I propose for the
process_data_from_disk()) as soon as a given ecosystem is done. That way you won't have more than one ZIP/set of JSONs on disk at any given time, decreasing the disk space usage significantly.Originally posted by @alex-ter in #4956 (comment)
(two)
Aside from the other comment about correctness, this will fetch full ecosystem files every time, which AFAICS is about a gigabyte of data (judging by the undocumented
all.zipat the root of the hierarchy, I presume it has all of them). It will then unpack and process all of them as well, which is quite inefficient.EDIT: while testing, the unpacked size is 7.1 GB and there are about 600K files in the cache. [...]
Originally posted by @alex-ter in #4956 (comment)