Skip to content

feat: optimizations for OSV data fetching #5569

@alex-ter

Description

@alex-ter

Now that we have basic functionality working, it would be good to optimize for decreasing the disk space consumption and overall download volume efficiency.

Please see full context in my comments in #4956. Relevant excerpts below for convenience.

(one)
W.r.t. my previous comment about disk space/inodes consumed - it would be better to do this unpacking and CVE import per-file (i.e., per-ecosystem), and delete both the ZIP (as you do here already) and the unpacked files (as I propose for the process_data_from_disk()) as soon as a given ecosystem is done. That way you won't have more than one ZIP/set of JSONs on disk at any given time, decreasing the disk space usage significantly.

Originally posted by @alex-ter in #4956 (comment)

(two)
Aside from the other comment about correctness, this will fetch full ecosystem files every time, which AFAICS is about a gigabyte of data (judging by the undocumented all.zip at the root of the hierarchy, I presume it has all of them). It will then unpack and process all of them as well, which is quite inefficient.

EDIT: while testing, the unpacked size is 7.1 GB and there are about 600K files in the cache. [...]

Originally posted by @alex-ter in #4956 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions