Skip to content

tarfile: memory exhaustion via oversized extended-header (GNU long name / pax) size field #151497

@iamsharduld

Description

@iamsharduld

Bug report

Bug description

tarfile reads a member's extended header — a GNU long name/link (GNUTYPE_LONGNAME / GNUTYPE_LONGLINK) or a pax header (XHDTYPE / XGLTYPE) — with a single read sized directly by the header's size field:

https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1427
https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1483

buf = tarfile.fileobj.read(self._block(self.size))

self.size comes from the 12-byte size field of the extended-header member and is not validated against the data actually present. Via base-256 encoding it can claim up to ~2**88 bytes. A ~512-byte crafted archive therefore makes read() pre-allocate gigabytes — and this happens on open / iterate (tarfile.open(...).getmembers()), before any extraction filter runs.

Reproducer

import os, resource, tarfile, tempfile

# A 512-byte tar whose single member is a GNU long-name header claiming ~1 GiB.
ti = tarfile.TarInfo("A")
ti.type = tarfile.GNUTYPE_LONGNAME
ti.size = 1_000_000_000          # claimed size; only the 512-byte header follows
data = ti.tobuf(format=tarfile.GNU_FORMAT)   # exactly 512 bytes

with tempfile.NamedTemporaryFile(suffix=".tar", delete=False) as f:
    f.write(data)
    path = f.name

before = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
try:
    with tarfile.open(path, "r:") as t:
        t.getmembers()
except Exception as e:
    print(type(e).__name__, e)
after = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"peak RSS delta: {(after - before) / 1024:.0f} MiB from a {os.path.getsize(path)}-byte file")
os.unlink(path)

On current main this prints a peak RSS delta of ~950 MiB from a 512-byte file. The same applies to GNUTYPE_LONGLINK and to pax headers (XHDTYPE). Because the size field accepts base-256 encoding, a 512-byte file can claim, e.g., 1 TiB, raising MemoryError even on machines with plenty of RAM. The crafted header round-trips through TarInfo.frombuf, so it parses exactly like a normal archive.

Suggested fix

Read the extended-header bytes in bounded chunks instead of one read(size), so the claimed size can't force a huge up-front allocation. The returned bytes are unchanged for valid archives. I have a patch + regression tests ready and will open a PR.

CPython versions tested on

CPython main (also reproduces on the released branches).

Operating systems tested on

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions