Bug report
Bug description
tarfile reads a member's extended header — a GNU long name/link (GNUTYPE_LONGNAME / GNUTYPE_LONGLINK) or a pax header (XHDTYPE / XGLTYPE) — with a single read sized directly by the header's size field:
https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1427
https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1483
buf = tarfile.fileobj.read(self._block(self.size))
self.size comes from the 12-byte size field of the extended-header member and is not validated against the data actually present. Via base-256 encoding it can claim up to ~2**88 bytes. A ~512-byte crafted archive therefore makes read() pre-allocate gigabytes — and this happens on open / iterate (tarfile.open(...).getmembers()), before any extraction filter runs.
Reproducer
import os, resource, tarfile, tempfile
# A 512-byte tar whose single member is a GNU long-name header claiming ~1 GiB.
ti = tarfile.TarInfo("A")
ti.type = tarfile.GNUTYPE_LONGNAME
ti.size = 1_000_000_000 # claimed size; only the 512-byte header follows
data = ti.tobuf(format=tarfile.GNU_FORMAT) # exactly 512 bytes
with tempfile.NamedTemporaryFile(suffix=".tar", delete=False) as f:
f.write(data)
path = f.name
before = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
try:
with tarfile.open(path, "r:") as t:
t.getmembers()
except Exception as e:
print(type(e).__name__, e)
after = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"peak RSS delta: {(after - before) / 1024:.0f} MiB from a {os.path.getsize(path)}-byte file")
os.unlink(path)
On current main this prints a peak RSS delta of ~950 MiB from a 512-byte file. The same applies to GNUTYPE_LONGLINK and to pax headers (XHDTYPE). Because the size field accepts base-256 encoding, a 512-byte file can claim, e.g., 1 TiB, raising MemoryError even on machines with plenty of RAM. The crafted header round-trips through TarInfo.frombuf, so it parses exactly like a normal archive.
Suggested fix
Read the extended-header bytes in bounded chunks instead of one read(size), so the claimed size can't force a huge up-front allocation. The returned bytes are unchanged for valid archives. I have a patch + regression tests ready and will open a PR.
CPython versions tested on
CPython main (also reproduces on the released branches).
Operating systems tested on
Linux
Linked PRs
Bug report
Bug description
tarfilereads a member's extended header — a GNU long name/link (GNUTYPE_LONGNAME/GNUTYPE_LONGLINK) or a pax header (XHDTYPE/XGLTYPE) — with a single read sized directly by the header'ssizefield:https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1427
https://github.com/python/cpython/blob/main/Lib/tarfile.py#L1483
self.sizecomes from the 12-byte size field of the extended-header member and is not validated against the data actually present. Via base-256 encoding it can claim up to ~2**88 bytes. A ~512-byte crafted archive therefore makesread()pre-allocate gigabytes — and this happens on open / iterate (tarfile.open(...).getmembers()), before any extraction filter runs.Reproducer
On current
mainthis prints a peak RSS delta of ~950 MiB from a 512-byte file. The same applies toGNUTYPE_LONGLINKand to pax headers (XHDTYPE). Because the size field accepts base-256 encoding, a 512-byte file can claim, e.g., 1 TiB, raisingMemoryErroreven on machines with plenty of RAM. The crafted header round-trips throughTarInfo.frombuf, so it parses exactly like a normal archive.Suggested fix
Read the extended-header bytes in bounded chunks instead of one
read(size), so the claimed size can't force a huge up-front allocation. The returned bytes are unchanged for valid archives. I have a patch + regression tests ready and will open a PR.CPython versions tested on
CPython main (also reproduces on the released branches).
Operating systems tested on
Linux
Linked PRs