Skip to content

Conversation

@kwonkwonn
Copy link
Contributor

resolves #7766 (BA-3744)

implementation of BA-1040, multiple file zip downloader.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Copilot AI review requested due to automatic review settings January 29, 2026 09:37
@github-actions github-actions bot added size:L 100~500 LoC comp:storage-proxy Related to Storage proxy component labels Jan 29, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a new endpoint for downloading multiple files/directories as a single ZIP archive, addressing issue #7766 (BA-3744). The implementation introduces a streaming ZIP archive handler that leverages zipstream to avoid creating temporary files.

Changes:

  • Adds new ZipArchiveStreamReader class and ArchiveDownloadTokenData Pydantic model for JWT validation
  • Implements /folder/file/download-archive manager endpoint to generate download tokens for multiple files
  • Implements /download-archive client endpoint to stream requested files as a ZIP archive

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 16 comments.

File Description
src/ai/backend/storage/services/file_stream/zip.py New module containing ZIP streaming infrastructure with ZipArchiveStreamReader and ArchiveDownloadTokenData Pydantic model
src/ai/backend/storage/api/manager.py Adds create_archive_download_session endpoint and route for generating JWT tokens for multi-file downloads
src/ai/backend/storage/api/client.py Adds download_archive handler for streaming multiple files as ZIP using token-based authentication

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +471 to +479
try:
payload = jwt.decode(raw_token, secret, algorithms=["HS256"])
token_data = ArchiveDownloadTokenData.model_validate(payload)
except jwt.ExpiredSignatureError as e:
raise web.HTTPUnauthorized(reason="Token expired") from e
except jwt.InvalidTokenError as e:
raise web.HTTPUnauthorized(reason="Invalid token") from e
except ValidationError as e:
raise InvalidAPIParameters(extra_msg=str(e)) from e
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual JWT validation is inconsistent with existing patterns in this file. The existing download and upload handlers use tx.JsonWebToken validator in the check_params context manager for automatic JWT decoding and validation (see lines 145-148, 309-312). This approach should be used here for consistency and better error handling.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

@kwonkwonn kwonkwonn Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tx.JsonWebToken validator in the check_params uses Trafaret based validation.
Our ultimate goal is to migrate Trafaret into pydantic inside the program, Better implement it inline the function and replace with pydantic based validator later.

But at least not in this pr.

Comment on lines 485 to 490
sanitized: list[Path] = []
for relpath in token_data.files:
file_path = volume.sanitize_vfpath(token_data.vfolder_id, relpath)
# Ensure the file_path is within the vfolder_root boundary
if not file_path.exists():
raise web.HTTPNotFound(reason=f"File not found: {relpath}")
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation doesn't handle duplicate paths in the input list. If the same file or directory appears multiple times in token_data.files, it will be added to the zip archive multiple times with the same arcname, which may cause issues or confusion. Consider deduplicating the file paths before processing, or at least documenting this behavior.

Suggested change
sanitized: list[Path] = []
for relpath in token_data.files:
file_path = volume.sanitize_vfpath(token_data.vfolder_id, relpath)
# Ensure the file_path is within the vfolder_root boundary
if not file_path.exists():
raise web.HTTPNotFound(reason=f"File not found: {relpath}")
sanitized: list[Path] = []
seen_paths: set[Path] = set()
for relpath in token_data.files:
file_path = volume.sanitize_vfpath(token_data.vfolder_id, relpath)
# Ensure the file_path is within the vfolder_root boundary
if not file_path.exists():
raise web.HTTPNotFound(reason=f"File not found: {relpath}")
if file_path in seen_paths:
continue
seen_paths.add(file_path)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

@kwonkwonn kwonkwonn Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comment on this one, but we need to decide which part should get charged of detecting duplicated files in the list

raise web.HTTPUnauthorized(reason="Invalid token") from e
except ValidationError as e:
raise InvalidAPIParameters(extra_msg=str(e)) from e
# TODO: We need to implement pydantic jwt decorder to avoid this manual validation. include extract secret key from ctx.
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment indicates that the current implementation is incomplete and needs to be refactored. Since the codebase already has a tx.JsonWebToken validator that handles JWT decoding with secret extraction from context, this manual approach should be replaced with the existing pattern before merging.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on trafaret based validation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write it based on Pydantic this time.

@lablup lablup deleted a comment from Copilot AI Jan 29, 2026
@kwonkwonn kwonkwonn force-pushed the feat/multiple-file-stream-zipper branch from 2660427 to 0f00acc Compare January 30, 2026 08:28
vfid: VFolderID
files: list[str]

async with cast(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of wrapping api handlers with pydantic, but it seems better to have it in the future PR

raw_token = request.query.get("token")
if not raw_token:
raise InvalidAPIParameters(extra_msg="Missing 'token' query parameter")
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find pydantic wrapper to automatize validation and error handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be done in following issue as migrating storage is needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the api_handler decorator, which was also used in manager.

@kwonkwonn kwonkwonn requested a review from fregataa January 30, 2026 08:55
@kwonkwonn
Copy link
Contributor Author

We can add three tests.

  • jwt generator: unit test with request->response validation
  • download_archive: mock streamReader and file structure, how it detect files, how it response (unit test?)
  • StreamReader: needs component test and running server at least. manual test for now

raise InvalidAPIParameters(extra_msg=f"Path escapes vfolder boundary: {relpath}")
if not file_path.exists():
raise web.HTTPNotFound(reason=f"File not found: {relpath}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitize_vfpath call mangle function which translate vfid into absolute path, not needed to use entire function.

raw_token = request.query.get("token")
if not raw_token:
raise InvalidAPIParameters(extra_msg="Missing 'token' query parameter")
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the api_handler decorator, which was also used in manager.

raise web.HTTPUnauthorized(reason="Invalid token") from e
except ValidationError as e:
raise InvalidAPIParameters(extra_msg=str(e)) from e
# TODO: We need to implement pydantic jwt decorder to avoid this manual validation. include extract secret key from ctx.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write it based on Pydantic this time.

Comment on lines +1092 to +1110
async def create_archive_download_session(request: web.Request) -> web.Response:
class Params(TypedDict):
volume: str
vfid: VFolderID
files: list[str]

async with cast(
AbstractAsyncContextManager[Params],
check_params(
request,
t.Dict(
{
t.Key("volume"): t.String(),
t.Key("vfid"): tx.VFolderID(),
t.Key("files"): t.List(t.String, min_length=1),
},
),
),
) as params:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to keep this implementation. Please define it based on pydantic.

Comment on lines +1113 to +1118
token_payload = ArchiveDownloadTokenData(
operation="download",
volume=params["volume"],
vfolder_id=params["vfid"],
files=params["files"],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining pydantic and first validating with trafaret is highly unusual.

Comment on lines +30 to +37
class ArchiveDownloadTokenData(BaseModel):
"""Pydantic model for validating the JWT payload of archive download tokens."""

operation: Literal["download"]
volume: str
vfolder_id: VFolderIDField
files: list[str] = Field(min_length=1)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the request must also be made in the manager, the values to be passed as DTOs should be defined in common/dto/storage.
And please define values like operation as an Enum rather than a Literal.

Comment on lines +113 to +136
async def stream_archive_response(
request: web.Request,
reader: StreamReader,
filename: str,
) -> web.StreamResponse:
"""Stream a StreamReader's output as an HTTP attachment response.

This helper is format-agnostic: it works with any StreamReader (ZIP, tar, etc.).
"""
ascii_filename = filename.encode("ascii", errors="ignore").decode("ascii").replace('"', r"\"")
encoded_filename = urllib.parse.quote(filename, encoding="utf-8")
response = web.StreamResponse(
headers={
hdrs.CONTENT_TYPE: reader.content_type() or "application/octet-stream",
hdrs.CONTENT_DISPOSITION: " ".join([
f'attachment;filename="{ascii_filename}";', # RFC-2616 sec2.2
f"filename*=UTF-8''{encoded_filename}", # RFC-5987
]),
},
)
await response.prepare(request)
async for chunk in reader.read():
await response.write(chunk)
return response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure API requests are not intertwined here. The service layer should only return a StreamReader, and we want it to use only the values returned from the API layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:storage-proxy Related to Storage proxy component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Storage Proxy: Implement multiple files zip streaming API

3 participants