Skip to content

Conversation

@JR-1991
Copy link
Member

@JR-1991 JR-1991 commented Sep 24, 2025

This pull request addresses issue #40, which has been documented. When the number of open files exceeds a certain limit, Python raises an OSError indicating an overflow. Previously, the implementation opened files as handlers during File initialization, which resulted in this error when the number of open files exceeded the OS's limit.

This PR refactors file handler management across several modules to improve reliability and consistency when accessing file data. The main change introduces a new get_handler method in the File class to standardize how file handlers are obtained, ensuring files are opened only when needed and properly closed after use. This update affects file reading, uploading, packaging, and checksum operations.

File handler management and API consistency:

  • Added a get_handler method to the File class in dvuploader/file.py, which opens the file if the handler is not already set, centralizing file handler access logic.
  • Updated all usages of direct file.handler access in dvuploader/directupload.py, dvuploader/nativeupload.py, and dvuploader/packaging.py to use file.get_handler(), ensuring consistent file opening and closing behavior. [1] [2] [3]

Checksum and file reading improvements:

  • Refactored update_checksum_chunked in dvuploader/file.py to use get_handler() for reading file data, and added logic to reset or close handlers depending on how they were obtained. [1] [2]
  • Removed the assertion that required self.handler to be initialized before updating checksum, allowing for just-in-time file opening.

Closes issues

Introduces a get_handler method to manage file handler initialization and usage. Updates checksum chunked reading to use get_handler, ensuring proper resource management by seeking or closing handlers as appropriate. Removes direct handler assignment during size calculation for improved separation of concerns.
Replaced direct access to file.handler with file.get_handler() in directupload.py, nativeupload.py, and packaging.py for improved encapsulation and consistency. Also made minor formatting improvements in nativeupload.py.
@JR-1991 JR-1991 requested a review from Copilot September 24, 2025 09:11
@JR-1991 JR-1991 self-assigned this Sep 24, 2025
@JR-1991 JR-1991 added the bug Something isn't working label Sep 24, 2025
@JR-1991 JR-1991 linked an issue Sep 24, 2025 that may be closed by this pull request
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request addresses issue #40 by implementing just-in-time file opening to prevent OSError when the number of open files exceeds OS limits. The main change introduces a new get_handler() method that opens files only when needed rather than during initialization.

  • Refactors file handler management across multiple modules to use lazy file opening
  • Introduces standardized get_handler() method in the File class for consistent file access
  • Updates checksum operations to properly handle both pre-initialized and just-in-time opened handlers

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
dvuploader/file.py Adds get_handler() method and removes automatic file opening during initialization
dvuploader/packaging.py Updates zip file creation to use get_handler() instead of direct handler access
dvuploader/nativeupload.py Replaces direct handler access with get_handler() for file uploads
dvuploader/directupload.py Updates direct upload functionality to use get_handler() method

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@JR-1991 JR-1991 marked this pull request as ready for review September 24, 2025 10:59
@JR-1991 JR-1991 moved this to Ready for Review in PyDataverse Working Group Sep 24, 2025
@JonathanHungerland
Copy link

Hi @JR-1991,
thank you for looking into this and sorry for taking so long to get back to you.

I first tested without any batching of files (I'm trying to upload 1290 files), this gave me an (unrelated error):
File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

With batching (maximum of 20 files per dvuploader.upload() instance, things went quite smooth. There were two non-fatal problems in between:

Error in batch 19: RetryError[<Future at 0x7b70f026bce0 state=finished raised ValueError>]
Uploading batch 20 of 65

╭──────── DVUploader ─────────╮
│ Server: https://dare.uol.de
│ PID: doi:10.57782/AGJV89 │
│ Files: 20 │
╰─────────────────────────────╯
Error in batch 20: The read operation timed out

After the read timeout, the batch 21 began. which showed a similar
Zip package of 20 files ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--
Error in batch 21: RetryError[<Future at 0x7b70f05e5ee0 state=finished raised HTTPStatusError>]

The "read operation time out" errors occured for a few batches in a row without indication of any successful upload. I then decided to stop the process and simply restarted it. The same errors appeared. I tried reducing the batch size to 4 and setting PARALLEL_UPLOADS to 1 but the "read operation timed out" continued.

The wall-time for read operation might be a server-based setting (i.e. not an issue of DVUploader). So while there remain some issues for my case, I did not encounter the "too many open files" problem.

Reorders and cleans up import statements for consistency. Updates httpx timeout configuration to use explicit Timeout object with all parameters set to None for improved clarity.
@JR-1991
Copy link
Member Author

JR-1991 commented Oct 22, 2025

@JonathanHungerland, thanks for getting back and testing the PR! I believe the issue might be that the DvUploader handles timeouts too strictly, resulting in saturation of the retries and thus a general error is raised. I’ve updated the PR to explicitly set a timeout of None to prevent it from failing due to a slow connection. Can you test it again with the updated code?

@JonathanHungerland
Copy link

Unfortunately I still get
Error in batch 2: The read operation timed out
for every single batch which I suppose is because my dataset already has quite an extensive amount of files. I don't get any httpx related error anymore though.

@JonathanHungerland
Copy link

JonathanHungerland commented Oct 22, 2025

I now tried uploading all my files as a single zip file. When I used replace_existing = True, I got:

Traceback (most recent call last):
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 250, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
    return self._connection.handle_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
    raise exc
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
    ) = self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
    data = self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_backends/sync.py", line 126, in read
    with map_exceptions(exc_map):
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/qblstorage/guests/jiate/upload_to_dataverse.py", line 21, in <module>
    dvuploader.upload(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 103, in upload
    self._check_duplicates(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 213, in _check_duplicates
    ds_files = retrieve_dataset_files(
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/utils.py", line 61, in retrieve_dataset_files
    response = httpx.get(
               ^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_api.py", line 195, in get
    return request(
           ^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_api.py", line 109, in request
    return client.request(
           ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1014, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 249, in handle_request
    with map_httpcore_exceptions():
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

When I instead set replace_existing=False, then I get a similar yet slightly different error message:

Traceback (most recent call last):
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
    raise exc from None
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request
    response = await connection.handle_async_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 103, in handle_async_request
    return await self._connection.handle_async_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 136, in handle_async_request
    raise exc
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 106, in handle_async_request
    ) = await self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 177, in _receive_response_headers
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_async/http11.py", line 231, in _receive_event
    raise RemoteProtocolError(msg)
httpcore.RemoteProtocolError: Server disconnected without sending a response.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/qblstorage/guests/jiate/upload_to_dataverse.py", line 21, in <module>
    dvuploader.upload(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 140, in upload
    asyncio.run(
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib/python3.12/asyncio/tasks.py", line 316, in __step_run_and_handle_result
    result = coro.throw(exc)
             ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 142, in native_upload
    responses = await asyncio.gather(*tasks)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
    future.result()
  File "/usr/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    return await copy(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    do = await self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 296, in _single_native_upload
    response = await session.post(
               ^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1859, in post
    return await self.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1540, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.

@JR-1991
Copy link
Member Author

JR-1991 commented Oct 22, 2025

@JonathanHungerland thanks for getting back so quick. Regarding the single ZIP file, I have encountered this as well a couple of times already and Dataverse cuts the connection and is quite possibly overwhelmed. If I may ask, how big is the final ZIP file and for how long did the upload run?

Regarding Error in batch 2: The read operation timed out could you paste the complete trace? The read operation in terms of httpx should actually run indefinitely. This looks like the timeout is still up.

@JonathanHungerland
Copy link

JonathanHungerland commented Oct 22, 2025

The single ZIP file is 12GB large. The upload did not complete (see the errors above) but the errors came quite quickly (within few seconds).

For "Error in batch2: The read operation timed out", that was actually the complete trace. There was nothing else being reported.

Maybe related: the DARE website says that dataset currently has status "Edit in progress" and it seems to be stuck like this. Can't fix it myself, but I approached our support.

@JonathanHungerland
Copy link

JonathanHungerland commented Oct 22, 2025

Sorry I was stupid. My try: ... expect: ... block got rid of the full traceback. So the full traceback for "the read operation time out" is:

Traceback (most recent call last):
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 250, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
    return self._connection.handle_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
    raise exc
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
    ) = self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
    data = self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_backends/sync.py", line 126, in read
    with map_exceptions(exc_map):
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/qblstorage/guests/jiate/upload_to_dataverse.py", line 35, in <module>
    dvuploader.upload(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 103, in upload
    self._check_duplicates(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 213, in _check_duplicates
    ds_files = retrieve_dataset_files(
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/utils.py", line 61, in retrieve_dataset_files
    response = httpx.get(
               ^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_api.py", line 195, in get
    return request(
           ^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_api.py", line 109, in request
    return client.request(
           ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1014, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 249, in handle_request
    with map_httpcore_exceptions():
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

@JR-1991
Copy link
Member Author

JR-1991 commented Oct 22, 2025

@JonathanHungerland, thanks for sharing this. It’s quite interesting! The actual error doesn’t originate from the upload, but rather from the initial duplication check. Based on the details you’ve provided, I believe this issue is related to the dataset lock. This lock might have occurred due to tab ingestion or unzipping/registering your data. Depending on the size of the data, this process can take some time. So, it’s possible that the lock will automatically resolve in the future.

In general, uploading 12GB of data via the native upload method is always a risky endeavor, as the HTTP connection can be quite fragile. If you have the opportunity, I strongly recommend using the direct upload feature. This feature is much more robust and can be enabled by your Dataverse Collection Admin.

To resolve this issue, I’ll conduct an adjacent test on our local instance here in Stuttgart. This will help us rule out any systematic errors in the code and provide a more realistic test environment compared to a simple CI or local tests. I’ll try to squeeze this into the week and get back to you as soon as I have more information.

@JonathanHungerland
Copy link

JonathanHungerland commented Oct 27, 2025

The dataset has been "unlocked", I removed all files and tried uploading again. Already in my first batch I received an error:

('File TDDFT/rpc/prot_sol/stb_anly.tar.gz not found in Dataverse repository.', 'This may be due to the file not being uploaded to the repository:')
('File CDFT-CI/dark_state/ds_set2.tar.gz not found in Dataverse repository.', 'This may be due to the file not being uploaded to the repository:')
('File CDFT-CI/dark_state/ds_bun4.tar.gz not found in Dataverse repository.', 'This may be due to the file not being uploaded to the repository:')
Zip package of 9 files ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Zip package of 4 files ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Zip package of 4 files ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
ds_bun4.tar.gz         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Traceback (most recent call last):
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 462, in _update_single_metadata
    raise ValueError(f"Failed to update metadata for file {file.file_name}.")
ValueError: Failed to update metadata for file ds_set1.tar.gz.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/qblstorage/guests/jiate/upload_to_dataverse.py", line 35, in <module>
    dvuploader.upload(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 140, in upload
    asyncio.run(
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 145, in native_upload
    await _update_metadata(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 418, in _update_metadata
    await asyncio.gather(*tasks)
  File "/usr/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
    future.result()
  File "/usr/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    return await copy(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    do = await self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/__init__.py", line 419, in exc_check
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7e5aa40de8a0 state=finished raised ValueError>]

Enhances the ValueError raised in _update_single_metadata to include detailed error messages from the response, improving debugging and user feedback.
@JR-1991
Copy link
Member Author

JR-1991 commented Oct 27, 2025

@JonathanHungerland thanks for sharing the traceback. The metadata update has failed for some reason, but the upload should have completed, following the code flow below. Updated the uninformative exception in this PR to include the actual error message. Can you see the files in the dataset at least?

# In function `_single_native_upload`
tasks = [
    _single_native_upload(
        session=session,
        file=file,
        persistent_id=persistent_id,
        pbar=pbar,  # type: ignore
        progress=progress,
    )
    for pbar, file in (packaged_files + replacable_files)
]

# ^ has completed

responses = await asyncio.gather(*tasks)
_validate_upload_responses(responses, files)

await _update_metadata(
    session=session,
    files=files_new + files_new_metadata,
    persistent_id=persistent_id,
    dataverse_url=dataverse_url,
    api_token=api_token,
)

# ^ This fails

@JonathanHungerland
Copy link

JonathanHungerland commented Oct 27, 2025

I can indeed see the uploaded files. There was only one file that was not correctly put into its correct folder and instead remained at the root location. I suppose that was a consequence of the metadata-error. The dataset got into the "edit in progress"-lock again. I'm currently waiting for the lock to be freed and then I'll try again.

@JR-1991
Copy link
Member Author

JR-1991 commented Oct 27, 2025

That’s great to hear! Which directory_label did you assign? It’s possible that this folder has been invalid for some reason. Hopefully, the improved error message will provide some clarity :’)

@JonathanHungerland
Copy link

JonathanHungerland commented Oct 27, 2025

I would be surprised if it's directory labels. There are no characters expect for [A-Z,a-z], underscore and dashes in all names and directory labels. I created a completely new dataset (because the old one is still locked). This time, the abort happened later and with a different error. Apparently the occurence of a dataset lock prevented edits to the metadata. I really hope it's not a race-condition.

Zip package of 10 files ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Traceback (most recent call last):
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 469, in _update_single_metadata
    raise ValueError(
ValueError: Failed to update metadata for file top_water_ions.str: Error adding metadata to DataFile: edu.harvard.iq.dataverse.api.AbstractApiBean$WrappedResponse: edu.harvard.iq.dataverse.engine.command.exception.IllegalCommandException: Dataset cannot be edited due to dataset lock.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/qblstorage/guests/jiate/upload_to_dataverse.py", line 35, in <module>
    dvuploader.upload(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/dvuploader.py", line 140, in upload
    asyncio.run(
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 145, in native_upload
    await _update_metadata(
  File "/users/student/zaaf8531/python-dvuploader/dvuploader/nativeupload.py", line 418, in _update_metadata
    await asyncio.gather(*tasks)
  File "/usr/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
    future.result()
  File "/usr/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    return await copy(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    do = await self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/users/student/zaaf8531/python-dvuploader/.venv/lib/python3.12/site-packages/tenacity/__init__.py", line 419, in exc_check
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7751f4324d40 state=finished raised ValueError>]

@JR-1991
Copy link
Member Author

JR-1991 commented Oct 27, 2025

Okay, I believe the retry approach using tenacity has reached its limits. This presents a good opportunity to explicitly check for locks. Although I had intended to be more general with the retry approach, it appears to lack sufficient robustness. I will update the PR today and notify you here.

@JonathanHungerland
Copy link

Great, thanks a lot!

@JonathanHungerland
Copy link

After creating some new datasets and finding a delicate balance between number of files and size of files, I finally managed to compile everything.

I'll create a test-version of the same dataset once you need it in order to test the lock-check.

Thank you so much for your quick and reliable help!

@JR-1991
Copy link
Member Author

JR-1991 commented Nov 26, 2025

@JonathanHungerland, I apologize for my delayed response. I was on vacation.

It’s great to hear that everything has been resolved now! I’ll push through the remaining changes, including the lock check. However, I need to take some time to wrap my head around everything again after the weeks of absence 😅

Introduces lock wait and timeout configuration to ensure datasets are unlocked before file upload and metadata update operations. Adds utility functions for checking and waiting on dataset locks, and integrates these checks into direct and native upload processes. Also initializes logging for better debugging and monitoring.
Added DVUPLOADER_LOCK_WAIT_TIME and DVUPLOADER_LOCK_TIMEOUT to the README, including examples for environment and programmatic configuration. This clarifies new options for controlling dataset lock checks during uploads.
Enhanced the create_dataset fixture with type overloads and a return_id parameter to optionally return both persistentId and id. Updated type hints for improved clarity and flexibility in test usage.
Added assertions to ensure required file attributes (file_id, file_name) are present and of correct type in dvuploader.py and packaging.py. Also reorganized import statements for consistency and readability.
Added mock responses for dataset retrieval and lock checks in directupload unit tests to better simulate Dataverse API interactions. Also set base_url for AsyncClient to ensure consistent request URLs.
Replaces 'directory_label' with 'directoryLabel' when instantiating File objects in both integration and unit tests to match the updated File class API. This ensures consistency and prevents argument errors.
Ensures that the 'recurse' parameter defaults to False if not provided, preventing potential issues when enumerating file paths. Also improves import ordering for consistency.
@JR-1991 JR-1991 merged commit 1b1397d into main Nov 26, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in PyDataverse Working Group Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Development

Successfully merging this pull request may close these issues.

[Errno24] Too many open files:

3 participants