-
Notifications
You must be signed in to change notification settings - Fork 6
Open files just-in-time to prevent OSError
#45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces a get_handler method to manage file handler initialization and usage. Updates checksum chunked reading to use get_handler, ensuring proper resource management by seeking or closing handlers as appropriate. Removes direct handler assignment during size calculation for improved separation of concerns.
Replaced direct access to file.handler with file.get_handler() in directupload.py, nativeupload.py, and packaging.py for improved encapsulation and consistency. Also made minor formatting improvements in nativeupload.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request addresses issue #40 by implementing just-in-time file opening to prevent OSError when the number of open files exceeds OS limits. The main change introduces a new get_handler() method that opens files only when needed rather than during initialization.
- Refactors file handler management across multiple modules to use lazy file opening
- Introduces standardized
get_handler()method in theFileclass for consistent file access - Updates checksum operations to properly handle both pre-initialized and just-in-time opened handlers
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| dvuploader/file.py | Adds get_handler() method and removes automatic file opening during initialization |
| dvuploader/packaging.py | Updates zip file creation to use get_handler() instead of direct handler access |
| dvuploader/nativeupload.py | Replaces direct handler access with get_handler() for file uploads |
| dvuploader/directupload.py | Updates direct upload functionality to use get_handler() method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Hi @JR-1991, I first tested without any batching of files (I'm trying to upload 1290 files), this gave me an (unrelated error): With batching (maximum of 20 files per dvuploader.upload() instance, things went quite smooth. There were two non-fatal problems in between: Error in batch 19: RetryError[<Future at 0x7b70f026bce0 state=finished raised ValueError>] ╭──────── DVUploader ─────────╮ After the read timeout, the batch 21 began. which showed a similar The "read operation time out" errors occured for a few batches in a row without indication of any successful upload. I then decided to stop the process and simply restarted it. The same errors appeared. I tried reducing the batch size to 4 and setting PARALLEL_UPLOADS to 1 but the "read operation timed out" continued. The wall-time for read operation might be a server-based setting (i.e. not an issue of DVUploader). So while there remain some issues for my case, I did not encounter the "too many open files" problem. |
Reorders and cleans up import statements for consistency. Updates httpx timeout configuration to use explicit Timeout object with all parameters set to None for improved clarity.
|
@JonathanHungerland, thanks for getting back and testing the PR! I believe the issue might be that the DvUploader handles timeouts too strictly, resulting in saturation of the retries and thus a general error is raised. I’ve updated the PR to explicitly set a timeout of |
|
Unfortunately I still get |
|
I now tried uploading all my files as a single zip file. When I used replace_existing = True, I got: When I instead set replace_existing=False, then I get a similar yet slightly different error message: |
|
@JonathanHungerland thanks for getting back so quick. Regarding the single ZIP file, I have encountered this as well a couple of times already and Dataverse cuts the connection and is quite possibly overwhelmed. If I may ask, how big is the final ZIP file and for how long did the upload run? Regarding |
|
The single ZIP file is 12GB large. The upload did not complete (see the errors above) but the errors came quite quickly (within few seconds). For "Error in batch2: The read operation timed out", that was actually the complete trace. There was nothing else being reported. Maybe related: the DARE website says that dataset currently has status "Edit in progress" and it seems to be stuck like this. Can't fix it myself, but I approached our support. |
|
Sorry I was stupid. My try: ... expect: ... block got rid of the full traceback. So the full traceback for "the read operation time out" is: |
|
@JonathanHungerland, thanks for sharing this. It’s quite interesting! The actual error doesn’t originate from the upload, but rather from the initial duplication check. Based on the details you’ve provided, I believe this issue is related to the dataset lock. This lock might have occurred due to tab ingestion or unzipping/registering your data. Depending on the size of the data, this process can take some time. So, it’s possible that the lock will automatically resolve in the future. In general, uploading 12GB of data via the native upload method is always a risky endeavor, as the HTTP connection can be quite fragile. If you have the opportunity, I strongly recommend using the direct upload feature. This feature is much more robust and can be enabled by your Dataverse Collection Admin. To resolve this issue, I’ll conduct an adjacent test on our local instance here in Stuttgart. This will help us rule out any systematic errors in the code and provide a more realistic test environment compared to a simple CI or local tests. I’ll try to squeeze this into the week and get back to you as soon as I have more information. |
|
The dataset has been "unlocked", I removed all files and tried uploading again. Already in my first batch I received an error: |
Enhances the ValueError raised in _update_single_metadata to include detailed error messages from the response, improving debugging and user feedback.
|
@JonathanHungerland thanks for sharing the traceback. The metadata update has failed for some reason, but the upload should have completed, following the code flow below. Updated the uninformative exception in this PR to include the actual error message. Can you see the files in the dataset at least? # In function `_single_native_upload`
tasks = [
_single_native_upload(
session=session,
file=file,
persistent_id=persistent_id,
pbar=pbar, # type: ignore
progress=progress,
)
for pbar, file in (packaged_files + replacable_files)
]
# ^ has completed
responses = await asyncio.gather(*tasks)
_validate_upload_responses(responses, files)
await _update_metadata(
session=session,
files=files_new + files_new_metadata,
persistent_id=persistent_id,
dataverse_url=dataverse_url,
api_token=api_token,
)
# ^ This fails |
|
I can indeed see the uploaded files. There was only one file that was not correctly put into its correct folder and instead remained at the root location. I suppose that was a consequence of the metadata-error. The dataset got into the "edit in progress"-lock again. I'm currently waiting for the lock to be freed and then I'll try again. |
|
That’s great to hear! Which |
|
I would be surprised if it's directory labels. There are no characters expect for [A-Z,a-z], underscore and dashes in all names and directory labels. I created a completely new dataset (because the old one is still locked). This time, the abort happened later and with a different error. Apparently the occurence of a dataset lock prevented edits to the metadata. I really hope it's not a race-condition. |
|
Okay, I believe the retry approach using |
|
Great, thanks a lot! |
|
After creating some new datasets and finding a delicate balance between number of files and size of files, I finally managed to compile everything. I'll create a test-version of the same dataset once you need it in order to test the lock-check. Thank you so much for your quick and reliable help! |
|
@JonathanHungerland, I apologize for my delayed response. I was on vacation. It’s great to hear that everything has been resolved now! I’ll push through the remaining changes, including the lock check. However, I need to take some time to wrap my head around everything again after the weeks of absence 😅 |
Introduces lock wait and timeout configuration to ensure datasets are unlocked before file upload and metadata update operations. Adds utility functions for checking and waiting on dataset locks, and integrates these checks into direct and native upload processes. Also initializes logging for better debugging and monitoring.
Added DVUPLOADER_LOCK_WAIT_TIME and DVUPLOADER_LOCK_TIMEOUT to the README, including examples for environment and programmatic configuration. This clarifies new options for controlling dataset lock checks during uploads.
Enhanced the create_dataset fixture with type overloads and a return_id parameter to optionally return both persistentId and id. Updated type hints for improved clarity and flexibility in test usage.
Added assertions to ensure required file attributes (file_id, file_name) are present and of correct type in dvuploader.py and packaging.py. Also reorganized import statements for consistency and readability.
Added mock responses for dataset retrieval and lock checks in directupload unit tests to better simulate Dataverse API interactions. Also set base_url for AsyncClient to ensure consistent request URLs.
Replaces 'directory_label' with 'directoryLabel' when instantiating File objects in both integration and unit tests to match the updated File class API. This ensures consistency and prevents argument errors.
Ensures that the 'recurse' parameter defaults to False if not provided, preventing potential issues when enumerating file paths. Also improves import ordering for consistency.
This pull request addresses issue #40, which has been documented. When the number of open files exceeds a certain limit, Python raises an
OSErrorindicating an overflow. Previously, the implementation opened files as handlers duringFileinitialization, which resulted in this error when the number of open files exceeded the OS's limit.This PR refactors file handler management across several modules to improve reliability and consistency when accessing file data. The main change introduces a new
get_handlermethod in theFileclass to standardize how file handlers are obtained, ensuring files are opened only when needed and properly closed after use. This update affects file reading, uploading, packaging, and checksum operations.File handler management and API consistency:
get_handlermethod to theFileclass indvuploader/file.py, which opens the file if the handler is not already set, centralizing file handler access logic.file.handleraccess indvuploader/directupload.py,dvuploader/nativeupload.py, anddvuploader/packaging.pyto usefile.get_handler(), ensuring consistent file opening and closing behavior. [1] [2] [3]Checksum and file reading improvements:
update_checksum_chunkedindvuploader/file.pyto useget_handler()for reading file data, and added logic to reset or close handlers depending on how they were obtained. [1] [2]self.handlerto be initialized before updating checksum, allowing for just-in-time file opening.Closes issues