Skip to content

handle virtuoso dump for initial sync#39

Merged
nbittich merged 15 commits intomasterfrom
feature/initial-sync-with-virtuoso-dump
Jan 12, 2026
Merged

handle virtuoso dump for initial sync#39
nbittich merged 15 commits intomasterfrom
feature/initial-sync-with-virtuoso-dump

Conversation

@nbittich
Copy link
Copy Markdown
Contributor

@nbittich nbittich commented Oct 8, 2025

related to lblod/app-decide#8.

This PR should make the delta consumer a bit more resilient when consuming large dumps generated with the Graph dump service. Added The updateWithRecover function utils and the HTTP_MAX_QUERY_SIZE_BYTES environment variable, which should be used in custom dispatching, as it splits the query according to the size of the query itself (I could be wrong but in my test a query over 100kb fails), reducing the number of retry attempts. Gzipped delta files and dumps are also now supported.

Comment thread lib/dump-file.js Outdated
});
const distribution = await resultDistribution.json();
return new DumpFile(distributionMetaData, distribution.data[0].relationships.subject.data);
const fileResponse = await fetcher(`${GET_FILE_ENDPOINT.replace(':id', distribution.data[0].relationships.subject.data.id)}`, {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just taking the first distribution found, it might be better to search for a specific one, similar to what is done here:
https://github.com/lblod/app-burgernabije-besluitendatabank/blob/master/scripts/import-dumps/run.rb#L130

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should now be fixed

@nbittich nbittich marked this pull request as ready for review October 20, 2025 08:38
@nbittich nbittich force-pushed the feature/initial-sync-with-virtuoso-dump branch from b48f357 to 7d5a2f2 Compare October 20, 2025 08:43
@nbittich nbittich requested a review from nvdk October 20, 2025 08:44
Comment thread config.js
export const SYNC_FILES_PATH = process.env.DCR_SYNC_FILES_PATH || '/sync/files';
export const DOWNLOAD_FILE_PATH = process.env.DCR_DOWNLOAD_FILE_PATH || '/files/:id/download';
export const GET_FILE_PATH = process.env.DCR_GET_FILE_PATH || '/files/:id';
export const DOWNLOAD_FILE_PATH = process.env.DCR_DOWNLOAD_FILE_PATH || GET_FILE_PATH + '/download';
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify why this split was necessary? it seems this would be a breaking change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to fetch metadata about the file, thus the new GET_FILE_PATH env var. I think the two could even be static, probably no app uses it (it's not documented in the readme). it's only the relative path of the file endpoint, and the previous one was too specific (pointing to the download endpoint), and thus we couldn't fetch metadata about the file.

Comment thread lib/delta-file.js Outdated
try {
await this.download();
const changeSets = await fs.readJson(this.filePath, { encoding: 'utf-8' });
let fileStream = fs.createReadStream(this.filePath);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a bit confused about streams and their events in an async method. I assume that the final await json will throw any pipeline errors as they cascade to the final consumer in the stream pipeline. If that's the case this looks fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, it's a special function from stream/consumers module provided by node. I also get confused with the behavior of streams, that's why I used the json helper to convert this to a simple async await (and the error will cascade idd)

Comment thread package.json Outdated
"homepage": "https://github.com/lblod/delta-consumer#readme",
"dependencies": {
"@lblod/mu-auth-sudo": "0.6.1",
"@lblod/mu-auth-sudo": "0.6.2",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider bumping the template to 1.9.1 and just using auth-sudo from the template

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumped to 1.9.1 but didn't switch to sudo support of the js template yet as it doesn't provide the ability to override the sparql endpoint and the retry mechanism

Comment thread package.json Outdated
Copy link
Copy Markdown
Member

@nvdk nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor remarks but changes seem sound. I didn't test it yet though

@nbittich nbittich merged commit 7db7e99 into master Jan 12, 2026
1 check passed
@nbittich nbittich deleted the feature/initial-sync-with-virtuoso-dump branch January 12, 2026 10:51
@nbittich nbittich restored the feature/initial-sync-with-virtuoso-dump branch January 22, 2026 09:34
nbittich added a commit that referenced this pull request Jan 22, 2026
nbittich added a commit that referenced this pull request Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants