You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some notes on the investigation so far, I know this is a lot of blather and you can skim down through to the last half to see the juiciest bits. Not completely solved but I think I'm close and I believe the actions I've listed at the bottom will get us to resolution.
Previous threads:
Filecoin Slack #ecosystem-dev channel has threads (here & here)
Thankfully Alvin (via stuberman and lodge) was able to provide a DAG that is failing so we can dig deeper into the nature of the failure. A ~32G DAG hanging off bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm. Looking at the version that ipfs dag export (proper exhaustive selector export, what we would expect for a well-formed CAR) gives compared to the one that Boost apparently has on its end where it's reporting the mismatch we can see:
They are the same size
They contain the same blocks
The blocks are out of order
The ordering problem can be seen just by looking at the first few blocks. Here's what we expect (ipfs dag export):
This list of expected links can be confirmed by just looking at the root block's links with ipfs dag get bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm | jq .Links[].Hash[] | head.
The second and third block are the same in both lists and then it diverges. Both of those initial links are just Bytes, they have no links, so this isn't a case of a traverser deciding to go down a different pathway, they should just be walking those links in the root block in order.
I wrote a simple program to "traverse" these links in the various ways that may matter, just from that root block, and keep on getting the same, stable ordering:
Raw links list coming out of a go-codec-dagpb decode
Raw links list coming out of a go-merkledag DecodeProtobuf
Traversal using go-ipld-prime
Traversal using go-merkledag
Most of the tooling in the path to make CARs uses go-ipld-prime's traversals which in turn will be relying on go-codec-dagpb. But there is a dependency in boost for a custom branch of go-car @ ipld/go-car#290 that uses go-merkledag's Walk and other legacy pieces to load and decode blocks. So there's a suspicion that the use of the legacy stack may be involved here.
In version 0.4.0 of go-merkledag, the underlying mechanics of protobuf decode were swapped out to use go-codec-dagpb, so since that version we should even have the same decoding path.
BUT prior to 0.4.0 it turns out we had a sneaky decode-sort of links going on whenever you decode a DAG-PB block. This is not something that we factored in to the DAG-PB spec or go-codec-dagpb—links are only sorted on encode. And in a go-ipld-prime world, your Node decode ordering will dictate your traversal ordering. I'm going to add some clarifications to the spec about this @ ipld/ipld#233.
This shouldn't be a problem under normal circumstances, but we also have to deal with badly, or unsorted DAG-PB Links since we're not being strict about rejecting blocks with unsorted Links lists. And, it turns out that the failure case we have here is one of those. If we pull out the Name for each of the links that appear in the first blocks past the root in the CAR we can see what's going on:
The first list is giving the list of links in the order they appear in the bytes, but Boost is doing them in sorted order.
This isn't normally a problem because we expect DAG-PB encoders to sort before encoding, so the order they appear in the bytes is the sorted order, so in "normal" cases we wouldn't see this mismatch.
There obviously exists a DAG-PB encoder that's producing alternatively sorted Links lists that's triggering these failures. This isn't awesome, it's why we have specs and also why we encourage use of existing, battle-hardened codecs. But to be clear: our systems should be able to account for this, the problems we are having arise when we have different decode paths in our tooling.
Re-running my test program against go-merkledag@0.3.2 and doing a Walk produces the same order we're seeing out of Boost.
Unfortunately I haven't figured out why Boost is doing this sorting. Even in v1.0.0 I can only see it pulling in >v0.4.0 versions of go-merkledag, and I've confirmed that this effect only appears for versions <v0.4.0. Perhaps there's some dependency jumbling that's going on to bring it in.
I see three things to do next:
Figure out how/why Boost might be using an older go-merkledag to do this traversal (perhaps a weird Go dependency shuffle, perhaps this isn't actually coming out of CarOffsetWriter but some other CAR creation path I'm not seeing?)
I think we should prioritise getting Add a 'skip' parameter to writev1 so that the beginning of a car can … ipld/go-car#291 over the line and replacing CarOffsetWriter here with that. We really shouldn't be using go-merkledag for these kinds of things, we've been using ipld-prime traversals for CAR creation since the Filecoin launch (primarily through go-car's SelectiveCar).
(lower priority) Figure out what DAG-PB encoder is producing these blocks—is it one that PL controls, or some other, or do we have a bug in sorting? As far as I'm aware we're sorting consistently across implementations and have been doing so forever. (Initial guess: https://github.com/Jorropo/linux2ipfs/blob/262ac5bb774b681babe85c944d69ee44f8505436/main.go#L504-L510 - @Jorropo - do you know if this is being used much in the wild? Do you have a way of checking whether it might be involved in these failing deals?).
Some notes on the investigation so far, I know this is a lot of blather and you can skim down through to the last half to see the juiciest bits. Not completely solved but I think I'm close and I believe the actions I've listed at the bottom will get us to resolution.
Previous threads:
#ecosystem-devchannel has threads (here & here)Thankfully Alvin (via stuberman and lodge) was able to provide a DAG that is failing so we can dig deeper into the nature of the failure. A ~32G DAG hanging off
bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm. Looking at the version thatipfs dag export(proper exhaustive selector export, what we would expect for a well-formed CAR) gives compared to the one that Boost apparently has on its end where it's reporting the mismatch we can see:The ordering problem can be seen just by looking at the first few blocks. Here's what we expect (
ipfs dag export):Here's what we get in Boost:
This list of expected links can be confirmed by just looking at the root block's links with
ipfs dag get bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm | jq .Links[].Hash[] | head.The second and third block are the same in both lists and then it diverges. Both of those initial links are just Bytes, they have no links, so this isn't a case of a traverser deciding to go down a different pathway, they should just be walking those links in the root block in order.
I wrote a simple program to "traverse" these links in the various ways that may matter, just from that root block, and keep on getting the same, stable ordering:
DecodeProtobufMost of the tooling in the path to make CARs uses go-ipld-prime's traversals which in turn will be relying on go-codec-dagpb. But there is a dependency in boost for a custom branch of go-car @ ipld/go-car#290 that uses go-merkledag's
Walkand other legacy pieces to load and decode blocks. So there's a suspicion that the use of the legacy stack may be involved here.In version 0.4.0 of go-merkledag, the underlying mechanics of protobuf decode were swapped out to use go-codec-dagpb, so since that version we should even have the same decoding path.
BUT prior to 0.4.0 it turns out we had a sneaky decode-sort of links going on whenever you decode a DAG-PB block. This is not something that we factored in to the DAG-PB spec or go-codec-dagpb—links are only sorted on encode. And in a go-ipld-prime world, your Node decode ordering will dictate your traversal ordering. I'm going to add some clarifications to the spec about this @ ipld/ipld#233.
This shouldn't be a problem under normal circumstances, but we also have to deal with badly, or unsorted DAG-PB Links since we're not being strict about rejecting blocks with unsorted Links lists. And, it turns out that the failure case we have here is one of those. If we pull out the
Namefor each of the links that appear in the first blocks past the root in the CAR we can see what's going on:ipfs dag export:Boost:
Re-running my test program against go-merkledag@0.3.2 and doing a
Walkproduces the same order we're seeing out of Boost.Unfortunately I haven't figured out why Boost is doing this sorting. Even in v1.0.0 I can only see it pulling in >v0.4.0 versions of go-merkledag, and I've confirmed that this effect only appears for versions <v0.4.0. Perhaps there's some dependency jumbling that's going on to bring it in.
I see three things to do next:
CarOffsetWriterbut some other CAR creation path I'm not seeing?)CarOffsetWriterhere with that. We really shouldn't be using go-merkledag for these kinds of things, we've been using ipld-prime traversals for CAR creation since the Filecoin launch (primarily through go-car'sSelectiveCar).