Hello! Thank you for this excellent work on FastSAM3D.
I'm trying to use the FastSAM3D checkpoint from HuggingFace (https://huggingface.co/techlove/FastSAM3D, first download link "FASTSAM3D") with the code from this repository, but encountering a tensor dimension mismatch error.
Environment:
Error:
RuntimeError: The size of tensor a (768) must match the size of tensor b (8) at non-singleton dimension 5
File "segment_anything/modeling/mask_decoder3D.py", line 407, in predict_masks
src = src + dense_prompt_embeddings
What I've tried:
Confirmed checkpoint has 6-layer encoder (student model)
Built model with 6-layer ImageEncoderViT3D
Used vit_b_ori model type (as specified in checkpoint args)
Loaded with strict=False
The error occurs in the mask decoder, not the encoder
Checkpoint inspection shows:
pythonargs.model_type = 'vit_b_ori'
args.checkpoint = './work_dir/SAM/sam_med3d_oringin.pth' # Teacher checkpoint
Encoder has 6 blocks (layers) - confirmed
Questions:
Does the HuggingFace checkpoint match the current repository code, or was it trained with a modified version?
Which specific commit/branch should be used with this checkpoint?
Are there additional architectural modifications needed beyond the 6-layer encoder?
I noticed Issue #6 mentions different attention mechanisms (woatt vs flash attention). Could this be related?Any guidance would be greatly appreciated!
Hello! Thank you for this excellent work on FastSAM3D.
I'm trying to use the FastSAM3D checkpoint from HuggingFace (https://huggingface.co/techlove/FastSAM3D, first download link "FASTSAM3D") with the code from this repository, but encountering a tensor dimension mismatch error.
Environment:
fastsam3d.pthfrom HuggingFace (first download link)Error:
What I've tried:
Confirmed checkpoint has 6-layer encoder (student model)
Built model with 6-layer ImageEncoderViT3D
Used vit_b_ori model type (as specified in checkpoint args)
Loaded with strict=False
The error occurs in the mask decoder, not the encoder
Checkpoint inspection shows:
pythonargs.model_type = 'vit_b_ori'
args.checkpoint = './work_dir/SAM/sam_med3d_oringin.pth' # Teacher checkpoint
Encoder has 6 blocks (layers) - confirmed
Questions:
Does the HuggingFace checkpoint match the current repository code, or was it trained with a modified version?
Which specific commit/branch should be used with this checkpoint?
Are there additional architectural modifications needed beyond the 6-layer encoder?
I noticed Issue #6 mentions different attention mechanisms (woatt vs flash attention). Could this be related?Any guidance would be greatly appreciated!