About the image reconstruction supervision

I am curious about the image reconstruction task you use for the detail preserver. It seems the input of u-net is the concat of noise depth map and rgb token, I wonder how do you prevent the model directly copy the rgb token from input? Can you provide more detail about this supervision?