Currently, the octets are copied from source into destination for aligned byte order. But if the byte order is inverse of native byte order, first octets are copied in native byte order and then the octets are mirrored. Why not copying the octets in reverse order in the first place?
Simplified suggestion for bf_ref_u32b() for little-endian system as example:
dst[0U] = src[3U];
dst[1U] = src[2U];
dst[2U] = src[1U];
dst[3U] = src[0U];