I've worked this out myself now. For each value in the LUT I now store the single index into the source image as a single value. To create the destination image from the LUT I just do B = A(LUT);
That works quickly but only does nearest neighbour interpolation. Is there a fast method that can do bilear / bicubic?