When the number of columns is a multiple of eight, we use eight-by-eight transposes with unaligned writes
For matrices with less than 16 rows or columns, we use bit interleaving or uninterleaving
For other matrices, we overtake on the last axis so that the column lengths are a multiple of eight, then use eight-by-eight transposes
Transposes with rank three or more can be decomposed into one transpose that rotates the axes (equivalent to a matrix transpose) and one that fixes the last axis (so it can move multiple bits at a time)