Transposing an 8x8 boolean matrix
All of the swaps can be done with two (fixed-width) shifts and some constant masks
We can get down to one mask and nine instructions per step using xor-based swapping
y = (x ^ (x>>sh)) & mask; x ^= y | y<<sh;