When the stride is large enough (we use a threshold of 256 bytes, much longer than shown here!) the best method for most operands is simply to merge one row at a time.
Rows can be combined using vector instructions on x86, ARM, or POWER.