be: Lower Perms using copy instead of swap by default.

For architectures without a swap instruction (all except general purpose register set on amd64 and ia32) this results in shorter code.
In many cases (probably except swapping two registers) it is also better this way on amd64/ia32 due to fewer uops and modern processors eliminating mov during decoding.
1 job for master in 21 seconds
latest
Status Job ID Name Coverage
  Test
passed #10610
unittests

00:00:21