

我一直在研究C#和C ++中SIMD算法的优势,发现在许多情况下,在AVX处理器上使用128位寄存器要比在带有AVX2的处理器上使用256位寄存器更好.我不明白为什么.

I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why.


By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine.

在AVX处理器上,当不执行AVX指令时,CPU将关闭256位寄存器和浮点单元的上半部分(VEX编码的操作码) .当代码确实使用AVX指令时,CPU必须为FP单元加电-这大约需要70微秒,在此期间AVX指令实际上是使用128微操作执行两次的.

On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not executing AVX instructons (VEX encoded opcodes). When code does use AVX instructions, the CPU has to power up the FP units - this takes about 70 microseconds, during which time AVX instructions are actually executed using 128 micro-ops twice.


When AVX instructions haven't been used for about 700 microseconds, the CPU powers down the upper half of the circuitry again.


Now it does this because the upper half of the circuitry consumes power (doh!), and so generates heat (double doh!). This means that the CPU runs hotter when AVX instructions are used. So given that CPUs can "turbo boost" when they have thermal headroom, using AVX instructions reduces this chance, and in fact, the CPU actually reduces the "base clock speed". So if you have, for example, a CPU officially clocked at 2.3GHz that can turbo boost to 2.7, when you start using AVX instructions, the chip is clocked down to 2.1 and boosted to only 2.3, and in extreme cases the base clock may be reduced to 1.9 (see pages 2-4 of this).


At this stage, your CPU is executing ALL instructions about 10-15%, maybe even 20% SLOWER than when not using AVX instructions. If you're doing loads of SIMD operations, the 256 bit wide instructions make this worthwhile. But if you're doing a few AVX instructions, then "normal" code, then a bit of AVX again, then this clock speed penalty will cost more than all the gains you can make from AVX alone.


This can be why 128 bit wide SIMD can run faster than 256 bit wide unless you've got lengthy intensive bursts of SIMD-dominated operations. There is a price to using the rest of the silicon... (or perhaps more accurately, a reward for not using it that we sometimes forget we've been getting).