Optimizing constant bitshifts on AVR
The AVR instruction set used on microcontrollers from Atmel (now Microchip), does not have shift instructions like most other instruction sets. It can only shift by a single bit at a time, although compilers are often smart enough to shift in a more efficient way.
Lately I've been thinking about how to produce optimal shift instructions in the AVR backend of LLVM, the compiler toolkit that is used by TinyGo, Clang, Rust, Swift, and many other compilers. At the moment, it often produces pretty bad (but correct) code that's both big and slow. Recently Ben Shi has been working on improving this but we're not yet at the point of matching avr-gcc.
In this post I'll investigate how to emit the most efficient sequence of instructions to shift by a constant amount, for any supported integer width. I think there is a strategy to always emit the most compact and efficient instructions, especially if shift expansion is done before register allocation (which appears not to be the case in avr-gcc and is certainly not yet the case in LLVM).
Here I'm only going to describe left shift operations. Right shifting unsigned integers is almost identical but with everything in reverse. Right shifting signed integers is slightly more complicated. I won't cover right shifting here but the basic principles are more or less the same. I also will only cover shifting by a constant amount, as shifting by an amount unknown at compile time is basically just repeatedly shifting by one until enough bits are shifted. Yeah, that's pretty bad but sadly the AVR instruction set doesn't have a better way to do it.
I'll use i8
, i16
, i24
and i32
as used in LLVM, meaning that these are simple integers. You might be familiar with them if you know Rust, but compared to Rust these integers are sign-less (they are not signed or unsigned). Luckily this signed/unsigned distinction doesn't matter for left shifting.
Regarding registers, I will be using r0
..r3
with r0
containing the least significant byte and r3
containing the most significant byte. Because the AVR is an 8 bits processor, an i32
will be contained in four registers.
Warning: I have not tested any of the code samples in this post. While I think they are correct, I haven't verified them so they might be wrong.
Shift by 1
Let's take a look at what compilers can do to lower shift instructions. Shifting left an i8
is simple enough:
lsl r0
The lsl
instruction (logical shift left) simply shifts the value in r0
left by one, filling in the empty space with a zero bit. Now let's take a look at shifting an i16
:
lsl r0
rol r1
That's slightly more involved. The instruction lsl r0
not only shifts r0
left by one, it also stores the most significant bit that got shifted out in the carry flag (C) of the status register. The rol
instruction (rotate left) is not a regular rotate instruction, it does indeed rotate all the bits but instead of moving the most significant bit to the least significant bit it fills the least significant bit with the carry flag and stores the most significant bit in the carry flag. Therefore, it rotates around nine bits, not eight, including the carry flag. This behavior makes it possible to shift larger integers, for example an i24
:
lsl r0
rol r1
rol r2
Or an i32
:
lsl r0
rol r1
rol r2
rol r3
You get the idea. I won't be showing many i32
examples because from i16
and i24
it's usually easy to extend the behavior to any integer with a bit width that's a multiple of eight.
Shift by 2 and 3
Shifting by two or three isn't exactly sophisticated, it's basically the above but repeated. Shifting an i8
by two:
lsl r0
lsl r0
Shifting an i16
by two:
lsl r0
ror r1
lsl r0
ror r1
Shifting an i24
by three:
lsl r0
ror r1
ror r2
lsl r0
ror r1
ror r2
lsl r0
ror r1
ror r2
Sadly there's not a more sophisticated way to do it.
Shift by 4
Shifting an i8
can be done somewhat more efficiently using the swap
instruction:
swap r0
andi r0, 0xf0
The swap
instruction swaps the lower 4 bits and the upper 4 bits of the byte. The andi
instruction then clears the lower 4 bits. This has the same effect as shifting left by four. (Note that I'm only using r0
here for ease of reading, the andi
instruction only supports registers r16..r31
).
Doing the same for i16
and bigger is a lot more complicated:
swap r1
andi r1, 0xf0
swap r0
eor r1, r0
andi r0, 0xf0
eor r1, r0
The first two instructions are the same as for i8
and shift the value in r1
left by four. However, then there is a complicated dance of XORs:
swap r0
does a similar swap.eor r1, r0
effectively doesr1 ^= r0
(XOR). This moves the upper four bits ofr0
(that are now in the lower four bits ofr0
due to the swap) into the previously zero lower four bits ofr1
, thereby shifting left by four bits. The upper bits are now garbage, but that will be restored in step 4.andi r0, 0xf0
zeroes the lower four bits ofr0
, makingr0
fully shifted left.eor r1, r0
will do the samer1 ^= r0
operation again, but this time the bottom four bits ofr0
are zero. Remember thatr1
had the correct bits in the lower four bits and had the upper four bits XORed with the upper four bits ofr1
. This operation (eor r1, r0
) has no effect on the lower four bits inr0
(which were already correct) but fixes the upper four bits, which now contain the expected value.
All in all, this sequence of instructions can shift two bytes left by four in six instructions.
Going further towards i24
repeats the same pattern.
; shift r2
swap r2
andi r2, 0xf0
; shift r1
swap r1
eor r2, r1
andi r1, 0xf0
eor r2, r1
; shift r0
swap r0
eor r1, r0
andi r0, 0xf0
eor r1, r0
Shift by 5
Shifting by 5 is basically a shift by four plus a shift by one. For i8
:
swap r0
andi r0, 0xf0
lsl r0
For i16
:
; shift r1 by 4
swap r1
andi r1, 0xf0
; shift r0 by 4
swap r0
eor r1, r0
andi r0, 0xf0
eor r1, r0
; shift r1 and r0 by 1
lsl r0
rol r1
You can also do the lsl
and rol
at the start like GCC does.
Shift by 6
Shifting left an i8
is still relatively easy:
ror r0
ror r0
ror r0
andi r0, 0xc0
Remember that ror
rotates among nine bits, so to rotate two bits across eight bits you need three ror
operations. At that point the upper two bits contain the previously lower two bits (as intended) and the rest is unimportant so is zeroed using andi
. The lower six bits contain five of the old bits and the carry bit that was in the status register before so it is really just garbage.
Shifting an i16
or bigger left by 6 bits is more efficiently done by shifting right by two into a temporary register and then moving them all back. I will indicate the temporary register with __tmp_reg__
.
; clear temporary register
clr __tmp_reg__
; shift by one
lsr r1
ror r0
ror __tmp_reg__
; shift again by one
lsr r1
ror r0
ror __tmp_reg__
; move registers back
mov r1, r0
mov r0, __tmp_reg__
This could be done more efficiently before register allocation: you would create a new virtual register for the temporary register and could avoid the two mov
instructions at the end to move the register back. Later instructions would simply treat the value as being in different registers.
Shift by 7
Shifting an i8
is again relatively short and simple:
lsr r0
clr r0
ror r0
The first lsr
shift right by one, but more importantly moves the least significant bit into the carry bit of the status register. Then the clr
will clear the register (but not the carry bit) and the ror
will shift a zero register with the effect that it moves the carry bit into the most significant bit of r0
.
Note that we could use a different register after the first instruction. Because all data is contained in the status register as the carry flag, we could use something like this:
lsr r0
clr r5
ror r5
If you do this expansion before register allocation, you could assign a new virtual register here to allow the register allocator a bit more freedom in assigning registers. This will be important later for shifting an i16
by 15.
There are other variations possible too, like this:
bst r0, 0
clr r0
bld r0, 7
This loads the least significant bit into the T bit of the status register, clears the register, and then copies that bit again into bit 7 (the most significant bit).
Shifting more than 8 bits is a little more complicated, like shifting by 6 this involves shifting right by one and moving registers around, effectively shifting one to the right and then 8 back to the left:
; clear temporary register before use
clr __tmp_reg__
; shift 1 to the right into the temporary register
lsr r1
ror r0
ror __tmp_reg__
; shift left by 8
mov r1, r0
mov r0, __tmp_reg__
Or shifting an i24
:
; clear temporary register before use
clr __tmp_reg__
; shift 1 to the right into the temporary register
lsr r2
ror r1
ror r0
ror __tmp_reg__
; shift left by 8
mov r2, r1
mov r1, r0
mov r0, __tmp_reg__
Again, this can be optimized when it is done before register allocation. In fact, in that case it can be optimized even further by reusing the most significant register as the (new) least significant register. This is done as follows for i16
:
lsr r1
ror r0
clr newreg
ror newreg
In this case the register allocator can recognize that r1
is not used afterwards and optimize to code like this:
lsr r1
ror r0
clr r1
ror r1
After this operation, the registers are reversed: r0
contains the most significant bits and r1
contains the least significant bits.
Shifting multiples of 8
Shifting multiples of 8 is basically just moving registers around and clearing some registers. For example, to shift an i16
left by 8:
mov r1, r0
clr r0
And this is where doing these expansions before register allocation really shines. Instead of moving these registers around, the register allocator could just use r0
in later instructions that want the upper byte and use the fixed zero register as the lower byte. In the ideal case, that would make this shift zero cost! It even frees an existing register, lowering register pressure and potentially reducing necessary spills and reloads.
Shifting an i24
left by 8 is similar:
mov r2, r1
mov r1, r0
clr r0
Or shifting an i24
left by 16:
mov r2, r0
clr r1
clr r0
You get the idea. Shifting by multiples of 8 only involves register moves and zeroing registers so is very cheap.
Shift by more than 8
Shifting by different amounts is as simple as first doing a multiple-of-8 shift and then shifting the remaining registers. For example, shifting an i24
by 10:
; shift left by 8
mov r2, r1
mov r1, r0
clr r0
; shift left by 2 (top 16 bits)
lsl r1
ror r2
lsl r1
ror r2
Or shifting an i16
left by 15:
; shift left by 8
mov r1, r0
clr r0
; shift upper register by 7
lsr r1
clr r1
ror r1
GCC produces similar code:
; shift left by 8, leaving r0 dirty
mov r1,r0
; shift upper register by 7
ror r1
clr r1
ror r1
; zero lower register that was previously left dirty
ldi r0,0
However, this code is not perfect, this shift could have been done in just four instructions:
lsr r0
clr r1
ror r1
clr r0
I think a smart register allocator could figure this out when given the proper hints. As I described above, the two instructions after lsr
do not have to be the same register and should therefore be given a different virtual register. The register allocator therefore sees something like this:
lsr r0
clr newreg1 ; newreg is a new virtual register
ror newreg1
clr newreg2
Like before, the clr
to zero the lower bits could possibly be replaced with a direct use of the zero register by its users, possibly reducing this shift to just three instructions instead of the five that avr-gcc generates.
Shifting an i24
or larger left by 8+6 or 8+7 is more difficult. The naive (but correct) method to shift an i24
left by 8+7 (=15) would be as follows:
; shift left by 8
mov r2, r1
mov r1, r0
clr r0
; shift left by 7 (see above)
clr __tmp_reg__
lsr r2
ror r1
ror __tmp_reg__
mov r2, r1
mov r1, __tmp_reg__
There are four mov
instructions, certainly there must be a more efficient way to do it:
; shift left by 16
mov r2, r0
; shift 16 bits right by 1, using r1 as __tmp_reg__
lsr r1
ror r2
clr r1 ; clear "__tmp_reg__" before use
ror r1
; clear bottom 8 bits
clr r0
This makes use of the fact that a shift left by 7 is really just a shift right by 1 over one extra register.
I'm sure this can be further generalized, but I don't see an easy solution. The same pattern sadly doesn't work for shifting an i32
left by 15. But as always, there is the fallback of first doing the modulo-8 register move and then doing a shift left by 7:
; shift left by 8
mov r3, r2
mov r2, r1
mov r1, r0
clr r0
; shift left by 7 (see above)
clr __tmp_reg__
lsr r3
ror r2
ror r1
ror __tmp_reg__
mov r3, r2
mov r2, r1
mov r1, __tmp_reg__
I think a good register allocator could, like before, optimize this if given the proper hints. That said, I've checked avr-gcc and it doesn't do a good job of optimizing i24
and i32
either (__int24
and long
). For i24
it often produces calls to __mulpsi3
and for i32
it usually expands to a loop even though a faster solution should be available.
Code size optimizations
As you can see, these sequences of code can be relatively long. Instead, the above can in all cases be emitted as a loop instead. For example for an i8
:
ldi __tmp_reg__, 5 ; could be any shift amount
1:
lsl r0
dec __tmp_reg__
brne 1b
For i16
:
ldi __tmp_reg__, 5
1:
lsl r0
rol r1
dec __tmp_reg__
brne 1b
For i24
:
ldi __tmp_reg__, 5
1:
lsl r0
rol r1
rol r2
dec __tmp_reg__
brne 1b
This can easily be extended to any modulo-8 bit width.
The total number of instructions is the number of registers used by the number plus 3, so for shifting an i16
you need 2+3=5 instructions. For i8
you need 4 instructions. Because all expansions above for i8
only need up to 4 instructions, expanding to a loop is never advantageous. For i16
it can be advantageous sometimes, such as when shifting by 5 (which would otherwise take up 6 instructions). Of course, loops are practically always slower so there is a code size / performance tradeoff. Because of that, avr-gcc will usually expand inline when optimizing for speed (-O2
) and will use a loop when optimizing for size and the loop uses fewer instructions.
Comparison with GCC
While writing this post, I ran a lot of code snippets in avr-gcc, making extensive use of Compiler Explorer. I wanted to figure out whether there was any pattern in the instruction sequences it emits for various shift amounts. And I believe I've found some patterns, which I've already mostly described above. However, I also found a number of limitations (some of which I already mentioned previously). I'll go through them one by one.
First, avr-gcc appears to do shift expansion after register allocation. This is clearly visible in a sample like this:
void foo(int n, int *ptr) {
*ptr = n << 6;
}
The output (with n
in r24:r25
and ptr
in r22:r23
) is:
clr __tmp_reg__
lsr r25
ror r24
ror __tmp_reg__
lsr r25
ror r24
ror __tmp_reg__
mov r25, r24
mov r24, __tmp_reg__
movw r30, r22
std Z+1, r25
st Z, r24
ret
The two mov
instructions at the end aren't necessary, it would have been possible to store the underlying registers directly. Like this:
clr __tmp_reg__
lsr r25
ror r24
ror __tmp_reg__
lsr r25
ror r24
ror __tmp_reg__
movw r30, r22
std Z+1, r24
st Z, __tmp_reg__
ret
In some other examples, it does appear to be a bit smarter than this but still I suspect a lot of expansions are done after register allocation.
Another issue with avr-gcc is that while it supports shifting i8
and i16
values well, it barely supports i24
and i32
(usable as __int24
and long
). i24
is often lowered using calls to __mulpsi3
(an i24
multiply function) instead of doing an inline expansion, even with -O2
. And i32
is often lowered using many more instructions than necessary or using a loop even though more efficient solutions should be available.
Conclusion
I've dug deep into how arbitrary constant shifts can be implemented on AVR. I think that when these are implemented before register allocation, they can be as short and efficient as possible. However, even after register allocation I think shift instructions can be lowered close to the optimal sequence.