Optimizing constant bitshifts on AVR

2021-02-24, by Ayke van Laethem

The AVR instruction set used on microcontrollers from Atmel (now Microchip), does not have shift instructions like most other instruction sets. It can only shift by a single bit at a time, although compilers are often smart enough to shift in a more efficient way.

Lately I've been thinking about how to produce optimal shift instructions in the AVR backend of LLVM, the compiler toolkit that is used by TinyGo, Clang, Rust, Swift, and many other compilers. At the moment, it often produces pretty bad (but correct) code that's both big and slow. Recently Ben Shi has been working on improving this but we're not yet at the point of matching avr-gcc.

In this post I'll investigate how to emit the most efficient sequence of instructions to shift by a constant amount, for any supported integer width. I think there is a strategy to always emit the most compact and efficient instructions, especially if shift expansion is done before register allocation (which appears not to be the case in avr-gcc and is certainly not yet the case in LLVM).

Here I'm only going to describe left shift operations. Right shifting unsigned integers is almost identical but with everything in reverse. Right shifting signed integers is slightly more complicated. I won't cover right shifting here but the basic principles are more or less the same. I also will only cover shifting by a constant amount, as shifting by an amount unknown at compile time is basically just repeatedly shifting by one until enough bits are shifted. Yeah, that's pretty bad but sadly the AVR instruction set doesn't have a better way to do it.

I'll use i8, i16, i24 and i32 as used in LLVM, meaning that these are simple integers. You might be familiar with them if you know Rust, but compared to Rust these integers are sign-less (they are not signed or unsigned). Luckily this signed/unsigned distinction doesn't matter for left shifting.

Regarding registers, I will be using r0..r3 with r0 containing the least significant byte and r3 containing the most significant byte. Because the AVR is an 8 bits processor, an i32 will be contained in four registers.

Warning: I have not tested any of the code samples in this post. While I think they are correct, I haven't verified them so they might be wrong.

Shift by 1

Let's take a look at what compilers can do to lower shift instructions. Shifting left an i8 is simple enough:

lsl r0

The lsl instruction (logical shift left) simply shifts the value in r0 left by one, filling in the empty space with a zero bit. Now let's take a look at shifting an i16:

lsl r0
rol r1

That's slightly more involved. The instruction lsl r0 not only shifts r0 left by one, it also stores the most significant bit that got shifted out in the carry flag (C) of the status register. The rol instruction (rotate left) is not a regular rotate instruction, it does indeed rotate all the bits but instead of moving the most significant bit to the least significant bit it fills the least significant bit with the carry flag and stores the most significant bit in the carry flag. Therefore, it rotates around nine bits, not eight, including the carry flag. This behavior makes it possible to shift larger integers, for example an i24:

lsl r0
rol r1
rol r2

Or an i32:

lsl r0
rol r1
rol r2
rol r3

You get the idea. I won't be showing many i32 examples because from i16 and i24 it's usually easy to extend the behavior to any integer with a bit width that's a multiple of eight.

Shift by 2 and 3

Shifting by two or three isn't exactly sophisticated, it's basically the above but repeated. Shifting an i8 by two:

lsl r0
lsl r0

Shifting an i16 by two:

lsl r0
ror r1
lsl r0
ror r1

Shifting an i24 by three:

lsl r0
ror r1
ror r2
lsl r0
ror r1
ror r2
lsl r0
ror r1
ror r2

Sadly there's not a more sophisticated way to do it.

Shift by 4

Shifting an i8 can be done somewhat more efficiently using the swap instruction:

swap r0
andi r0, 0xf0

The swap instruction swaps the lower 4 bits and the upper 4 bits of the byte. The andi instruction then clears the lower 4 bits. This has the same effect as shifting left by four. (Note that I'm only using r0 here for ease of reading, the andi instruction only supports registers r16..r31).

Doing the same for i16 and bigger is a lot more complicated:

swap r1
andi r1, 0xf0
swap r0
eor r1, r0
andi r0, 0xf0
eor r1, r0

The first two instructions are the same as for i8 and shift the value in r1 left by four. However, then there is a complicated dance of XORs:

swap r0 does a similar swap.
eor r1, r0 effectively does r1 ^= r0 (XOR). This moves the upper four bits of r0 (that are now in the lower four bits of r0 due to the swap) into the previously zero lower four bits of r1, thereby shifting left by four bits. The upper bits are now garbage, but that will be restored in step 4.
andi r0, 0xf0 zeroes the lower four bits of r0, making r0 fully shifted left.
eor r1, r0 will do the same r1 ^= r0 operation again, but this time the bottom four bits of r0 are zero. Remember that r1 had the correct bits in the lower four bits and had the upper four bits XORed with the upper four bits of r1. This operation (eor r1, r0) has no effect on the lower four bits in r0 (which were already correct) but fixes the upper four bits, which now contain the expected value.

All in all, this sequence of instructions can shift two bytes left by four in six instructions.

Going further towards i24 repeats the same pattern.

; shift r2
swap r2
andi r2, 0xf0
; shift r1
swap r1
eor r2, r1
andi r1, 0xf0
eor r2, r1
; shift r0
swap r0
eor r1, r0
andi r0, 0xf0
eor r1, r0

Shift by 5

Shifting by 5 is basically a shift by four plus a shift by one. For i8:

swap r0
andi r0, 0xf0
lsl r0

For i16:

; shift r1 by 4
swap r1
andi r1, 0xf0
; shift r0 by 4
swap r0
eor r1, r0
andi r0, 0xf0
eor r1, r0
; shift r1 and r0 by 1
lsl r0
rol r1

You can also do the lsl and rol at the start like GCC does.

Shift by 6

Shifting left an i8 is still relatively easy:

ror r0
ror r0
ror r0
andi r0, 0xc0

Remember that ror rotates among nine bits, so to rotate two bits across eight bits you need three ror operations. At that point the upper two bits contain the previously lower two bits (as intended) and the rest is unimportant so is zeroed using andi. The lower six bits contain five of the old bits and the carry bit that was in the status register before so it is really just garbage.

Shifting an i16 or bigger left by 6 bits is more efficiently done by shifting right by two into a temporary register and then moving them all back. I will indicate the temporary register with __tmp_reg__.

; clear temporary register
clr __tmp_reg__
; shift by one
lsr r1
ror r0
ror __tmp_reg__
; shift again by one
lsr r1
ror r0
ror __tmp_reg__
; move registers back
mov r1, r0
mov r0, __tmp_reg__

This could be done more efficiently before register allocation: you would create a new virtual register for the temporary register and could avoid the two mov instructions at the end to move the register back. Later instructions would simply treat the value as being in different registers.

Shift by 7

Shifting an i8 is again relatively short and simple:

lsr r0
clr r0
ror r0

The first lsr shift right by one, but more importantly moves the least significant bit into the carry bit of the status register. Then the clr will clear the register (but not the carry bit) and the ror will shift a zero register with the effect that it moves the carry bit into the most significant bit of r0.

Note that we could use a different register after the first instruction. Because all data is contained in the status register as the carry flag, we could use something like this:

lsr r0
clr r5
ror r5

If you do this expansion before register allocation, you could assign a new virtual register here to allow the register allocator a bit more freedom in assigning registers. This will be important later for shifting an i16 by 15.

There are other variations possible too, like this:

bst r0, 0
clr r0
bld r0, 7

This loads the least significant bit into the T bit of the status register, clears the register, and then copies that bit again into bit 7 (the most significant bit).

Shifting more than 8 bits is a little more complicated, like shifting by 6 this involves shifting right by one and moving registers around, effectively shifting one to the right and then 8 back to the left:

; clear temporary register before use
clr __tmp_reg__
; shift 1 to the right into the temporary register
lsr r1
ror r0
ror __tmp_reg__
; shift left by 8
mov r1, r0
mov r0, __tmp_reg__

Or shifting an i24:

; clear temporary register before use
clr __tmp_reg__
; shift 1 to the right into the temporary register
lsr r2
ror r1
ror r0
ror __tmp_reg__
; shift left by 8
mov r2, r1
mov r1, r0
mov r0, __tmp_reg__

Again, this can be optimized when it is done before register allocation. In fact, in that case it can be optimized even further by reusing the most significant register as the (new) least significant register. This is done as follows for i16:

lsr r1
ror r0
clr newreg
ror newreg

In this case the register allocator can recognize that r1 is not used afterwards and optimize to code like this:

lsr r1
ror r0
clr r1
ror r1

After this operation, the registers are reversed: r0 contains the most significant bits and r1 contains the least significant bits.

Shifting multiples of 8

Shifting multiples of 8 is basically just moving registers around and clearing some registers. For example, to shift an i16 left by 8:

mov r1, r0
clr r0

And this is where doing these expansions before register allocation really shines. Instead of moving these registers around, the register allocator could just use r0 in later instructions that want the upper byte and use the fixed zero register as the lower byte. In the ideal case, that would make this shift zero cost! It even frees an existing register, lowering register pressure and potentially reducing necessary spills and reloads.

Shifting an i24 left by 8 is similar:

mov r2, r1
mov r1, r0
clr r0

Or shifting an i24 left by 16:

mov r2, r0
clr r1
clr r0

You get the idea. Shifting by multiples of 8 only involves register moves and zeroing registers so is very cheap.

Shift by more than 8

Shifting by different amounts is as simple as first doing a multiple-of-8 shift and then shifting the remaining registers. For example, shifting an i24 by 10:

; shift left by 8
mov r2, r1
mov r1, r0
clr r0
; shift left by 2 (top 16 bits)
lsl r1
ror r2
lsl r1
ror r2

Or shifting an i16 left by 15:

; shift left by 8
mov r1, r0
clr r0
; shift upper register by 7
lsr r1
clr r1
ror r1

GCC produces similar code:

; shift left by 8, leaving r0 dirty
mov r1,r0
; shift upper register by 7
ror r1
clr r1
ror r1
; zero lower register that was previously left dirty
ldi r0,0

However, this code is not perfect, this shift could have been done in just four instructions:

lsr r0
clr r1
ror r1
clr r0

I think a smart register allocator could figure this out when given the proper hints. As I described above, the two instructions after lsr do not have to be the same register and should therefore be given a different virtual register. The register allocator therefore sees something like this:

lsr r0
clr newreg1 ; newreg is a new virtual register
ror newreg1
clr newreg2

Like before, the clr to zero the lower bits could possibly be replaced with a direct use of the zero register by its users, possibly reducing this shift to just three instructions instead of the five that avr-gcc generates.

Shifting an i24 or larger left by 8+6 or 8+7 is more difficult. The naive (but correct) method to shift an i24 left by 8+7 (=15) would be as follows:

; shift left by 8
mov r2, r1
mov r1, r0
clr r0
; shift left by 7 (see above)
clr __tmp_reg__
lsr r2
ror r1
ror __tmp_reg__
mov r2, r1
mov r1, __tmp_reg__

There are four mov instructions, certainly there must be a more efficient way to do it:

; shift left by 16
mov r2, r0
; shift 16 bits right by 1, using r1 as __tmp_reg__
lsr r1
ror r2
clr r1 ; clear "__tmp_reg__" before use
ror r1
; clear bottom 8 bits
clr r0

This makes use of the fact that a shift left by 7 is really just a shift right by 1 over one extra register.

I'm sure this can be further generalized, but I don't see an easy solution. The same pattern sadly doesn't work for shifting an i32 left by 15. But as always, there is the fallback of first doing the modulo-8 register move and then doing a shift left by 7:

; shift left by 8
mov r3, r2
mov r2, r1
mov r1, r0
clr r0
; shift left by 7 (see above)
clr __tmp_reg__
lsr r3
ror r2
ror r1
ror __tmp_reg__
mov r3, r2
mov r2, r1
mov r1, __tmp_reg__

I think a good register allocator could, like before, optimize this if given the proper hints. That said, I've checked avr-gcc and it doesn't do a good job of optimizing i24 and i32 either (__int24 and long). For i24 it often produces calls to __mulpsi3 and for i32 it usually expands to a loop even though a faster solution should be available.

Code size optimizations

As you can see, these sequences of code can be relatively long. Instead, the above can in all cases be emitted as a loop instead. For example for an i8:

ldi __tmp_reg__, 5 ; could be any shift amount
1:
lsl r0
dec __tmp_reg__
brne 1b

For i16:

ldi __tmp_reg__, 5
1:
lsl r0
rol r1
dec __tmp_reg__
brne 1b

For i24:

ldi __tmp_reg__, 5
1:
lsl r0
rol r1
rol r2
dec __tmp_reg__
brne 1b

This can easily be extended to any modulo-8 bit width.

The total number of instructions is the number of registers used by the number plus 3, so for shifting an i16 you need 2+3=5 instructions. For i8 you need 4 instructions. Because all expansions above for i8 only need up to 4 instructions, expanding to a loop is never advantageous. For i16 it can be advantageous sometimes, such as when shifting by 5 (which would otherwise take up 6 instructions). Of course, loops are practically always slower so there is a code size / performance tradeoff. Because of that, avr-gcc will usually expand inline when optimizing for speed (-O2) and will use a loop when optimizing for size and the loop uses fewer instructions.

Comparison with GCC

While writing this post, I ran a lot of code snippets in avr-gcc, making extensive use of Compiler Explorer. I wanted to figure out whether there was any pattern in the instruction sequences it emits for various shift amounts. And I believe I've found some patterns, which I've already mostly described above. However, I also found a number of limitations (some of which I already mentioned previously). I'll go through them one by one.

First, avr-gcc appears to do shift expansion after register allocation. This is clearly visible in a sample like this:

void foo(int n, int *ptr) {
    *ptr = n << 6;
}

The output (with n in r24:r25 and ptr in r22:r23) is:

clr  __tmp_reg__
lsr  r25
ror  r24
ror  __tmp_reg__
lsr  r25
ror  r24
ror  __tmp_reg__
mov  r25, r24
mov  r24, __tmp_reg__
movw r30, r22
std  Z+1, r25
st   Z,   r24
ret

The two mov instructions at the end aren't necessary, it would have been possible to store the underlying registers directly. Like this:

clr  __tmp_reg__
lsr  r25
ror  r24
ror  __tmp_reg__
lsr  r25
ror  r24
ror  __tmp_reg__
movw r30, r22
std  Z+1, r24
st   Z,   __tmp_reg__
ret

In some other examples, it does appear to be a bit smarter than this but still I suspect a lot of expansions are done after register allocation.

Another issue with avr-gcc is that while it supports shifting i8 and i16 values well, it barely supports i24 and i32 (usable as __int24 and long). i24 is often lowered using calls to __mulpsi3 (an i24 multiply function) instead of doing an inline expansion, even with -O2. And i32 is often lowered using many more instructions than necessary or using a loop even though more efficient solutions should be available.

Conclusion

I've dug deep into how arbitrary constant shifts can be implemented on AVR. I think that when these are implemented before register allocation, they can be as short and efficient as possible. However, even after register allocation I think shift instructions can be lowered close to the optimal sequence.

Updated: 2021-02-24