Efficiency of globals vs local variables in ZSDCC

ZX80, ZX 81, ZX Spectrum, TS2068 and other clones
Post Reply
derekfountain
Member
Posts: 50
Joined: Mon Mar 26, 2018 1:49 pm

Efficiency of globals vs local variables in ZSDCC

Post by derekfountain »

I've picked up the conventional wisdom that using global variables is better in Z88DK than locals. But am I now correct in thinking that that's dated information, and that ZSDCC produces better code if you use locals?

I have a large function which does collision detection. Its flow is to start with x,y coords, then add 8 to one of them, call a function, then add 1, call function again, then add 8 to the other, call the function, and so on. There are lots of branches which depend on the way a sprite is facing, etc. I tried to opimise it by replacing all the local variables (most are temporary, the results of adding the offsets) and holding them as globals. The code size went from 861 bytes to 1159, and although Ts were hard to empirically observe, I'm pretty sure it got slower.

So I wrote a test case and had a play with it. Consider this simple code:

Code: Select all

unsigned int test1( unsigned char x, unsigned int y )
{
  unsigned int result;

  unsigned char x_local = x;
  unsigned int  y_local = y;

  result = x_local + y_local;

  return result;
}
That compiles to:

Code: Select all

288   0031  DD E5               push        ix
289   0033  DD 21 00 00         ld        ix,0
290   0037  DD 39               add        ix,sp
291   0039              ;test.c:23: unsigned char x_local = x;
292   0039  DD 5E 04            ld        e,(ix+4)
293   003C              ;test.c:24: unsigned int  y_local = y;
294   003C  DD 4E 05            ld        c,(ix+5)
295   003F  DD 46 06            ld        b,(ix+6)
296   0042              ;test.c:26: result = x_local + y_local;
297   0042  26 00               ld        h,0x00
298   0044  6B                  ld        l, e
299   0045  09                  add        hl, bc
300   0046              ;test.c:28: return result;
301   0046              ;test.c:29: }
302   0046  DD E1               pop        ix
303   0048  C9                  ret
which is 24 bytes and 137 Ts (according to FUSE debugger).

The alternative:

Code: Select all

unsigned int result;
unsigned char x_local;
unsigned int  y_local;

unsigned int test2( unsigned char x, unsigned int y )
{
  x_local = x;
  y_local = y;

  result = x_local + y_local;

  return result;
}
which compiles to:

Code: Select all

257   0000  DD E5               push        ix
258   0002  DD 21 00 00         ld        ix,0
259   0006  DD 39               add        ix,sp
260   0008              ;test.c:11: x_local = x;
261   0008  DD 7E 04            ld        a,(ix+4)
262   000B  32 02 00            ld        (_x_local),a
263   000E              ;test.c:12: y_local = y;
264   000E  DD 6E 05            ld        l,(ix+5)
265   0011  DD 66 06            ld        h,(ix+6)
266   0014  22 03 00            ld        (_y_local),hl
267   0017              ;test.c:14: result = x_local + y_local;
268   0017  3A 02 00            ld        a,(_x_local)
269   001A  06 00               ld        b,0x00
270   001C  21 03 00            ld        hl,_y_local
271   001F  86                  add        a, (hl)
272   0020  32 00 00            ld        (_result),a
273   0023  78                  ld        a, b
274   0024  21 04 00            ld        hl,_y_local + 1
275   0027  8E                  adc        a, (hl)
276   0028  32 01 00            ld        (_result + 1),a
277   002B              ;test.c:16: return result;
278   002B  2A 00 00            ld        hl, (_result)
279   002E              ;test.c:17: }
280   002E  DD E1               pop        ix
281   0030  C9                  ret
which is 48 bytes and 244 Ts.

I can see the copying of the input values from the stack into the globals takes time and instructions, and that only happens once which makes it a bit of an unfair comparison in such a small, simple testcase. Even so, the pattern of loading a memory location, reading or writing from/to it, is clearly heavier than using the index register, and in my game code the difference is quite marked.

So, what is the current conventional wisdom? Am I doing this wrong and drawing incorrect conclusions? Has the ZSDCC compiler been improved in this area or was it always advice only for sccz80 users? Most importantly, what's the current advice in this area? Globals or locals?
User avatar
dom
Well known member
Posts: 1302
Joined: Sun Jul 15, 2007 10:01 pm

Post by dom »

As usual, the answer is "it depends".

Zsdcc is actually pretty good at picking variables up off the stack: ld l,(ix+dd) is much quicker than the sp relative work that sccz80 has to do if the variable is not in the bottom 2 stack locations. The sp relative handling of sccz80 is great when it comes to dealing with the Rabbit processor though!

The regular zsdcc has been traditionally been bad at handling statics, though aralbrec has put a lot of effort into improving that, whereas sccz80 has historically tended to be a bit better - which is where that advice comes from - even today it looks like sdcc tends to prefer using 8 bit operations for statics, whereas sccz80 usually prefers 16 bit operations.

In your snippet you've made the temporaries global - making them local static will generate a different code sequence for zsdcc though I've not seen it completely eliminate access.

I don't think I've really answered the question and I suspect that it's impossible to give a one-size-fits-all recommendation which sadly means it's back to relying on emulator profiling support.
alvin
Well known member
Posts: 1872
Joined: Mon Jul 16, 2007 7:39 pm

Post by alvin »

SCCZ80 picks up a statement, generates code for it and is done.
ZSDCC looks at all the code and decides how to allocate values to registers that persist across statements so that it doesn't have to constantly read/write to memory.

For ZSDCC you always want to use locals because then the compiler can hold those locals in registers instead of memory. But the Z80 does not have an infinite supply of registers available - in fact not that many at all. So if you have too many locals, then the locals get spilled into the stack frame where they are accessed via (ix+n) addressing which is slower than global or static memory if done too frequently. So you want only a small number of things local and the rest global for best performance. What the compiler can efficiently juggle in registers is the "working set" - the set of values it most frequently needs in the current block. So you want to write code in blocks that keep a working set local and less frequently accessed or accessed more difficultly (due to limitations in Z80 addressing modes) made static.

This for loop is always better in ZSDCC using a local:

Code: Select all

for (unsigned char i = 0; i != 20; ++i)
...
ZSDCC will very likely put "i" into a single 8-bit register for the duration of the loop.

It may be worse with SCCZ80 rather than using a local static for "i". The reason is SCCZ80 will put "i" onto the stack and constantly reading/writing to the stack is normally slower than just reading/writing an 8-bit value to a fixed memory address.
Post Reply