Page 1 of 1

Optimize for speed? Unefficient code

Posted: Fri Mar 30, 2018 7:28 am
by siggi
When compiling my ZX81 midiplayer I had also a look on the generated code, because som parts of the player are time critical.
I found locations, where inefficient code has been generated.
This always occurs, if a variable need to be incremented or decremented, e.g.
"ev++;".

The generated code increments the memory using registers, but then decrements the registers again. That is useful for comparisons based on that variable, e. g.
"if (ev++) ..."
but wastes time (and space) if only increment is needed. The generated is e. g.

Code: Select all

;                debug_delay_calls++;
        C_LINE        211,"MidiPlayer.c"
        C_LINE        211,"MidiPlayer.c"
        ld        hl,(_debug_delay_calls)
        inc        hl
        ld        (_debug_delay_calls),hl
        dec        hl
The last "dec hl" is not necessary here. In case of bigger data types (e. g. long), a call to a library (for long dec) is not necessary.

So I did some optimations by using the generated assembler code in my C program (getting more and more ugly) and optimized it by hand.
Is there any compiler option available to avoid this?

Siggi

Posted: Fri Mar 30, 2018 9:38 am
by dom
A couple of things.

Most of the dead assembler elimination for sccz80 is done by copt. Running with -c-code-in-asm option will clobber some of the checks for a situation like this.

In this case use a pre increment rather than a post decrement.

Posted: Fri Mar 30, 2018 11:20 am
by siggi
Thanks for that information.

Maybe you could pu those kind of things (like speed) into a WIKI page, that could be easily found?

Siggi

Posted: Fri Mar 30, 2018 3:39 pm
by alvin
Although be careful because the post increment and pre increment don't mean the same thing.

if (ev++) ...

Means increment the value but use the old value in the if test. That's why the following "dec hl" must be there.

if (++ev) ...

Mean increment the value and use that for the if test.

Posted: Sun Apr 01, 2018 9:01 pm
by dom
siggi wrote:Thanks for that information.

Maybe you could pu those kind of things (like speed) into a WIKI page, that could be easily found?

Siggi
It's scattered about the place, but I've brought it together here: https://github.com/z88dk/z88dk/wiki/WritingOptimalCode

Some of the tips are based on my current conditional branch which will be merged in soon.

Posted: Tue Apr 03, 2018 5:41 pm
by siggi
Hi Dom
thanks again. That info helps a lot!

Siggi

Posted: Wed Apr 04, 2018 6:57 pm
by siggi
Hi Dom
here are my "benchmarks" using the latest compiler, giving strange results!

My midiplayer has a counter, which is incremented, when a delay between 2 midi-events (defined in the midi-file) could not be met by the player, because the system (Zeddy-player, USB stick holding the midi-file, RS232 uart used to send the midi-data) is too slow to be back from its work to fulfill the given delay to the next midi event.
My first player version was compiled using the old Z88DK version dated December 2015. Playing my test song, I got a value of 825 delay violations (goal is 0).

After that I optimized the code for speed (using also your hints) and used the latest compiler using the compiler options:
zcc +zx81 -startup=2 -m -O2 --opt-code-speed=all -create-app -Cz--disable-autorun -vn -o MidiPlay.bin MidiPlayer.c

Running my testsong I got the value of 800 delay violations (good progress :) )
But when I compiled that opimized program again, using the old compiler version and run my testsong again, I got a better result: 785 delay violations!
It seems, that the old compiler version creates faster code than the new version ...

Then I compiled an old project, where the size of the program is critical. I compiled with that options:
zcc +zx81 -startup=2 -O3 -zorg=11192 -vn -DDRIVER=8192 -o ufm-11192.bin ufm-driver.c
which gave a file size of 5475 bytes (too big!), using the latest compiler.

When I used again the old compiler version, I got a file size of 5095 bytes (is OK, limit is 5192).

Thus the current state (concerning at least my projects) is:
the current compiler makes slower and bigger code that the old compiler (using the same source and compiler options).

???

Siggi

Posted: Wed Apr 04, 2018 8:28 pm
by dom
That's odd, I can believe that files may be bigger - more stuff is being inlined, but slower? That shouldn't be the case at all - unless you're calling a lot of routines that use the index register .

Can you send me (via email (dom /at/ z88dk /dot/ org) the sources and your binaries/.maps and I'll take a look

The good news is that my tips worked!

Posted: Wed Apr 04, 2018 8:52 pm
by siggi
Hi Dom
e-mail is sent!

Regards
Siggi

Posted: Wed Apr 04, 2018 11:32 pm
by dom
I'm working offline with Siggi on this, but it looks like the increase in size is due to library changes - probably the extra functionality within stdio and the importing of the new lib integer maths routines.

The slowdown may well be related to a library routine as well given the compiler generated code has only minor differences.

Posted: Tue May 15, 2018 9:16 am
by siggi
That is the current state (containing speed optimations) of my ZX81 midi-player:
https://youtu.be/kD9Tkxjx7yg

:-)
Siggi

Posted: Tue May 15, 2018 8:13 pm
by dom
That's really cool - I'm glad it's working so well.

Can you talk through that ZX81 setup? It doesn't look quite like my one.

Posted: Wed May 16, 2018 3:44 pm
by siggi
The Zeddy (a ZXNU: ZX81-clone without ULA) is mounted to the left side of a small rack. Internally it has an interface for VDRIVE2 to use an USB stick as mass storage. About the ZXNU:
http://forum.tlienhard.com/phpBB3/viewt ... f=2&t=1029
"Out of the box" the ZXNU has 80kB ram, but I modded it to use 96KB of the 128 KB ram chip.

The backplane in the rack is connected to the Zeddy via a bus driver board (between the Zeddy and the left side of the rack). 7 cards can be connected to the backplane. Currently is is equipped with (from left to right)

- a sound card (AY compatible, active speakers on top of the rack)
- a keyboard buffer card for the external Memotech keyboard (see http://forum.tlienhard.com/phpBB3/viewt ... f=2&t=2745 )
- MMC card interface used as MEFISDOS drive
- a RS232 card (with 8251 UART) used for MIDI output (see http://forum.tlienhard.com/phpBB3/viewt ... f=2&t=2404 )
- a ZeddyNet (Ethernet) card (see http://forum.tlienhard.com/phpBB3/viewt ... 19#p10835)

The serial board output (RS232-voltagel level) is converted on small vero board into a current loop signal, used at MIDI devices. The MIDI signal goes to a Yamaha synthesizer/keyboard.

Re: Optimize for speed? Unefficient code

Posted: Thu Oct 15, 2020 9:47 am
by cborn
Hello,
I dont know were to put my remark so i try it here since it says 'optimize'
A tread on WOS mentions some z88dk asm code and how to optimize.
it shows the compiler (at that time) created an extra 'or a,a' after a 'dec a'
https://worldofspectrum.org/forums/disc ... ent_970778

I dont think the 'or a,a' is needed but i have realy no clue about what part of which compiler this is and if this is still working like that.
OR resets most flags while DEC has a multiple outcome. perhaps there are cases that the flags should be resetted after a DEC A but that will be rare and probably only between different compile parts ,afa i can imagine, and not inside an asm loop. I hope i see this right and that it is usable.

Re: Optimize for speed? Unefficient code

Posted: Thu Oct 15, 2020 1:30 pm
by dom
Why not just use memset for this "problem" - both compilers have logic to inline it where one or more parameters are const, using djnz as appropriate or ldir for longer blocks.

They can also inline:

memcpy
strcpy
strchr

Not that the z88dk library implementations are bad, but they are general purpose so I think assume 16 bit lengths and you have to swallow the frame setup/call cost.

Re: Optimize for speed? Unefficient code

Posted: Thu Oct 15, 2020 5:11 pm
by cborn
Hi,
actualy i did not react on optimizing the C commands from 16 to 8 bit but i point to the compiler result cq asm outcome being double.
2 points:
a) djnz does NOT touch the Zero-flag
b) OR a,a is used to do that after all

in the 8bit variant djnz is avoided and a standard 'DEC A' is used
2 points again:
a) DEC always influences the Zero flag
b) 'OR a,a' just repeats that setting or striking of the flag and removes al other Flag results aswell by resetting to 0

the compiler routine it self can be shorter and quicker if IN CASE OF no djnz the 'OR a,a' solution is removed
If coded manualy i would remove it.
so i think the compiler itself can be optimized IF the compiler does make that double instant of (re)setting the Zero-flag
And i will defenitly look at the C commands you suggest,since i need to learn C instead of asm
edit:
I think using 'OR a,a' is a type off patching on 'djnz' and only should be used sometimes

Re: Optimize for speed? Unefficient code

Posted: Sat Oct 17, 2020 8:20 pm
by dom
Yes, the or a is redundant, it's one of those things that would be taken out by the peepholer.

However my point stands - it's always best to use the library to achieve things if possible, both the standard library and the target specific library have a lot of functionality that should be used in preference to writing yourself.