[z88dk-dev] __z88dk_callee

Bridge to the z88dk-developers mailing list
Post Reply
Philipp Klaus Krause

[z88dk-dev] __z88dk_callee

Post by Philipp Klaus Krause »

In revision #9212, I implemented support for __z88dk_callee in sdcc on
the caller side. __z88dk_callee can be combined with __smallc.

Philipp
alvin
Well known member
Posts: 1872
Joined: Mon Jul 16, 2007 7:39 pm

Post by alvin »

In revision #9212, I implemented support for __z88dk_callee in sdcc on
the caller side. __z88dk_callee can be combined with __smallc.

Wow I didn't expect that. That means a whole lot of work just got piled in my direction :)

The changes for sdcc fastcall in the new c library are almost done. I've put an #if guard around them so that they have to be explicitly enabled on the command line until the peephole problem is fixed. To get correct code with the fastcall linkage, the peephole optimizer has to be disabled.

I think with the fastcall about 100 or so functions were affected but callee affects the rest of the functions.


For the classic library, the same header can be used as with sccz80 with appropriate __smallc, __z88dk_fastcall, __z88dk_callee defined as macros and attached to the prototypes. The only problem is then the preservation of ix where needed.



------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
alvin
Well known member
Posts: 1872
Joined: Mon Jul 16, 2007 7:39 pm

Post by alvin »

In the new clib I've added callee linkage to sdcc for stdlib and malloc so far just to test things out. It seems to work perfectly.

Simple test program:


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <arch/spectrum.h>

#pragma output CRT_INITIALIZE_BSS = 1

unsigned char *strings[20];

int string_compare(unsigned char **a, unsigned char **b)
{
return stricmp(*a, *b);
}

main()
{
static unsigned int i;
static unsigned int j;
static unsigned int sz;

zx_border(INK_WHITE);
zx_cls(INK_BLACK | PAPER_WHITE);

// read strings from stdin

printf("ENTER UP TO %u LINES OF TEXT.\n", sizeof(strings) / sizeof(unsigned char *));
printf("END EARLY WITH AN EMPTY LINE.\n\n");

for (i = 0; i != sizeof(strings) / sizeof(unsigned char *); ++i)
{
printf("%2u: ", i+1);

sz = 0;
if ((getline(&strings[i], &sz, stdin) == -1) || (sz <= 2))
{
if (sz > 0) free(strings[i]);
break;
}

strrstrip(strings[i]);
}

// show the strings

printf("\n\nTHE STRINGS YOU ENTERED:\n\n");

for (j = 0; j != i; ++j)
printf("\"%.60s\"\n", strings[j]);

// sort them

printf("\nQSORT BEGINS\n\n");
qsort(strings, i, sizeof(unsigned char *), string_compare);

for (j = 0; j != i; ++j)
printf("\"%.60s\"\n", strings[j]);

printf("\n\n\n");
return 0;
}


Most of the single parameter functions are also configured for sdcc fastcall (maybe another 30 to go) but it's disabled by default as the sdcc peephole optimizer has to be disabled while using it. To enable sdcc fastcall on the compile line add "-D__SDCC_ENABLE_FASTCALL --no-peep".

In this test I have fastcall disabled even though there are some single param functions in the program.

sdcc compile without callee (-D__SDCC_DISABLE_CALLEE):
8824 bytes

sdcc compile with callee:
8796 bytes

On the surface only qsort() from stdlib.h would seem to be callee. But there are other hidden callee functions attached to the binary. Another is realloc() from malloc.h since it's called by the getline() implementation (not callee because I haven't done stdio yet). The asm implementation of a function is INCLUDEd into the callee C implementation in the c library so a direct call made by the library to the asm entry point in realloc() will get the callee function to attach (or the more expensive standard linkage if there is no callee).

The printf() here is the full-featured version so it has many fingers in the library with many functions (possibly non-callee) pulled in.

Compared to the sccz80 compile:
8553 bytes

But I think a good chunk of that is the callee+fastcall difference that isn't complete yet for sdcc.

Out of curiosity an sdcc compile with fastcall+callee enabled and peephole optimizer running:
8755 bytes

but of course the code is not correct as some params to fastcall functions are optimized out by the peephole optimizer.



------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
alvin
Well known member
Posts: 1872
Joined: Mon Jul 16, 2007 7:39 pm

Post by alvin »

The transition to sdcc fastcall is nearly complete. The peephole issue has been fixed so I made fastcall linkage the default which can be switched off with "-D__SDCC_DISABLE_FASTCALL".

callee linkage is being worked on. strings and stdlib are done among the standard portion of the clib. callee is active by default and can be disabled with "-D__SDCC_DISABLE_CALLEE".

A compile of the qsort example program is now down to 8742 bytes from 8824 bytes.

A new callee / peephole issue has been found concerning tail calls to a callee function. The peepholer tries to turn the final call into a jp and in so doing pops the saved frame pointer before making the jp. But the item popped is one of the callee params and not the frame pointer.



------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
stefano
Well known member
Posts: 2137
Joined: Mon Jul 16, 2007 7:39 pm

Post by stefano »

Wonderful !



------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
Philipp Klaus Krause

Post by Philipp Klaus Krause »

On 15.04.2015 00:10, alvin (alvin_albrecht@...) wrote:
In revision #9212, I implemented support for __z88dk_callee in sdcc
on the caller side. __z88dk_callee can be combined with __smallc.

Wow I didn't expect that. That means a whole lot of work just got
piled in my direction :)

The changes for sdcc fastcall in the new c library are almost done.
I've put an #if guard around them so that they have to be explicitly
enabled on the command line until the peephole problem is fixed. To
get correct code with the fastcall linkage, the peephole optimizer
has to be disabled.

Unfortunately, my commit #9218 did not fully fix the issue; it worked
for 16-bit and 32-bit arguments when these were literals or global
variables, but some issues remained, in particular with 8-bit arguments
that are local variables. They should be fixed now. Please use #9221 or
newer for __z88dk_fastcall.

Philipp
alvin
Well known member
Posts: 1872
Joined: Mon Jul 16, 2007 7:39 pm

Post by alvin »

I've completed the transition to callee and fastcall compilation when using sdcc in the new clib except for the sp1 sprite library which I will do when I have another block of free time.

Comparison for qsort_test.c:

sccz80: 8586 bytes
sdcc_iy: 8598 bytes
sdcc_ix: 8723 bytes

With fastcall and callee disabled:
(-D__SDCC_DISABLE_CALLEE -D__SDCC_DISABLE_FASTCALL)

sdcc_iy: 8674 bytes
sdcc_ix: 8798 bytes

The sdcc_iy version of the clib gives ix to sdcc (for frame pointer) and has the library using iy exclusively. This means the library does not have to preserve ix when called by sdcc and this puts sdcc on equal footing with sccz80 which does not reserve any registers. When sdcc_iy is used "--reserve-regs-iy" is always enabled in compiles. Anyway as you can see, sdcc code size is now similar to sccz80 in the test programs I have been trying. These programs are relatively short, use lots of statics and use library functions. Different results may come for code with other characteristics.

In the sdcc_ix compile I've also used "--reserve-regs-iy" for comparison purposes. sdcc will be generating the same code as the sdcc_iy compile but the result is 125 bytes larger! This is down to the extra code necessary to preserve ix around calls to library functions. This not only affects direct calls by the c program to library functions but also hidden calls that the library is making. For example, stdio maintains a linked list of files and to manage that it makes calls to the singly linked list type in the library via the asm interface. However that also pulls in the c interface for those functions and that will add to the binary size even though the c program is not using the linked list type.

For that reason I am wondering if the right decision was made on the structure of the clib. Right now for callee functions (anything more than one parameter) there are two c implementations:

1. Independent entry point for standard C linkage for function pointers. A few bytes that collects params into registers and jumps to the asm implementation. This code never gets added to the binary unless called through a function pointer.

2. Callee entry point that is almost always used. The C implementation gathers parameters into registers and then INCLUDES the asm implementation so that there is no extra jump to the asm code. The lack of jump saves three bytes and 10 cycles. But this also means if the program only calls a function via the asm entry point (as library code will or user asm code might) that C preamble will also be present even if not used. That's what is adding to the sdcc_ix compile size.

If I changed callee to end in a jump to the asm implementation in the library then calls to the asm entry point will not add the c preamble code to binaries. This will make sdcc_ix compiles a lot smaller and sccz80/sdcc_iy calls maybe a little bit smaller, assuming the number of hidden calls to library functions outweigh the number of explicit calls from the c code.

I've also been generating four different versions of the library: sdcc_ix, sdcc_iy, sccz80 and asm. "asm" is the asm version of the library without any c linkages available. It's meant to be used with asm only programs. The C linkage code in sdcc_ix, sdcc_iy and sccz80 are all different. What this means is you can't (or shouldn't as you can if you accept duplicate implementations) mix C code compiled with sdcc with C code compiled with sccz80. If I make this change so that callee jumps to the asm implementation rather than includes the asm implementation, all these libraries can be condensed into one. This single library would contain the asm implementations as independent entities (and callable from asm without being penalized with the addition of C preamble code) and uniquely named C linkage entry points for sdcc_ix, sdcc_iy, and sccz80 that would jump into the asm implementation instead of including it. So, eg, there would be one implementation of memcpy() called
"asm_memcpy" and there would be three callee entry points from C named "memcpy_callee_sdccix", "memcpy_callee_sdcciy" and "memcpy_callee_sccz80". The header file would sort out the naming depending on which compiler was doing the compiling. With all that in one library, you could then mix compilers in a single project. You could also add more C compilers in a simpler manner (I have been wondering if Hitech C under cpm is now accessible with John Elliott's or udo's tools).

The cost is those extra 10 cycles (and 3 bytes) per C call. I think I've probably talked myself into this now but maybe there is some comment?



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
stefano
Well known member
Posts: 2137
Joined: Mon Jul 16, 2007 7:39 pm

Post by stefano »

> If I make this change so that callee jumps to the asm implementation rather than includes the asm implementation, all these libraries can be condensed into one

Which is very good
Sorry but I'm not sure I understood what is preventing you to make this change:

- the big amount of work ?
- three bytes and 10 cycles not being saved (I'm not sure I understood correctly) ?

In the latter case perhaps there are different saving opportunities, i.e. BDS C had some optimization moved into the linker.
One thing tempting me everytime is a recurring sequence of calls which an optimizer could identify and substitute by one call to a single properly grouped subroutine.. obviously not all the subs may be handled like this, and it doesn''t help the speed.. !



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
alvin
Well known member
Posts: 1872
Joined: Mon Jul 16, 2007 7:39 pm

Post by alvin »

Sorry but I'm not sure I understood what is preventing you to make this change:

- the big amount of work ?
- three bytes and 10 cycles not being saved (I'm not sure I understood correctly) ?

We're losing 10 cycles and 3 bytes on the most common use case for calls from C. It doesn't seem like a lot but I'm pretty much sold on the idea that we probably save more bytes than we lose by not having unnecessary C preamble code attach when only the asm entry points are called (by the library). The likelihood that this is true reduces as the program size increases which is when you want to save those bytes :/ A callee preamble for a two-parameter function can be just 31 cycles and increasing that to 41 cycles is +33%. But by taking more than one parameter, callee functions should be large enough that those 10 cycles should be swamped out. If this affected fastcall functions this might not be the case -- there are fastcall functions in the library where a simple CALL/RET is adding 10-15% to execution time. I have one specific case with a tight loop plotting points and the CALL/RET to the plot related functions are slowing things down 15%. But that's fastcall.
(incidentally sdcc does a good job of inlining fastcall functions and I may try that with this example).

Anyway that's what I was thinking about when the library structure was laid out. But the benefit of having a smaller number of libraries -- it will have to be two as I forgot that sdcc_ix and sdcc_iy are incompatible -- and especially the ability to use any C compiler to compile portions of a project really outweigh the possibility that code size will go up a bit. I'm thinking of using sdcc to compile one C file to an object file (zcc -c) and sccz80 compiling another and then all of it being linked together. You might want to do this as sdcc is almost always generating faster code but sccz80 is usually generating smaller code especially when longs & statics are involved. At some point we should also look to see if we can get hitech-c for cpm to generate a rel file under native windows / linux (john elliott's & udo monk's recent stuff) and then translate that to a z80asm library for linking which I think you've done already for the older library format. Actually I don't !
believe
hitech is generating much better code as some repute it -- I think people are confusing the small output with hitech's leverage of cpm os code. But it might be fun to look into.

For headers I am thinking of changing them into macros. So they might look something like:

#undef __callee_linkage
#undef __fastcall_linkage

#ifdef __SDCC

#define __callee_linkage(function, p0, p1, p2) some magic here
#define __fastcall_linkage(...) ...

#endif

#ifdef __SCCZ80

#define __callee_linkage(function, p0, p1, p2) some magic here
#define __fastcall_linkage(...) ...

#endif

__fastcall_linkage(strlen, void *s)
__callee_linkage(memcpy, void *dst, void *src, size_t len)


But something better than that. Then there is only one set of function declarations. Right now, there are two independent sections in each header for each compiler and it's becoming unwieldly as well as non-trivial to add another compiler option.



BTW, siggi has a zx81 question in the bugs section.



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
Post Reply