Supporting far memory

obiwanjacobi · Post by **obiwanjacobi** » Thu Apr 21, 2016 11:30 am

I am researching how to implement/support far memory. I have read the wiki http://www.z88dk.org/wiki/doku.php?id=porting:farmemory and can support the suggested functions. However because the virtual continuous memory assumed by the far ptr is not a physical representation, some sort of mapping must be applied. It is this mapping, that I am trying to figure out.

My memory management unit hardware (MMU) organizes all available memory in 4k blocks. Any of these 4k blocks can be mapped to any 4k boundary in the Z80 address space - obviously with a maximum of 64k. Currently I have 4 banks of 64k of memory available but can support a maximum of 1MB (20 address lines).

I thought that I could (mis)use the extra address byte to indicate a block number (fits in 8 bits/256 blocks) and use the lower 16 address bits to indicate the address in real Z80 memory address space.

I just wanted to throw this up and see if any other suggestions would come up.
Disclaimer: I am just starting to think about this, so there may be some obvious flaw in my reasoning I haven''t seen yet...

Post by **dom** » Thu Apr 21, 2016 8:15 pm

It's been a while since I looked at the far code, hopefully I've understood it correctly....

Far memory is only supported on the z88. The memory model there is that there are 256 x 16k pages that be paged in at 0x0000, 0x4000, 0x8000, 0xc000. You can ask for memory from the OS, but it will only give you 256 bytes at a time (OZ <4.6). There's more details on the implementation here: http://z88dk.org/previous/far_z88.html - the key thing is that you can allocate more than 64k in one go.

Far pointers are held in ehl. For the z88 they are "virtual", with a lookup table to convert them to a physical address. Far memory is always paged into the address space at 0x4000 which means that you don't need to have the pointer store a physical address. The lookup table consumes 2 bytes for every 256 bytes allocated.

The lookup table has: 8 bits = bank, 8 bits = address high byte within bank (mapped to 0x4000 for convenience). When you allocate you find an appropriately sized hole in the malloc table and write into it the bank+offset.

For the z88, eh = offset/2 into malloc table. l = offset within each 256 byte chunk.

That way each far pointer address within an allocation is contiguous and pointer arithmetic (within an allocation) works as expected.

Looking at your setup, I think the same scheme will work, the downside is that the minimum allocation size is 256 bytes which isn't too memory efficient but if I sort out the malloc.h header you should be able to use both near and far heaps at the same time.

Post by **alvin** » Thu Apr 21, 2016 8:31 pm

You should be aware that the 24-bit far pointer type is only supported by sccz80 and not sdcc. sccz80 will make calls to the functions you found in the wiki whenever it needs to read or write C data in memory (ie char, int, long, etc). So sccz80 imposes a linear address model on bankswitched memory.

sdcc is following what recently came out of the C Embedded Technical Report ( section 5 of http://z88dk.cvs.sourceforge.net/viewvc ... vision=1.1 ) and sdcc's brief description can be found in section 3.5.4 of http://sdcc.sourceforge.net/doc/sdccman.pdf . It is advocating named address spaces where you assign an address space to a specific region of bankswitched memory and you can declare C variables that are located there. Each time sdcc needs to access C data, it will generated a call to a function you supply that allows access to the corresponding address space. The difference is instead of linearizing the entire bankswitched memory space, you're giving names to portions of the bankswitched space, none of which portions can exceed the uP's address range (so on the z80, a named address space must be <= 64k in size).

You can support both, of course, to use bankswitched memory (differently) on both compilers. I think the linearized address space sccz80 does is easier. Within z88dk there is only one target that supports sccz80's far pointer (the z88) and no one has much experience with it but it should be easy to implement. With sdcc I'm not aware of anyone having used the named address space feature on the z80. For sdcc I think it's a regular feature of 8051 code which has many native address spaces already.

Both these methods work the same way. The compilers will call bankswitching code which it expects to be in a page that is always present. At the moment we're calling that section "crt_code_common" in the new c library. So after the code jumps to this page, the banking code that you write can page in the required page (at the top of memory?), read/write the data, then restore the page there before returning. You have to be sure that you don't bank out the page containing the stack pointer while the stack is used.

For sccz80 that banking code is going to be the lp_* primitives you found. These are specific to each target so must be added to the target's sccz80 library specifically and I'm not sure where the best place is for them but probably someplace in the target subdir (sccz80 and sdcc compiler primitives are located in z88dk/libsrc/_DEVELOPMENT/l/sccz80, /sdcc but that wouldn't be a good place for target specific code)

And yes I think you can steal that extra byte to select bank number with your hardware. When you start allocating pages to programs as they are loaded, you may have to perform a translation step on that number for each program in the banking code because programs may not always occupy the same page numbers.

For sdcc the banking code will be a call to some function BANK_M_N. I'm not quite sure of the details because I have not tried this yet but it will be clear what's happening once it's tried and we may have to make a few minor adjustments in the clib side to get the details right.

The next problem is how to place variables in these memory banks. With sdcc you just declared variables as normal with the address space qualifier and these will be assigned to a new section with name derived from the address space name. Again we haven't tried this yet so we may need to make a few small adjustments to make sure the sections get the right names. Anyway, after a compile these named sections will be output as separate binaries which you will then have to place in the correct memory bank(s) when the program is loaded. If an address space is only going to hold uninitialized data (note it won't even be zeroed by the crt -- if you need to do that your program will have to do it) then you don't have to bother with loading the section separately into memory.

For sccz80 and far pointers: you don't declare variables in banked memory. Everything is accessed through a pointer that has at some point been assigned an absolute address. However you can get some help with variable placement by creating a new section for banked page(s) with org 0. Then you can assign the address of variables in that section to the far pointer, remembering to manually add the bank number to the most significant byte.

Outside of all that, you can implement far versions of memcpy, memset, strcpy, etc. that take far pointers as arguments and carry out the requested function. Again this code should be located in a page that is always paged in (crt_code_common section). You're free to implement whatever functions you think you need and again these are target-specific. However we will probably define some of these functions (and maybe do it now if you make progress) and have a place in the source tree for target-specific implementations.

obiwanjacobi · Post by **obiwanjacobi** » Fri Apr 22, 2016 9:53 am

Ok, that is some new info to take in, thanx.

I also have the ability to have 256 memory-mapping tables to select from - where a memory-mapping table is 16 bytes that translate A12-A15 to A12'-A19' (the extended address space). Switching mem-map tables (1 output instruction) is faster than altering the table itself (2 output instructions). So the trick is to find some algorithm to utilize this. Initially I would like to do this for one program but I want to scale up to a multi-tasking system in the future.

I like the named address spaces - that maps the cleanest onto my hardware. I could let the loader read the memory map sections from the program and prepare the mem-map tables accordingly. I would need to add some sort of header to the program and scoop up all the separate binary output files from the compile. But that is doable.

I think I would also create an memory API that would allow to allocate memory in switched-out blocks. Then when application actually needs it it can be brought into the active address space and the program can resolve the handle it got from the allocation, into an actual physical pointer. Closing the 'pointer' would allow that block to be swapped out when other memory needs to be brought in. That way you can manage the extra memory optimally and deliberately (I think).

I also would like to figure out if it is possible to have dynamically linked libraries stored in a common (to all code running) memory segment. That way frequently used code can be shared when multiple programs are running. Not something I will work out in detail now, but I wouldn't want to decide on something now, that would make that scenario hard or impossible.

This all will require a bit of housekeeping to keep track of active memory blocks that belong to a thread/process.

I will have to give this some more thought because there are still a couple of grey areas... ;-)

Post by **alvin** » Sat Apr 23, 2016 5:11 pm

obiwanjacobi wrote:I like the named address spaces - that maps the cleanest onto my hardware. I could let the loader read the memory map sections from the program and prepare the mem-map tables accordingly. I would need to add some sort of header to the program and scoop up all the separate binary output files from the compile. But that is doable.

It sounds like sdcc wants to manage what pages are currently visible so that it minimizes the amount of bankswitching that goes on. So maybe the idea of having a quick switch of the entire memory map won't work? But I am not sure. Some test code may have to be written to verify how things work and maybe we'll need to bother Philip at sdcc for more details.

I also would like to figure out if it is possible to have dynamically linked libraries stored in a common (to all code running) memory segment. That way frequently used code can be shared when multiple programs are running. Not something I will work out in detail now, but I wouldn't want to decide on something now, that would make that scenario hard or impossible.

Easiest would be statically linking against a single instance of library code at a fixed address. Then all programs can share that single instance if you write library entry stubs for each exported function that go into the application's library (so you would compile application code against a library containing these stubs instead of the real library functions). Then the application code would call these stubs which would do the necessary bankswitch and jump into the static library code. You may want to try to split the library code into a bunch of individual self-contained 4k pages so that this code only has to occupy 4k of the application's space while it is running.

This is fresh territory so I'm not sure what the best way to go about things would be.

dom wrote:Looking at your setup, I think the same scheme will work, the downside is that the minimum allocation size is 256 bytes which isn't too memory efficient but if I sort out the malloc.h header you should be able to use both near and far heaps at the same time.

The flat memory model is a good one - I think we should take a look at generalizing it so it doesn't contain z88 specific code. Even though sdcc can't do far pointers, it could still use the far pointer functions in particular malloc and far strings.h if we make a far ptr a long on sdcc. Adding memcpy(), memset() and maybe non-standard memswap() would make it really useful.

Post by **dom** » Sat Apr 23, 2016 7:28 pm

alvin wrote:
obiwanjacobi wrote:I also would like to figure out if it is possible to have dynamically linked libraries stored in a common (to all code running) memory segment. That way frequently used code can be shared when multiple programs are running. Not something I will work out in detail now, but I wouldn't want to decide on something now, that would make that scenario hard or impossible.
Easiest would be statically linking against a single instance of library code at a fixed address. Then all programs can share that single instance if you write library entry stubs for each exported function that go into the application's library (so you would compile application code against a library containing these stubs instead of the real library functions). Then the application code would call these stubs which would do the necessary bankswitch and jump into the static library code. You may want to try to split the library code into a bunch of individual self-contained 4k pages so that this code only has to occupy 4k of the application's space while it is running.

This is fresh territory so I'm not sure what the best way to go about things would be.

We did something like this for both Residos and z88, the calls were via a rst. Effectively the program pulls in a stub which then trampolines to the correct implementation. Obviously this sort of thing works best with pure assembly library functions, if you know the stack offset that the rst implements then with sccz80 you can configure a "shared offset" when compiling the dll files - these cause any stacked parameters to be offset. ZSock uses this technique to provide a dll service.

alvin wrote:The flat memory model is a good one - I think we should take a look at generalizing it so it doesn't contain z88 specific code. Even though sdcc can't do far pointers, it could still use the far pointer functions in particular malloc and far strings.h if we make a far ptr a long on sdcc. Adding memcpy(), memset() and maybe non-standard memswap() would make it really useful.

I think far is a sdcc feature, it's just not implemented in the z80 port. The general getters/setters can be made generic with a small bit of work - the only specific code is keeping track of the original binding for the paging segment. The actual malloc (really sbrk) and free would probably have to be machine specific.

obiwanjacobi · Post by **obiwanjacobi** » Sun Apr 24, 2016 11:57 am

I don't think I will put in the effort to support both far ptrs as well as named address regions (I am probably the only user of my target). I will have to choose which compiler it is going to be and stick with that. However that choice has not been made yet.

I am currently writing down some stuff on memory management. I find that writing it down focuses my mind and exposes flaws earlier, so... During this I was wondering how large a stack and a heap should be? I was thinking of putting a default heap and stack in one 4k block. I will have to try to work something out to detect collision. Would that be reasonable? If you have a very recursive function call with a lot of (stack) params it can add up quite fast. But I have no idea on how feasible/real this number is... Way too small, smallish, about right, on the large side, way to big?

I am trying to keep the number of fixed 4k blocks to a minimum. I need at least one 4k block at x0000 for the RST and NMI handlers and some basic bios entry points. This can grow as needed, but I will try to keep the number of blocks as small as possible. Then the one 4k block of default heap and stack and perhaps the 256 bytes of interrupt vectors (IM2). This should also be fixed, although it doesn't matter where in the address space it lives. I would like the programs memory to be as large and continuous as possible. So either everything is at the bottom or the stack block is at the top of the address space.

Creating proxies for shared/system methods sounds like a good plan. I think an RST dispatcher would be the shortest code. I will try to make that mechanism generic and reusable for other 'libraries' as well - not sure how to do that at this time though. I have seen RST mechanism that put some sort of literal code byte or word right after the RST instruction. The RST handler will pop the return address, read the byte/word and push back an adjusted return address and then go off doing what the code indicated. I don't really see the benefit of this system compared to the normal mechanisms available to pass parameters (like used by z88dk)... You will need one extra (16 bits?) parameter to indicate the operation itself, but that seems like a small price to pay compared to the overhead of that other method.

For switching memory banks and blocks into the address space, I think I will reserve my 256 tables for 256 tasks. Meaning each taks/thread will have one dedicated memory-map table (16 bytes) that is always active as long as that thread is running. Changing the mapping data on a fixed mem-map table will be one output instruction - so that is no worse than selecting an entirely new mem-map table. When a thread context switch is performed (future version) switching to the correct mem-map table is part of that context switch. Sounds reasonable. At this time I could not figure out an algorithm that used those mem-map tables effectively. Shows you how a cool hardware feature might totally pointless in software

.

There will also be an API to allow the program to manage its memory needs. I was thinking of something like (using std types):

Code: Select all

// reserves one or more 4k blocks of memory.
// flags determine if the block is fixed or movable and what access (read/write/execute).
// returns a handle
uint16_t AllocMem(uint8_t flags, uint16_t size);

// locks the memory block down into the active address space and returns the ptr to it.
void* LockMem(uint16_t handle);

// invalidates all ptrs and releases the block.
void UnlockMem(uint16_t handle);

// lifts the reservation so memory is free to be reused.
void FreeMem(uint16_t handle);

This is very much like the win32 api...

More methods may be needed but you get the idea.

With this basis it should be possible to load libraries, to let programs allocate large amounts of data and have the memory manager make the correct decisions on how, when and where to move the 4k blocks of memory. It should also be a basis for integrating into the CRT - I hope.

I got some more details on the data structures for threads, memory manager and mem blocks but that might be too much for now

Post by **alvin** » Mon Apr 25, 2016 7:33 pm

obiwanjacobi wrote:During this I was wondering how large a stack and a heap should be? I was thinking of putting a default heap and stack in one 4k block. I will have to try to work something out to detect collision. Would that be reasonable? If you have a very recursive function call with a lot of (stack) params it can add up quite fast. But I have no idea on how feasible/real this number is... Way too small, smallish, about right, on the large side, way to big?

For stack size, 256 bytes is probably enough and 512 will be comfortable. The exception is, of course, highly recursive functions. printf of floats is probably the biggest non-recursive function stack user in the standard library, pushing 80 bytes maybe plus whatever the driver may use. Most of those 80 bytes are workspaces allocated on the stack. Quicksort (shellsort is actually the default algorithm used for qsort) is the iterative type but the iterative type still has to push pointers to the largest partition not immediately pursued. There is a practical upper limit to stack size set by memory size and I think this is probably no worse than 64 bytes or so but if it's important, it should be checked. There is one recursive function in the library for flood filling images on screen but that wouldn't apply here unless you plan to add a pixel display. This one has the caller specify the amount of stack space it is allowed to use so you can get it to stay within available stack space at runtime. But for typical 80s era computers, stack space requirement to fill an arbitrary shape is probably in the 900 bytes region.

For the heap, size really depends on the application. 4k may be too small for some programs. The library creates a heap out of a block of memory handed to it . The heap will never exceed its assigned block so there is no danger that it will overrun the stack. The stack may grow down into it, however. The library is also able to create many different heaps addressed by name (the heap used by malloc & co has an internal name "_heap") so the program could create multiple heaps in different 4k pages. At the moment there is no way to grow an existing heap beyond its size (as with brk or sbrk) although it's not impossible to add that behaviour.

For stdio, the printf/scanf sides and the stdio data structures and drivers would have to be addressable all at once. This probably would require two 4k pages to fit. If you start looking at disks, the library is heading toward caching sectors which will typically be 512 bytes in size each. If you're caching 8 sectors, that's a 4k page for that. There will be options to control how much caching is done, including none. In the latter case, the library will allocate 512 bytes on the stack to read in a sector and then copy out of that to the ultimate destination. This is how the classic library is currently working.

feilipu · Post by **feilipu** » Tue Nov 15, 2016 5:25 am

I'm interested in a path forward on this discussion, because it is at the intersection of the design of two components of my current project. I'm building a [Z8S180 based board](https://feilipu.me/2016/05/23/another-z80-project/), that can access up to 1MB of writable memory (RAM + Flash), and I intend to use FreeRTOS with it.

FreeRTOS has the concept of Task Control Blocks (TCB) which contains the information relevant to each Task. It would be simple to add the MMU Registers to the Task swap, driven by the Scheduler. But, this would require that all Interrupt code remain in Page 0, and would limit the number of Tasks to a multiple of the size of the largest memory space required for a Task. Not a very flexible solution.

I think it would be more useful to build a flat or linear address model solution, that is "below" the OS layer. This looks like the eZ80 memory solution, and therefore this would help in maintaining consistency between Z80, Z8S180, and eZ80 hardware. I guess the disadvantage is inefficiency of every call would have to go via an address trampoline to set the MMU correctly.

There are some old C compilers that have addressed and solved this problem previously. Is there any source code in the public domain, or abandonware category, which can be used as a design guide or kick start?
For example [SCZ180](http://www.softools.com/scz180.htm) or [Jack Ganssle's examples](http://www.ganssle.com/articles/ammu.htm)

obiwanjacobi · Post by **obiwanjacobi** » Tue Nov 15, 2016 11:09 am

Nice board!

I got side-tracked with writing a Z80 simulator* and designing a generic high-speed IO bus. So I have not looked at this problem any further. I have not made a choice yet on supporting far pointers or named memory blocks yet.

As for references and samples, the links I mention in the beginning of this thread are the only clues I could find at that moment.

*) I am writing the Z80 simulator because I wanted to simulate my board and the BIOS I have to write without going to hardware. I expect I can produce much more stable code this way.

Post by **alvin** » Thu Nov 17, 2016 5:34 am

feilipu wrote:There are some old C compilers that have addressed and solved this problem previously. Is there any source code in the public domain, or abandonware category, which can be used as a design guide or kick start?
For example [SCZ180](http://www.softools.com/scz180.htm) or [Jack Ganssle's examples](http://www.ganssle.com/articles/ammu.htm)

z88dk is very close to being able to do this - it just needs some attention to get it done. The assembler is getting a rewrite but that's subject to Paulo's free time. But I don't think we need a rewrite to do this, just a new tool that manipulates the object files generated by z80asm. This could also act as a prototype for z80asm's rewrite.

Right now you can write bankswitched programs if you manually assign data and functions to named banks. You can create as many sections as you want by name, placed at any logical address. The linker will output one binary file per section given an org (sections not given an org merge with others -- this is how we create memory maps that broadly consist of CODE,DATA,BSS made up of many smaller sections). Then those banks can be loaded into physical memory appropriately.

z80asm can create consolidated object files which are complete object files generated from multiple source files minus linking of library functions. We could get z80asm to fully link to create an object file that only needs patching to generate a binary. A separate program could sift through that object file and create an automatically banked executable. You could use the same scheme as the previously mentioned ones in the Ganssle article or allow more flexible schemes. Common code would be assigned to banks named "COMMON" and banked code to banks named "BANKED", eg. A special patching program, given a description of memory and the consolidated object, could assign code and data to pages such that the amount of banking is minimized, then if a function is sometimes called across banks, place a trampoline for it in the common area that can be called when needed.

We'd just need to iron out the details and get down to writing it. Any volunteers