Geoff Streeter
				Dyalog Ltd













As a credit to John Scholes (1948-2019) I thought I would do this in pared back
text. John and I first met in 1975 when we both worked for Atkins. We worked on
Dyalog APL together from 1981.

I can't claim to have John's presentation ability. That is part of the reason
I am not usually seen on these podia. I think this is only my 4th presentation
in 42 years.

Some brains react better to visual text - mine included - than verbal text. So
you get the both.


















Back in 2013 we had a request from a customer. The customer was suffering from
slow startup for his application. The application had several hundred functions
in his workspace. His request was for a facility to start with a bare workspace
and load functions as their names were referenced.

This had echoes of my (and Dave Crossley's) paper in APL 1979 but I had the germ
of an idea. I raised the idea which received no enthusiasm and an expressed
desire to focus on what was requested.

However, I am a natural, non-conformist, dissenter. I still thought it was worth
pursuing and we have a policy of a "Friday Afternoon project" for non management
approved experiments. It actually took me a day to produce a first cut. That
first cut - with a demonstrable prototype - made the others realise that this
idea might be worth a bit more effort.

So what had I done:







Our workspace is divided into "pockets". This is our unit of memory allocation. 
A pocket has a length, a refcount, a descriptor and some content.

	┌────────┬────────┬────────┬────────────────────┐
	│LENGTH  │REFCOUNT│ZONES   │ any old stuff      │
	└────────┴────────┴────────┴────────────────────┘

The ZONES are bunch of fields which say what to expect in the "any old stuff" -
let's call that AOS for short.

The AOS can contain pointers to other pockets and the REFCOUNT keeps a track of
how many pointers point to this pocket. For a lot of pockets when the REFCOUNT
reaches zero the pocket can be discarded and the memory used for another pocket.








So for a traditional (not Dfn) function we have several pockets:

	┌────────┬────────┬────────┬────────┬────────┬────────┐
	│LENGTH  │REFCOUNT│DEFUNCT │LST     │FNBODY  │...     │
	└────────┴────────┴────────┴───*────┴───*────┴────────┘

The stars mark fields that are pointers to other pockets.

The LST (Local Symbol Table) is a pocket of pointers to other pockets:
	symbols: the names the function uses - think ⎕REFS
	constants: The explicit constants the function uses 1, 2, 'HELP' ...
	comments: The comments the function has.

The FNBODY is a pocket that does not point elsewhere and defines how the
function is split into lines and how it has been tokenised.













The next concept you need is of a symbol table. For all versions of Dyalog to
date these are binary trees.

	┌────────┬────────┬────────┬────────┬────────┬────────┬   ┬─────────┐
	│LENGTH  │REFCOUNT│SYMBOL  │LEFT    │RIGHT   │VALUE   │...│name...\0│
	└────────┴────────┴────────┴───*────┴───*────┴───*────┴   ┴─────────┘

Each namespace has its own symbol table including # and ⎕SE.

So that gives you some insight into a workspace.
However ...











Dyalog APL has always supported several workspaces. Originally, this came from
how we support )COPY.
To do )COPY we allocate memory for another workspace. Load the required
workspace into the extra memory and then copy information from that to the main
workspace.

(In fact originally these two workspaces were in different processes - because
the Zilog Z8000 was a 16 bit machine we and only had 64KiB of data per process.
We used pipes to transfer the information.)













So we can now come to the germ of my idea:
If I save the users workspace in a form where it always loads at a known
address. I can set up pointers to the addresses from my main workspace to that
workspace. Consider our SYMBOL pocket.

	┌────────┬     ┬────────┐
    ... │VALUE   │ ... │foo     │
	└───*────┴     └────────┘
            │
============┼========= workspace boundary =======================
            │
	┌───┴────┬────────┬────────┬────────┬────────┬────────┐
	│LENGTH  │REFCOUNT│DEFUNCT │LST     │FNBODY  │...     │
	└────────┴────────┴────────┴───*────┴───*────┴────────┘

That has not gained us much. We still have to read the users workspace. Just
into a different location.

But suppose ...



We map the workspace into that location. 

For those not familiar with mapped files: a file is mapped into the address
space of the process and portions (pages) of that file are only read on
reference.

So now we can have that user's workspace loaded and nothing has been read. That
is fast.

Clearly, we do have to have some adjustment to the main workspace or no code
will see the change. All we have to do is to create the symbols in the main
workspace and point them at the appropriate pockets in the mapped workspace.
Only when those symbols are used will the mapped space be referenced.









Before, in a normal loaded workspace:
	┌────────┬────────┬────────┬────────┬────────┬────────┬   ┬─────────┐
	│LENGTH  │REFCOUNT│SYMBOL  │LEFT    │RIGHT   │VALUE   │...│name...\0│
	└────────┴────────┴────────┴───*────┴───*────┴───┬────┴   ┴─────────┘
	    ┌────────────────────────────────────────────┘
	┌───┴────┬────────┬────────┬────────┬────────┬────────┐
	│LENGTH  │REFCOUNT│DEFUNCT │LST     │FNBODY  │...     │
	└────────┴────────┴────────┴───*────┴───*────┴────────┘
After, in a workspace with shared code:
	┌────────┬────────┬────────┬────────┬────────┬────────┬   ┬─────────┐
	│LENGTH  │REFCOUNT│SYMBOL  │LEFT    │RIGHT   │VALUE   │...│name...\0│
	└────────┴────────┴────────┴───*────┴───*────┴───┬────┴   ┴─────────┘
      ┌──────────────────────────────────────────────────┘
      │								main workspace
======│=============== workplace boundary =======================
      │								shared workspace
      │ ┌────────┬────────┬────────┬────────┬────────┬────────┬   ┬─────────┐
      │ │LENGTH  │REFCOUNT│SYMBOL  │LEFT    │RIGHT   │VALUE   │...│name...\0│
      │ └────────┴────────┴────────┴───*────┴───*────┴───┬────┴   ┴─────────┘
      │     ┌────────────────────────────────────────────┘
      │ ┌───┴────┬────────┬────────┬────────┬────────┬────────┐
      └─┤LENGTH  │REFCOUNT│DEFUNCT │LST     │FNBODY  │...     │
	└────────┴────────┴────────┴───*────┴───*────┴────────┘

So we come back to my lashed up experiment. I took a copy of the code that
supports 0 ⎕SAVE. I used some of the code we use for fixing up workspaces as we
load them and used that code to create the workspace at a known fixed address
and saved it to file.

I could then map that workspace (I actually used dfns.dws) at that known
address. I could walk the symbol table and copy the appropriate symbols into
the main workspace. Lo and behold I have working dfns.dws which only gets
paged in on demand.

This worked excellently ... on Linux. On Windows it worked but not as fast as
I hoped. The customer is committed to Windows - poor benighted soul.

Why was this? Windows does not page as well as Linux. It also does not create
processes as well. The two aspects may be linked as process code is mapped at
kernel level.






There are things I can do to mitigate that.

I have lots of time as we prepare the workspace (at save time) so I chose to 
sort the pockets. The first idea was to sort all the symbol pockets together so
that they would be grouped in as few pages as possible.

Since I was sorting pockets I could also sort data pockets by length so that all
those small scalar pockets are packed into only a few pages. That customer has
lots of these.

That was enough to have a working system that met the customer's desire for fast
startup on his system.












Then we started to realise that we had other benefits that were not originally
envisaged. The customers system looks like:
			┌──────────┐
			│ SERVER   │
			│ FOR CODE │
			└──────────┘
			     │
			piece of wet string - slow connection
			     │
			┌──────────┐
			│ LOCAL    │
			│ SERVER   │
			│ ROOM     │
			└────┬─────┘
			     │
			fast connection -  on the same machine
			     │
	    ┌────────────────┴──────────────┐
	┌───┴────┐			┌───┴────┐
	│ USER   │			│ USER   │
	│ PROCESS│	    ...		│ PROCESS│
	│ 0      │			│ N      │
	└────────┘			└────────┘

So now user process 0 starts up and pages are dragged across the piece of wet
string to the local server. These pages are cached there and when the next
user process starts they are already accessible. So user 1 gets even faster
startup than user 0. The cache is maintained by the operating system - I do not
have to do anything.


















Let's revisit the way we allocate memory in the workspaces - note that plural.
	┌────────┬────────┬────────┬────────────────────┐
	│LENGTH  │REFCOUNT│ZONES   │ any old stuff      │
	└────────┴────────┴────────┴────────────────────┘
With our new memory mapped approach. 
	┬────────┬     ┬────────┐
    ... │VALUE   │ ... │foo     │
	┴───*────┴     ┴────────┘
            │
============┼========= workspace boundary =======================
            │
	┌───┴────┬────────┬────────┬────────┬────────┬────────┐
	│LENGTH  │REFCOUNT│DEFUNCT │LST     │FNBODY  │...     │
	└────────┴────────┴────────┴───*────┴───*────┴────────┘

That pointer from VALUE to the function pocket has caused the REFCOUNT to
increase.

However, this is shared memory so you may think that all users of that memory
would see the same increased value of the REFCOUNT and worse that we may have
a race condition as multiple users increase/decrease the REFCOUNT.

But ... as a brilliant part of paging systems we have COPY ON WRITE (COW). This
means that if a mapped page is modified then it does not change the original
but instead a local page in the users swap space is allocated instead. So each
process that modifies a page gets its own copy.

I tried to find the history (and inventor) of COW but it seems to have had a
slow gestation. Hardware support seems to come from DEC VAX. Developed software
from BSD4.4. Today it is pretty universal. For Intel it goes back to the 386.
The driving force for the development was the Unix fork() system call that is
fundamental to Unix (and Linux) process creation. fork() creates a new process
identical (nearly - lots of subtleties) to the current process. It is nearly
always followed by an exec() call to replace the current program with a new,
different, program. Replicating all of the current process's pages just to
dispose of them was too expensive.







I use COW both for aspects of ⎕MAP and for the shared code file implementation
presented here. 

The use of COW means that all of our normal code just works. REFCOUNTs get
adjusted correctly. The penalty, of course, is that the memory footprint of each
task increases as pages are copied.

















There are of course some specific adjustments needed.

That function pocket:

	┌────────┬────────┬────────┬────────┬────────┬────────┐
	│LENGTH  │REFCOUNT│DEFUNCT │LST     │FNBODY  │...     │
	└────────┴────────┴────────┴───*────┴───*────┴────────┘

The FNBODY is not a problem. It points to a pocket that does not change and can
remain in shared (read only) memory. Every user sees the same pocket (apart from
its REFCOUNT and that does not change until the function executes). The FNBODY
contains the tokenised code and the definition of the lines in the function.

The LST (Local Symbol Table) is more of an issue. It contains pointers to
SYMBOLS and CONSTANTS. For this purpose COMMENTS are CONSTANTS. However each
SYMBOL pocket has a pointer to its namespace (# is a namespace, ⎕SE is a
namespace, anything created with ⎕NS is a namespace, ...).

If your function creates a name, A←1, then the A is a SYMBOL but when it is
accessed it belongs to a namespace.

In a shared code file the SYMBOL points to a namespace in the shared code file
but when the function executes it needs to access the name in the appropriate
namespace in the main workspace. We already have code that copes with this
because it can already happen that code is created in one namespace but executes
in another. So we walk the LST and make all of the symbols point to the correct
namespace. That is OK but expensive. It is most likely that the next invocation
of the function will be in the same namespace so we write the new LST pointer
back into the function pocket. This would be OK in our shared code file (COW
would cope) but in order to avoid doing things to that workspace that would
involve accessing all of the pockets - compaction, garbage collection - we avoid
creating pointers in the shared code file workspace that point back into the
main workspace.

OK:
	┬────────┬
    ... │pointer │ ...
	┴───┬────┴						main workspace
============┼========= workspace boundary =======================
            ↓
	┌───┴────┬────────┬────────┬				shared workspace
	│LENGTH  │REFCOUNT│ZONES   │ ...
	└────────┴────────┴────────┴
Not OK:
	┌────────┬────────┬────────┬
	│LENGTH  │REFCOUNT│ZONES   │ ...
	└───┬────┴────────┴────────┴				main workspace
============┼========= workspace boundary =======================
            ↑
	┬───┴────┬						shared workspace
    ... │pointer │ ...
	┴────────┴

So ... we pay the cost on function invocation of checking if the function
pocket is in a shared code file and, if so, create a copy in the main workspace.
Then the replacement of the LST is not a problem.

Back to the original request. The customer wanted to load FUNCTIONS on demand.
We have also delivered loading DATA on demand. Another bonus. However, never
satisfied, the customer wanted namespaces (actually classes) but for me that is
the same thing.

Not so easy. For instance, similar to symbols, the namespace has a pointer to
its parent (##). That parent is in the shared code file not the user's
workspace.  So while we walk the global symbols to put them in the main
workspace we also copy across VALUEs that are namespaces. That copy is not
complete some things in the space can remain shared - FUNCTIONs and DATA - but
not sub namespaces. Although their FUNCTIONs and DATA can remain shared.

We also, at shared code file creation time, can mark SYMBOL values that contain
namespaces at a lower level. So the SYMBOL may point to a nested array that
contains namespaces. These can also be pulled across as the shared code file is
attached.  Again we do not have to drag FUNCTIONs and DATA across.







Morten now realised that this could give him benefits for isolates. Isolates are
separate processes with which APL communicates. They may require a lot of
supporting APL code. If each isolate loads a shared code file then it gets the
benefit of not accessing stuff it does not use and not impacting the system
memory use for stuff it does not use. I don't think the original customer uses
isolates - yet.
However, really good ideas feed through into strange and unplanned places. I
first learned this when I. P. Sharp introduced "packages" in the late 1970s. I
am sure the developers did not expect the deluge of applications that bent their
use.












We have seen what you can do. Let's examine what you can't do.

At this point I had wanted to wave around a copy of our book "Shared Code Files
User Guide" but LuLu (who do our print on demand) will not print less than 50
pages so I will have to rely on using the screen to show the pdf. Credit Richard
Smith and Fiona Smith. In it we list the current restrictions.  Some of those
may be temporary. Some are just us nudging you in the right direction (no
classic). Some are for good, hard, reasons. 64 bit - mapped memory does not go
well with a 32 bit address space. We allow the use of 8 shared workspaces
concurently. The 8 slots allocated are at 1TiB intervals. I wanted to leave some
space for workspaces to grow over the next year or two. No ⎕WC (or .Net)
objects is likely to be the one restriction you may notice.

Was the customer happy with the solution he had not asked for? Happy enough to
be pushing for some futher extensions - so I guess the answer is yes.