Expanding Bits in Shrinking Time

perf0
The chart above compares the performance of ∘.∧ in Dyalog versions 16.0 and 17.0. The improvement is a factor of 3-4 across the whole range (except at multiple-of-8 lengths, where 16.0 spikes up). That’s not easy to get, especially for a function like ∘.∧ which was already the target of a fair amount of special code. And the improvements are completely general. They apply to any scalar dyadic which has a Boolean result on Booleans, which is all but +-○.

How did it happen? The answer, at least for the parts of the graph left of 1024, follows. It takes us through a surprising number of APL primitives, which I think is a great demonstration of the ways APL thinking can lead to faster algorithms even in other languages.

Binary choices

One of the many performance bumps in Dyalog version 17 is that selecting from an array of two options using a Boolean array is much faster. Dyalog 17 also takes advantage of this speed improvement by using it for a few other functions: scalar dyadics when one argument is a singleton and the other is Boolean, (or ) where the right (left) argument is Boolean, and outer products where the left argument is Boolean.

Selection has been sped up for all kinds of indexed array, but the most interesting case is that in which the cells of the indexed array are Boolean, with a cell shape that’s not a multiple of 8 (when working with Boolean algorithms, we tend to call such shapes “odd”). In this case the rows of the result aren’t byte-aligned, so to generate them quickly we will need some Boolean trickery. As it turns out, a lot of Boolean trickery.

The centre of our attention will be the function Replicate (/) when the right argument is Boolean and the left is a scalar. That means something like 3/1 0 0 1 0. Replicate doesn’t seem obviously connected to indexing, but like indexing, it turns each bit of the argument into multiple bits in the result. In fact, this is by far the most important and difficult step in implementing Index, since Index may be phrased straightforwardly in terms of Replicate using exclusive or ():

      y[x;]  ←→  (((≢y)/⍪x) ∧⍤1 ≠⌿y) ≠⍤1 ⊣⌿y

If for some reason that doesn’t look straightforward to you, I’ll explain how it all fits together later. But first, Replicate!

The trouble with Replicate

Just implementing Replicate on a Boolean array isn’t difficult. Read one bit at a time from the argument, and then write it to the result the required number of times. However, reading and writing one bit at a time is slow. A fast algorithm for Replicate will work an entire machine word at a time (that is, 64 bits on 64-bit systems).

But even fairly well-implemented word-at-a-time strategies will quickly run into another problem: branching. A simple if/then construct in C can be very cheap or very expensive depending on whether the CPU is able to predict in advance which path will be taken. With odd-sized replicates the not-quite-periodic way that the writes interact with byte boundaries will make nearly all useful branching unpredictable. But there are many techniques which can allow the CPU to make choices without having to change the path it takes through a program. Coming up is one rather exotic entry in the field of branchless programming.

Small expansion factors

When beginning work on fixed-size replicates, I quickly found that, in Dyalog version 16, it was much faster to add a length-1 axis before the first, replicate that, and transpose it to the end than to replicate the last axis directly.

      ⎕IO←0 ⋄ x←?1e4⍴2
      cmpx '5/⍪x' 'x,,⍨,[0.5]⍨x' '⍉5⌿(1,⍴x)⍴x'
  5/⍪x         → 9.5E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  x,,⍨,[0.5]⍨x → 3.4E¯5 | -64% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  ⍉5⌿(1,⍴x)⍴x  → 1.3E¯5 | -86% ⎕⎕⎕⎕⎕⎕

This is largely because I had spent a lot of time improving Transpose with Boolean arguments of various shapes in version 16 (the comparison above looks very different in version 15). Transpose on a short but wide Boolean matrix uses a technique that I call bit interleaving, which would unfortunately require another blog post to explain. To transpose a matrix with r rows, I move across the rows, using some special tricks to insert (r-1) zeroes after each bit in each one, and combine them using bitwise or. To instead replicate a vector by r, I modified this to expand the one argument as though it were going to be combined with other rows, but instead of doing that, I combined it with itself by multiplying by the number whose base-2 representation consists of r ones. (C programmers may notice that multiplying by this number is the same as shifting to the left by r and then subtracting the original value. In fact, using multiplication is faster—x86 variable shifts are slow!)

Bit interleaving is a somewhat complicated algorithm, but it had all been done by the time I started on Replicate, and modifying it didn’t take long.

Large expansion factors, the obvious way

The bit-interleaving algorithm is very fast for small numbers and gets slower as the expansion factor increases. Above 64 it is no longer usable because it only writes one word at a time, while each bit should expand to more than a word. I chose to stop using it after 32, as it turns out we can get just about the same performance at 33 using a different approach.

The old approach, on the other hand, is only about half as fast at this expansion factor. Version 16’s algorithm reads from the right argument one bit at a time, then writes that bit the appropriate number of times to the result by writing one partial byte and then using memset (which may spill over into the next bit’s area since memset can only write whole bytes; it will be overwritten when we get to the next bit). This works fine for large enough left arguments, and scales up very well: memset is provided by the C standard library and will push bytes to memory as fast as possible. But for small values it is not so good: the call to memset branches a few times based on the unpredictable size of the write, which incurs a steep branch misprediction penalty.

There’s no easy way to get around the penalty for writing some bits at a variable offset. If the expansion factor isn’t a multiple of eight, then the number of bytes each expanded bit touches will always vary by one, and having to check whether to write that byte results in an unpredictable branch. There are ways to avoid a check like this (overwriting is one), but these tend to be fast only on a small range of expansion factors and to be very tricky to write and bug-prone, especially at the boundaries of the result.

A new idea

So the problem is that writing a weird number of bits at a time is expensive. But it’s possible to get away with only writing one bit for each “memset” we want to perform! The trick is found in a pair of inverse functions well-known for their usefulness in Boolean APL code. These are {2≠/0,⍵} and ≠\, and they are analogous to the pairwise difference {¯2-/0,⍵} and progressive sum +\. Dyalog 16 has fast code for ≠\ provided by Bob Bernecky, so I knew it was a good building block for other high-performance primitives.

The diagram above illustrates the relationships between scans and pairwise differences. In the bitwise world consisting only of even and odd, is like +. But each number is its own negative, so it’s also like -. More precisely, since ⍺≠⍵ is the same as 2|⍺+⍵ on Booleans (check for yourself!) and modulus distributes over addition when the final result has a modulus applied to it, ≠\ is equivalent to 2|+\. Similarly, {2≠/⍵} or {¯2≠/⍵} is the same as {2|¯2-/⍵}. The diagram includes the extra zero in {2≠/0,⍵} as part of the vectors in the top half of the diagram, but omits it from the functions used to label arrows to avoid clutter.

Let’s take a look at the pairwise difference after we replicate a vector:
fig0
It’s very sparse—the only one bits come at indices which are multiples of five. This is because expanding v multiplies the indices of its pairwise difference by five:

      v←1 1 0 1 0 0 0 1
      ⍸ 2≠/0, v  ⍝ ⍸⍵  ←→  ⍵/⍳⍴⍵
0 2 3 4 7
      ⍸ 2≠/0, 5/v
0 10 15 20 35

We can recover 5/v using the inverse ≠\ of pairwise difference. So to expand, we can add four zeros after each bit under the ≠\⍣¯1 function (which is the same as pairwise difference):

      ≠\ ,5↑[1]⍪ ≠\⍣¯1⊢ v
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1

Of course isn’t particularly fast on these arguments, so we will have to write our own version, which sets the entire result to zero and then writes bits in the appropriate places. Implemented properly, this goes pretty quick, the only substantial costs being a variable shift and one byte-sized write per element on the right. It’s completely branchless. Follow up with the fast ≠\ algorithm and we have a pretty good solution.

This solution performs more writes than it has to, since it writes each difference bit even if it is zero. But trying to get rid of the extra writes is a losing battle. Checking the value before each write takes much longer than just writing it, and even a fancy strategy, like using the count-leading-zeros instruction to quickly skip over all zeros before each one, will be slower for arguments without many zeros in the difference vector. But there is another way to make this algorithm a little faster, coming up after we tie up some loose ends…

Back to Index, and more xor

What about my unexplained formula for indexing? The expression

      y[x;]  ←→  (((≢y)/⍪x) ∧⍤1 ≠⌿y) ≠⍤1 ⊣⌿y

also relies on insight about , but in a completely different context. Let’s simplify it a bit to see what’s going on. Suppose we have two Boolean arrays b and c of the same shape, and we want to use the bit a to select one of them: we want an expression that returns b if a=0 and c if a=1. For real numbers, we might use a linear combination:

      b + a×(c-b)

If a=0, then the c-b term is removed and we get b. If a=1, then it stays and we have (b+(c-b)) = c.
This formula is also valid for Boolean arrays, of course, but introducing an integer c-b would be a poor choice from a performance standpoint. Instead, we can use the insight that performs the function of both + and - in the bitwise world, and transliterate our formula into

      b ≠ a∧(c≠b)

Again, if a=0, then the c≠b is zeroed out, and we end up with b≠0, which is the same as b. If a=1, then it stays, leaving (b≠(c≠b)), which gives c.

If y is a shape 2 m array y←b,[¯0.5]c, then we have b ≡ ⊣⌿y and (c≠b) ≡ ≠⌿y. So b ≠ a∧(c≠b) is, rearranging some, the same as (a ∧ ≠⌿y) ≠ ⊣⌿y. But there is an implicit scalar extension in the function; we can make this explicit by writing ((m/a) ∧ ≠⌿y) ≠ ⊣⌿y. Now it’s clear that to select using each bit of a vector x, we need to turn each bit into a row with length m, and then apply the and functions on rows, giving the formula I showed earlier.

How about Outer Product? That one’s easier, with no Boolean magic required. For vector arguments, ⍺ ∘.f ⍵ is identical to ((≢⍵)/⍪⍺) f ((≢⍺),≢⍵)⍴⍵, and the same computation works for general arrays if we use the ravel length ×/∘⍴ instead of the count . When the right argument has more than around 1024 bits, replicate becomes slower than the old way of just pairing each bit of the left argument with the whole of the right, so we use that method instead.

Speeding up ≠\

We left off with an expansion algorithm whose cost is dominated by the time to compute ≠\ on a vector. That computation is based on a fast method for ≠\ on a single machine word. It moves along the argument vector (which is the same as the result vector in this case) one word at a time, keeping track of the most recent bit of the result. Then each word of the result is obtained from the fast xor-scan of that word in the argument, xor-ed with the carried bit. For arbitrary input vectors, this is the best algorithm. But the word-length ≠\ computation is slowing us down, and it turns out we can do a lot better.

Writing a single bit has the same cost as writing an entire aligned machine word. So writing one bit at a time, while much better than writing a variable number of bits with arbitrary alignment, represents some wasted effort. Can we make the impending ≠\ computation easier by using full-word writes?

A definite “yes” for that question. In fact, we can eliminate all of the intra-word computation. Consider the effect of a bit that we write on the final result after applying ≠\. That bit will be spread out over all of the subsequent bits, flipping each one. In order to produce this effect on a single word (leaving the task of flipping subsequent words for later), we can thus xor an aligned word with a word which is zero before that bit and one at that bit and later. We can get this word just by shifting the all-ones word right by an appropriate amount. And then we can omit the expensive word-size ≠\ computation entirely, so that to perform the final pass we simply repeatedly xor the next word with the current one’s last bit. This greatly reduces the cost of the final pass at very little expense to the pass before it (xor-ing rather than setting a word requires reading that word’s value first, so it’s not quite free).
fig2

To summarize, there are three passes:

  • Set all bits of the result to zero.
  • For each bit in the right argument, xor the appropriate word in the result with a word which is all ones after the place where that bit starts in the result.
  • Iterate across words of the result, xor-ing each with the last bit of the previous word (after finishing computation on that word).

For expansion factors between 32 and 256, this is the fastest method I know of. Below 32, interleaving is faster, and above 256 simply using memset will do the trick.

The xor-based algorithm also works for Replicate or Expand when the left argument is a vector, and in Dyalog 17 it is used for both of these whenever the average expansion factor, which is equal to the ratio of result length to argument length, is 256 or less.

The end result

perf1
The graph above shows the gains in scalar-Boolean replicate between versions 16.0 and 17.0. Note the huge scale of the vertical axis—replicate is just short of a hundred times faster at the far left! Version 17’s Replicate is split into three parts, with the performance of each individual algorithm shown under the overall result. From left to right, these are the algorithm with bit interleaving, the xor-based algorithm described in the last section, and an algorithm with memset. The last of these is the same basic strategy used by version 16, but there are some improvements in version 17 that make it about 45% faster for short arguments.

The seams between these algorithms are not quite perfect on my machine, but they are not off by enough to damage performance significantly. The cutoffs favor the middle, xor-based algorithm a little. I think this is appropriate because that algorithm is likely to have consistent performance even on older machines and different architectures. In contrast, bit interleaving uses the BMI2 instruction set (available on x86 processors since 2013) and is slower if it is not present, and memset may perform differently with other implementations of the standard library or depending on available vector instruction sets. Our xor-based algorithm uses only very widely available instructions, and one of the costs—the variable shift used on the all-ones word—is actually much faster on other architectures because of a design flaw in x86. An excellent addition to Dyalog APL!

Is it Sorted?

Motivation

I have been working on the Dyalog APL quicksort implementation. The following programming puzzle arose in the process of doing the QA for this work.

is a simple array. Write a function sorted, without using or , such that sorted ⍵ is 1 if is sorted in ascending order and 0 otherwise.

The point about not using grade is that this is supposed to be an independent check that grade is correct (remains correct) after the new work.

Real Vectors

The simplest case is when is a numeric vector. If furthermore are not complex numbers (a case addressed later), then

   ∧/ 2≤/⍵

each item being less than or equal to the next one, checks that is sorted. Since uses exact comparisons, here we must set ⎕ct←⎕dct←0. Morever, in order that decimal floating-point numbers (DECFs) be compared correctly, here ⎕fr←1287.

Real Arrays

More generally, when is a non-complex numeric matrix, we must check that each row precedes or is equal to the next row. If c and d are consecutive rows, then corresponding items are compared and at the first item where they differ, c[i] must be less than d[i].

   ~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵

The expression incorporates two refinements:

  • If is not a matrix, first apply ⍪⍵.
  • Instead of checking c[i] is less than d[i], check that c[i] is not greater than d[i]. This finesses the case where c≡d and there is no first item where they differ; that is, the case where <\2≠⌿⍪⍵ is all 0s for that row.

<\on a boolean vector has 0s after the first 1, (and is all 0 if there are no 1s). Therefore, <\2≠⌿⍪⍵ finds the first item (if any) where one cell differs from the next cell, and that item must not be greater than the corresponding item in the next cell.

For example:

   x←?97 3⍴10

   {~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵} x
0
   {~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵} x[⍋x;]
1

(Muse: since x above are random numbers, there is a possibility that it is sorted and the first test above can be 1. But: if each elementary particle in the visible universe were a computer and every nanosecond each of them creates a random matrix and tests it for sortedness as above, running from the beginning of the time to the end of our lives, it is still a very safe bet that no 1 would result.)

For integer arrays, there is an alternative of using the signs of the arithmetic difference between successive cells:

   {~ 0 ∊ 1≠t×<\0≠t← × 2-⌿⍪⍵} x[⍋x;]
1

(The sign where consecutive cells first differ must not be 1.) However, computing the difference on floating point numbers can founder on overflow:

   ⊢ x←¯1 1×⌊/⍬
¯1.79769E308 1.79769E308

   {~ 0 ∊ 1≠t×<\0≠t← × 2-⌿⍪⍵} x
DOMAIN ERROR
   {~0∊1≠t×<\0≠t←×2-⌿⍪⍵}x
                   ∧

Complex Numbers

Two complex numbers are ordered first by the real parts and then by the imaginary parts. (This is part of the TAO extension implemented in Dyalog APL version 17.0.) Therefore, a complex array can be tested for sortedness by testing an equivalent real array with each number replaced by their real and imaginary parts, thus:

   (¯1⌽⍳1+⍴⍴⍵) ⍉ 9 11∘.○⍵
   ↑9 11∘○¨⍵
   9 11○⍤1 0⊢⍵

Although the second expression is the shortest, it is less efficient in time, space, and number of getspace calls. The last expression is favored for its brevity and performance.

The number of getspace is a worthwhile measure. Part of the QA process is a rather stringent procedure called the “Shuffle QA”. The entire Shuffle QA takes several weeks to run and its running time is directly related to the number of getspace.

Character Arrays

None of the functions < ≤ ≥ > - × are permitted on characters. This is solved by application of ⎕ucs, converting characters to integers while preserving the ordering.

Putting It All Together

sorted←{
  ⎕ct←⎕dct←0
  ⎕fr←1287
  d←10|⎕dr ⍵
  d∊0 2: ∇ ⎕ucs ⍵
  d=9:   ∇ 9 11○⍤1 0⊢⍵
  ~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵
}

Other Considerations

That ⍵⌷⍨⊂⍋⍵ is sorted is a necessary but not sufficient condition that ⍋⍵ is correct. For example, an “adversary” can supply the following results for ⍋⍵ so that ⍵⌷⍨⊂⍋⍵ is sorted:

?≢⍵
(≢⍵)⍴?≢⍵
¯1↓⍋⍵
∊ i {⊂⍵[?⍨≢⍵]}⌸⍨ ⍵⌷⍨⊂i←⍋⍵

The last expression randomly permutes the grade indices of equal cells, a result which violates the requirement that grade indices of equal cells are in ascending order. That is, grade must be stable.

In Dyalog APL version 17.0, grade has been extended to work on non-simple arrays, the much discussed TAO, total array ordering. Checking that a non-simple array is sorted without using grade requires facilities discussed in the paper TAO Axioms and is beyond the scope of this note.

Dyadic Grade

⎕io=0 is assumed throughout. The essay talks only about but the same ideas apply to .

Background

has the distinction of being the first (in 1980) APL primitive function defined on major cells: the result orders items of a vector, rows of a matrix, planes of a 3-d array, etc. In the ordering major cells are compared in ravelled order, with leading items being more significant than trailing (lexicographic ordering). Moreover, in dyadic grade ⍺⍋⍵, specifies “alphabets” to be used in comparing the items of character array .

Dyadic grade has always been an APL primitive which is hard for me to understand, in that way kind of like dyadic transpose ☺. I sat down to really understand it, starting from the simplest cases to the general case. The following is a record of my explorations.

Vector Left Argument

   gv← {⍋⍺⍳⍵}

   a0← 'abcdefghij'
   x0← 'chthonic'
      
   a0 gv x0
0 7 1 3 6 2 4 5
   a0 ⍋ x0
0 7 1 3 6 2 4 5

   x0 ⌷⍨ ⊂ a0 gv x0
cchhiton

That is, grade the indices of in . If an item of is not in then its index is ≢⍺.

Higher-Rank Left Argument with Unique Items

The coordinates of A[i;j;k;…] or A[⊂i,j,k,…] is the vector i,j,k,…. The phrase ⍳⍴A produces the array of coordinates. For example, if is the (2 26)-matrix of the upper and lower case English letters,

   ABCDEFGHIJKLMNOPQRSTUVWXYZ
   abcdefghijklmnopqrstuvwxyz

the corresponding coordinates are

   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬   ┬────┬────┐
   │0 0│0 1│0 2│0 3│0 4│0 5│0 6│0 7│0 8│   │0 24│0 25│
   ├───┼───┼───┼───┼───┼───┼───┼───┼───┼ … ├────┼────┤
   │1 0│1 1│1 2│1 3│1 4│1 5│1 6│1 7│1 8│   │1 24│1 25│
   └───┴───┴───┴───┴───┴───┴───┴───┴───┴   ┴────┴────┘

If the items of are unique,

   gu← {⍋ 0 2 1 ⍉ (⊂(,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ ⍉(⍴⍺)⊤⍳×/⍴⍺}

That is, ⍺⍋⍵ obtains as the grade of the reversed coordinates of in . (If an item does not occur in , its coordinates are ⍴⍺.) The implements that in , the first axis is least significant and the last axis is most significant. For the (2 26)-matrix above, case (the first axis) is less significant than A-Z and a-z (the last axis).

   ⊢ a1←' ',⎕av[(⎕av⍳'Aa')∘.+⍳26]
 ABCDEFGHIJKLMNOPQRSTUVWXYZ
 abcdefghijklmnopqrstuvwxyz

   ⊢ x1←↑' '(≠⊆⊢)' Jay roger Roger adam Adam jay' 
Jay  
roger
Roger
adam 
Adam 
jay  

   a1 gu x1
4 3 0 5 2 1
   a1 ⍋ x1
4 3 0 5 2 1

   x1 ⌷⍨ ⊂ a1 gu x1
Adam 
adam 
Jay  
jay  
Roger
roger

Higher-Rank Left Arguments

Suppose does have duplicates? For purposes of , the coordinates of an item c are

   ⌊⌿(c=,⍺)⌿↑,⍳⍴⍺

That is, the minimum of coordinates of all items equal to c. Note that the expression also works if c is a unique item. Therefore, for a general , with or without duplicates, ⍺⍋⍵ obtains as

   gr← {⍋ 0 2 1 ⍉ (⊂(∪,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ (,⍺) {⌊⌿⍵}⌸ ⍉(⍴⍺)⊤⍳×/⍴⍺}

The “minimum of coordinates” computation is exploited to effect equal coodinates for disparate characters. For example, an ordering where upper and lower case are significant but diacritical marks are not, can be implemented as follows:

   A    ⍝ A has a leading blank column
 AÀÁÂÃÄÅBCÇDEÈÉÊËFGHIÌÍÎÏJKLMNÑOÒÓÔÕÖØPQRSTUÙÚÛÜVWXYÝZ
 aàáâãäåbcçdeèéêëfghiìíîïjklmnñoòóôõöøpqrstuùúûüvwxyýz
 À       Ç  È       Ì        Ñ Ò                   Ý  
 Á       ç  É       Í        ñ Ó                   ý  
 Â          Ê       Î          Ô                      
 Ã          Ë       Ï          Ö                      
 Ä          è       ì          Õ                      
 Å          é       í          Ø                      
 à          ê       î          ò                      
 á          ë       ï          ó                      
 â                             ô                      
 ã                             õ                      
 ä                             ö                      
 å                             ø                      
   ⍴A
14 54

   ('È'=,A)⌿↑,⍳⍴A                ('è'=,A)⌿↑,⍳⍴A
0 13                          1 13
2 12                          6 12
   ⌊⌿('È'=,A)⌿↑,⍳⍴A              ⌊⌿('è'=,A)⌿↑,⍳⍴A
0 12                          1 12

   ('E'=,A)⌿↑,⍳⍴A                ('e'=,A)⌿↑,⍳⍴A
0 12                          1 12
   ⌊⌿('E'=,A)⌿↑,⍳⍴A              ⌊⌿('e'=,A)⌿↑,⍳⍴A
0 12                          1 12

'È' occurs twice in A with coordinates 0 13 and 2 12. The coordinates assigned to 'È' are the minimum of these, 0 12. In contrast, 'E' occurs once and its coordinates are 0 12, the same as those for 'È'. Therefore, 'E' and 'È' are considered equal for purposes of dyadic grade. Similarly, 'e' and 'è' have coordinates 1 12 and are considered equal by , but they follow 'E' and 'È' (because their coordinates are 0 12).

For example:

   ⊢ x← ↑' '(≠⊆⊢)' roger adàm Röger rÖger Adåm JÃY JAY JÃY adåm adàm'
roger
adàm 
Röger
rÖger
Adåm 
JÃY  
JAY  
JÃY  
adåm 
adàm 

   A gr x
4 1 8 9 5 6 7 2 3 0
   A ⍋ x
4 1 8 9 5 6 7 2 3 0

   x ⌷⍨⊂ A gr x
Adåm 
adàm 
adåm 
adàm 
JÃY  
JAY  
JÃY  
Röger
rÖger
roger

Lest you think that dyadic grade in its full generality suffices to implement any ordering: in “telephone book” ordering, “1600 Pennsylvania Avenue” and “Unter den Linden 23” are ordered as if 1600 were spelled out as “Sixteen Hundred” and 23 as “Dreiundzwanzig”. A program to do that ought to be très amusant.

Code Archeology

The above code are improved versions of what appeared in Peter Wooster, Extended Upgrade and Downgrade, SHARP APL Technical Notes 35, I.P. Sharp Associates, 1980-09-15. It is interesting to study the code from the two eras. (The code from 1980 is lightly edited for executability and clarity.)

2018

gv← {⍋⍺⍳⍵}
gu← {⍋ 0 2 1 ⍉ (⊂(,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ ⍉(⍴⍺)⊤⍳×/⍴⍺}
gr← {⍋ 0 2 1 ⍉ (⊂(∪,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ (,⍺) {⌊⌿⍵}⌸ ⍉(⍴⍺)⊤⍳×/⍴⍺}

1980

eu← {d⊤⍳×/d←⍴⍵}
er← {¯1+÷(÷1+d⊤⍳×/d←⍴⍵)⌈.×a∘.=a←,⍵}

fv← {⍋⍺⍳⍵}
fu← {⍋(⍒0 1,1↓0×⍳⍴⍴⍵)⍉(⊖(eu ⍺),⍴⍺)[;(,⍺)⍳⍵]}
fr← {⍋(⍒0 1,1↓0×⍳⍴⍴⍵)⍉(⊖(er ⍺),⍴⍺)[;(,⍺)⍳⍵]}
gv, fv vector left argument
gu, fu higher-ranked left argument with unique items
gr, fr higher-ranked left argument

In the sequence gv gu gr, a function is more general than the preceding one and subsumes it. Likewise fv fu fr.

Comparison of the code illustrates advances in APL between 1980 and 2018:

  • {⌊⌿⍵}⌸ minimum of major cells corresponding to identical keys
  • ∪      unique items
  • ⍪⍵     ravel major cells
  • ⍺⍪⍵    catenate on first axis
  • ⍨      commute operator
  • dfns

Alternatives

If a left argument is large and complicated and is used repeatedly, it may be worthwhile for the APL interpreter to perform precomputations on it. Thus:

   U← ∪,A
   C← ⌽ (⍴A) ⍪⍨ (,A) {⌊⌿⍵}⌸ ⍉(⍴A)⊤⍳×/⍴A

   ⍴U        ⍴C
107       108 2

   ⍪U        C
           0  0
A          1  0
À          1  0
Á          1  0
          1  0
à         1  0
Ä          1  0
Å          1  0
B          8  0
C          9  0
Ç          9  0
…           …
x         50  1
y         51  1
ý         51  1
z         53  1
          14 54

   gp← (U C)∘{U C←⍺ ⋄ ⍋0 2 1⍉C[U⍳⍪⍵;]}

   gp x
4 1 8 9 5 6 7 2 3 0
   A ⍋ x
4 1 8 9 5 6 7 2 3 0

It makes sense that the number of columns in the coordinate matrix C is equal to the rank of the alphabet array A: The rank is the number of dimensions, a-z, upper/lower case, color, etc.; each row of C is a vector of the numeric value for each dimension.

With 20/20 hindsight, the above code can be seen as an argument against defining dyadic grade to do ordering with specified alphabets. After all,

   ⍺⍋⍵  ←→  ⍋0 2 1⍉C[U⍳⍪⍵;]

and specifying U and C directly makes the computation easier to understand, easier to use, and as it happens is faster than the primitive in the current implementation. The inverse calculation, from U C to the alphabet array A, is an amusing bit of code left as an exercise for the reader☺.

One can further argue that the current definition of dyadic grade precludes an alternative attractive but incompatible definition:

   ⍺⍋⍵  ←→  ⍺⌷⍨⊂⍋⍵

That is, order by the grade of , whence ⍋⍨ sorts. In Dyalog APL version 17.0, monadic grade is extended to work with a TAO (total array ordering). With a TAO and this alternative definition, ⍋⍨ sorts any array.

The present exposition exposes a difficulty with extending the current dyadic grade to work with TAO: It is axiomatic that monadic grade compares cells itemwise, stopping at the first pair of unequal items. Dyadic grade does not do that in general. For example, with an upper and lower case alphabet, you don’t stop comparing 'Rogerz' and 'rogers' on encountering 'R' and 'r'.

Linear Interpolation

⎕io=0 assumed throughout; works in 1-origin with the obvious modifications.

Introduction

On Wednesday, a question arrived via Dyalog Support from an intern in Africa: If M is the matrix on the left, use linear interpolation to compute the result on the right.

   1 20         1 20
   4 80         2 40
   6 82         3 60
                4 80
                5 81
                6 82

Linear Interpolation

Two points (x0,y0) and (x1,y1) specify a line; for any x there is a unique y on that line (assuming x0≠x1). The equation for the line derives as follows, starting from its slope m:

   m = (y1-y0) ÷ (x1-x0)
   (y-y0) = m × (x-x0)
   y = y0 + m × (x-x0)

Therefore, if is a 2-by-2 matrix of the two points and are the x-values to be interpolated, then:

   g ← {(⊃⌽⍺)+(⍵-⊃⍺)÷÷/-⌿⍺}

   ⊢ M←1 4 6,⍪20 80 82
1 20
4 80
6 82

   M[0 1;] g 2 3
40 60
   M[1 2;] g 5
81

A New Twist, A New Solution

The problem as posed implicitly required that:

  • The x-values are the positive integers bounded by ⊃⊖M.
  • Appropriate rows of the matrix are selected for a given x-value.
  • The missing x-values and their interpolations are “slotted back” into the argument matrix.

These requirements are best met by , interval index, a relatively new primitive function introduced in Dyalog APL version 16.0. The left argument must be sorted and partitions the universe into disjoint contiguous intervals; ⍺⍸⍵ finds the index of the interval which contains an item of . The result is ⎕io dependent.

For the given matrix M, the partition (of the real numbers in this case) is depicted below. As in conventional mathematical notation, [ denotes that the interval includes the left end-point and ) denotes that the interval excludes the right end-point.

          1        4      6
─────────)[───────)[─────)[──────────
     ¯1       0       1       2

   v←¯5 0 1 2.5 6 3 4 5 9 8 7

   1 4 6 ⍸ v
¯1 ¯1 0 0 2 0 1 1 2 2 2

   v ,[¯0.5] 1 4 6 ⍸ v
¯5  0 1 2.5 6 3 4 5 9 8 7
¯1 ¯1 0 0   2 0 1 1 2 2 2

With in hand, the problem can be solved as follows:

interpol←{
  (x y)←↓⍉⍵
  m←m,⊃⌽m←(2-/y)÷(2-/x)
  j←0⌈x⍸i←1+⍳⊃⌽x
  i,⍪y[j]+m[j]×i-x[j]
}

   interpol M
1 20
2 40
3 60
4 80
5 81
6 82

The problem of x-values less than the first end-point is finessed by applying 0⌈ to the interval indices, and that of x-values greater than or equal to the last end-point is finessed by repeating the last slope m←m,⊃⌽m.

It is possible to do the interpolation only on the missing indices (2 3 5 in this case) and insert them into the argument matrix. It seems neater to simply interpolate everything, and in so doing provide a check that the interpolated values equal the values given in the argument.

An Alternative Interpolation

Interpolating according to two selected rows of a matrix of points treats the function as piecewise linear, with sharp inflection points where the lines join (different slopes between adjacent lines). A “holistic” alternative approach is possible: the matrix can be interpreted as specifying a single line and the interpolation is according to this single line. The primitive function computes the coefficients of the line which best fits the points:

   ⎕rl←7*5  ⍝ for reproducible random numbers

   ⊢ M←t,⍪(?7⍴5)+¯17+3×t←?7⍴100
35  89
98 278
19  44
 4  ¯5
62 170
49 133
25  59

   M[;1] ⌹ 1,M[;,0]    ⍝ y-intercept and slope
¯15.3164 2.99731

   interpola ← {(1,⍤0⊢⍵)+.×⍺[;1]⌹1,⍺[;,0]}

   M[;1] ,[¯0.5] M interpola M[;0]
89      278    44      ¯5       170     133     59     
89.5895 278.42 41.6325 ¯3.32713 170.517 131.552 59.6164

   M interpola 33 35 37 39.7
83.5949 89.5895 95.5841 103.677

Finally

Our best wishes to the intern. Welcome to APL!

Phil Goacher (05-11-40 – 09-03-18)

Phil Goacher
I attended Phil Goacher’s funeral yesterday. He probably had no idea of just how much he would influence the development of APL. I first met Phil at W S Atkins which is a large UK engineering consultancy. There was a sub unit called Atkins Computing which provided internal computing support to the engineers and, crucially, provided a time sharing service to external users. He became my manager when I moved from supporting the transportation program suite to the brand new APL team.

Phil was more manager and business man than programmer but he had an engineering background in Gas distribution and thus the mathematical bias that seems common to APL users.

Phil was the sort of man with an eye for a business opportunity and when a advert appeared from someone who wanted to set up an APL consultancy with a limited objective to provide resources to Rank Xerox (UK arm of Xerox Corp) he turned up along with half his team. Phil was a wheeler dealer type and had already talked to the recruiter, persuaded him that he could manage the team, and then interviewed the candidates. So I wound up being interviewed by my existing manager. Following this Phil decided we only needed a salesman and we could set the consultancy up with a much wider remit. Phil recruited Ted Hare who was one of Atkin’s top salesmen and Dyadic Systems was born.

Phil did all of the necessary legal and administration to get Dyadic off the ground and became the company’s administration guy. He bought an Apple 2 with VisiCalc so was early into the spreadsheet concept. It must be said that all five of the people who started Dyadic were “early adopters” by nature.

Dyadic was quite successful as an APL consultancy but wanted to do more. I am not sure who instigated it (I was too busy consulting) but it was probably Phil’s idea to develop an APL of our own. Knowing Phil this was probably a business eye and a request from Pam Geisler who was Pauline Brand’s sister (Pauline was an early recruit that we “acquired” from Atkin’s) but, more importantly, high up in the management of Zilog. Zilog was developing a new 16 bit chip, Z8000, that was going to push them up market from the 8-bit chip with which they had had enormous success – Z80. IBM had been pushing APL as the future, and it was incumbent on competitors to provide an APL offering.

Phil and Dave Crossley will have been the instigators of recruiting John Scholes, who had also been at Atkin’s, and teaming him with me to develop Dyalog (Dyadic + Zilog). Dave provided the management overview of that project while Phil and Ted had to worry about financing John and I, who were not bringing in consultancy money. It was at this point that Dyalog APL became a UNIX and C project. It was much later when I read Brian Kernighan’s obituary that I realised just how early we had got into UNIX and C.

Dyalog APL was not the instant success that Dyadic hoped for. It had increased costs. We needed an office and the paraphernalia that goes with selling a product rather than brains. Dyadic ran into financial issues and was taken over by Lynwood Scientific, who made terminals with Zilog Z8000 chips inside. I think they really wanted the Z8000 expertise that John and I had acquired. As part of that takeover Phil, Ted and Dave had to leave with a derisory pay-off and I lost touch until I learnt of his death a fortnight ago.

From the perspective of APL language development Phil is not a tower of strength. However, from the perspective of having provided the impetus and drive to get Dyalog APL developed, he is very important.

NOTE: For anyone interested in reading more about the early days of APL, see the Vector special on the first 25 years of Dyalog Ltd.

Permuting Internal Letters

Friday Afternoon

It’s something of a custom in Dyalog to send a “fun” e-mail to the group on Friday afternoons. My gambit for this past Friday was:

   x ←' according to research it doesn''t matter'
   x,←' what order the letters in a word are'
   x,←' the human mind can still read it'
   x,←' the only important thing is that'
   x,←' the first and the last letters are in the right place'

   ∊ ' ',¨ {(1↑⍵),({⍵[?⍨≢⍵]}1↓¯1↓⍵),(-1<≢⍵)↑⍵}¨ (' '∘≠ ⊆ ⊢) x
 aroidnccg to reasrech it dneso't mttear waht order the ltreets 
      in a wrod are the hmaun mnid can sltil read it the olny 
      imartpnot tihng is that the fsrit and the lsat lterets 
      are in the rhigt palce

(The research: http://www.mrc-cbu.cam.ac.uk/people/matt.davis/Cmabrigde/rawlinson/)

Code Golfing

The code golfers were not long in responding:

   ∊{' ',⍵[1,(1+?⍨0⌈¯2+n),n/⍨1<n←≢⍵]}¨(' '∘≠⊆⊢)x     ⍝ fa
   ∊{' ',⍵[∪1,{⍵,⍨?⍨⍵-1}≢⍵]}¨(' '∘≠⊆⊢)x              ⍝ fb
   '\S+'⎕r{{⍵[∪1,{⍵,⍨?⍨⍵-1}≢⍵]}⍵.Match}x             ⍝ fc
   ∊{' ',⍵[n↑1,(1+?⍨0⌈¯2+n),n←≢⍵]}¨(' '∘≠⊆⊢)x        ⍝ fd

In defense of the original expression, it can be said that it follows the pattern:

  • cut into words            (' '∘≠ ⊆ ⊢) x  also ' '(≠⊆⊢)x
  • permute each word      {(1↑⍵),({⍵[?⍨≢⍵]}1↓¯1↓⍵),(-1<≢⍵)↑⍵}¨
  • undo the cut             ∊ ' ',¨

That is, the pattern is “permute under cut”. Moving the ' ', into the permuting function, while shortening the overall expression, obscures this pattern.

Golfing for Speed

One of the code golfers was undaunted by the highfalutin’ functional programming argument. Changing tack, he claimed to be golfing for speed (while himself travelling at 900 kph):

fe←{
  p←1+0,⍸⍵=' '           ⍝ start of each word
  n←0⌈¯3+¯2-/p,2+≢⍵      ⍝ sizes of windows to be shuffled
  ⍵[∊p+?⍨¨n]@((⍳¯1+≢⍵)~p∘.-0 1 2)⊢⍵
}

   cmpx 'fe x' 'fb x' 'fc x' 'fd x'
  fe x → 4.27E¯5 |    0% ⎕⎕⎕⎕
* fb x → 1.05E¯4 | +145% ⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fc x → 3.39E¯4 | +692% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fd x → 1.11E¯4 | +159% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The question was then posed whether the algorithm can be “flattened” further; that is, whether the expression ∊p+?⍨¨n in function fe can be done without each. The answer is of course yes (else I wouldn’t be writing this blog post :-). Flattening, or avoiding the creation of nested arrays, has the potential to reduce memory consumption and increase efficiency, because there is more potential for the interpreter to perform optimized sequential or even parallel operations.

Partitioned Random Permutations

In the old days, before general arrays, before the each and rank operators, ingenious techniques were devised for working with “partitioned” arrays: A boolean vector with a leading 1 specifies a partition on a corresponding array with the same length, basically what we can now do with or similar facility. A detailed description can be found in Bob Smith’s APL79 paper A Programming Technique for Non-Rectangular Data.

   p←1 0 0 1 1 0 0 0 0 0
   v←3 1 4 1 5 9 2 6 53 58

   p⊂v
┌─────┬─┬─────────────┐
│3 1 4│1│5 9 2 6 53 58│
└─────┴─┴─────────────┘

   +/¨ p⊂v
8 1 133
   t-¯1↓0,t←(1⌽p)/+\v
8 1 133

   ∊ +\¨ p⊂v
3 4 8 1 5 14 16 22 75 133
   s-(t-¯1↓0,t←(1⌽p)/⍳⍴p)/¯1↓0,(1⌽p)/s←+\v
3 4 8 1 5 14 16 22 75 133

The last two sets of expressions illustrate how “partitioned plus reduce” and “partitioned plus scan” can be computed, without use of general arrays and without each.

At issue is how to do “partitioned random permute”. Answer: ⍋(n?n)+n×+\p ⊣ n←≢p.

p                1  0  0    1    1  0  0  0  0  0
n?n              5  4 10    6    2  9  8  1  3  7
n×+\p           10 10 10   20   30 30 30 30 30 30
(n?n)+n×+\p     15 14 20   26   32 39 38 31 33 37
⍋(n?n)+n×+\p     2  1  3    4    8  5  9 10  7  6

Therefore:

ff←{
  b←~(⊢ ∨ 1∘⌽ ∨ ¯1∘⌽)' '=⍵  ⍝ internal letters
  n←≢p←b/b>¯1⌽b             ⍝ partition vector for groups of same
  (b/⍵)[⍋(n?n)+n×+\p]@{b}⍵
}

   cmpx 'ff x' 'fe x'
  ff x → 1.46E¯5 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fe x → 4.19E¯5 | +187% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   x6←1e6⍴x

   cmpx 'ff x6' 'fe x6'
  ff x6 → 5.16E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fe x6 → 1.77E¯1 | +243% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

There is a puzzle from 1980 which similarly involved randomly permuting items within groups. See A History of APL in 50 Functions, Chapter 47, Pyramigram. The puzzle solution is amusing also for that fact that it used -\.

Afterword

Plobssiy, a txet wshoe words are ilivuddinaly ptemrued is rdeaable mnaliy bseacue in odirrany txet many wrods, eelpclsaiy iptnroamt wrods, are sorht. A curops wtih long and ufnmiailar wdors wuold lkiely be hard to raed atfer scuh pittarmoeun. For eapxmle:

   ff t
 eyonelsmeary dpihnoeissopt siiuqpeedlasan pniormoasaac oopimoentaoa
   ff t
 emlresaneyoy dnpepoihissot seilesiqaadupn poimrnaaasoc oaioomtpneoa
   ff t
 erloymneseay dhsiepopiosnt seildspqiauean praisoaaonmc oomatneiopoa