Is it Sorted?

Motivation

I have been working on the Dyalog APL quicksort implementation. The following programming puzzle arose in the process of doing the QA for this work.

is a simple array. Write a function sorted, without using or , such that sorted ⍵ is 1 if is sorted in ascending order and 0 otherwise.

The point about not using grade is that this is supposed to be an independent check that grade is correct (remains correct) after the new work.

Real Vectors

The simplest case is when is a numeric vector. If furthermore are not complex numbers (a case addressed later), then

   ∧/ 2≤/⍵

each item being less than or equal to the next one, checks that is sorted. Since uses exact comparisons, here we must set ⎕ct←⎕dct←0. Morever, in order that decimal floating-point numbers (DECFs) be compared correctly, here ⎕fr←1287.

Real Arrays

More generally, when is a non-complex numeric matrix, we must check that each row precedes or is equal to the next row. If c and d are consecutive rows, then corresponding items are compared and at the first item where they differ, c[i] must be less than d[i].

   ~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵

The expression incorporates two refinements:

  • If is not a matrix, first apply ⍪⍵.
  • Instead of checking c[i] is less than d[i], check that c[i] is not greater than d[i]. This finesses the case where c≡d and there is no first item where they differ; that is, the case where <\2≠⌿⍪⍵ is all 0s for that row.

<\on a boolean vector has 0s after the first 1, (and is all 0 if there are no 1s). Therefore, <\2≠⌿⍪⍵ finds the first item (if any) where one cell differs from the next cell, and that item must not be greater than the corresponding item in the next cell.

For example:

   x←?97 3⍴10

   {~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵} x
0
   {~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵} x[⍋x;]
1

(Muse: since x above are random numbers, there is a possibility that it is sorted and the first test above can be 1. But: if each elementary particle in the visible universe were a computer and every nanosecond each of them creates a random matrix and tests it for sortedness as above, running from the beginning of the time to the end of our lives, it is still a very safe bet that no 1 would result.)

For integer arrays, there is an alternative of using the signs of the arithmetic difference between successive cells:

   {~ 0 ∊ 1≠t×<\0≠t← × 2-⌿⍪⍵} x[⍋x;]
1

(The sign where consecutive cells first differ must not be 1.) However, computing the difference on floating point numbers can founder on overflow:

   ⊢ x←¯1 1×⌊/⍬
¯1.79769E308 1.79769E308

   {~ 0 ∊ 1≠t×<\0≠t← × 2-⌿⍪⍵} x
DOMAIN ERROR
   {~0∊1≠t×<\0≠t←×2-⌿⍪⍵}x
                   ∧

Complex Numbers

Two complex numbers are ordered first by the real parts and then by the imaginary parts. (This is part of the TAO extension implemented in Dyalog APL version 17.0.) Therefore, a complex array can be tested for sortedness by testing an equivalent real array with each number replaced by their real and imaginary parts, thus:

   (¯1⌽⍳1+⍴⍴⍵) ⍉ 9 11∘.○⍵
   ↑9 11∘○¨⍵
   9 11○⍤1 0⊢⍵

Although the second expression is the shortest, it is less efficient in time, space, and number of getspace calls. The last expression is favored for its brevity and performance.

The number of getspace is a worthwhile measure. Part of the QA process is a rather stringent procedure called the “Shuffle QA”. The entire Shuffle QA takes several weeks to run and its running time is directly related to the number of getspace.

Character Arrays

None of the functions < ≤ ≥ > - × are permitted on characters. This is solved by application of ⎕ucs, converting characters to integers while preserving the ordering.

Putting It All Together

sorted←{
  ⎕ct←⎕dct←0
  ⎕fr←1287
  d←10|⎕dr ⍵
  d∊0 2: ∇ ⎕ucs ⍵
  d=9:   ∇ 9 11○⍤1 0⊢⍵
  ~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵
}

Other Considerations

That ⍵⌷⍨⊂⍋⍵ is sorted is a necessary but not sufficient condition that ⍋⍵ is correct. For example, an “adversary” can supply the following results for ⍋⍵ so that ⍵⌷⍨⊂⍋⍵ is sorted:

?≢⍵
(≢⍵)⍴?≢⍵
¯1↓⍋⍵
∊ i {⊂⍵[?⍨≢⍵]}⌸⍨ ⍵⌷⍨⊂i←⍋⍵

The last expression randomly permutes the grade indices of equal cells, a result which violates the requirement that grade indices of equal cells are in ascending order. That is, grade must be stable.

In Dyalog APL version 17.0, grade has been extended to work on non-simple arrays, the much discussed TAO, total array ordering. Checking that a non-simple array is sorted without using grade requires facilities discussed in the paper TAO Axioms and is beyond the scope of this note.

Dyadic Grade

⎕io=0 is assumed throughout. The essay talks only about but the same ideas apply to .

Background

has the distinction of being the first (in 1980) APL primitive function defined on major cells: the result orders items of a vector, rows of a matrix, planes of a 3-d array, etc. In the ordering major cells are compared in ravelled order, with leading items being more significant than trailing (lexicographic ordering). Moreover, in dyadic grade ⍺⍋⍵, specifies “alphabets” to be used in comparing the items of character array .

Dyadic grade has always been an APL primitive which is hard for me to understand, in that way kind of like dyadic transpose ☺. I sat down to really understand it, starting from the simplest cases to the general case. The following is a record of my explorations.

Vector Left Argument

   gv← {⍋⍺⍳⍵}

   a0← 'abcdefghij'
   x0← 'chthonic'
      
   a0 gv x0
0 7 1 3 6 2 4 5
   a0 ⍋ x0
0 7 1 3 6 2 4 5

   x0 ⌷⍨ ⊂ a0 gv x0
cchhiton

That is, grade the indices of in . If an item of is not in then its index is ≢⍺.

Higher-Rank Left Argument with Unique Items

The coordinates of A[i;j;k;…] or A[⊂i,j,k,…] is the vector i,j,k,…. The phrase ⍳⍴A produces the array of coordinates. For example, if is the (2 26)-matrix of the upper and lower case English letters,

   ABCDEFGHIJKLMNOPQRSTUVWXYZ
   abcdefghijklmnopqrstuvwxyz

the corresponding coordinates are

   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬   ┬────┬────┐
   │0 0│0 1│0 2│0 3│0 4│0 5│0 6│0 7│0 8│   │0 24│0 25│
   ├───┼───┼───┼───┼───┼───┼───┼───┼───┼ … ├────┼────┤
   │1 0│1 1│1 2│1 3│1 4│1 5│1 6│1 7│1 8│   │1 24│1 25│
   └───┴───┴───┴───┴───┴───┴───┴───┴───┴   ┴────┴────┘

If the items of are unique,

   gu← {⍋ 0 2 1 ⍉ (⊂(,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ ⍉(⍴⍺)⊤⍳×/⍴⍺}

That is, ⍺⍋⍵ obtains as the grade of the reversed coordinates of in . (If an item does not occur in , its coordinates are ⍴⍺.) The implements that in , the first axis is least significant and the last axis is most significant. For the (2 26)-matrix above, case (the first axis) is less significant than A-Z and a-z (the last axis).

   ⊢ a1←' ',⎕av[(⎕av⍳'Aa')∘.+⍳26]
 ABCDEFGHIJKLMNOPQRSTUVWXYZ
 abcdefghijklmnopqrstuvwxyz

   ⊢ x1←↑' '(≠⊆⊢)' Jay roger Roger adam Adam jay' 
Jay  
roger
Roger
adam 
Adam 
jay  

   a1 gu x1
4 3 0 5 2 1
   a1 ⍋ x1
4 3 0 5 2 1

   x1 ⌷⍨ ⊂ a1 gu x1
Adam 
adam 
Jay  
jay  
Roger
roger

Higher-Rank Left Arguments

Suppose does have duplicates? For purposes of , the coordinates of an item c are

   ⌊⌿(c=,⍺)⌿↑,⍳⍴⍺

That is, the minimum of coordinates of all items equal to c. Note that the expression also works if c is a unique item. Therefore, for a general , with or without duplicates, ⍺⍋⍵ obtains as

   gr← {⍋ 0 2 1 ⍉ (⊂(∪,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ (,⍺) {⌊⌿⍵}⌸ ⍉(⍴⍺)⊤⍳×/⍴⍺}

The “minimum of coordinates” computation is exploited to effect equal coodinates for disparate characters. For example, an ordering where upper and lower case are significant but diacritical marks are not, can be implemented as follows:

   A    ⍝ A has a leading blank column
 AÀÁÂÃÄÅBCÇDEÈÉÊËFGHIÌÍÎÏJKLMNÑOÒÓÔÕÖØPQRSTUÙÚÛÜVWXYÝZ
 aàáâãäåbcçdeèéêëfghiìíîïjklmnñoòóôõöøpqrstuùúûüvwxyýz
 À       Ç  È       Ì        Ñ Ò                   Ý  
 Á       ç  É       Í        ñ Ó                   ý  
 Â          Ê       Î          Ô                      
 Ã          Ë       Ï          Ö                      
 Ä          è       ì          Õ                      
 Å          é       í          Ø                      
 à          ê       î          ò                      
 á          ë       ï          ó                      
 â                             ô                      
 ã                             õ                      
 ä                             ö                      
 å                             ø                      
   ⍴A
14 54

   ('È'=,A)⌿↑,⍳⍴A                ('è'=,A)⌿↑,⍳⍴A
0 13                          1 13
2 12                          6 12
   ⌊⌿('È'=,A)⌿↑,⍳⍴A              ⌊⌿('è'=,A)⌿↑,⍳⍴A
0 12                          1 12

   ('E'=,A)⌿↑,⍳⍴A                ('e'=,A)⌿↑,⍳⍴A
0 12                          1 12
   ⌊⌿('E'=,A)⌿↑,⍳⍴A              ⌊⌿('e'=,A)⌿↑,⍳⍴A
0 12                          1 12

'È' occurs twice in A with coordinates 0 13 and 2 12. The coordinates assigned to 'È' are the minimum of these, 0 12. In contrast, 'E' occurs once and its coordinates are 0 12, the same as those for 'È'. Therefore, 'E' and 'È' are considered equal for purposes of dyadic grade. Similarly, 'e' and 'è' have coordinates 1 12 and are considered equal by , but they follow 'E' and 'È' (because their coordinates are 0 12).

For example:

   ⊢ x← ↑' '(≠⊆⊢)' roger adàm Röger rÖger Adåm JÃY JAY JÃY adåm adàm'
roger
adàm 
Röger
rÖger
Adåm 
JÃY  
JAY  
JÃY  
adåm 
adàm 

   A gr x
4 1 8 9 5 6 7 2 3 0
   A ⍋ x
4 1 8 9 5 6 7 2 3 0

   x ⌷⍨⊂ A gr x
Adåm 
adàm 
adåm 
adàm 
JÃY  
JAY  
JÃY  
Röger
rÖger
roger

Lest you think that dyadic grade in its full generality suffices to implement any ordering: in “telephone book” ordering, “1600 Pennsylvania Avenue” and “Unter den Linden 23” are ordered as if 1600 were spelled out as “Sixteen Hundred” and 23 as “Dreiundzwanzig”. A program to do that ought to be très amusant.

Code Archeology

The above code are improved versions of what appeared in Peter Wooster, Extended Upgrade and Downgrade, SHARP APL Technical Notes 35, I.P. Sharp Associates, 1980-09-15. It is interesting to study the code from the two eras. (The code from 1980 is lightly edited for executability and clarity.)

2018

gv← {⍋⍺⍳⍵}
gu← {⍋ 0 2 1 ⍉ (⊂(,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ ⍉(⍴⍺)⊤⍳×/⍴⍺}
gr← {⍋ 0 2 1 ⍉ (⊂(∪,⍺)⍳⍪⍵) ⌷ ⌽ (⍴⍺) ⍪⍨ (,⍺) {⌊⌿⍵}⌸ ⍉(⍴⍺)⊤⍳×/⍴⍺}

1980

eu← {d⊤⍳×/d←⍴⍵}
er← {¯1+÷(÷1+d⊤⍳×/d←⍴⍵)⌈.×a∘.=a←,⍵}

fv← {⍋⍺⍳⍵}
fu← {⍋(⍒0 1,1↓0×⍳⍴⍴⍵)⍉(⊖(eu ⍺),⍴⍺)[;(,⍺)⍳⍵]}
fr← {⍋(⍒0 1,1↓0×⍳⍴⍴⍵)⍉(⊖(er ⍺),⍴⍺)[;(,⍺)⍳⍵]}
gv, fv vector left argument
gu, fu higher-ranked left argument with unique items
gr, fr higher-ranked left argument

In the sequence gv gu gr, a function is more general than the preceding one and subsumes it. Likewise fv fu fr.

Comparison of the code illustrates advances in APL between 1980 and 2018:

  • {⌊⌿⍵}⌸ minimum of major cells corresponding to identical keys
  • ∪      unique items
  • ⍪⍵     ravel major cells
  • ⍺⍪⍵    catenate on first axis
  • ⍨      commute operator
  • dfns

Alternatives

If a left argument is large and complicated and is used repeatedly, it may be worthwhile for the APL interpreter to perform precomputations on it. Thus:

   U← ∪,A
   C← ⌽ (⍴A) ⍪⍨ (,A) {⌊⌿⍵}⌸ ⍉(⍴A)⊤⍳×/⍴A

   ⍴U        ⍴C
107       108 2

   ⍪U        C
           0  0
A          1  0
À          1  0
Á          1  0
          1  0
à         1  0
Ä          1  0
Å          1  0
B          8  0
C          9  0
Ç          9  0
…           …
x         50  1
y         51  1
ý         51  1
z         53  1
          14 54

   gp← (U C)∘{U C←⍺ ⋄ ⍋0 2 1⍉C[U⍳⍪⍵;]}

   gp x
4 1 8 9 5 6 7 2 3 0
   A ⍋ x
4 1 8 9 5 6 7 2 3 0

It makes sense that the number of columns in the coordinate matrix C is equal to the rank of the alphabet array A: The rank is the number of dimensions, a-z, upper/lower case, color, etc.; each row of C is a vector of the numeric value for each dimension.

With 20/20 hindsight, the above code can be seen as an argument against defining dyadic grade to do ordering with specified alphabets. After all,

   ⍺⍋⍵  ←→  ⍋0 2 1⍉C[U⍳⍪⍵;]

and specifying U and C directly makes the computation easier to understand, easier to use, and as it happens is faster than the primitive in the current implementation. The inverse calculation, from U C to the alphabet array A, is an amusing bit of code left as an exercise for the reader☺.

One can further argue that the current definition of dyadic grade precludes an alternative attractive but incompatible definition:

   ⍺⍋⍵  ←→  ⍺⌷⍨⊂⍋⍵

That is, order by the grade of , whence ⍋⍨ sorts. In Dyalog APL version 17.0, monadic grade is extended to work with a TAO (total array ordering). With a TAO and this alternative definition, ⍋⍨ sorts any array.

The present exposition exposes a difficulty with extending the current dyadic grade to work with TAO: It is axiomatic that monadic grade compares cells itemwise, stopping at the first pair of unequal items. Dyadic grade does not do that in general. For example, with an upper and lower case alphabet, you don’t stop comparing 'Rogerz' and 'rogers' on encountering 'R' and 'r'.

Linear Interpolation

⎕io=0 assumed throughout; works in 1-origin with the obvious modifications.

Introduction

On Wednesday, a question arrived via Dyalog Support from an intern in Africa: If M is the matrix on the left, use linear interpolation to compute the result on the right.

   1 20         1 20
   4 80         2 40
   6 82         3 60
                4 80
                5 81
                6 82

Linear Interpolation

Two points (x0,y0) and (x1,y1) specify a line; for any x there is a unique y on that line (assuming x0≠x1). The equation for the line derives as follows, starting from its slope m:

   m = (y1-y0) ÷ (x1-x0)
   (y-y0) = m × (x-x0)
   y = y0 + m × (x-x0)

Therefore, if is a 2-by-2 matrix of the two points and are the x-values to be interpolated, then:

   g ← {(⊃⌽⍺)+(⍵-⊃⍺)÷÷/-⌿⍺}

   ⊢ M←1 4 6,⍪20 80 82
1 20
4 80
6 82

   M[0 1;] g 2 3
40 60
   M[1 2;] g 5
81

A New Twist, A New Solution

The problem as posed implicitly required that:

  • The x-values are the positive integers bounded by ⊃⊖M.
  • Appropriate rows of the matrix are selected for a given x-value.
  • The missing x-values and their interpolations are “slotted back” into the argument matrix.

These requirements are best met by , interval index, a relatively new primitive function introduced in Dyalog APL version 16.0. The left argument must be sorted and partitions the universe into disjoint contiguous intervals; ⍺⍸⍵ finds the index of the interval which contains an item of . The result is ⎕io dependent.

For the given matrix M, the partition (of the real numbers in this case) is depicted below. As in conventional mathematical notation, [ denotes that the interval includes the left end-point and ) denotes that the interval excludes the right end-point.

          1        4      6
─────────)[───────)[─────)[──────────
     ¯1       0       1       2

   v←¯5 0 1 2.5 6 3 4 5 9 8 7

   1 4 6 ⍸ v
¯1 ¯1 0 0 2 0 1 1 2 2 2

   v ,[¯0.5] 1 4 6 ⍸ v
¯5  0 1 2.5 6 3 4 5 9 8 7
¯1 ¯1 0 0   2 0 1 1 2 2 2

With in hand, the problem can be solved as follows:

interpol←{
  (x y)←↓⍉⍵
  m←m,⊃⌽m←(2-/y)÷(2-/x)
  j←0⌈x⍸i←1+⍳⊃⌽x
  i,⍪y[j]+m[j]×i-x[j]
}

   interpol M
1 20
2 40
3 60
4 80
5 81
6 82

The problem of x-values less than the first end-point is finessed by applying 0⌈ to the interval indices, and that of x-values greater than or equal to the last end-point is finessed by repeating the last slope m←m,⊃⌽m.

It is possible to do the interpolation only on the missing indices (2 3 5 in this case) and insert them into the argument matrix. It seems neater to simply interpolate everything, and in so doing provide a check that the interpolated values equal the values given in the argument.

An Alternative Interpolation

Interpolating according to two selected rows of a matrix of points treats the function as piecewise linear, with sharp inflection points where the lines join (different slopes between adjacent lines). A “holistic” alternative approach is possible: the matrix can be interpreted as specifying a single line and the interpolation is according to this single line. The primitive function computes the coefficients of the line which best fits the points:

   ⎕rl←7*5  ⍝ for reproducible random numbers

   ⊢ M←t,⍪(?7⍴5)+¯17+3×t←?7⍴100
35  89
98 278
19  44
 4  ¯5
62 170
49 133
25  59

   M[;1] ⌹ 1,M[;,0]    ⍝ y-intercept and slope
¯15.3164 2.99731

   interpola ← {(1,⍤0⊢⍵)+.×⍺[;1]⌹1,⍺[;,0]}

   M[;1] ,[¯0.5] M interpola M[;0]
89      278    44      ¯5       170     133     59     
89.5895 278.42 41.6325 ¯3.32713 170.517 131.552 59.6164

   M interpola 33 35 37 39.7
83.5949 89.5895 95.5841 103.677

Finally

Our best wishes to the intern. Welcome to APL!

Phil Goacher (05-11-40 – 09-03-18)

Phil Goacher
I attended Phil Goacher’s funeral yesterday. He probably had no idea of just how much he would influence the development of APL. I first met Phil at W S Atkins which is a large UK engineering consultancy. There was a sub unit called Atkins Computing which provided internal computing support to the engineers and, crucially, provided a time sharing service to external users. He became my manager when I moved from supporting the transportation program suite to the brand new APL team.

Phil was more manager and business man than programmer but he had an engineering background in Gas distribution and thus the mathematical bias that seems common to APL users.

Phil was the sort of man with an eye for a business opportunity and when a advert appeared from someone who wanted to set up an APL consultancy with a limited objective to provide resources to Rank Xerox (UK arm of Xerox Corp) he turned up along with half his team. Phil was a wheeler dealer type and had already talked to the recruiter, persuaded him that he could manage the team, and then interviewed the candidates. So I wound up being interviewed by my existing manager. Following this Phil decided we only needed a salesman and we could set the consultancy up with a much wider remit. Phil recruited Ted Hare who was one of Atkin’s top salesmen and Dyadic Systems was born.

Phil did all of the necessary legal and administration to get Dyadic off the ground and became the company’s administration guy. He bought an Apple 2 with VisiCalc so was early into the spreadsheet concept. It must be said that all five of the people who started Dyadic were “early adopters” by nature.

Dyadic was quite successful as an APL consultancy but wanted to do more. I am not sure who instigated it (I was too busy consulting) but it was probably Phil’s idea to develop an APL of our own. Knowing Phil this was probably a business eye and a request from Pam Geisler who was Pauline Brand’s sister (Pauline was an early recruit that we “acquired” from Atkin’s) but, more importantly, high up in the management of Zilog. Zilog was developing a new 16 bit chip, Z8000, that was going to push them up market from the 8-bit chip with which they had had enormous success – Z80. IBM had been pushing APL as the future, and it was incumbent on competitors to provide an APL offering.

Phil and Dave Crossley will have been the instigators of recruiting John Scholes, who had also been at Atkin’s, and teaming him with me to develop Dyalog (Dyadic + Zilog). Dave provided the management overview of that project while Phil and Ted had to worry about financing John and I, who were not bringing in consultancy money. It was at this point that Dyalog APL became a UNIX and C project. It was much later when I read Brian Kernighan’s obituary that I realised just how early we had got into UNIX and C.

Dyalog APL was not the instant success that Dyadic hoped for. It had increased costs. We needed an office and the paraphernalia that goes with selling a product rather than brains. Dyadic ran into financial issues and was taken over by Lynwood Scientific, who made terminals with Zilog Z8000 chips inside. I think they really wanted the Z8000 expertise that John and I had acquired. As part of that takeover Phil, Ted and Dave had to leave with a derisory pay-off and I lost touch until I learnt of his death a fortnight ago.

From the perspective of APL language development Phil is not a tower of strength. However, from the perspective of having provided the impetus and drive to get Dyalog APL developed, he is very important.

NOTE: For anyone interested in reading more about the early days of APL, see the Vector special on the first 25 years of Dyalog Ltd.

Permuting Internal Letters

Friday Afternoon

It’s something of a custom in Dyalog to send a “fun” e-mail to the group on Friday afternoons. My gambit for this past Friday was:

   x ←' according to research it doesn''t matter'
   x,←' what order the letters in a word are'
   x,←' the human mind can still read it'
   x,←' the only important thing is that'
   x,←' the first and the last letters are in the right place'

   ∊ ' ',¨ {(1↑⍵),({⍵[?⍨≢⍵]}1↓¯1↓⍵),(-1<≢⍵)↑⍵}¨ (' '∘≠ ⊆ ⊢) x
 aroidnccg to reasrech it dneso't mttear waht order the ltreets 
      in a wrod are the hmaun mnid can sltil read it the olny 
      imartpnot tihng is that the fsrit and the lsat lterets 
      are in the rhigt palce

(The research: http://www.mrc-cbu.cam.ac.uk/people/matt.davis/Cmabrigde/rawlinson/)

Code Golfing

The code golfers were not long in responding:

   ∊{' ',⍵[1,(1+?⍨0⌈¯2+n),n/⍨1<n←≢⍵]}¨(' '∘≠⊆⊢)x     ⍝ fa
   ∊{' ',⍵[∪1,{⍵,⍨?⍨⍵-1}≢⍵]}¨(' '∘≠⊆⊢)x              ⍝ fb
   '\S+'⎕r{{⍵[∪1,{⍵,⍨?⍨⍵-1}≢⍵]}⍵.Match}x             ⍝ fc
   ∊{' ',⍵[n↑1,(1+?⍨0⌈¯2+n),n←≢⍵]}¨(' '∘≠⊆⊢)x        ⍝ fd

In defense of the original expression, it can be said that it follows the pattern:

  • cut into words            (' '∘≠ ⊆ ⊢) x  also ' '(≠⊆⊢)x
  • permute each word      {(1↑⍵),({⍵[?⍨≢⍵]}1↓¯1↓⍵),(-1<≢⍵)↑⍵}¨
  • undo the cut             ∊ ' ',¨

That is, the pattern is “permute under cut”. Moving the ' ', into the permuting function, while shortening the overall expression, obscures this pattern.

Golfing for Speed

One of the code golfers was undaunted by the highfalutin’ functional programming argument. Changing tack, he claimed to be golfing for speed (while himself travelling at 900 kph):

fe←{
  p←1+0,⍸⍵=' '           ⍝ start of each word
  n←0⌈¯3+¯2-/p,2+≢⍵      ⍝ sizes of windows to be shuffled
  ⍵[∊p+?⍨¨n]@((⍳¯1+≢⍵)~p∘.-0 1 2)⊢⍵
}

   cmpx 'fe x' 'fb x' 'fc x' 'fd x'
  fe x → 4.27E¯5 |    0% ⎕⎕⎕⎕
* fb x → 1.05E¯4 | +145% ⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fc x → 3.39E¯4 | +692% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fd x → 1.11E¯4 | +159% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The question was then posed whether the algorithm can be “flattened” further; that is, whether the expression ∊p+?⍨¨n in function fe can be done without each. The answer is of course yes (else I wouldn’t be writing this blog post :-). Flattening, or avoiding the creation of nested arrays, has the potential to reduce memory consumption and increase efficiency, because there is more potential for the interpreter to perform optimized sequential or even parallel operations.

Partitioned Random Permutations

In the old days, before general arrays, before the each and rank operators, ingenious techniques were devised for working with “partitioned” arrays: A boolean vector with a leading 1 specifies a partition on a corresponding array with the same length, basically what we can now do with or similar facility. A detailed description can be found in Bob Smith’s APL79 paper A Programming Technique for Non-Rectangular Data.

   p←1 0 0 1 1 0 0 0 0 0
   v←3 1 4 1 5 9 2 6 53 58

   p⊂v
┌─────┬─┬─────────────┐
│3 1 4│1│5 9 2 6 53 58│
└─────┴─┴─────────────┘

   +/¨ p⊂v
8 1 133
   t-¯1↓0,t←(1⌽p)/+\v
8 1 133

   ∊ +\¨ p⊂v
3 4 8 1 5 14 16 22 75 133
   s-(t-¯1↓0,t←(1⌽p)/⍳⍴p)/¯1↓0,(1⌽p)/s←+\v
3 4 8 1 5 14 16 22 75 133

The last two sets of expressions illustrate how “partitioned plus reduce” and “partitioned plus scan” can be computed, without use of general arrays and without each.

At issue is how to do “partitioned random permute”. Answer: ⍋(n?n)+n×+\p ⊣ n←≢p.

p                1  0  0    1    1  0  0  0  0  0
n?n              5  4 10    6    2  9  8  1  3  7
n×+\p           10 10 10   20   30 30 30 30 30 30
(n?n)+n×+\p     15 14 20   26   32 39 38 31 33 37
⍋(n?n)+n×+\p     2  1  3    4    8  5  9 10  7  6

Therefore:

ff←{
  b←~(⊢ ∨ 1∘⌽ ∨ ¯1∘⌽)' '=⍵  ⍝ internal letters
  n←≢p←b/b>¯1⌽b             ⍝ partition vector for groups of same
  (b/⍵)[⍋(n?n)+n×+\p]@{b}⍵
}

   cmpx 'ff x' 'fe x'
  ff x → 1.46E¯5 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fe x → 4.19E¯5 | +187% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   x6←1e6⍴x

   cmpx 'ff x6' 'fe x6'
  ff x6 → 5.16E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕
* fe x6 → 1.77E¯1 | +243% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

There is a puzzle from 1980 which similarly involved randomly permuting items within groups. See A History of APL in 50 Functions, Chapter 47, Pyramigram. The puzzle solution is amusing also for that fact that it used -\.

Afterword

Plobssiy, a txet wshoe words are ilivuddinaly ptemrued is rdeaable mnaliy bseacue in odirrany txet many wrods, eelpclsaiy iptnroamt wrods, are sorht. A curops wtih long and ufnmiailar wdors wuold lkiely be hard to raed atfer scuh pittarmoeun. For eapxmle:

   ff t
 eyonelsmeary dpihnoeissopt siiuqpeedlasan pniormoasaac oopimoentaoa
   ff t
 emlresaneyoy dnpepoihissot seilesiqaadupn poimrnaaasoc oaioomtpneoa
   ff t
 erloymneseay dhsiepopiosnt seildspqiauean praisoaaonmc oomatneiopoa

Stackless Traversal

Enlist (∊) is twice as fast in Dyalog 16.0 as it was in Dyalog 15.0. Pretty much across the board: ∊⍳100 is not going to be any faster, but whenever the argument is a nested array and the simple arrays it contains are reasonably small, there are huge performance improvements. How did we achieve the huge speedup?

Constraints

The usual way for a C programmer to write the traversal used in Enlist would be a simple recursive function: If the current array is simple, handle it, and if it is nested, traverse on each array it contains. We can’t do this, though, because it would break our product. The specific problem is that the C stack used for function recursion has a limited (and fairly small) size, while the depth of nesting in an array is limited only by available memory. When passed a tree a few thousand layers deep, an Enlist implemented using the C stack would run out of space, then attempt to write past the end of the stack, resulting in a segfault and a syserror 999.
So keeping memory on the stack is out. The only place we can store an arbitrary amount of memory is the workspace itself. But storing things in the workspace is harder than it sounds. When Dyalog fails to do a memory allocation, it tries to get more space by compacting pockets of memory and squeezing arrays. This happens rarely, but any operation that allocates memory has to take it into account. That means it has to register all of the memory it uses so it can find it again when memory is shuffled around.

The old approach: trav()

Dyalog has a general function for traversing arrays, which will take care of all of the problems above. It’s very flexible, but in enlist it’s used for a single purpose: to call a function on each simple array (“leaf”) in a nested array. The old Enlist used it twice, first to calculate the type and length of the final result, and again to move all of the elements into a result array after it’s allocated.
Trav works very well, but is also very slow. It stores a lot of information, and constantly allocates memory to do so. It also is much more general than it needs to be for enlist, so it spends a lot of time checking for options we never use. We can do better.

A different strategy

Rather than allocating a bunch of APL pockets to emulate the C stack, what if we just didn’t use the stack at all? Sounds kind of difficult, since we definitely need some sort of stack to remember where we are during traversal. Once we finish counting or copying the last leaf in a branch, we have to find our way back up to the pocket we started at. This pocket could be anywhere from one to a billion levels above us, and we have to remember the location of every pocket in between. And once we return to a pocket, we have to remember how many children we have already traversed, so we can move on to the next one. Where do we store all of this information?

Fun with pointers

Suppose we are in the middle of traversing a nested array p. An array in Dyalog is a pointer to a pocket of memory, which contains a header followed by some other pocket pointers p[0], p[1], and so on. We consider the pocket pointers themselves, rather than the data that they point to, to be arrays since pointers can be stored and manipulated directly and pockets cannot. If the rank of p is more than one, p[i] refers to the i’th array in ,p.
fig0
Let’s assume we have gone through the first array p[0] in p and now want to begin traversing p[1], which we rename q. In order to move on to p[2] after we finish with q we will need some stored information—a stack-based algorithm might push p and the current index 1 onto the stack. A variable-size stack is needed rather than some constant amount of memory because this information has to be stored at every level of the arbitrarily-deep traversal. We don’t want to use such a stack, but there is a redundant value we can use: the pointer p[1]=q stored in p. When we finish traversing q and come back to p, we know the address of q, since we just traversed it! So we can safely overwrite that address. But all of the other nearby pointers are in use. It would be very difficult to find space for another word. So instead of storing the location of p, we store a slightly different address: ptr, which points to p[1] (we’ll discuss how to recover the pointer p from ptr later). It won’t help to store ptr in p, since it would just point to its own location! Instead we use a temporary variable prev to hold onto it while we traverse q, and store the previous value of prev, which points to a location inside p’s parent, in the location that used to hold q. By the time we begin traversing q’s first child, r, the picture looks like this:
fig1
We have reversed all of the pointers above us, and now we have a trail leading all the way back up to the array we started at. To do this we needed an extra temporary value, prev. This uses more memory, but only a constant amount (one word), so it won’t cause problems for us. However, it might be helpful to consider why we need that extra word. In a stack-based traversal only one temporary pointer would be needed to iterate over the array, and in fact we do get away with only writing one word of data in each array during traversal. But the first array doesn’t have any parent array, so there is no pointer to write in it. Instead we write a zero (which is not a valid pointer) to indicate that fact. This zero is an extra word that doesn’t correspond to any array, so to compensate for it we use an extra temporary variable.

When to stop

When we finish traversing a pocket, we end up with another problem. We don’t know we’re done! Following the algorithm as described so far, we would end up with ptr pointing at the last element of q, which looks like all of the others, and we would keep moving past the end of q, landing on the header of the next pocket, which probably isn’t even a pointer. Segfault—do not pass go; do not collect the elements of the original array into a list.
fig2

A typical algorithm would just store the length of the array on the stack, which we can’t do. However, there is a little more space hiding in the array q which we can use. A pointer gives the location of a particular byte in memory. But the pointers in q all point to the beginning of pockets, which are word-aligned: they begin at addresses which are all multiples of 4 bytes (in 32-bit builds) or 8 bytes (64-bit builds). So we have at least two bits at the bottom of each pointer which are guaranteed to be zero. As long as we remember to clear those bits once we finish, and before following a pointer, we can write anything we want in them!
The first thing to write is a stop bit. We mark the last pointer q with some two bits which aren’t zero (we’ll decide what later). While shuffling pointers around, we always leave the bottom two bits where they were, so that when we get back to q[n] we see the two bits we wrote when we started traversing q. If we haven’t reached the end, that could be anything other than the stop code, and if we have reached the end, it must be the stop code. So now we know when to stop, and won’t segfault by running past the end of a pocket.
However, in order to get back to the parent array with all of our state intact (specifically, the overwritten pointer to the current array), we need to know the address of the beginning of our pocket, not the end. And the shape of the pocket, which we would need to find the pocket’s length, is kept at the beginning of the array as well. How do we get back to the start?

Retracing our steps

Obviously we need to start moving backwards. But once again, we need to know when to stop. Since we’re about to run into it face-first, maybe we should discuss what comes before the pointers in a pocket?
fig3

The contents of an array pocket are pretty simple. There are three words at the beginning: the length and reference count of the pocket (which we don’t need to worry about), and the zones field, which contains the type and rank of the pocket along with potentially some other data. Next is the shape (which might be empty!), and then the data in the array—for a nested pocket, pointers to other pockets. There are three cases we want to consider:

  1. The shape is empty, which means q contains only one pointer. We will run into the zones field right before the last (and first) pointer.
    fig4
  2. The shape is not empty, but the total number of pointers is small. We will run into the shape after a short amount of backwards searching.
  3. The total number of pointers is large. It would take a while to get back to the beginning one pointer at a time, but we also have a lot of spare bits at the bottom of those pointers…
  4. (Extra credit) The pocket is empty, but it still has a prototype to traverse. For enlist we ignore these pockets, but for other purposes we have to handle this case. The shape has to have a zero in it, but the other axes could have any size that fits in a word.

For the first two cases, we mark the shape and the zones field, respectively. By a lucky coincidence, the bottom two bits of the zones field are used to store the type of the pocket we are in, and for nested arrays, the two bits we care about are always both 1. So let’s make 3 (in binary, 11) the indicator for the zones field. In case 2, the last item of the shape is small, so there are free bits at the top. We’ll shift the shape up to move those to the bottom, leaving room to store the rank and a marker. We’ll choose 1 for the marker.
In case 3, there are enough pointers to store the length of the array in the bottom bits. Since we can’t use the stop marker to encode the length, we opt to just use the bottom bit, that is, storing zeros and ones in the bottom two bits, of each pointer. To distinguish from case 2 we mark the beginning of this offset with a 2, the only unused code so far. We mark the end of the offset with a 2 as well, and then to read the offset we just move backwards, adding each bit to the end of an integer, until we reach that 2. Then we use the length to get back to the beginning of an array. While moving backwards, we also clear those bits to leave the pointers pointing at the exact, byte aligned start of pockets like they did when we found them.
Case 4 is best left as an exercise, I think. There’s a whole empty word in the shape for you to use—what could be easier? Granted, it’s stashed away in some random location in the shape, but that shouldn’t be an obstacle.

Conditions of use

The traversal described above is much faster than trav, but it requires more care to use. Since it scribbles notes on parts of arrays during traversal, the workspace isn’t necessarily in a valid state until it finishes. So it’s not safe to call memory management functions during traversal. Any memory used needs to be allocated before the start of traversal, which means it has to have a constant size: we can’t use anything that will grow as needed.
Enlist is made up of two very simple functions which meet this requirements in most cases: the first goes through and records what types are in the array and how many total elements there are, and the second actually moves the elements into a result array. For the counting function, we just need a little extra storage to fit the type and the count. That’s a constant amount, so it’s fine. For the moving function, we can’t promote an element of a non-mixed array to an element of a mixed array, since this requires us to allocate a new array for it. But this is not a very common case, and when it does happen it will be slow anyway. This means it’s okay to use trav for that case, and use the faster traversal otherwise.

Results

Here are timings showing how long it takes to enlist various arrays, and the improvement from version 15.0 to version 16.0. Timings are measured on a modern CPU, an Intel Kaby Lake i7, but I don’t expect the ratios to change very much on other machines.

APL expression Description 15.0 16.0 Ratio
{(?⍵⍴10){⊂⍵}⌸?⍵⍴99} 1e3 1e3 bytes in 10 groups 6.138E¯7 2.898E¯7 2.118
{(?⍵⍴10){⊂⍵}⌸?⍵⍴99} 1e4 1e4 bytes in 10 groups 1.174E¯6 9.439E¯7 1.244
{(?⍵⍴10){⊂⍵}⌸?⍵⍴99} 1e5 1e5 bytes in 10 groups 8.799E¯6 8.168E¯6 1.077
{⍵⊂⍨1,1↓1=?(≢⍵)⍴2}⍣10?1e4⍴99 1e4 bytes nested 10 deep 8.540E¯4 3.464E¯4 2.466
⊂⍣1e3⊢2 3 A 1000-deep nesting 1.032E¯4 1.533E¯5 6.730
,⍨∘⊂⍣20⊢2 3 Duplicated references 1.648E¯1 3.595E¯2 4.584

The first row is probably the most typical case: a single array with a little bit of nesting, but fairly small leaves. In this case, we can expect to see about a factor of two improvement. As the leaves grow larger, the running time becomes dominated by the time to actually move data to the result, and the improvement falls off, dropping to 25% around 1000 bytes per leaf and 0.8% at 1e4 bytes per leaf. The new code is faster even at very small numbers of leaves. In fact, it takes less time to start up an array traversal than the old method, so it’s many times faster on very deeply nested arrays with few elements in each array. We can see from the final row that the new method has no issues when it encounters the same array many times in one traversal.
The same algorithm is also used for hashing arrays in Dyalog 16.0, which in turn is used for dyadic iota and related functions. The impact on individual operations is not quite as much as the improvements to enlist shown above, but they are still substantial, and being able to search for a few strings in a list more quickly will surely help a lot of applications.