Shuffle Faster Please!

Andy reported that in the shuffle QA some functions take a long time:

m9249   “4½ days so far”
rankop  21.5 hours
m12466  26.3 hours
points  7.2 hours

Background: Shuffle QA is an intensification of the normal QA. The suite of over 1800 QA functions is run under shuffle, whereby every getspace (memory allocation) is followed by every APL array in the workspace being shifted (“shuffled”) and by a simple workspace integrity check. The purpose is to uncover memory reads/writes which are rendered unsafe by events which in normal operations are rare.

m9249

m9249 tests +\[a], ×\[a], ⌈\[a], and ⌊\[a]. It ran in under 3 seconds in normal operations; for it to take more than 4½ days under shuffle, getspace is clearly the dominant resource and hereafter we focus on that.

m9249 used expressions like:

assert (+⍀x) ≡ {⍺+⍵}⍀x

Since {⍺+⍵} is not recognized by the scan operator and therefore not supported by special code, {⍺+⍵}⍀x is done by the general O(n*2) algorithm for scans, AND on each scalar the operand {⍺+⍵} is invoked and a getspace is done. The improved, faster QA does this:

F ← {1≥≢⍵:⍵ ⋄ x ⍪ (¯1↑⍵) ⍺⍺⍨ ¯1↑x←⍺⍺ ∇∇ ¯1↓⍵}
assert (+⍀x) ≡ +F x

With such a change, the faster m9249 runs in 22 minutes under shuffle instead of over 4½ days, without any reduction in coverage.

The number of getspace are as follows: +⍀x takes O(1). The second expression {⍺+⍵}⍀x does ×\1↓⍴x separate scans, and each one takes O((≢x)*2) getspace; the total is therefore O(((≢x)*2)××/1↓⍴x) ←→ O((≢x)××/⍴x). The third expression +F is linear in ≢x and is therefore O(≢x) in the number of getspace. In m9249 the largest x has size 33 5 7. The following table shows the orders of complexity and what they imply for this largest x:

+⍀x O(1) 1
{⍺+⍵}⍀x O((≢x)××/⍴x) 38115 = 33 × ×/ 33 5 7
+F x O(≢x) 33

The ratio between 38115 and 33 goes a long way towards explaining why the time to run m9249 under shuffle was over 4½ days and is now 22 minutes. (Why does m9249 still take 22 minutes? It tests for the four functions + × ⌈ ⌊, for various axes, for different datatypes, and for different axis lengths.)

rankop

Another shuffle sluggard rankop took 47 hours. rankop tests expressions of the form f⍤r⊢yand x f⍤r⊢y for various functions f. The techniques used for reducing the number of getspace differ from function to function. We look at x↑⍤r⊢y as an illustration.

assert (4↑⍤1⊢y) ≡ 4{⍺↑⍵}⍤1⊢y

⍴y was 2 7 1 8 2 8. The expression took 7.6e¯5 seconds normally and 7.5 seconds under shuffle. The left hand side requires O(1) getspace, the right hand side O(×/¯1↓⍴x). The expression was changed to:

assert (4↑⍤1⊢y) ≡ 2 7 1 8 2 4↑y

Feels like cheating, doesn’t it? :-) The new expression takes 0.115 seconds under shuffle.

m12466

m12466 tests n⌈/[a]x (and n⌊/[a]x). Previously, it did this by comparing n⌈/[a]x against n{⍺⌈⍵}/[a]x for x with shape 7 17 13 11, for each of the four axes, for two different values of n, and for the 1-, 2-, 4-, and 8-byte numeric datatypes. It required 5238426 getspace and 26.3 hours to run under shuffle.

F←{p⍉⍺⍺⌿(⊂(⎕io-⍨⍳|⍺)∘.+⍳(1-|⍺)+(⍴⍵)[⍵⍵])⌷(⍋p←⍒⍵⍵=⍳⍴⍴⍵)⍉⍵}

F is a dyadic operator. The left operand ⍺⍺ is or ; the right operand ⍵⍵ is the axis; and n(⌈ F a)x is the same as n⌈/[a]x. n(⌈ F a)x mainly does n⌈⌿x; other axes are handled by transposing the axis of interest to be the first, then doing n⌈⌿x, then transposing the axes back into the required order. n(⌈ F a)x is much more complicated than n{⍺⌈⍵}/[a]x but has the advantage of using only O(1) getspace.

The revamped m12466 function uses 13717 getspace and 2 minutes to run under shuffle.

points

points is a function to test monadic format () for different values of ⎕pp.

points←{⍺←17 ⋄ ⎕pp←⍺
  tt←{(⌽∨\⌽⍵≠⍺)/⍵}                   ⍝ trim trailing ⍺s.    
  ~0∊∘.{                             ⍝ all pairs (⍳⍵)       
    num←'0'tt↑,/⍕¨⍺'.'⍵              ⍝ eg '9.3'             
    num≡⍕⍎num                        ⍝ check round trip     
  }⍨⍳⍵
}

The expression is 14 15 16 17 points¨ 100, meaning that for each of the 100 numbers num as text,

  1.1   1.2   1.3   1.4   1.5     1.100
  2.1   2.2   2.3   2.4   2.5     2.100
  3.1   3.2   3.3   3.4   3.5 ... 3.100
...
 99.1  99.2  99.3  99.4  99.5    99.100
100.1 100.2 100.3 100.4 100.5   100.100

a “round-trip” num≡⍕⍎num is evaluated. The expression requires 1.3 million getspace and 7.2 hours to run under shuffle.

A much more efficient version with respect to getspace is possible. Let x be a vector. ⍕¨x requires O(≢x) getspace; ⍕x requires O(1) getspace and, like ⍕¨x, formats each individual element of x independently.

points1←{
  ⎕io←⎕ml←1                                                 
  tt←{(⌽∨\⌽⍵≠⍺)/⍵}                   ⍝ trim trailing ⍺s     
  x←1↓∊(' ',¨s,¨'.')∘.,'0'tt¨s←⍕¨⍳⍵  ⍝ numbers as text      
  y←⍎x                                                      
  {⎕pp←⍵ ⋄ x≡⍕y}¨⍺                   ⍝ check round trip     
}

14 15 16 17 points1 100 requires 11050 getspace and less than 2 minutes to run under shuffle.

What to do?

Andy says that the shuffle QA takes over 2000 hours (!). The shuffle sluggards will be tackled as done here by using more parsimonious APL expressions to compare against the aspect being QA-ed. Going forward, a new QA function will be run under shuffle as it is being developed. The developers will be unlikely to tolerate a running time of 4½ days.

Loops, Folds and Tuples

For-loops
Given an initial state defined by a number of variables, a for-loop iterates through its argument array modifying the state.

    A←... ⋄ B←... ⋄ C←...       ⍝ initial state
    :For item :In items         ⍝ iterating through array "items"
        A←A ... item            ⍝ new value for A depending on item
        C←C ... A ... item      ⍝ new value for C depending on A and item
        ...                     ⍝ state updated
    :EndFor
    A B C                       ⍝ final state

In the above example the state comprises just three variables. In general, it may be arbitrarily complex, as can the interactions between its various components within the body of the loop.

Dfns
Dfns don’t have for-loops. Instead, we can use reduction (or “fold”) with an accumulating vector “tuple” of values representing the state. Here is the D-equivalent of the above code:

    ⊃{                      ⍝ next item is ⍺
        (A B C)←⍵           ⍝ named items of tuple ⍵
        A∆←A ... ⍺          ⍝ new value for A depending on item ⍺
        C∆←C ... A∆ ... ⍺   ⍝ new value for C depending on A∆ and item ⍺
        ...                 ⍝ ...
        A∆ B C∆             ⍝ "successor" tuple (A and C changed)
    }/(⌽items),⊂A B C       ⍝ for each item and initial state tuple A B C

In this coding, the accumulating tuple arrives as the right argument (⍵) of the operand function, with the next “loop item” on the left (⍺). Notice how the items vector is reversed (⌽items) so that the items arrive in index-order in the right-to-left reduction.

If you prefer your accumulator to be on the left, you can replace the primitive reduction operator (/) with Phil Last’s foldl operator, which also presents its loop items in index-order:

    foldl←{⊃⍺⍺⍨/(⌽⍵),⊂⍺}    ⍝ fold left

then:

    A B C {                 ⍝ final state from initial state (left argument)
        (A B C)←⍺           ⍝ named items of tuple ⍺
        A∆←A ... ⍵          ⍝ new value for A depending on item ⍵
        C∆←C ... A∆ ... ⍵   ⍝ new value for C depending on A∆ and item ⍵
        A∆ B C∆             ⍝ successor tuple (A and C changed)
    } foldl items

If the number of elements in the state tuple is large, it can become unwieldy to name each on entry and exit to the operand function. In this case it simplifies the code to name the indices of the tuple vector, together with an at operator to denote the items of its successor:

    T ← (...) (...) ...         ⍝ intitial state tuple T
    A B C D ... ← ⍳⍴T           ⍝ A B C D ... are tuple item "names"

    T {                         ⍝ final state from initial state (left argument)
        A∆←(A⊃⍺) ... ⍵          ⍝ new value for A depending on item ⍵
        C∆←(C⊃⍺) ... A∆ ... ⍵   ⍝ new value for C depending on A∆ and item ⍵
        A∆ C∆(⊣at A C)⍺         ⍝ successor tuple (A and C changed)
    } foldl items

where:

    at←{A⊣A[⍵⍵]←⍺ ⍺⍺(A←⍵)[⍵⍵]}   ⍝ (⍺ ⍺⍺ ⍵) at ⍵⍵ in ⍵

There is some discussion about providing a primitive operator @ for at in Dyalog V16.

Examples
Function kk illustrates naming tuple elements (S E K Q) at the start of the operand function.
Function scc accesses tuple elements using named indices (C L X x S) and an at operator.

Tuples: a postscript
We might define a “tuple” in APL as a vector in which we think of each item as having a name, rather than an index position.

    bob         ⍝ Tuple: name gender age
┌───┬─┬──┐
│Bob│M│39│
└───┴─┴──┘

    folk        ⍝ Vector of tuples
┌──────────┬────────────┬──────────┬────────────┬─
│┌───┬─┬──┐│┌─────┬─┬──┐│┌───┬─┬──┐│┌─────┬─┬──┐│
││Bob│M│39│││Carol│F│31│││Ted│M│31│││Alice│F│32││ ...
│└───┴─┴──┘│└─────┴─┴──┘│└───┴─┴──┘│└─────┴─┴──┘│
└──────────┴────────────┴──────────┴────────────┴─

    ↓⍉↑ folk    ⍝ Tuple of vectors
┌─────────────────────────┬────────┬───────────────┐
│┌───┬─────┬───┬─────┬─   │MFMF ...│39 31 31 32 ...│
││Bob│Carol│Ted│Alice│ ...│        │               │
│└───┴─────┴───┴─────┴─   │        │               │
└─────────────────────────┴────────┴───────────────┘

    ⍪∘↑¨ ↓⍉↑ folk   ⍝ Tuple of matrices
┌─────┬─┬──┐
│Bob  │M│39│
│Carol│F│31│
│Ted  │M│31│
│Alice│F│32│
 ...   . ..
└─────┴─┴──┘

APLers sometimes talk about “inverted files”. In this sense, a “regular” file is a vector-of-tuples and an inverted file (or more recently: “column store”) is a tuple-of-vectors (or matrices).