Primitive Performance
Roger Hui



14.0 v 13.2

x+.×y   2.5
≤\b 1.2-1.4

bit mover 1.3-23
b⍳0  b⍳1 780
n↑b 6-8
+\[a]b 1.5-28
+/[a]b 1-8
+/∧\b 18-36
b=c 1.2
x∊y 1-2

x(⍳∘1 ≥)y 1.3-∞
x(∧/≥)y  x∧.≥y 2-∞
x(1∊≥)y 2-∞
x(+/≥)y 1.4
x≡⍤1⊢y 7-∞
{≢⍵}⌸y 2.5-3
{⍺(≢⍵)}⌸y 1.5

⌊0.5+y  ⌈y  ⌊y 1.2-1.6
↑[⎕io-0.5]y 2-11
x[i;j;…] 1-1.17
b/[a]y 2-17
+\[a]y  ⌈\  ⌊\ 2-22
v[i]+←1 2.8

Dyalog v14.0 has a number of performance improvements. There are three underlying themes that bring order to the list of individual improvements. They are: Visual Studio 2012, boolean functions, and new constructs.



Visual Studio 2012

x+.×y   2.5
≤\b 1.2-1.4

• auto-vectorization
• auto-parallelization

These speed-ups are noteworthy because they did not require any changes to the C source, but obtained by compiling under Visual Studio 2012.

Two features of VS2012 are relevant: auto-vectorization uses vector registers and vector instructions. This is particulary effective for +.× because of the non-traditional way Dyalog coded +.× . The other feature, auto-parallelization, uses multi-cores, again without any changes to the C source. We have not yet explored this second feature.

≤\b is listed not because it is an important function, but because we happened to find that it has been sped up. We have just began exploring VS2012, including verifying that there are no significant slow-downs. So far it looks good.



Booleans

bit mover 1.3-23
b⍳0  b⍳1 780
n↑b 6-8
+\[a]b 1.5-28
+/[a]b 1-8
+/∧\b 18-36
b=c 1.2
x∊y 1-2

The initial impetus for this work came in March 2012 when Morten was preparing to present the Game of Life at the Software Passions Summit conference in Göteborg, Sweden. We examined the performance of boolean functions, and some speed-ups were released in v13.2. Version 14.0 continues the program.

We now look at two boolean speed-ups in particular.



Booleans: Bit Mover

bit mover 1.3-23

x⍴b
x↑b
x↓b
bv⌿b
bv⍀b
b,c
b⍪c
x⊖b
b,←c
b⍪←c
b[x;]
(⊂x)⌷b
{⍵[⍋⍵;]}b
{⍵[⍒⍵;]}b
    b c boolean array
bv  boolean vector
x   arbitrary array

The interpreter has a function called bMb for moving bits. It is used in a number of primitives, and by speeding up bMb a number of primitives are now faster on boolean arrays.



Boolean: b⍳0 and b⍳1

b⍳0  b⍳1 780

(b⍳1)↓x   v   (∨\b)/x
(b⍳0)↑x   v   (∧\b)/x

∧\b ←→ (⍳n)<b⍳0  0 at or after the first 0
∨\b ←→ (⍳n)≥b⍳11 at or after the first 1
<\b ←→ (⍳n)=b⍳10 after the first 1

On 2012-05-08, Arthur Whitney wrote in the k mailing list that most APL boolean scans are not useful and that b⍳0 and b⍳1 are much better. This led me to do timings on them in Dyalog:

   b←(1e6⍴0),1
   10 timer 'b⍳1'
0.0195893
   c←~b
   10 timer 'c⍳0'
0.0186658

In J it takes roughly 0.00255 . Dyalog should beat J by a factor of about 8 because it has bit booleans whereas J has byte booleans, so b⍳1 in Dyalog can be faster by a factor of 0.019÷0.00255÷8 or about 60. We actually achieved a factor of 780.

Arthur did not elaborate on his comment, but the reasoning is probably as follows: ∨\b and b⍳1 contain the same information, but the former takes ⍴b bits whereas the latter takes 64 or 32 bits. As well, there are follow-on effects. For example, (b⍳1)↓x is more efficient than (∨\b)/x.

b⍳0 and b⍳1 are key computations in several boolean scans.

∧\  (⍳n)<b⍳00 at or after the first 0
∨\ (⍳n)≥b⍳11 at or after the first 1
<\ (⍳n)=b⍳10 after the first 1
≤\ (⍳n)≠b⍳01 after the first 0
>\ n⍴(j⍴1 0),(0≠2|j),n⍴0≠2|j←b⍳0  
≥\ n⍴(j⍴0 1),(1≠2|j),n⍴1≠2|j←b⍳1
⍱\ n⍴(j⍴0 1),(1≠2|j),n⍴0≠2|j←b⍳1
⍲\ n⍴(j⍴1 0),(0≠2|j),n⍴1≠2|j←b⍳0

Therefore, ∧\ ∨\ <\ , last sped up in 2008 (v12.1), are sped up yet again.



New Constructs

x(⍳∘1 ≥)y 1.3-∞
x(∧/≥)y  x∧.≥y 2-∞
x(1∊≥)y 2-∞
x(+/≥)y 1.4
x≡⍤1⊢y 7-∞
{≢⍵}⌸y 2.5-3
{⍺(≢⍵)}⌸y 1.5

Version 14.0 has the language extensions trains (forks), rank, key. They not only enhance the expressiveness of the language but are faster in some common and useful computations.



New Constructs: Index of Compare

x(⍳∘1 ≥)y 1.3-∞

  x (f g) y ←→ f (x g y)
∴ x (⍳∘1 ≥) y ←→ ⍳∘1 (x≥y) ←→ (x≥y)⍳1
 
In general, the pattern is x (⍳∘b comp) y
  • b is 0 1
  • comp is < ≤ = ≠ ≥ >
 
  • make idioms of {(⍺≥⍵)⍳1}, {(⍺≠⍵)⍳0}, …
  • index-of last occurrence:

The expression at the top of the slide computes the index of the first place where x is greater than or equal to y . If the hit happens near the beginning, you win big, potentially an “infinite” speed-up. Even if the hit happens near the end (or not at all), you still save on the cost of creating the boolean vector and then looking for the 1. (The factor of 1.3 is so small only because b⍳1 has been sped up.)

In general, you can look for 0 or 1, and the comparison can be any of the arithmetic comparisons.

Now that we see the pattern, a couple of bright ideas: First, we should make idioms of the equivalent Dfns. Second, wouldn’t it be nice if there is a primitive function,say, that finds the index of the last occurrence.



New Constructs: All/Any/Count of Compare

x(∧/≥)y  x∧.≥y 2-∞
x(1∊≥)y 2-∞
x(+/≥)y 1.4

Similarly,

  x (f g) y ←→ f (x g y)
∴ x (∧/≥) y ←→ ∧/(x≥y) ←→ ∧/x≥y (←→ x∧.≥y)
 
  x (a f g) y ←→ a f (x g y)
∴ x (1∊≥) y ←→ 1∊(x≥y) ←→ 1∊x≥y
 
In general, the patterns are x (f/ comp) y and x (b ∊ comp) y
  • f    is ∨ ∧ +
  • comp is < ≤ = ≠ ≥ >
  • b    is 0 or 1

The expression at the top of the slide answers the question, are all of x greater than or equal to y ? In general, the questions can be, are all, are any, and how many, and the comparison can be any of the arithmetic comparisons.

There is no “infinite” speed-up for +/ because there is no early exit from +/ . The factor of 1.4 is small because +/ on booleans, already phenomenally fast, has been sped up in v14.0.

After we saw the pattern, we thought to check that the equivalent inner products also have early exits. They did not. They do now.



New Constructs: Match with Rank

x≡⍤1⊢y 7-∞

Are corresponding rows of matrices x and y the same?

   x y ← ⊂⍤2 ⊢u[?2 2000⍴≢u←(⎕a,⎕d)[?10 29⍴36];]

   cmpx 'x≡⍤1⊢y' '∧/x=y'
x≡⍤1⊢y → 1.14E¯5 |    0% ⎕⎕⎕
∧/x=y  → 7.72E¯5 | +576% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
                         
   x y ← ⊂⍤2 ⊢(⎕a,⎕d)[?2 2000 100⍴36]

   cmpx 'x≡⍤1⊢y' '∧/x=y'
x≡⍤1⊢y → 1.01E¯5 |     0% ⎕
∧/x=y  → 2.47E¯4 | +2333% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

x≡⍤1⊢y is faster than ∧/x=y because match can exit early. This point is illustrated by using matrices with long random rows.



New Constructs: Tally with Key I

{≢⍵}⌸y 2.5-3

# of occurrences of unique elements of x←?1e6⍴1000

a.   ¯1+{≢⍵}⌸(⍳1000),x       1.00
b.   ¯1+{≢⍵}⌸(⍳1+⌈/x),x 1.62
c.   c ⊣ c[x]+←1 ⊣ c←1000⍴0 2.62
d.   t-¯1↓¯1,t←(s≠1↓s,¯1)/⍳⍴s 1.81
e.   -2-/¯1,(s≠1↓s,¯1)/⍳⍴s 1.81
f.   -2-/¯1,(s≠1↓s,¯1)/⍳⍴s←{⍵[⍋⍵]}x 2.93
g.   -2-/¯1,(s≠1↓s,¯1)/⍳⍴s←x[⍋x] 9.13
h.   +/(⍳1000)∘.=x 778.66

This is a problem encountered in July in a tuning project with a client.

a.  Fast solution using key. Prefacing the “universe” to x has the advantage of (0) generating the counts in a known order; and (1) including a count of 0 for elements of the universe that do not occur in the sample.
b.The same algorithm, but assumes that you don’t know the range.
c.Using modified indexed assignment v[x]+←1 whose performance has also been improved in v14.0. I believe it can be faster still because a. and c. should be equally fast.
d.About the best that you can do if you don’t use key or modified index assignment, but can assume that the sample is sorted.
e.A marginal gain by using pairwise reduction -2-/⍵ to find successive differences, perhaps suggesting that 2-/⍵ can be improved.
f.What the previous solution looks like if you can not assume that the sample is sorted. Note that for this x the interpreter is able to use a fast sort algorithm.
g.What happens if you don’t use the sort idiom {⍵[⍋⍵]} .
h.A shorter solution than key, but impractical because of abysmal performance.

Bottom line: for this problem, key offers the fastest and shortest practical solution.



New Constructs: Tally with Key II

{≢⍵}⌸y 2.5-3

   ?0
0.549364

   ?2 3⍴0
0.384692 0.331927 0.026112 
0.773553 0.784359 0.0516709

   ¯1+{≢⍵}⌸(⍳5),⌊5×?1e6⍴0
200742 199084 199985 200288 199901

In v14.0, ?0 returns a uniform random number between 0 and 1. Tally with key is useful for testing this new facility: If r is a unform random number between 0 and 1, then ⌊5×r should be uniformly distributed among ⍳5 . And so it is, for this sample.



New Constructs: Tally with Key III

{⍺(≢⍵)}⌸y 1.5

   hexf ⍪ 0.1 0.5 0.9
3FB999999999999A
3FE0000000000000
3FECCCCCCCCCCCCD

   x ← hexf ⍪?1e6⍴0

   ⊖ sort ⍕ {⍺(≢⍵)}⌸ x[;⍳3]
 3FE  499256
 3FD  250610
 3FC  124865
 3FB   62471
 3FA   31330
 3F9   15803
 3F8    7877
 3F7    3883
 3F6    1934
 3F5    1033
 3F4     465
 3F3     227
 3F2     117
 3F1      65
 3F0      33
 3EF      18
 3EE       5
 3ED       7
 3EC       1

hexf from the Dfns workspace computes a character vector of the hexadecimal representation of a floating point number. The first 3 characters are the binary exponent (also the sign, always positive for our data). For example, 3FE is the binary exponent of the numbers between 0.5 and 1.

The interval size for the next smaller exponent is halved; therefore, starting down from 3FE , the tally should be halved from one exponent to the next. And so it is, for this sample.



New Constructs: Tally with Key IV

{⍺(≢⍵)}⌸y 1.5

{⊖sort⍕{⍺(≢⍵)}⌸⍵[;⍳3]}
{⊖⍕(↓b⌿t),⍪-2-/¯1,b/⍳⍴b←∨/(1↓t⍪' ')≠t←sort ⍵[;⍳3]}

Using key is not only shorter and simpler but also faster. Currently there is no special code for this case ({⍺(≢⍵)}⌸) of key. When there is, it will be faster still.



New Constructs: Special Code

Key
⊢∘(f/)  ⊢∘(f⌿)  ⊢∘≢   ⊢∘⊂
{f/⍵}   {f⌿⍵}   {≢⍵}  {⊂⍵}  {⍺(≢⍵)}
 
Rank   monad    
⊂  ⊖  f⌿  f⍀  {⍵[⍋⍵]} {⍵[⍋⍵;]} {(⊂⍋⍵)⌷⍵}
dyad
↑  ↓  ≡  ≢  ⌿  
both    
,  ⍪  ⍉
 
Train   dyad    
(⍳∘1 ≥)  (1 ∊ ≥)  (∧/ ≥)  (⍳ < (≢⊣))

f is a primitive scalar dyadic function.
 

More generally, the new language features rank, key, and trains each have code for the general case, plus special code (idioms) for specific useful cases. If you find a compelling example which you think should be faster, please let us know.

(The grey items do not have special code yet.)



created:  2013-10-01 17:00
updated:2013-10-22 08:55