Primitive Performance
Roger Hui
14.0 v 13.2
x+.×y | |
2.5 |
≤\b | |
1.2-1.4 |
|
bit mover | |
1.3-23 |
b⍳0 b⍳1 | |
780 |
n↑b | |
6-8 |
+\[a]b | |
1.5-28 |
+/[a]b | |
1-8 |
+/∧\b | |
18-36 |
b=c | |
1.2 |
x∊y | |
1-2 |
|
x(⍳∘1 ≥)y | |
1.3-∞ |
x(∧/≥)y x∧.≥y | |
2-∞ |
x(1∊≥)y | |
2-∞ |
x(+/≥)y | |
1.4 |
x≡⍤1⊢y | |
7-∞ |
{≢⍵}⌸y | |
2.5-3 |
{⍺(≢⍵)}⌸y | |
1.5 |
|
⌊0.5+y ⌈y ⌊y | |
1.2-1.6 |
↑[⎕io-0.5]y | |
2-11 |
x[i;j;
] | |
1-1.17 |
b/[a]y | |
2-17 |
+\[a]y ⌈\ ⌊\ | |
2-22 |
v[i]+←1 | |
2.8 |
Dyalog v14.0 has a number of performance improvements.
There are three underlying themes that bring order to
the list of individual improvements. They are:
Visual Studio 2012, boolean functions, and new constructs.
Visual Studio 2012
auto-vectorization |
auto-parallelization |
These speed-ups are noteworthy because they did not require
any changes to the C source, but obtained by compiling
under Visual Studio 2012.
Two features of VS2012 are relevant:
auto-vectorization uses vector registers and vector instructions.
This is particulary effective for +.× because
of the non-traditional way Dyalog coded +.× .
The other feature, auto-parallelization, uses multi-cores,
again without any changes to the C source.
We have not yet explored this second feature.
≤\b is listed not because it is an important function,
but because we happened to find that it has been sped up.
We have just began exploring VS2012, including verifying that
there are no significant slow-downs.
So far it looks good.
Booleans
bit mover | |
1.3-23 |
b⍳0 b⍳1 | |
780 |
n↑b | |
6-8 |
+\[a]b | |
1.5-28 |
+/[a]b | |
1-8 |
+/∧\b | |
18-36 |
b=c | |
1.2 |
x∊y | |
1-2 |
|
The initial impetus for this work came in March 2012
when Morten was preparing to present the Game of Life at the
Software Passions Summit conference in Göteborg, Sweden.
We examined the performance of boolean functions,
and some speed-ups were released in v13.2.
Version 14.0 continues the program.
We now look at two boolean speed-ups in particular.
Booleans: Bit Mover
x⍴b
x↑b
x↓b
bv⌿b
bv⍀b
b,c
b⍪c
x⊖b
b,←c
b⍪←c
b[x;]
(⊂x)⌷b
{⍵[⍋⍵;]}b
{⍵[⍒⍵;]}b
| |
b c boolean array |
bv boolean vector |
x arbitrary array |
The interpreter has a function called bMb for moving bits.
It is used in a number of primitives, and by speeding up bMb
a number of primitives are now faster on boolean arrays.
Boolean: b⍳0 and b⍳1
(b⍳1)↓x v (∨\b)/x
(b⍳0)↑x v (∧\b)/x
∧\b ←→ (⍳n)<b⍳0 | 0 at or after the first 0 |
∨\b ←→ (⍳n)≥b⍳1 | 1 at or after the first 1 |
<\b ←→ (⍳n)=b⍳1 | 0 after the first 1 |
On 2012-05-08,
Arthur Whitney wrote in the k mailing list that most APL boolean scans
are not useful and that b⍳0 and b⍳1 are much better.
This led me to do timings on them in Dyalog:
b←(1e6⍴0),1
10 timer 'b⍳1'
0.0195893
c←~b
10 timer 'c⍳0'
0.0186658
In J it takes roughly 0.00255 .
Dyalog should beat J by a factor of about 8
because it has bit booleans whereas J has byte booleans,
so b⍳1 in Dyalog can be faster by a factor of 0.019÷0.00255÷8 or about 60.
We actually achieved a factor of 780.
Arthur did not elaborate on his comment,
but the reasoning is probably as follows: ∨\b and b⍳1 contain the same information,
but the former takes ⍴b bits whereas the latter takes 64 or 32 bits.
As well, there are follow-on effects.
For example, (b⍳1)↓x is more efficient than (∨\b)/x.
b⍳0 and b⍳1 are key computations in several boolean scans.
∧\ | (⍳n)<b⍳0 | 0 at or after the first 0 |
∨\ | (⍳n)≥b⍳1 | 1 at or after the first 1 |
<\ | (⍳n)=b⍳1 | 0 after the first 1 |
≤\ | (⍳n)≠b⍳0 | 1 after the first 0 |
>\ | n⍴(j⍴1 0),(0≠2|j),n⍴0≠2|j←b⍳0 | |
≥\ | n⍴(j⍴0 1),(1≠2|j),n⍴1≠2|j←b⍳1 | |
⍱\ | n⍴(j⍴0 1),(1≠2|j),n⍴0≠2|j←b⍳1 | |
⍲\ | n⍴(j⍴1 0),(0≠2|j),n⍴1≠2|j←b⍳0 | |
Therefore, ∧\ ∨\ <\ ,
last sped up in 2008 (v12.1), are sped up yet again.
New Constructs
x(⍳∘1 ≥)y | |
1.3-∞ |
x(∧/≥)y x∧.≥y | |
2-∞ |
x(1∊≥)y | |
2-∞ |
x(+/≥)y | |
1.4 |
x≡⍤1⊢y | |
7-∞ |
{≢⍵}⌸y | |
2.5-3 |
{⍺(≢⍵)}⌸y | |
1.5 |
|
Version 14.0 has the language extensions
trains (forks), rank, key.
They not only enhance the expressiveness of the language but
are faster in some common and useful computations.
New Constructs: Index of Compare
x (f g) y ←→ f (x g y) |
∴ x (⍳∘1 ≥) y ←→ ⍳∘1 (x≥y) ←→ (x≥y)⍳1 |
|
In general, the pattern is x (⍳∘b comp) y |
b is 0 1 |
comp is < ≤ = ≠ ≥ > |
|
 |
make idioms of {(⍺≥⍵)⍳1}, {(⍺≠⍵)⍳0},
|
index-of last occurrence: ⍸ |
The expression at the top of the slide computes the index
of the first place where x is greater than or equal to y .
If the hit happens near the beginning, you win big, potentially
an “infinite” speed-up.
Even if the hit happens near the end (or not at all),
you still save on the cost of creating the boolean vector
and then looking for the 1.
(The factor of 1.3 is so small
only because b⍳1 has been sped up.)
In general, you can look for 0 or 1,
and the comparison can be any of the arithmetic comparisons.
Now that we see the pattern, a couple of bright ideas:
First, we should make idioms of the equivalent Dfns.
Second, wouldn’t it be nice if there is
a primitive function, ⍸ say,
that finds the index of the last occurrence.
New Constructs: All/Any/Count of Compare
x(∧/≥)y x∧.≥y | |
2-∞ |
x(1∊≥)y | |
2-∞ |
x(+/≥)y | |
1.4 |
|
Similarly,
x (f g) y ←→ f (x g y) |
∴ x (∧/≥) y ←→ ∧/(x≥y) ←→ ∧/x≥y (←→ x∧.≥y) |
|
x (a f g) y ←→ a f (x g y) |
∴ x (1∊≥) y ←→ 1∊(x≥y) ←→ 1∊x≥y |
|
In general, the patterns are x (f/ comp) y
and x (b ∊ comp) y |
f is ∨ ∧ + |
comp is < ≤ = ≠ ≥ > |
b is 0 or 1 |
The expression at the top of the slide answers the question,
are all of x greater than or equal to y ?
In general, the questions can be, are all, are any, and how many,
and the comparison can be any of the arithmetic comparisons.
There is no “infinite” speed-up for +/ because
there is no early exit from +/ .
The factor of 1.4 is small because +/ on booleans,
already phenomenally fast, has been sped up in v14.0.
After we saw the pattern, we thought to check that the equivalent
inner products also have early exits.
They did not. They do now.
New Constructs: Match with Rank
Are corresponding rows of
matrices x and y the same?
x y ← ⊂⍤2 ⊢u[?2 2000⍴≢u←(⎕a,⎕d)[?10 29⍴36];]
cmpx 'x≡⍤1⊢y' '∧/x=y'
x≡⍤1⊢y → 1.14E¯5 | 0% ⎕⎕⎕
∧/x=y → 7.72E¯5 | +576% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x y ← ⊂⍤2 ⊢(⎕a,⎕d)[?2 2000 100⍴36]
cmpx 'x≡⍤1⊢y' '∧/x=y'
x≡⍤1⊢y → 1.01E¯5 | 0% ⎕
∧/x=y → 2.47E¯4 | +2333% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x≡⍤1⊢y is faster than ∧/x=y because
match can exit early.
This point is illustrated by using matrices with long random rows.
New Constructs: Tally with Key I
# of occurrences of unique elements of x←?1e6⍴1000
a. | ¯1+{≢⍵}⌸(⍳1000),x |
| 1.00 |
b. | ¯1+{≢⍵}⌸(⍳1+⌈/x),x |
| 1.62 |
c. | c ⊣ c[x]+←1 ⊣ c←1000⍴0 |
| 2.62 |
d. | t-¯1↓¯1,t←(s≠1↓s,¯1)/⍳⍴s |
| 1.81 |
e. | -2-/¯1,(s≠1↓s,¯1)/⍳⍴s |
| 1.81 |
f. | -2-/¯1,(s≠1↓s,¯1)/⍳⍴s←{⍵[⍋⍵]}x |
| 2.93 |
g. | -2-/¯1,(s≠1↓s,¯1)/⍳⍴s←x[⍋x] |
| 9.13 |
h. | +/(⍳1000)∘.=x |
| 778.66 |
This is a problem encountered in July in a tuning project with a client.
a. | Fast solution using key.
Prefacing the “universe” to x has the advantage of
(0) generating the counts in a known order;
and (1) including a count of 0 for elements of the universe that do not occur in the sample.
|
b. | The same algorithm, but assumes that you don’t know the range.
|
c. | Using modified indexed assignment v[x]+←1 whose performance
has also been improved in v14.0.
I believe it can be faster still because a. and c. should be equally fast.
|
d. | About the best that you can do
if you don’t use key or modified index assignment,
but can assume that the sample is sorted.
|
e. | A marginal gain by using pairwise reduction -2-/⍵
to find successive differences, perhaps suggesting that 2-/⍵ can be improved.
|
f. | What the previous solution looks like if you
can not assume that the sample is sorted. Note that for this x
the interpreter is able to use a fast sort algorithm.
|
g. | What happens if you don’t use the sort idiom {⍵[⍋⍵]} .
|
h. | A shorter solution than key,
but impractical because of abysmal performance.
|
Bottom line: for this problem, key offers the fastest and shortest practical solution.
New Constructs: Tally with Key II
?0
0.549364
?2 3⍴0
0.384692 0.331927 0.026112
0.773553 0.784359 0.0516709
¯1+{≢⍵}⌸(⍳5),⌊5×?1e6⍴0
200742 199084 199985 200288 199901
In v14.0, ?0 returns a uniform random number between 0 and 1.
Tally with key is useful for testing this new facility:
If r is a unform random number between 0 and 1,
then ⌊5×r should be uniformly distributed
among ⍳5 .
And so it is, for this sample.
New Constructs: Tally with Key III
hexf ⍪ 0.1 0.5 0.9
3FB999999999999A
3FE0000000000000
3FECCCCCCCCCCCCD
x ← hexf ⍪?1e6⍴0
⊖ sort ⍕ {⍺(≢⍵)}⌸ x[;⍳3]
3FE 499256
3FD 250610
3FC 124865
3FB 62471
3FA 31330
3F9 15803
3F8 7877
3F7 3883
3F6 1934
3F5 1033
3F4 465
3F3 227
3F2 117
3F1 65
3F0 33
3EF 18
3EE 5
3ED 7
3EC 1
hexf from the Dfns workspace computes
a character vector of the hexadecimal representation
of a floating point number.
The first 3 characters are the binary exponent
(also the sign, always positive for our data).
For example, 3FE is the binary exponent
of the numbers between 0.5 and 1.
The interval size for the next smaller exponent is halved;
therefore, starting down from 3FE ,
the tally should be halved from one exponent to the next.
And so it is, for this sample.
New Constructs: Tally with Key IV
{⊖sort⍕{⍺(≢⍵)}⌸⍵[;⍳3]}
{⊖⍕(↓b⌿t),⍪-2-/¯1,b/⍳⍴b←∨/(1↓t⍪' ')≠t←sort ⍵[;⍳3]}
Using key is not only shorter and simpler but also faster.
Currently there is no special code for this
case ({⍺(≢⍵)}⌸)
of key. When there is, it will be faster still.
New Constructs: Special Code
Key | ⊢∘(f/) ⊢∘(f⌿) ⊢∘≢ ⊢∘⊂
{f/⍵} {f⌿⍵} {≢⍵} {⊂⍵} {⍺(≢⍵)} |
|
Rank |
monad | ⊂ ⊖ f⌿ f⍀ {⍵[⍋⍵]} {⍵[⍋⍵;]} {(⊂⍋⍵)⌷⍵} |
|
dyad | ↑ ↓ ≡ ≢ ⌿ ⍳ |
|
both | , ⍪ ⍉ |
|
Train | dyad |
(⍳∘1 ≥) (1 ∊ ≥) (∧/ ≥) (⍳ < (≢⊣)) |
f is a primitive scalar dyadic function.
More generally, the new language features
rank, key, and trains
each have code for the general case,
plus special code (idioms) for specific useful cases.
If you find a compelling example
which you think should be faster,
please let us know.
(The grey items do not have special code yet.)
created: | 2013-10-01 17:00 |
updated: | 2013-10-22 08:55 |
|