﻿ Roger Hui - Dyalog08
 Performance Improvements: A Case Study Preliminary Version initial writing: 2008-07-15last updated: 2008-10-06

Contents

0. Basic Properties of Boolean Arrays

• Boolean arrays have one bit per element.
• Arrays are stored in row-major order, one bit after another.
• Bit access is slower than byte or word access.
```Dyalog APL/W Version 12.0.1
clear ws

⎕io←0

)copy t randbool timer

randbool 3 5
1 0 0 1 1
1 0 0 1 0
0 0 0 0 1

x←randbool 8e5  8 ⋄ 100 timer '≠⌿x'
0.03422
x←randbool 4e5 16 ⋄ 100 timer '≠⌿x'
0.01766
x←randbool 2e5 32 ⋄ 100 timer '≠⌿x'
0.00906
x←randbool 1e5 64 ⋄ 100 timer '≠⌿x'
0.00516

6.4e6 ÷ 14
457142.8571
x←randbool 457143 14 ⋄ 100 timer '≠⌿x'
0.04985
```

The goal is to speed up f⌿x where x is an n,c boolean array with c not a multiple of 8.

1. Initial Ideas

Idea 0: pad each row up to next byte
```      (≠⌿x) ≡ 14 ⍴ ≠⌿ 457143 16 ↑ x
1
100 timer '14 ⍴ ≠⌿ 457143 16 ↑ x'
0.16657
```
Idea 1: use integers
```      (≠⌿x) ≡ 2|+⌿x
1
100 timer '2|+⌿x'
0.07031
```

2. Key Idea

Idea 2: treat argument as if it had c∧8 number of columns
```      ⎕←t←17 14⍴⍳14
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14∧8
56
56 ÷ 14
4
⎕←t1←5 56⍴t⍪7 14⍴0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 0 0 0 0 0  0  0  0  0 0 0 0 0 0 0 0 0 0 0  0  0  0  0 0 0 0 0 0 0 0 0 0 0  0  0  0  0
+⌿t1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52
4 14⍴+⌿t1
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0 4  8 12 16 20 24 28 32 36 40 44 48 52
0 4  8 12 16 20 24 28 32 36 40 44 48 52
0 4  8 12 16 20 24 28 32 36 40 44 48 52
+⌿ 4 14 ⍴ +⌿t1
0 17 34 51 68 85 102 119 136 153 170 187 204 221
+⌿t
0 17 34 51 68 85 102 119 136 153 170 187 204 221

457143 ÷ 4
114285.75

(≠⌿x) ≡ ≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0
1
100 timer '≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0'
0.00765

0.04985 ÷ 0.00765
6.516339869
```

In general:

``` slash0←{
c←¯1↑⍴⍵                ⍝ # columns
m←8∧c                  ⍝ LCM
d←m÷c                  ⍝ req'd multiple of # columns
n←d×⌈(1↑⍴⍵)÷d          ⍝ round up to next multiple of d
⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0
}

(≠⌿x) ≡ ≠ slash0 x
1
(=⌿x) ≡ = slash0 x
1
(∧⌿x) ≡ ∧ slash0 x
1
(∨⌿x) ≡ ∨ slash0 x
1
(+⌿x) ≡ + slash0 x
1

100 timer '≠ slash0 x'
0.00766

÷/ 100 timer¨ '≠⌿x' '≠ slash0 x'
7.211062591
÷/ 100 timer¨ '=⌿x' '= slash0 x'
7.270666667
÷/ 100 timer¨ '∧⌿x' '∧ slash0 x'
6.910987483
÷/ 100 timer¨ '∨⌿x' '∨ slash0 x'
6.974965229
÷/ 100 timer¨ '+⌿x' '+ slash0 x'
3.094653812
```

3. Implementation

```Dyalog 12.1.x

x←randbool 457143 14
100 timer '≠⌿x'
0.00204

0.04985 ÷ 0.00204
24.43627451
```
J6.02
```   timer=: 6!:2
x=: ? 457143 14 \$ 2
100 timer '~:/x'
0.00477299
```

APL should be faster than J, because APL has bit booleans whereas J has byte booleans.

4. Testing

The expressive power of APL and the computational power of modern machines facilitate thorough testing.

```      )copy t assert

⎕cr 'assert'
assert x
'assertion failure'⎕SIGNAL(0∊x)⍴11

assert 0=2|2 4 6 8 10
assert 0=2|2 4 6 8 11
assertion failure
assert 0=2|2 4 6 8 11
∧

nereduce0←{2|+⌿2+⍵}
assert 1 29 113∘.{(≠⌿x)≡nereduce0+x←randbool 137,⍵}1+⍳20
```

5. Functional Interface

For an interpreter with a functional interface, translating from slash0 to C takes a few minutes:

``` slash0←{
c←¯1↑⍴⍵                ⍝ number of columns
m←8∧c                  ⍝ LCM
d←m÷c                  ⍝ req'd multiple of # columns
n←d×⌈(1↑⍴⍵)÷d          ⍝ round up to next multiple of m
⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0
}

#include "globals.h"

static I gcd[8]={8,1,2,1,4,1,2,1};

F1(neslash0){I c,m,n,*s;
s=y->s;                // shape
c=s[1];                // # columns
m=8*c/gcd[c%8];        // LCM(8,c)
d=m/c;                 // req‘d multiple of # columns
n=d*((*s+d-1)/d);      // round up to next multiple of m
return neslash(reshape(v2(d,c),neslash(reshape(v2(n/d,m),cat0(y,reshape(v2(n-*s,c),zero))))));
}
```
globals.h
```typedef long I;
typedef struct{I t,c,n,r,s[1];}*A;

#define F1(f) A f(A y)
#define F2(f) A f(A x,A y);

extern A zero;
extern F1(neslash);
extern F2(cat0);
extern F2(reshape);
extern A v2(I,I);
```

This implementation would have obtained nearly all of the speed-up from the key idea.

```Dyalog 12.1.x

100 timer '≠ slash0 x'
0.00266
```
Timings:
 0.04985 v12.0 ≠⌿x 0.00766 v12.0 ≠ slash0 x 0.00204 v12.1 ≠⌿x 0.00266 v12.1 ≠ slash0 x 0.00477 J6.02