Performance Improvements: A Case Study
Preliminary Version
initial writing: 2008-07-15
last updated: 2008-10-06

Contents

0. Basic Properties of Boolean Arrays
1. Initial Ideas
2. Key Idea
3. Implementation
4. Testing
5. Functional Interface


0. Basic Properties of Boolean Arrays

Dyalog APL/W Version 12.0.1
clear ws

      ⎕io←0

      )copy t randbool timer

      randbool 3 5
1 0 0 1 1
1 0 0 1 0
0 0 0 0 1

      x←randbool 8e5  8 ⋄ 100 timer '≠⌿x'
0.03422
      x←randbool 4e5 16 ⋄ 100 timer '≠⌿x'
0.01766
      x←randbool 2e5 32 ⋄ 100 timer '≠⌿x'
0.00906
      x←randbool 1e5 64 ⋄ 100 timer '≠⌿x'
0.00516

      6.4e6 ÷ 14
457142.8571
      x←randbool 457143 14 ⋄ 100 timer '≠⌿x'
0.04985

The goal is to speed up f⌿x where x is an n,c boolean array with c not a multiple of 8.


1. Initial Ideas

Idea 0: pad each row up to next byte
      (≠⌿x) ≡ 14 ⍴ ≠⌿ 457143 16 ↑ x
1
      100 timer '14 ⍴ ≠⌿ 457143 16 ↑ x'
0.16657
Idea 1: use integers
      (≠⌿x) ≡ 2|+⌿x
1
      100 timer '2|+⌿x'
0.07031

2. Key Idea

Idea 2: treat argument as if it had c∧8 number of columns
      ⎕←t←17 14⍴⍳14
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
      14∧8
56
      56 ÷ 14
4
      ⎕←t1←5 56⍴t⍪7 14⍴0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 0 0 0 0 0  0  0  0  0 0 0 0 0 0 0 0 0 0 0  0  0  0  0 0 0 0 0 0 0 0 0 0 0  0  0  0  0
      +⌿t1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52
      4 14⍴+⌿t1
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0 4  8 12 16 20 24 28 32 36 40 44 48 52
0 4  8 12 16 20 24 28 32 36 40 44 48 52
0 4  8 12 16 20 24 28 32 36 40 44 48 52
      +⌿ 4 14 ⍴ +⌿t1
0 17 34 51 68 85 102 119 136 153 170 187 204 221
      +⌿t
0 17 34 51 68 85 102 119 136 153 170 187 204 221

      457143 ÷ 4
114285.75

      (≠⌿x) ≡ ≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0
1
      100 timer '≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0'
0.00765

      0.04985 ÷ 0.00765
6.516339869

In general:

 slash0←{
     c←¯1↑⍴⍵                ⍝ # columns
     m←8∧c                  ⍝ LCM
     d←m÷c                  ⍝ req'd multiple of # columns
     n←d×⌈(1↑⍴⍵)÷d          ⍝ round up to next multiple of d
     ⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0
 }

      (≠⌿x) ≡ ≠ slash0 x
1
      (=⌿x) ≡ = slash0 x
1
      (∧⌿x) ≡ ∧ slash0 x
1
      (∨⌿x) ≡ ∨ slash0 x
1
      (+⌿x) ≡ + slash0 x
1

      100 timer '≠ slash0 x'
0.00766

      ÷/ 100 timer¨ '≠⌿x' '≠ slash0 x'
7.211062591
      ÷/ 100 timer¨ '=⌿x' '= slash0 x'
7.270666667
      ÷/ 100 timer¨ '∧⌿x' '∧ slash0 x'
6.910987483
      ÷/ 100 timer¨ '∨⌿x' '∨ slash0 x'
6.974965229
      ÷/ 100 timer¨ '+⌿x' '+ slash0 x'
3.094653812

3. Implementation

Dyalog 12.1.x

      x←randbool 457143 14
      100 timer '≠⌿x'
0.00204

      0.04985 ÷ 0.00204
24.43627451
J6.02
   timer=: 6!:2
   x=: ? 457143 14 $ 2
   100 timer '~:/x'
0.00477299

APL should be faster than J, because APL has bit booleans whereas J has byte booleans.


4. Testing

The expressive power of APL and the computational power of modern machines facilitate thorough testing.

      )copy t assert

      ⎕cr 'assert'
 assert x                          
 'assertion failure'⎕SIGNAL(0∊x)⍴11

      assert 0=2|2 4 6 8 10
      assert 0=2|2 4 6 8 11
assertion failure
      assert 0=2|2 4 6 8 11
     ∧

      nereduce0←{2|+⌿2+⍵}
      assert 1 29 113∘.{(≠⌿x)≡nereduce0+x←randbool 137,⍵}1+⍳20

5. Functional Interface

For an interpreter with a functional interface, translating from slash0 to C takes a few minutes:

 slash0←{                                                  
     c←¯1↑⍴⍵                ⍝ number of columns             
     m←8∧c                  ⍝ LCM                           
     d←m÷c                  ⍝ req'd multiple of # columns   
     n←d×⌈(1↑⍴⍵)÷d          ⍝ round up to next multiple of m
     ⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0             
 }           


#include "globals.h"

static I gcd[8]={8,1,2,1,4,1,2,1};

F1(neslash0){I c,m,n,*s;
 s=y->s;                // shape
 c=s[1];                // # columns
 m=8*c/gcd[c%8];        // LCM(8,c)
 d=m/c;                 // req‘d multiple of # columns
 n=d*((*s+d-1)/d);      // round up to next multiple of m
 return neslash(reshape(v2(d,c),neslash(reshape(v2(n/d,m),cat0(y,reshape(v2(n-*s,c),zero))))));
}
globals.h
typedef long I;
typedef struct{I t,c,n,r,s[1];}*A;

#define F1(f) A f(A y)
#define F2(f) A f(A x,A y);

extern A zero;
extern F1(neslash);
extern F2(cat0);
extern F2(reshape);
extern A v2(I,I);

This implementation would have obtained nearly all of the speed-up from the key idea.

Dyalog 12.1.x

      100 timer '≠ slash0 x'
0.00266
Timings:
0.04985   v12.0 ≠⌿x
0.00766 v12.0 ≠ slash0 x
0.00204 v12.1 ≠⌿x
0.00266 v12.1 ≠ slash0 x
0.00477 J6.02