Making Lasagne
Participants in the SA2 Performance Tuning workshop at the Dyalog ’18 User Meeting were encouraged to bring their own problems for the group to work on. Diane Hymas of ExxonMobil brought a good one. The oneliner computation is as follows:
lasagne0 ← {groups {+⌿⍵}⌸ amts ×[⎕io] spices[inds;]}
where
n ← 8e5
spices ← ?6000 44⍴0
groups ← +\(16↑1 2)[?n⍴16]
inds ← ?n⍴≢spices
amts ← ?n⍴0
Applying lasagne0
to this dataset:
⍴ lasagne0 ⍬
100015 44
≢ ∪ groups
100015
)copy dfns wsreq cmpx
wsreq 'lasagne0 ⍬'
844799820
cmpx 'lasagne0 ⍬'
2.12E¯1
The problem with lasagne0
is space rather than time. The 845 MB required for this dataset may be acceptable, but we can be called upon to cook up large batches of lasagne in a smallish workspace, on a machine with limited RAM. (Large n
and large ≢∪groups
.)
All benchmarks in this document were run in Dyalog APL version 17.0, in a 2 GB workspace, on a machine with generous RAM.
Solutions
Marshall Lochbaum solved the problem. The alternative solutions are as follows:
lasagne0 ← {groups {+⌿⍵}⌸ amts ×[⎕io] spices[inds;]}
lasagne1 ← {↑ (groups{⊂⍵}⌸amts) {+⌿⍺×[⎕io]spices[⍵;]}¨ groups{⊂⍵}⌸inds}
lasagne2 ← {↑ (groups{⊂⍵}⌸amts) {⍺+.×spices[⍵;]}¨ groups{⊂⍵}⌸inds}
lasagne3 ← {↑ {amts[⍵]+.×spices[inds[⍵];]}¨ {⊂⍵}⌸groups}
lasagne0
is the original expression; lasagne1
and lasagne2
were derived by Marshall during the workshop; lasagne3
was suggested by a participant in the workshop. The four functions produce matching results. Comparing the space and time:
space (MB)  time  
lasagne0 
845 
2.29e¯1 ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 

lasagne1 
74 
3.60e¯1 ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 

lasagne2 
74 
2.39e¯1 ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 

lasagne3 
74 
2.93e¯1 ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 
lasagne0 v lasagne1
Nearly all of the space required to evaluate lasagne0
is accounted for by the space for computing the right argument to key:
wsreq 'lasagne0 ⍬'
844799820
wsreq 'amts ×[⎕io] spices[inds;]'
844799548
In fact, the array spices[inds;]
, all by itself, is already very large. It has shape (⍴inds),1↓⍴spices
(≡ 8e5 44)
, each item requiring 8 bytes.
wsreq 'spices[inds;]'
281599556
⍴ spices[inds;]
800000 44
8 × ×/ ⍴ spices[inds;]
281600000
qsize←{⎕size '⍵'} ⍝ # bytes for array ⍵
qsize spices[inds;]
281600040
lasagne1
avoids creating these large intermediate results, by first partitioning the arguments (groups{⊂⍵}⌸amts
and groups{⊂⍵}⌸inds
) and then applying a computation to each partition. In that computation, the operand function {+⌿⍺×[⎕io]spices[⍵;]}
, ⍺
is a partition of amts
and ⍵
is the corresponding partition of inds
.
lasagne1
, lasagne2
and lasagne3
require the same amount of space to run, so the comparison among them is on time.
lasagne1 v lasagne2
The only change is from +⌿⍺×[⎕io]spices[⍵;]
to ⍺+.×spices[⍵;]
, which are equivalent when ⍺
is a vector. The interpreter can compute +.×
in one go rather than doing +⌿
separately after doing ×[⎕io]
; in such computation the interpreter can and does exploit the additional information afforded by +.×
and is faster by a factor of 1.5
(= 2.39 ÷⍨ 3.60)
.
lasagne2 v lasagne3
The idea in lasagne3
is doing one key operation rather than the two in lasagne2
. Therefore, the changes between lasagne2
v lasagne3
are:
lasagne2 
groups{⊂⍵}⌸amts 
⍺ 
spices[⍵;] 
lasagne3 
{⊂⍵}⌸groups 
amts[⍵] 
spices[inds[⍵];] 
All three key operations involve {⊂⍵}⌸
with groups
as the key, and are roughly equally fast, each taking up no more than 7% of the total time.
cmpx 'groups{⊂⍵}⌸amts' 'groups{⊂⍵}⌸inds' '{⊂⍵}⌸groups'
groups{⊂⍵}⌸amts → 1.69E¯2  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* groups{⊂⍵}⌸inds → 1.39E¯2  18% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* {⊂⍵}⌸groups → 1.36E¯2  20% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
lasagne3
is doing one less key operation than lasagne2
in exchange for doing, for each of the ≢∪groups
(= 100015)
executions of the operand function, amt[⍵]
v ⍺
and spices[ind[⍵];]
v spices[⍵;]
. Indexing is by no means slow, but it’s not as fast as references to ⍺
and ⍵
. Therefore, lasagne2
is faster.
The tradeoff may differ for different values of groups
. In this case groups
are smallrange integers so operations using it as the key value are fast.