Core Performance 2015 Abstract Recent and future versions of Dyalog APL have benefitted from several implementation techniques, including:
Acknowledgments The vector range-finder and CRC code were written by Jay Foad.
Magic functions are made possible due to work done
by John Scholes and Richard Smith.
1. Range Finding with Vector Instructions Functions in the index-of family on small-range arguments have faster implementations than on general arguments [0,§10]. For example: s←100×x←?1e6⍴8e5 t←100×y←?1e6⍴8e5 cmpx 'x⍳y' 's⍳t' x⍳y → 7.91E¯3 | 0% ⎕⎕⎕⎕⎕ s⍳t → 5.17E¯2 | +553% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ But how do you know when something is small-range?
Intel SSE2 and other CPUs of similar vintage (2001 or later)
have vector instructions for finding the
min and the max.
The vector range-finding costs less than 2%
of the time required for the general algorithm,
and is used to advantage in primitives such
as ⍳ ∊ ∪ ⍋ ⍒ {⍵[⍋⍵]} {⍵[⍒⍵]} f⌸ ⍺[⍵] ,
the last for checking index bounds.
2. CRC Modern CPUs (SSE4.2 or later) also has instructions for computing the CRC, cyclic redundancy check. CRCs are useful for functions in the index-of family because the probability of CRCs being equal on unequal arguments is small and because CRCs are computable in real time [0,§8]. The following functions use CRC:
A benchmark comparing a relational matrix vs an ordinary nested matrix: x←1e5 3⍴⊂⍬ x[;1]←u[?1e5⍴≢u←' '~¨⍨↓⎕nl 2 3] x[;2]←15+?1e5⍴60 x[;3]←?1e5⍴2e9 3↑x ⍝ x is a relational matrix ┌───────────────┬──┬──────────┐ │s │41│196549161 │ ├───────────────┼──┼──────────┤ │ben9949 │16│1426044578│ ├───────────────┼──┼──────────┤ │rank_transpose2│25│1926599481│ └───────────────┴──┴──────────┘ y←x y[1;3]←⊂'not a relational matrix' cmpx 'x⍳x' 'y⍳y' x⍳x → 6.50E¯3 | 0% ⎕⎕⎕ y⍳y → 5.82E¯2 | +795% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ If a relational matrix is made into an inverted table [1], the performance is even better: t←↑¨↓⍉x ⍝ inverted table ⍪¨ 3↑¨t ┌───────────────┬──┬──────────┐ │s │41│ 196549161│ │ben9949 │16│1426044578│ │rank_transpose2│25│1926599481│ └───────────────┴──┴──────────┘ cmpx 'x⍳x' 't(8⌶)t' x⍳x → 7.98E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ t(8⌶)t → 3.37E¯3 | -58% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ When CRC is not available on the current CPU, the APL primitives use a C-coded function instead. In the following benchmarks, hashCRC computes the CRC for each record and hashC computed the C-coded hash. Analoguously, 8⌶ computes index-of using CRC and iC computes index-of using the C-coded hash. (hashCRC , hashC , and iC are private facilities.) cmpx 'hashCRC⍨t' 'hashC ⍨t' hashCRC⍨t → 1.09E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ *hashC ⍨t → 3.40E¯3 | +212% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx '8⌶⍨t' 'iC⍨t' 8⌶⍨t → 3.36E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ iC⍨t → 6.73E¯3 | +100% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ The benchmarks also show how much of the time for index-of
is taken up by computing the hash.
3. Special Codes in Operators An operator has potentially an infinite number of cases, but only a relatively few of them are commonly useful. The strategy has been to provide special code for the commonly useful cases. For example, for the key operator, the special code are as follows. Additional cases can be implemented as required.
For example, {⍺,(≢⍵)}⌸ returns a 2-column matrix of the unique items and the corresponding counts, and is supported by special code. The following benchmark illustrates the difference special code can make. x←?1e6⍴100 {⍺,≢⍵}⌸x 44 10132 8 9873 39 9948 53 9826 79 10068 cmpx '{⍺,≢⍵}⌸x' '{⍺,(≢⍵)}⌸x' {⍺,≢⍵}⌸x → 7.50E¯4 | 0% ⎕ {⍺,(≢⍵)}⌸x → 1.55E¯2 | +1960% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ Another comparison is to contrast the special code {⍺,≢⍵}⌸x against the fastest APL functions for solving the problem on this data. In the following, the results of functions f and g differ from that for {⍺,≢⍵}⌸x only in being sorted. f ← {(b/s),⍤0-2-/(⎕io+⍴b),⍨b/⍳⍴b←s≠¯1↓¯1,s←{⍵[⍋⍵]}⍵} g ← {c←(1+⌈/⍵)⍴0 ⋄ c[⍵]+←1 ⋄ (b/⍳⍴b),⍤0⊢(b←0≠c)/c} cmpx '{⍺,≢⍵}⌸x' 'f x' 'g x' {⍺,≢⍵}⌸x → 7.42E¯4 | 0% ⎕⎕⎕⎕⎕⎕⎕ * f x → 2.57E¯3 | +246% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ * g x → 3.31E¯3 | +345% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (A complete list of special codes is presented and discussed in the workshop TP4 on 2015-09-10 [2].) The general case of key is implemented by a magic function.
4. Magic Functions A magic function is an APL-coded (dfn) computation in the C source of the interpreter [3]. The magic function for the general case of key: { 0=⎕nc'⍺':⍵ ∇ ⍳≢⍵ ⍝ monadic case: f⌸ x ←→ x f⌸ ⍳≢x ⍺ ⍺⍺{⍺ ⍺⍺ ⍵}{ ⎕ml←1 ⍝ needed by ↑⍵ and ⍺⊂⍵ j←⍋i⊣⎕dr i←⍳⍨⍺ ↑ (⊂[1↓⍳⍴⍴⍺]((⍳≢i)=⍳⍨i)⌿⍺) ⍺⍺¨ (2≠/¯1,i[j]) ⊂[⎕io] ⍵⌷⍨⊂j }⍵ } Interpreter performance work in recent and future versions has been informed by usage in these magic functions, including the primitives colored red above. It is instructive to examine what would have happened if magic functions were used for some older operators. The dot (inner product) operator has a possible magic function implementation [4]: dot ← {⍺ (⍺⍺⌿⍵⍵¨⍤¯1)⍤1 255 ⊢⍵} x←?40 100⍴0 y←?100 80⍴0 (x+.×y) ≡ x +dot× y 1 +.× is implemented by special code receiving of lavish attention, and: cmpx 'x+.×y' 'x+dot×y' x+.×y → 2.03E¯4 | 0% x+dot×y → 5.52E¯2 | +27076% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ We note that the × in +dot× is applied through ר and we know furthermore that ר does not yet make a special case for scalar functions. (You can verify the latter by doing some benchmarks.) Suppose we make a special version of dot for scalar functions, simulating a future version of the interpreter in which ¨ will have special code for scalar functions. dot0 ← {⍺ (⍺⍺⌿⍵⍵⍤¯1)⍤1 255 ⊢⍵} cmpx 'x+.×y' 'x+dot0×y' x+.×y → 2.08E¯4 | 0% ⎕⎕⎕⎕ x+dot0×y → 1.68E¯3 | +710% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ Not bad for 5 minutes of analysis and coding (from 27076% down to 710%). Finally, what about the general case for which . does not have special code? cmpx 'x{⍺+⍵}.{⍺×⍵}y' 'x{⍺+⍵}dot{⍺×⍵}y' x{⍺+⍵}.{⍺×⍵}y → 3.44E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ x{⍺+⍵}dot{⍺×⍵}y → 2.72E¯1 | -22% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ References
Presented at Dyalog ’15, Naxos Beach, Sicily, 2015-09-07.
|