Tolerant Unique

Roger K.W. Hui
 




⎕io←0

In this presentation, the index origin is 0, although the ideas, algorithms, code, examples, …, work in 1-origin with obvious modifications.
 

The Problem 0

It’s undesirable that the unique of x does not contain all of x . The problem is the same if, instead of 10, we’d used 1e6.

For this problem, the best solution is to set ⎕ct to 0, that is, use exact comparisons. The APL user can make this choice depending on the application. The APL implementer can not.
 

The Problem 1

The bad implementation of unique apparently stems from the bad meme {((⍳≢⍵)=⍵⍳⍵)⌿⍵} , found in APL textbooks, papers, idiom lists, folklore, etc., dating from the earliest days of APL. The expression depends on = being transitive, which it is if comparisons are exact but not if 0≠⎕ct and floating point numbers are involved. With one exception, every APL and APL dialect which has the unique primitive, has it wrong. (The one exception is because we told them about it and they fixed it [0].)

Fixing unique is a non-compatible change. It also affects key (). We propose to apply the bug fix starting with version 17.0.
 

The Problem 2

We also propose to extend the definition ofso that it finds the unique major cells. For example, on a matrix argumentwould find the unique rows.
 

Desiderata 0

Since tolerant unique is so problematic, it behooves us to specify desiderata that any unique function should satisfy. The set of desiderata need not be minimal but ideally should reject an incorrect unique function, and on a failure it should be obvious what the problem is.

Regarding the monadic operator UD :

⍺⍺
       a prospective unique function
(≢⍴u)=1⌈≢⍴⍵
the rank of the unique should be the rank of , but at least 1
(≢u)>u⍳⍵
the major cells ofshould be (tolerantly) in the unique. ⍵∊u would have been a shorter way to say this, butcan not be extended to look for non-scalars because it behaves as though the right argument were ravelled. Original sin from APL\360.
(≢⍵)>i←⍵ ⍳ Tol 0⊢u
every major cell of u should exactly match a major cell of
2</i
moreover, the indices of the unique in , using exact comparisons, should be strictly increasing
u ≡ Tol 0 ⍺⍺ u
⍺⍺ should be idempotent. (This is probably redundant but I want to use the word “idempotent” :-)
(⍳≢u)≡u⍳u
a more efficient way of saying that ∘.≡⍨⊂⍤¯1⊢u should be the identity matrix
 

Desiderata 1

Another desideratum is that unique should be O(n) . In the operator UT , a benchmark is run on arguments which doubles in size at each step. The execution time should no more than double.
 

Specifications

Specifications for computing unique can be found in two sources: the ISO standard on extended APL [1] and the J dictionary [2]. The ISO APL standard specs work only on scalar or vector arguments; the J specs work on arguments of any rank. (In J, “item” means major cell.)

We choose to use the J specs because they are easier to understand.
 

Implementation 0

The specs are readily implemented as uniq1 . It is instructive to compare it to a version uniq1v which works only on vectors and scalars.

There is a prima facie case that uniq1 satisfies the desiderata UD . Unfortunately, it does not run in O(n) time, with the running time more than doubling as the argument size doubles.
 

Implementation 1

The clustering/hashing technique developed for Index-Of on Multiple Floats [3] presented earlier in this User Meeting is applicable. Looking at Uniq :

•   ⍺⍺ ⍵ applies the hash function (the clustering) to the argument.
{⊂⍵}⌸ computes the clusters c in terms of indices into the argument.
An anonymous local function is applied to each cluster. It computes the indices of the unique major cells with respect to the original right argument (X) rather than the actual unique major cells.
The result indices from each cluster are collected together, sorted, and indexed into the original argument to obtain the final result.

The local function is O(n*2) within each cluster, but the overall time for Uniq is O(n) , unless the clustering is “unlucky”.

The function cluster makes essential use of interval index (), which requires that the argument be sortable. For real arguments, using Jay Foad’s key insight, cluster is applied independently to each column to result in an integer array k . You can then cluster the argument according to k or, equivalently, according to k⍳k , and in the result (tolerantly) equal cells are in the same cluster.

For complex arguments, cluster can not be used directly because complex numbers can not be sorted so that equal elements are contiguous. Instead, it is applied to the magnitudes. Per the analysis in [3], the clustering uses 2×⎕ct to overcome the vicissitudes of fixed precision floating point arithmetic.

In both the real and the complex case, we have O(n) algorithms which satisfy the desiderata. Problem solved.
 

References

[0]   Hui, Roger K.W., e-mail to Bob Smith, 2013-04-22.
[1] International Standard ISO/IEC 13751:2001(E), Programming Language Extended APL, 2001-02-01.
[2] Hui, Roger K.W., and Kenneth E. Iverson, J Introduction and Dictionary, 2001-05-03; ~. entry.
[3] Hui, Roger K.W., Index-Of on Multiple Floats, Dyalog User Meeting, Helsingør, Denmark, 2017-09-11.



created:  2013-04-22 09:10
updated: