Tolerant Unique In this presentation, the index origin is 0,
although the ideas, algorithms, code, examples,
,
work in 1-origin with obvious modifications.
It’s undesirable that the unique of x does not contain all of x . The problem is the same if, instead of 10, we’d used 1e6. For this problem, the best solution is to set ⎕ct to 0,
that is, use exact comparisons.
The APL user can make this choice depending on the application.
The APL implementer can not.
The bad implementation of unique apparently stems from the bad meme {((⍳≢⍵)=⍵⍳⍵)⌿⍵} , found in APL textbooks, papers, idiom lists, folklore, etc., dating from the earliest days of APL. The expression depends on = being transitive, which it is if comparisons are exact but not if 0≠⎕ct and floating point numbers are involved. With one exception, every APL and APL dialect which has the unique primitive, has it wrong. (The one exception is because we told them about it and they fixed it [0].) Fixing unique is a non-compatible change.
It also affects key (⌸).
We propose to apply the bug fix starting with version 17.0.
We also propose to extend the definition of ∪ so that it
finds the unique major cells.
For example, on a matrix argument ∪ would find the unique rows.
Since tolerant unique is so problematic, it behooves us to specify desiderata that any unique function should satisfy. The set of desiderata need not be minimal but ideally should reject an incorrect unique function, and on a failure it should be obvious what the problem is. Regarding the monadic operator UD :
Another desideratum is that unique should be O(n) .
In the operator UT ,
a benchmark is run on arguments which doubles in size at each step.
The execution time should no more than double.
Specifications for computing unique can be found in two sources: the ISO standard on extended APL [1] and the J dictionary [2]. The ISO APL standard specs work only on scalar or vector arguments; the J specs work on arguments of any rank. (In J, “item” means major cell.) We choose to use the J specs because they are easier to understand.
The specs are readily implemented as uniq1 . It is instructive to compare it to a version uniq1v which works only on vectors and scalars. There is a prima facie case that uniq1
satisfies the desiderata UD .
Unfortunately, it does not run in O(n) time,
with the running time more than doubling as the argument size doubles.
The clustering/hashing technique developed for Index-Of on Multiple Floats [3] presented earlier in this User Meeting is applicable. Looking at Uniq :
The local function is O(n*2) within each cluster, but the overall time for Uniq is O(n) , unless the clustering is “unlucky”. The function cluster makes essential use of interval index (⍸), which requires that the argument be sortable. For real arguments, using Jay Foad’s key insight, cluster is applied independently to each column to result in an integer array k . You can then cluster the argument according to k or, equivalently, according to k⍳k , and in the result (tolerantly) equal cells are in the same cluster. For complex arguments, cluster can not be used directly because complex numbers can not be sorted so that equal elements are contiguous. Instead, it is applied to the magnitudes. Per the analysis in [3], the clustering uses 2×⎕ct to overcome the vicissitudes of fixed precision floating point arithmetic. In both the real and the complex case,
we have O(n) algorithms which satisfy the desiderata.
Problem solved.
References
|