Finding duplicates in vector

by **Phil Last** on Wed Mar 23, 2016 11:14 pm

      ∪⊢(/⍨)⍳⍨≠⍳∘⍴
┌─┴─┐       
∪ ┌─┼───┐   
  ⊢ ⍨ ┌─┼─┐ 
  ┌─┘ ⍨ ≠ ∘ 
  / ┌─┘  ┌┴┐
    ⍳    ⍳ ⍴

It was basically the first thing I thought of so could probably have been clearer as well as faster as we've seen. As you see from the tree the outside is a 2-tine "atop" where unique is applied to the result of the rest which is a 3-tine "fork". I've tried before and the only way I've discovered to express

      (f ⍵)/⍵

as a fork is, because of the "hybrid" nature of "/"

      ⊢(/⍨)f

where right "⊢" stands in for "⍵". So "f" is the right fork

      ⍳⍨≠⍳∘⍴

which redundantly parenthesised is

      (⍳⍨)≠(⍳∘⍴)

the jot "∘" holding the right tine together which alternatively could have been another atop

      (⍳⍴)

with required parentheses.

by **Phil Last** on Thu Mar 24, 2016 12:01 am

Slightly off topic but the above reminded me that on 27/08/2010 in a comp.lang.apl thread called "what makes it APL?" I wrote with respect to the binding of trains wherein tines that are themselves derived functions do not need to be parenthesised

even if I could be persuaded that the parsing is consistent I'd never be comfortable with it.

To which Roger replied

"Never" is a long time, and your comfort may increase with time and usage.

Well, it seems that Roger was correct.

by **alexeyv** on Fri Mar 25, 2016 4:28 pm

Hi Phil, thanks for explanations! It took me like half of an hour to understand it partly (experimenting along the way). But the final part (or initial) I don't really understand is this equivalence:

      ⊢(/⍨)f iff (f ⍵)/⍵

Can you provide some more detailed insights ? Otherwise I start to feel what my mind is blowing :)

by **Roger|Dyalog** on Fri Mar 25, 2016 10:38 pm

To make a concrete example:

      f←1=2∘|  ⍝ 1 iff odd
      x←3 1 4 1 5 9 2 6 53 58 9 7 9

Now a sequence of equivalences is displayed. If you enter them in your APL session each line should give the same result as the previous:

      (⊢ (/⍨) f) x
      (⊢x) (/⍨) (f x)    ⍝ definition of fork, (f g h) x ←→ (f x) g (h x)
      (f x) / (⊢x)       ⍝ definition of ⍨
      (f x) / x          ⍝ definition of ⊢
                                                          QED

In this case there is a complication in that / is a "hybrid", sometimes an operator (as in +/) and sometimes a function (as in /⍨). That this is so is due to historical reasons. If / were replaced by a syntactically orthodox function, such as {⍺/⍵}, you can get a simpler derivation:

      (f {⍺/⍵} ⊢) x
      (f x) {⍺/⍵} (⊢x)   ⍝ definition of fork, (f g h) x ←→ (f x) g (h x)
      (f x) {⍺/⍵} x      ⍝ definition of ⊢

by **Phil Last** on Fri Mar 25, 2016 11:24 pm

I've just realised that I've taken so long to answer that Roger has beaten me to it. Oh well. I wrote all following this sentence before I noticed. First let me say that this search was more of an intellectual conceit than a necessary development.

Expressions similar to (f w)/w are ubiquitous in APL but have the awkward property of using the argument twice, necessitating its being named or, if the result of an expression, the making it subject to parameter substitution in a dfn as {(f ⍵)/⍵}. Now this obviously isn't good form because we are now using an unknown function f inside a dfn so for forms sake we should parameterise that as well and code f{(⍺⍺ ⍵)/⍵}.

Forks are useful for algorithms that re-use a value by applying two functions to the same one or two arguments before applying a third between the two results:
      (f0 f1 f2)w ←→ (f0 w)f1(f2 w)
Now fitting (f w)/w into that tripartite pattern as (f w) (/) (w) requires the mappings
      f0 ←→ f ⋄ f1 ←→ / ⋄ f2 ←→ ⊢
where (⊢) merely returns its argument, leading to the fork (f / ⊢). Unfortunately the binding rules tell us this isn't a fork for the reason that the f and / bind first to become (f /) because the hybrid (/) has a function to its left and we get the 2-train ((f /) ⊢) that is merely (f /) which we don't want.

No amount of parenthesis can unwind this. (f (/) ⊢) is still the same because parenthesising (/) no longer distinguishes it as a function, if it ever did, as adverbs, monadic operators, are now also allowed to be parenthesised and assigned.

As things currently stand we cannot parenthesise a pair of adverbs, try each commute, even though doing so would merely produce another adverb so if we code (/⍨) the parser has the choice of SYNTAX ERROR or allowing it as the function replicate commute.

If ever adverb binding is permitted, and there are good reasons for doing so, then (/⍨) may have to change its meaning.

Nevertheless, pro tem, we have it as a function so we can switch our two outer functions (f) and (⊢) to accommodate the switching being done by commute (⍨) and do
      (⊢(/⍨)f)
which I read as "right where f". And it's actually shorter and faster than
      f{(⍺⍺ ⍵)/⍵}

by **Roger|Dyalog** on Sat Mar 26, 2016 2:24 am

Sorry Phil for "butting in". I thought you may have been away on the long long weekend.

shorter and faster

When J was first implemented it was Eugene McDonnell who discovered that forks are faster than alternatives, to my surprise. This still holds true today in Dyalog APL:

      x←3 1 4 1 5 9 2 6 53 58 9 7 9

      d ← {((⍳≢⍵)≠⍵⍳⍵)/⍵}
      f ← ⊢ (⌿⍨) ⍳∘≢ ≠ ⍳⍨

      cmpx 'd x' 'f x'
  d x → 1.34E¯6 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  f x → 1.23E¯6 | -9% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The reason for the time difference is that d is parsed each time it is executed, but f is not. (I distinguish between parsing and tokenizing; d is tokenized once but parsed each time.) I conjecture that the internal form of f is close to what you get from the Dyalog compiler, so that it's as if d were compiled. On smaller arguments this is significant because on smaller arguments interpretive overhead is relatively larger. I expect that on larger arguments the time difference between d and f would be less. And so it is:

      y←?1e4⍴9e3
      cmpx 'd y' 'f y'
  d y → 4.93E¯5 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  f y → 4.93E¯5 | -1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

It would be interesting to do a timing comparison between the compiled form of d, and f.

by **Phil Last** on Sat Mar 26, 2016 9:10 am

Roger wrote:"butting in"

Not at all! If only I were so succint.

by **alexeyv** on Sat Mar 26, 2016 1:25 pm

Thanks a lot for detailed explanations! Now I understand what the right tac ⊢ is used as an identity function typically used in functional programming for situations like this and the role of commute here.

One more thing: in the book "APL An Interactive Approach"(3rd edition) by Gilman and Rose, page 130 they write about function unique as ∩ (not ∪), which should return 1s marking the first occurrence of item in array. Here is the example from the book:

      ∩R←3 2 3 4 2 1 0 1
1 1 0 1 0 1 1 0
        (∩R)/R
3 2 4 1 0

It seems to do the same as

      {(⍵⍳⍵)=⍳⍴⍵}

.
They claim what this function is available in APL2. However neither Dyalog APL nor GNU APL seems to have this function. Why is that and what is the history behind it?

by **Roger|Dyalog** on Sat Mar 26, 2016 2:42 pm

That's right, Dyalog APL does not yet have a function that returns a mask for selecting the distinct items. Iverson called it the "nubsieve", the sieve that selects the nub (AKA unique).

The nubsieve function will use the extended definition of ⍳, one which works on major cells rather than merely scalars. Thus:

      nubsieve ← ⍳∘≢ = ⍳⍨

      ⊢ x ← ↑ 3 ⍴¨ 'kakistocracy' 
kkk
aaa
kkk
iii
sss
ttt
ooo
ccc
rrr
aaa
ccc
yyy
      nubsieve x
1 1 0 1 1 1 1 1 1 0 0 1

by **Roger|Dyalog** on Sat Mar 26, 2016 2:53 pm

Timings on a dfn variant d1 which more closely mimic the fork f, and on the compiled versions of the dfns.

      d  ← {((⍳≢⍵)≠⍵⍳⍵)/⍵}
      f  ← ⊢ (⌿⍨) ⍳∘≢ ≠ ⍳⍨ 
      d1 ← {⍵⌿⍨(⍳≢⍵)≠⍳⍨⍵}

      c ←d  ⋄ 2(400⌶)'c' 
      c1←d1 ⋄ 2(400⌶)'c1'

       cmpx 'd x' 'd1 x' 'f x' 'c x' 'c1 x'
  d x  → 1.35E¯6 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕     
  d1 x → 1.60E¯6 | +18% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  f x  → 1.24E¯6 |  -9% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕       
  c x  → 1.13E¯6 | -17% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕         
  c1 x → 1.13E¯6 | -17% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

Compiler wins!

Timing improvements on small arguments are very hard to come by because there isn't much wriggle room, not as many elements over which to amortize the fixed cost of interpretive overhead.

The tool of thought for

software solutions

Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Who is online

QUICK LINKS