Finding duplicates in vector

by **alexeyv** on Tue Mar 22, 2016 10:10 pm

Hi,

I've decided to write a function to find duplicates (all elements which appears more than once) in a vector. It is a purely educational task.
I tried to make it generic, to support both vector of strings and vector of numbers. Here is my implementation:

      dups←{(1<+/u∘.≡⍵)/u←∪⍵}
      dups 1 2 3 2 2 3 4 3 2 5
┌→──┐
│2 3│
└~──┘
      dups 'aa' 'bbb' 'cc' 'aa'
┌→─────┐
│ ┌→─┐ │
│ │aa│ │
│ └──┘ │
└∊─────┘

The idea was to first get a list of unique elements, then build a table comparing these elements to source array and filter out those who have more than 1 element in the row of the table. Here I tried to follow the APL approach to add a new dimension to the source data and try to reduce it back.

Please provide any suggestions or alternative more elegant implementations!

by **Phil Last** on Tue Mar 22, 2016 10:36 pm

A more usual solution would be:

      {∪((⍵⍳⍵)≠⍳⍴⍵)/⍳⍴⍵}

Here's a tacit equivalent:

      ∪⊢(/⍨)⍳⍨≠⍳∘⍴

by **Roger|Dyalog** on Wed Mar 23, 2016 3:35 am

Phil, I think you meant

      {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}

Now for something completely different:

      x←?20⍴10
      x
3 7 3 3 4 7 8 3 2 8 7 1 2 4 9 7 4 1 4 4
      {⍺,≢⍵}⌸x
3 4
7 4
4 5
8 2
2 2
1 2
9 1
      (1≠t[;1])/t[;0]⊣t←{⍺,≢⍵}⌸x
3 7 4 8 2 1

by **Veli-Matti** on Wed Mar 23, 2016 1:41 pm

Phil, I didn't quite get your last expression, all I get is

      ∪⊢(/⍨)⍳⍨≠⍳∘⍴x
SYNTAX ERROR: The function requires a left argument

..but perhaps tacits are not meant for us mere mortals :)

I'd like also rewrite Roger's idea in a little bit more aplitically correct form:

      {(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x

For my amusement there were quite remarkable performance differences between some keyed approaches, such as

      ↑,/{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x

and

      ∊(//)⌽{⍺,1<≢⍵}⌸x

-Veli-Matti

by **AndyS|Dyalog** on Wed Mar 23, 2016 2:12 pm

Quite coincidentally this week's #onelinerwednesday Tweet (which also appears on http://www.dyalog.com) deals with this question.

The one-liner that has been tweeted is

      (∪(/⍨){1≠≢⍵}⌸) 4 3 2 3 5 2 3

It has the merits of being ⎕io-indepedent, and also, like Roger's expression above, returns the non-unique elements in the order in which they first appear in the original vector.

by **Phil Last** on Wed Mar 23, 2016 2:19 pm

Roger. Yes, I changed the dfn at the last minute from

      {∪⍵/⍨(⍵⍳⍵)≠⍳⍴⍵}

(always a mistake.)
Veli-Matti.

      ∪⊢(/⍨)⍳⍨≠⍳∘⍴x

is a function. It requires either assignment

      dups←∪⊢(/⍨)⍳⍨≠⍳∘⍴

or parentheses and an argument

      (∪⊢(/⍨)⍳⍨≠⍳∘⍴)arg

to run it in-line.

by **Veli-Matti** on Wed Mar 23, 2016 2:31 pm

I couldn't but run a little test:

      x←?2000⍴10000
      cmpx'(∪(/⍨){1≠≢⍵}⌸)x' '∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x' '(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x' '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
  (∪(/⍨){1≠≢⍵}⌸)x            → 1.3E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕       
  ∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x           → 1.6E¯3 | +19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕                                       
  {(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x      → 3.8E¯5 | -98% ⎕                                       
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x          → 1.5E¯5 | -99%                                         
      10↑(∪(/⍨){1≠≢⍵}⌸)x  
5874 6973 8115 8713 1386 8735 5212 5708 8672 2474
      10↑{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x
8672 2944 2650 2369 5874 6592 8413 3907 8568 8050

The shortest is not always the quickest, and quite interestingly the results are in a different order.

-wm

by **Roger|Dyalog** on Wed Mar 23, 2016 3:08 pm

Interesting benchmarks. I picked the "key" approach because it uses one "expensive" operation ({⍺,≢⍵}⌸) whereas {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵} uses two such operations (∪ and ⍵⍳⍵), but it looks like the "unique" approach is better overall.

There are arguments for which the key approach is faster, illustrating my claim about "expensive operations":

      u←?100⍴2e9
      x←u[?4e6⍴≢u]
      cmpx '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
  {(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 2.80E¯2 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕   
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x     → 3.12E¯2 | +11% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

A couple of side notes:

I'd probably rewrite the unique approach as {∪⍵⌿⍨(⍳≢⍵)=⍳⍨⍵}. (a) It'd work for higher-ranked arguments if/when ∪ is extended to higher-ranked arguments; (b) It puts ⍳⍨ on the right where it could theoretically run faster, in case the implementation chooses a slower algorithm due to lack of space; (c) It's shorter.

Secondly, f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases. A complete list of such cases can be found in the TP4 workshop materials from Dyalog '15 (it's in special.htm in the "special" folder).

Also, APL generally is faster if you give it larger arguments to work with. For example, if you say 1<≢⍵ in the key operand it's operating on a little piece, whereas if you say 1< on the result of f⌸ is operating on a larger piece. This is a point I made in the TP4 workshop. I believe it is one of the goals of the compiler (see Foad, Jay) to address this effect.

by **Roger|Dyalog** on Wed Mar 23, 2016 6:48 pm

f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases.

For example:

      x←?1e6⍴4e5
      cmpx '{⍺,≢⍵}⌸x' '{⊢⍺,≢⍵}⌸x'
  {⍺,≢⍵}⌸x  → 2.29E¯2 |     0% ⎕⎕                            
  {⊢⍺,≢⍵}⌸x → 3.40E¯1 | +1385% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

by **alexeyv** on Wed Mar 23, 2016 7:30 pm

Wow thanks for these amazing replies!
Phil, thanks for the solution with ⍵⍳⍵! I completely forgot this idiom and its use-cases. It really looks like idiomatic way to do it!
However I'm having a hard time trying to understand tacit version. Maybe it is just me but tacit version doesn't look readable even if I draw it like a tree

      ∪⊢(/⍨)⍳⍨≠⍳∘⍴
┌─┴─┐       
∪ ┌─┼───┐   
  ⊢ ⍨ ┌─┼─┐ 
  ┌─┘ ⍨ ≠ ∘ 
  / ┌─┘  ┌┴┐
    ⍳    ⍳ ⍴

Can you explain what is happening here? For example why right tack is used and why compose is used as well.
Probably it could be helpful to have an option in interpreter to rewrite expression in tacit notation in the typical lambda-function form?

Roger, the idea of using keys (or hashtables) was something I thought about (and actually implemented in another language - just a hashtable holding number of elements every input element is encountered), but forgot what it is available in Dyalog APL. It is really interesting and educational to see how it is used here.

The tool of thought for

software solutions

Finding duplicates in vector

Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Who is online

QUICK LINKS