Finding duplicates in vector

Learning APL or new to Dyalog? Ask "silly" questions here, without fear...

Finding duplicates in vector

Postby alexeyv on Tue Mar 22, 2016 10:10 pm

Hi,

I've decided to write a function to find duplicates (all elements which appears more than once) in a vector. It is a purely educational task.
I tried to make it generic, to support both vector of strings and vector of numbers. Here is my implementation:
      dups←{(1<+/u∘.≡⍵)/u←∪⍵}
dups 1 2 3 2 2 3 4 3 2 5
┌→──┐
│2 3│
└~──┘
dups 'aa' 'bbb' 'cc' 'aa'
┌→─────┐
│ ┌→─┐ │
│ │aa│ │
│ └──┘ │
└∊─────┘

The idea was to first get a list of unique elements, then build a table comparing these elements to source array and filter out those who have more than 1 element in the row of the table. Here I tried to follow the APL approach to add a new dimension to the source data and try to reduce it back.

Please provide any suggestions or alternative more elegant implementations!
alexeyv
 
Posts: 56
Joined: Tue Nov 17, 2015 4:18 pm

Re: Finding duplicates in vector

Postby Phil Last on Tue Mar 22, 2016 10:36 pm

A more usual solution would be:
      {∪((⍵⍳⍵)≠⍳⍴⍵)/⍳⍴⍵}
Here's a tacit equivalent:
      ∪⊢(/⍨)⍳⍨≠⍳∘⍴
User avatar
Phil Last
 
Posts: 628
Joined: Thu Jun 18, 2009 6:29 pm
Location: Wessex

Re: Finding duplicates in vector

Postby Roger|Dyalog on Wed Mar 23, 2016 3:35 am

Phil, I think you meant
      {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}


Now for something completely different:
      x←?20⍴10
x
3 7 3 3 4 7 8 3 2 8 7 1 2 4 9 7 4 1 4 4
{⍺,≢⍵}⌸x
3 4
7 4
4 5
8 2
2 2
1 2
9 1
(1≠t[;1])/t[;0]⊣t←{⍺,≢⍵}⌸x
3 7 4 8 2 1
Roger|Dyalog
 
Posts: 238
Joined: Thu Jul 28, 2011 10:53 am

Re: Finding duplicates in vector

Postby Veli-Matti on Wed Mar 23, 2016 1:41 pm

Phil, I didn't quite get your last expression, all I get is
      ∪⊢(/⍨)⍳⍨≠⍳∘⍴x
SYNTAX ERROR: The function requires a left argument

..but perhaps tacits are not meant for us mere mortals :)

I'd like also rewrite Roger's idea in a little bit more aplitically correct form:
      {(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x

For my amusement there were quite remarkable performance differences between some keyed approaches, such as
      ↑,/{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x
and
      ∊(//)⌽{⍺,1<≢⍵}⌸x


-Veli-Matti
Veli-Matti
 
Posts: 93
Joined: Sat Nov 28, 2009 3:12 pm

Re: Finding duplicates in vector

Postby AndyS|Dyalog on Wed Mar 23, 2016 2:12 pm

Quite coincidentally this week's #onelinerwednesday Tweet (which also appears on http://www.dyalog.com) deals with this question.

The one-liner that has been tweeted is
      (∪(/⍨){1≠≢⍵}⌸) 4 3 2 3 5 2 3
It has the merits of being ⎕io-indepedent, and also, like Roger's expression above, returns the non-unique elements in the order in which they first appear in the original vector.
User avatar
AndyS|Dyalog
 
Posts: 257
Joined: Tue May 12, 2009 6:06 pm

Re: Finding duplicates in vector

Postby Phil Last on Wed Mar 23, 2016 2:19 pm

Roger. Yes, I changed the dfn at the last minute from
      {∪⍵/⍨(⍵⍳⍵)≠⍳⍴⍵}
(always a mistake.)
Veli-Matti.
      ∪⊢(/⍨)⍳⍨≠⍳∘⍴x
is a function. It requires either assignment
      dups←∪⊢(/⍨)⍳⍨≠⍳∘⍴
or parentheses and an argument
      (∪⊢(/⍨)⍳⍨≠⍳∘⍴)arg
to run it in-line.
User avatar
Phil Last
 
Posts: 628
Joined: Thu Jun 18, 2009 6:29 pm
Location: Wessex

Re: Finding duplicates in vector

Postby Veli-Matti on Wed Mar 23, 2016 2:31 pm

I couldn't but run a little test:
      x←?2000⍴10000
cmpx'(∪(/⍨){1≠≢⍵}⌸)x' '∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x' '(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x' '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
(∪(/⍨){1≠≢⍵}⌸)x → 1.3E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x → 1.6E¯3 | +19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x → 1.5E¯5 | -99%
10↑(∪(/⍨){1≠≢⍵}⌸)x
5874 6973 8115 8713 1386 8735 5212 5708 8672 2474
10↑{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x
8672 2944 2650 2369 5874 6592 8413 3907 8568 8050

The shortest is not always the quickest, and quite interestingly the results are in a different order.

-wm
Veli-Matti
 
Posts: 93
Joined: Sat Nov 28, 2009 3:12 pm

Re: Finding duplicates in vector

Postby Roger|Dyalog on Wed Mar 23, 2016 3:08 pm

Interesting benchmarks. I picked the "key" approach because it uses one "expensive" operation ({⍺,≢⍵}⌸) whereas {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵} uses two such operations (∪ and ⍵⍳⍵), but it looks like the "unique" approach is better overall.

There are arguments for which the key approach is faster, illustrating my claim about "expensive operations":

      u←?100⍴2e9
x←u[?4e6⍴≢u]
cmpx '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 2.80E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x → 3.12E¯2 | +11% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

A couple of side notes:

I'd probably rewrite the unique approach as {∪⍵⌿⍨(⍳≢⍵)=⍳⍨⍵}. (a) It'd work for higher-ranked arguments if/when ∪ is extended to higher-ranked arguments; (b) It puts ⍳⍨ on the right where it could theoretically run faster, in case the implementation chooses a slower algorithm due to lack of space; (c) It's shorter.

Secondly, f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases. A complete list of such cases can be found in the TP4 workshop materials from Dyalog '15 (it's in special.htm in the "special" folder).

Also, APL generally is faster if you give it larger arguments to work with. For example, if you say 1<≢⍵ in the key operand it's operating on a little piece, whereas if you say 1< on the result of f⌸ is operating on a larger piece. This is a point I made in the TP4 workshop. I believe it is one of the goals of the compiler (see Foad, Jay) to address this effect.
Roger|Dyalog
 
Posts: 238
Joined: Thu Jul 28, 2011 10:53 am

Re: Finding duplicates in vector

Postby Roger|Dyalog on Wed Mar 23, 2016 6:48 pm

f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases.

For example:

      x←?1e6⍴4e5
cmpx '{⍺,≢⍵}⌸x' '{⊢⍺,≢⍵}⌸x'
{⍺,≢⍵}⌸x → 2.29E¯2 | 0% ⎕⎕
{⊢⍺,≢⍵}⌸x → 3.40E¯1 | +1385% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
Roger|Dyalog
 
Posts: 238
Joined: Thu Jul 28, 2011 10:53 am

Re: Finding duplicates in vector

Postby alexeyv on Wed Mar 23, 2016 7:30 pm

Wow thanks for these amazing replies!
Phil, thanks for the solution with ⍵⍳⍵! I completely forgot this idiom and its use-cases. It really looks like idiomatic way to do it!
However I'm having a hard time trying to understand tacit version. Maybe it is just me but tacit version doesn't look readable even if I draw it like a tree
      ∪⊢(/⍨)⍳⍨≠⍳∘⍴
┌─┴─┐
∪ ┌─┼───┐
⊢ ⍨ ┌─┼─┐
┌─┘ ⍨ ≠ ∘
/ ┌─┘ ┌┴┐
⍳ ⍳ ⍴


Can you explain what is happening here? For example why right tack is used and why compose is used as well.
Probably it could be helpful to have an option in interpreter to rewrite expression in tacit notation in the typical lambda-function form?

Roger, the idea of using keys (or hashtables) was something I thought about (and actually implemented in another language - just a hashtable holding number of elements every input element is encountered), but forgot what it is available in Dyalog APL. It is really interesting and educational to see how it is used here.
alexeyv
 
Posts: 56
Joined: Tue Nov 17, 2015 4:18 pm

Next

Return to New to Dyalog?

Who is online

Users browsing this forum: No registered users and 1 guest