Finding duplicates in vector
25 posts
• Page 1 of 3 • 1, 2, 3
Finding duplicates in vector
Hi,
I've decided to write a function to find duplicates (all elements which appears more than once) in a vector. It is a purely educational task.
I tried to make it generic, to support both vector of strings and vector of numbers. Here is my implementation:
The idea was to first get a list of unique elements, then build a table comparing these elements to source array and filter out those who have more than 1 element in the row of the table. Here I tried to follow the APL approach to add a new dimension to the source data and try to reduce it back.
Please provide any suggestions or alternative more elegant implementations!
I've decided to write a function to find duplicates (all elements which appears more than once) in a vector. It is a purely educational task.
I tried to make it generic, to support both vector of strings and vector of numbers. Here is my implementation:
dups←{(1<+/u∘.≡⍵)/u←∪⍵}
dups 1 2 3 2 2 3 4 3 2 5
┌→──┐
│2 3│
└~──┘
dups 'aa' 'bbb' 'cc' 'aa'
┌→─────┐
│ ┌→─┐ │
│ │aa│ │
│ └──┘ │
└∊─────┘
The idea was to first get a list of unique elements, then build a table comparing these elements to source array and filter out those who have more than 1 element in the row of the table. Here I tried to follow the APL approach to add a new dimension to the source data and try to reduce it back.
Please provide any suggestions or alternative more elegant implementations!
- alexeyv
- Posts: 56
- Joined: Tue Nov 17, 2015 4:18 pm
Re: Finding duplicates in vector
A more usual solution would be:
{∪((⍵⍳⍵)≠⍳⍴⍵)/⍳⍴⍵}Here's a tacit equivalent:
∪⊢(/⍨)⍳⍨≠⍳∘⍴
-
Phil Last - Posts: 628
- Joined: Thu Jun 18, 2009 6:29 pm
- Location: Wessex
Re: Finding duplicates in vector
Phil, I think you meant
Now for something completely different:
{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}
Now for something completely different:
x←?20⍴10
x
3 7 3 3 4 7 8 3 2 8 7 1 2 4 9 7 4 1 4 4
{⍺,≢⍵}⌸x
3 4
7 4
4 5
8 2
2 2
1 2
9 1
(1≠t[;1])/t[;0]⊣t←{⍺,≢⍵}⌸x
3 7 4 8 2 1
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
Re: Finding duplicates in vector
Phil, I didn't quite get your last expression, all I get is
..but perhaps tacits are not meant for us mere mortals :)
I'd like also rewrite Roger's idea in a little bit more aplitically correct form:
For my amusement there were quite remarkable performance differences between some keyed approaches, such as
-Veli-Matti
∪⊢(/⍨)⍳⍨≠⍳∘⍴x
SYNTAX ERROR: The function requires a left argument
..but perhaps tacits are not meant for us mere mortals :)
I'd like also rewrite Roger's idea in a little bit more aplitically correct form:
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x
For my amusement there were quite remarkable performance differences between some keyed approaches, such as
↑,/{1<≢⍵:⍺ ⋄ ⊂⍬}⌸xand
∊(//)⌽{⍺,1<≢⍵}⌸x
-Veli-Matti
- Veli-Matti
- Posts: 93
- Joined: Sat Nov 28, 2009 3:12 pm
Re: Finding duplicates in vector
Quite coincidentally this week's #onelinerwednesday Tweet (which also appears on http://www.dyalog.com) deals with this question.
The one-liner that has been tweeted is
The one-liner that has been tweeted is
(∪(/⍨){1≠≢⍵}⌸) 4 3 2 3 5 2 3It has the merits of being ⎕io-indepedent, and also, like Roger's expression above, returns the non-unique elements in the order in which they first appear in the original vector.
-
AndyS|Dyalog - Posts: 257
- Joined: Tue May 12, 2009 6:06 pm
Re: Finding duplicates in vector
Roger. Yes, I changed the dfn at the last minute from
Veli-Matti.
{∪⍵/⍨(⍵⍳⍵)≠⍳⍴⍵}(always a mistake.)
Veli-Matti.
∪⊢(/⍨)⍳⍨≠⍳∘⍴xis a function. It requires either assignment
dups←∪⊢(/⍨)⍳⍨≠⍳∘⍴or parentheses and an argument
(∪⊢(/⍨)⍳⍨≠⍳∘⍴)argto run it in-line.
-
Phil Last - Posts: 628
- Joined: Thu Jun 18, 2009 6:29 pm
- Location: Wessex
Re: Finding duplicates in vector
I couldn't but run a little test:
The shortest is not always the quickest, and quite interestingly the results are in a different order.
-wm
x←?2000⍴10000
cmpx'(∪(/⍨){1≠≢⍵}⌸)x' '∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x' '(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x' '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
(∪(/⍨){1≠≢⍵}⌸)x → 1.3E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x → 1.6E¯3 | +19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x → 1.5E¯5 | -99%
10↑(∪(/⍨){1≠≢⍵}⌸)x
5874 6973 8115 8713 1386 8735 5212 5708 8672 2474
10↑{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x
8672 2944 2650 2369 5874 6592 8413 3907 8568 8050
The shortest is not always the quickest, and quite interestingly the results are in a different order.
-wm
- Veli-Matti
- Posts: 93
- Joined: Sat Nov 28, 2009 3:12 pm
Re: Finding duplicates in vector
Interesting benchmarks. I picked the "key" approach because it uses one "expensive" operation ({⍺,≢⍵}⌸) whereas {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵} uses two such operations (∪ and ⍵⍳⍵), but it looks like the "unique" approach is better overall.
There are arguments for which the key approach is faster, illustrating my claim about "expensive operations":
A couple of side notes:
I'd probably rewrite the unique approach as {∪⍵⌿⍨(⍳≢⍵)=⍳⍨⍵}. (a) It'd work for higher-ranked arguments if/when ∪ is extended to higher-ranked arguments; (b) It puts ⍳⍨ on the right where it could theoretically run faster, in case the implementation chooses a slower algorithm due to lack of space; (c) It's shorter.
Secondly, f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases. A complete list of such cases can be found in the TP4 workshop materials from Dyalog '15 (it's in special.htm in the "special" folder).
Also, APL generally is faster if you give it larger arguments to work with. For example, if you say 1<≢⍵ in the key operand it's operating on a little piece, whereas if you say 1< on the result of f⌸ is operating on a larger piece. This is a point I made in the TP4 workshop. I believe it is one of the goals of the compiler (see Foad, Jay) to address this effect.
There are arguments for which the key approach is faster, illustrating my claim about "expensive operations":
u←?100⍴2e9
x←u[?4e6⍴≢u]
cmpx '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 2.80E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x → 3.12E¯2 | +11% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
A couple of side notes:
I'd probably rewrite the unique approach as {∪⍵⌿⍨(⍳≢⍵)=⍳⍨⍵}. (a) It'd work for higher-ranked arguments if/when ∪ is extended to higher-ranked arguments; (b) It puts ⍳⍨ on the right where it could theoretically run faster, in case the implementation chooses a slower algorithm due to lack of space; (c) It's shorter.
Secondly, f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases. A complete list of such cases can be found in the TP4 workshop materials from Dyalog '15 (it's in special.htm in the "special" folder).
Also, APL generally is faster if you give it larger arguments to work with. For example, if you say 1<≢⍵ in the key operand it's operating on a little piece, whereas if you say 1< on the result of f⌸ is operating on a larger piece. This is a point I made in the TP4 workshop. I believe it is one of the goals of the compiler (see Foad, Jay) to address this effect.
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
Re: Finding duplicates in vector
f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases.
For example:
x←?1e6⍴4e5
cmpx '{⍺,≢⍵}⌸x' '{⊢⍺,≢⍵}⌸x'
{⍺,≢⍵}⌸x → 2.29E¯2 | 0% ⎕⎕
{⊢⍺,≢⍵}⌸x → 3.40E¯1 | +1385% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
Re: Finding duplicates in vector
Wow thanks for these amazing replies!
Phil, thanks for the solution with ⍵⍳⍵! I completely forgot this idiom and its use-cases. It really looks like idiomatic way to do it!
However I'm having a hard time trying to understand tacit version. Maybe it is just me but tacit version doesn't look readable even if I draw it like a tree
Can you explain what is happening here? For example why right tack is used and why compose is used as well.
Probably it could be helpful to have an option in interpreter to rewrite expression in tacit notation in the typical lambda-function form?
Roger, the idea of using keys (or hashtables) was something I thought about (and actually implemented in another language - just a hashtable holding number of elements every input element is encountered), but forgot what it is available in Dyalog APL. It is really interesting and educational to see how it is used here.
Phil, thanks for the solution with ⍵⍳⍵! I completely forgot this idiom and its use-cases. It really looks like idiomatic way to do it!
However I'm having a hard time trying to understand tacit version. Maybe it is just me but tacit version doesn't look readable even if I draw it like a tree
∪⊢(/⍨)⍳⍨≠⍳∘⍴
┌─┴─┐
∪ ┌─┼───┐
⊢ ⍨ ┌─┼─┐
┌─┘ ⍨ ≠ ∘
/ ┌─┘ ┌┴┐
⍳ ⍳ ⍴
Can you explain what is happening here? For example why right tack is used and why compose is used as well.
Probably it could be helpful to have an option in interpreter to rewrite expression in tacit notation in the typical lambda-function form?
Roger, the idea of using keys (or hashtables) was something I thought about (and actually implemented in another language - just a hashtable holding number of elements every input element is encountered), but forgot what it is available in Dyalog APL. It is really interesting and educational to see how it is used here.
- alexeyv
- Posts: 56
- Joined: Tue Nov 17, 2015 4:18 pm
25 posts
• Page 1 of 3 • 1, 2, 3
Who is online
Users browsing this forum: No registered users and 1 guest
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group