Finding duplicates in vector
25 posts
• Page 2 of 3 • 1, 2, 3
Re: Finding duplicates in vector
∪⊢(/⍨)⍳⍨≠⍳∘⍴It was basically the first thing I thought of so could probably have been clearer as well as faster as we've seen. As you see from the tree the outside is a 2-tine "atop" where unique is applied to the result of the rest which is a 3-tine "fork". I've tried before and the only way I've discovered to express
┌─┴─┐
∪ ┌─┼───┐
⊢ ⍨ ┌─┼─┐
┌─┘ ⍨ ≠ ∘
/ ┌─┘ ┌┴┐
⍳ ⍳ ⍴
(f ⍵)/⍵as a fork is, because of the "hybrid" nature of "/"
⊢(/⍨)fwhere right "⊢" stands in for "⍵". So "f" is the right fork
⍳⍨≠⍳∘⍴which redundantly parenthesised is
(⍳⍨)≠(⍳∘⍴)the jot "∘" holding the right tine together which alternatively could have been another atop
(⍳⍴)with required parentheses.
-
Phil Last - Posts: 628
- Joined: Thu Jun 18, 2009 6:29 pm
- Location: Wessex
Re: Finding duplicates in vector
Slightly off topic but the above reminded me that on 27/08/2010 in a comp.lang.apl thread called "what makes it APL?" I wrote with respect to the binding of trains wherein tines that are themselves derived functions do not need to be parenthesised
To which Roger repliedeven if I could be persuaded that the parsing is consistent I'd never be comfortable with it.
Well, it seems that Roger was correct."Never" is a long time, and your comfort may increase with time and usage.
-
Phil Last - Posts: 628
- Joined: Thu Jun 18, 2009 6:29 pm
- Location: Wessex
Re: Finding duplicates in vector
Hi Phil, thanks for explanations! It took me like half of an hour to understand it partly (experimenting along the way). But the final part (or initial) I don't really understand is this equivalence:
Can you provide some more detailed insights ? Otherwise I start to feel what my mind is blowing :)
⊢(/⍨)f iff (f ⍵)/⍵
Can you provide some more detailed insights ? Otherwise I start to feel what my mind is blowing :)
- alexeyv
- Posts: 56
- Joined: Tue Nov 17, 2015 4:18 pm
Re: Finding duplicates in vector
To make a concrete example:
Now a sequence of equivalences is displayed. If you enter them in your APL session each line should give the same result as the previous:
In this case there is a complication in that / is a "hybrid", sometimes an operator (as in +/) and sometimes a function (as in /⍨). That this is so is due to historical reasons. If / were replaced by a syntactically orthodox function, such as {⍺/⍵}, you can get a simpler derivation:
f←1=2∘| ⍝ 1 iff odd
x←3 1 4 1 5 9 2 6 53 58 9 7 9
Now a sequence of equivalences is displayed. If you enter them in your APL session each line should give the same result as the previous:
(⊢ (/⍨) f) x
(⊢x) (/⍨) (f x) ⍝ definition of fork, (f g h) x ←→ (f x) g (h x)
(f x) / (⊢x) ⍝ definition of ⍨
(f x) / x ⍝ definition of ⊢
QED
In this case there is a complication in that / is a "hybrid", sometimes an operator (as in +/) and sometimes a function (as in /⍨). That this is so is due to historical reasons. If / were replaced by a syntactically orthodox function, such as {⍺/⍵}, you can get a simpler derivation:
(f {⍺/⍵} ⊢) x
(f x) {⍺/⍵} (⊢x) ⍝ definition of fork, (f g h) x ←→ (f x) g (h x)
(f x) {⍺/⍵} x ⍝ definition of ⊢
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
Re: Finding duplicates in vector
I've just realised that I've taken so long to answer that Roger has beaten me to it. Oh well. I wrote all following this sentence before I noticed. First let me say that this search was more of an intellectual conceit than a necessary development.
Expressions similar to (f w)/w are ubiquitous in APL but have the awkward property of using the argument twice, necessitating its being named or, if the result of an expression, the making it subject to parameter substitution in a dfn as {(f ⍵)/⍵}. Now this obviously isn't good form because we are now using an unknown function f inside a dfn so for forms sake we should parameterise that as well and code f{(⍺⍺ ⍵)/⍵}.
Forks are useful for algorithms that re-use a value by applying two functions to the same one or two arguments before applying a third between the two results:
(f0 f1 f2)w ←→ (f0 w)f1(f2 w)
Now fitting (f w)/w into that tripartite pattern as (f w) (/) (w) requires the mappings
f0 ←→ f ⋄ f1 ←→ / ⋄ f2 ←→ ⊢
where (⊢) merely returns its argument, leading to the fork (f / ⊢). Unfortunately the binding rules tell us this isn't a fork for the reason that the f and / bind first to become (f /) because the hybrid (/) has a function to its left and we get the 2-train ((f /) ⊢) that is merely (f /) which we don't want.
No amount of parenthesis can unwind this. (f (/) ⊢) is still the same because parenthesising (/) no longer distinguishes it as a function, if it ever did, as adverbs, monadic operators, are now also allowed to be parenthesised and assigned.
As things currently stand we cannot parenthesise a pair of adverbs, try each commute, even though doing so would merely produce another adverb so if we code (/⍨) the parser has the choice of SYNTAX ERROR or allowing it as the function replicate commute.
If ever adverb binding is permitted, and there are good reasons for doing so, then (/⍨) may have to change its meaning.
Nevertheless, pro tem, we have it as a function so we can switch our two outer functions (f) and (⊢) to accommodate the switching being done by commute (⍨) and do
(⊢(/⍨)f)
which I read as "right where f". And it's actually shorter and faster than
f{(⍺⍺ ⍵)/⍵}
Expressions similar to (f w)/w are ubiquitous in APL but have the awkward property of using the argument twice, necessitating its being named or, if the result of an expression, the making it subject to parameter substitution in a dfn as {(f ⍵)/⍵}. Now this obviously isn't good form because we are now using an unknown function f inside a dfn so for forms sake we should parameterise that as well and code f{(⍺⍺ ⍵)/⍵}.
Forks are useful for algorithms that re-use a value by applying two functions to the same one or two arguments before applying a third between the two results:
(f0 f1 f2)w ←→ (f0 w)f1(f2 w)
Now fitting (f w)/w into that tripartite pattern as (f w) (/) (w) requires the mappings
f0 ←→ f ⋄ f1 ←→ / ⋄ f2 ←→ ⊢
where (⊢) merely returns its argument, leading to the fork (f / ⊢). Unfortunately the binding rules tell us this isn't a fork for the reason that the f and / bind first to become (f /) because the hybrid (/) has a function to its left and we get the 2-train ((f /) ⊢) that is merely (f /) which we don't want.
No amount of parenthesis can unwind this. (f (/) ⊢) is still the same because parenthesising (/) no longer distinguishes it as a function, if it ever did, as adverbs, monadic operators, are now also allowed to be parenthesised and assigned.
As things currently stand we cannot parenthesise a pair of adverbs, try each commute, even though doing so would merely produce another adverb so if we code (/⍨) the parser has the choice of SYNTAX ERROR or allowing it as the function replicate commute.
If ever adverb binding is permitted, and there are good reasons for doing so, then (/⍨) may have to change its meaning.
Nevertheless, pro tem, we have it as a function so we can switch our two outer functions (f) and (⊢) to accommodate the switching being done by commute (⍨) and do
(⊢(/⍨)f)
which I read as "right where f". And it's actually shorter and faster than
f{(⍺⍺ ⍵)/⍵}
-
Phil Last - Posts: 628
- Joined: Thu Jun 18, 2009 6:29 pm
- Location: Wessex
Re: Finding duplicates in vector
Sorry Phil for "butting in". I thought you may have been away on the long long weekend.
When J was first implemented it was Eugene McDonnell who discovered that forks are faster than alternatives, to my surprise. This still holds true today in Dyalog APL:
The reason for the time difference is that d is parsed each time it is executed, but f is not. (I distinguish between parsing and tokenizing; d is tokenized once but parsed each time.) I conjecture that the internal form of f is close to what you get from the Dyalog compiler, so that it's as if d were compiled. On smaller arguments this is significant because on smaller arguments interpretive overhead is relatively larger. I expect that on larger arguments the time difference between d and f would be less. And so it is:
It would be interesting to do a timing comparison between the compiled form of d, and f.
shorter and faster
When J was first implemented it was Eugene McDonnell who discovered that forks are faster than alternatives, to my surprise. This still holds true today in Dyalog APL:
x←3 1 4 1 5 9 2 6 53 58 9 7 9
d ← {((⍳≢⍵)≠⍵⍳⍵)/⍵}
f ← ⊢ (⌿⍨) ⍳∘≢ ≠ ⍳⍨
cmpx 'd x' 'f x'
d x → 1.34E¯6 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
f x → 1.23E¯6 | -9% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
The reason for the time difference is that d is parsed each time it is executed, but f is not. (I distinguish between parsing and tokenizing; d is tokenized once but parsed each time.) I conjecture that the internal form of f is close to what you get from the Dyalog compiler, so that it's as if d were compiled. On smaller arguments this is significant because on smaller arguments interpretive overhead is relatively larger. I expect that on larger arguments the time difference between d and f would be less. And so it is:
y←?1e4⍴9e3
cmpx 'd y' 'f y'
d y → 4.93E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
f y → 4.93E¯5 | -1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
It would be interesting to do a timing comparison between the compiled form of d, and f.
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
Re: Finding duplicates in vector
Not at all! If only I were so succint.Roger wrote:"butting in"
-
Phil Last - Posts: 628
- Joined: Thu Jun 18, 2009 6:29 pm
- Location: Wessex
Re: Finding duplicates in vector
Thanks a lot for detailed explanations! Now I understand what the right tac ⊢ is used as an identity function typically used in functional programming for situations like this and the role of commute here.
One more thing: in the book "APL An Interactive Approach"(3rd edition) by Gilman and Rose, page 130 they write about function unique as ∩ (not ∪), which should return 1s marking the first occurrence of item in array. Here is the example from the book:
It seems to do the same as
They claim what this function is available in APL2. However neither Dyalog APL nor GNU APL seems to have this function. Why is that and what is the history behind it?
One more thing: in the book "APL An Interactive Approach"(3rd edition) by Gilman and Rose, page 130 they write about function unique as ∩ (not ∪), which should return 1s marking the first occurrence of item in array. Here is the example from the book:
∩R←3 2 3 4 2 1 0 1
1 1 0 1 0 1 1 0
(∩R)/R
3 2 4 1 0
It seems to do the same as
{(⍵⍳⍵)=⍳⍴⍵}.
They claim what this function is available in APL2. However neither Dyalog APL nor GNU APL seems to have this function. Why is that and what is the history behind it?
- alexeyv
- Posts: 56
- Joined: Tue Nov 17, 2015 4:18 pm
Re: Finding duplicates in vector
That's right, Dyalog APL does not yet have a function that returns a mask for selecting the distinct items. Iverson called it the "nubsieve", the sieve that selects the nub (AKA unique).
The nubsieve function will use the extended definition of ⍳, one which works on major cells rather than merely scalars. Thus:
The nubsieve function will use the extended definition of ⍳, one which works on major cells rather than merely scalars. Thus:
nubsieve ← ⍳∘≢ = ⍳⍨
⊢ x ← ↑ 3 ⍴¨ 'kakistocracy'
kkk
aaa
kkk
iii
sss
ttt
ooo
ccc
rrr
aaa
ccc
yyy
nubsieve x
1 1 0 1 1 1 1 1 1 0 0 1
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
Re: Finding duplicates in vector
Timings on a dfn variant d1 which more closely mimic the fork f, and on the compiled versions of the dfns.
Compiler wins!
Timing improvements on small arguments are very hard to come by because there isn't much wriggle room, not as many elements over which to amortize the fixed cost of interpretive overhead.
d ← {((⍳≢⍵)≠⍵⍳⍵)/⍵}
f ← ⊢ (⌿⍨) ⍳∘≢ ≠ ⍳⍨
d1 ← {⍵⌿⍨(⍳≢⍵)≠⍳⍨⍵}
c ←d ⋄ 2(400⌶)'c'
c1←d1 ⋄ 2(400⌶)'c1'
cmpx 'd x' 'd1 x' 'f x' 'c x' 'c1 x'
d x → 1.35E¯6 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
d1 x → 1.60E¯6 | +18% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
f x → 1.24E¯6 | -9% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
c x → 1.13E¯6 | -17% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
c1 x → 1.13E¯6 | -17% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
Compiler wins!
Timing improvements on small arguments are very hard to come by because there isn't much wriggle room, not as many elements over which to amortize the fixed cost of interpretive overhead.
- Roger|Dyalog
- Posts: 238
- Joined: Thu Jul 28, 2011 10:53 am
25 posts
• Page 2 of 3 • 1, 2, 3
Who is online
Users browsing this forum: No registered users and 1 guest
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group