To Each or Not to Each
3 posts
• Page 1 of 1
To Each or Not to Each
I'm contemplating ¨ with ⎕S.
My main use-case for regex is a character matrix, where I am searching each row. I first convert this to a vector of vectors V.
My first instinct was to pass V to ⎕S in line mode. However, there is a fair amount work involved in lining up the results with the original input, which is critical for my use case. So I tried it with an ¨, which eliminates the lining up issue. I assumed this would be much slower, and sometimes it is (but really not that much slower), but oddly, on very simple search patterns it appears the ¨ is faster, even ignoring the extra work involved in lining up the results in the no-each case.
In addition to the advantage of lining up results with input, the ¨ allows the use of the Document mode variant, which means you can search for line ending chars.
Aside from speed (con most of the time), and searching for newlines (pro - I think) I'm wondering if I am missing anything with respect to the difference of ¨ versus no ¨.
My main use-case for regex is a character matrix, where I am searching each row. I first convert this to a vector of vectors V.
My first instinct was to pass V to ⎕S in line mode. However, there is a fair amount work involved in lining up the results with the original input, which is critical for my use case. So I tried it with an ¨, which eliminates the lining up issue. I assumed this would be much slower, and sometimes it is (but really not that much slower), but oddly, on very simple search patterns it appears the ¨ is faster, even ignoring the extra work involved in lining up the results in the no-each case.
In addition to the advantage of lining up results with input, the ¨ allows the use of the Document mode variant, which means you can search for line ending chars.
Aside from speed (con most of the time), and searching for newlines (pro - I think) I'm wondering if I am missing anything with respect to the difference of ¨ versus no ¨.
- paulmansour
- Posts: 420
- Joined: Fri Oct 03, 2008 4:14 pm
Re: To Each or Not to Each
The most likely explanation for the speed up you are seeing is that in simplifying the input to ⎕S to single character vectors, ⎕S is switching to an optimised mode whereby it uses ⍷ to find matches on the data rather than use the PCRE search engine. Using PCRE is slower for a number of reasons - principally that it itself is slower than ⍷ because it can do far more complex searches, and also because it requires that the data in the workspace be re-encoded and reconstructed into a format suitable for it to use.
⎕S is quite a complicated function and passing data to it in sections (using each) has some significant effects on what you can and cannot do with the data and may or may not be beneficial to you. ⎕S in line mode is not equivalent to ⎕S¨ with document mode set, because the former will split the data into logical lines wherever it finds a line ending character - implicitly between the vectors of character vectors and explicitly in the data itself, which is why you can search for line ending characters. However, ⎕S allows you to enquire which "block" (line) the match occurred on, which may be useful especially when you use a transformation function, and this will not be available if ⎕S is invoked multiple times. Which form suits you better will very much depend on what you are trying to do.
⎕S is quite a complicated function and passing data to it in sections (using each) has some significant effects on what you can and cannot do with the data and may or may not be beneficial to you. ⎕S in line mode is not equivalent to ⎕S¨ with document mode set, because the former will split the data into logical lines wherever it finds a line ending character - implicitly between the vectors of character vectors and explicitly in the data itself, which is why you can search for line ending characters. However, ⎕S allows you to enquire which "block" (line) the match occurred on, which may be useful especially when you use a transformation function, and this will not be available if ⎕S is invoked multiple times. Which form suits you better will very much depend on what you are trying to do.
-
Richard|Dyalog - Posts: 44
- Joined: Thu Oct 02, 2008 11:11 am
3 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group