Regular Expressions

General APL language issues

Regular Expressions

Postby paulmansour on Tue May 16, 2017 2:24 pm

I'm finally getting around to revamping various string searching functions with ⎕S.

It is no surprise that ⍷ is much faster given its relatively limited functionality.

Here is my question: if my search pattern has no regex meta characters, for character matrix m and search pattern p, can this expression:

      (⍳≢ m)∊(p ⎕S 2) ↓m


be replaced with this:

      ∨/p⍷m


Also is this the complete set of meta characters?

      c←'\^$*+?.|{}[]()'
paulmansour
 
Posts: 420
Joined: Fri Oct 03, 2008 4:14 pm

Re: Regular Expressions

Postby paulmansour on Wed May 17, 2017 5:32 pm

I have another question.

There is a variant for ignoring case, IC. However, I can also change case sensitivity within the pattern using (?i) and (?-i).

As I am struggling to provide useful cover functions for third party users, this is useful. I can get rid of an argument and let the user (easily) embed the option in the search pattern.

Two questions:

1. Anything else going on with respect to IC? Is it just a convenience for putting (?i) at the start of each search pattern?

2. Is there something similar in PCRE for the ML variant, or is this totally outside the scope of the pattern? At first glance this looks like it is much more complicated to embed in the search pattern, if at all.
paulmansour
 
Posts: 420
Joined: Fri Oct 03, 2008 4:14 pm

Re: Regular Expressions

Postby Richard|Dyalog on Thu May 18, 2017 9:59 am

Hi, Paul - I'll try to go through your questions in turn.

> can this expression ... be replaced with this?

Your expression using ⍷ is not exactly equivalent to the one using ⎕S - though it may be that it actually suits your needs better. The reason is that ⎕R and ⎕S operate on lines and though your matrix is split into vectors where each is considered to be a line (that is, there are implicit line endings between the vectors), if there are also explicit line ending characters in m there will be more logical lines in the data than there are rows in the matrix.

Under the covers, ⎕R and ⎕S use the PCRE search engine to locate matches in the data but they do try to avoid doing that for simple searches - they, too, call ⍷ when they can. One requirement for this optimisation is that there is only a single logical line in the input so your example would not be optimised - though if you were to construct an expression using each to pass the data line-by-line to ⎕S it might well be. If you want more details about when the optimisation does and does not take place, let me know.

> Is this the complete set of meta characters?

As with your optimisation, another requirement for the interpreter optimisation is that there are no meta characters in the search pattern, and the list it uses is almost as yours - in fact, you don't need to include } or ] in the list because these don't have any special meaning without the leading { or [. The list of characters used by the interpreter is: \^$.[|()?*+{

> Anything else going on with respect to IC? Is it just a convenience for putting (?i) at the start of each search pattern?

PCRE is passed various options when invoked by the interpreter and one of these is the case sensitivity flag. Many of these options - including that one - can also be set within the search pattern. As far as I know, PCRE behaves identically regardless of how the flag is set and I can confirm that the interpreter does not change its behaviour dependent on it - so I believe they are exactly equivalent.

> Is there something similar in PCRE for the ML variant, or is this totally outside the scope of the pattern? At first glance this looks like it is much more complicated to embed in the search pattern, if at all.

The way the PCRE engine is invoked is, simplistically, to pass it the (precompiled) pattern, a buffer of text and start offset within the buffer, and it gives back the offset of the next match, if any. The interpreter keeps invoking it to find all the matches. Therefore, the ML option entirely controls the way the interpreter calls PCRE; there is equivalent option you can put in the search pattern. Your only way of achieving the effect you want with the search pattern alone would be to construct a suitable pattern which actually matched the number of times you wanted, which is a far more difficult undertaking that simply setting the option.

I hope this helps!
User avatar
Richard|Dyalog
 
Posts: 44
Joined: Thu Oct 02, 2008 11:11 am

Re: Regular Expressions

Postby paulmansour on Thu May 18, 2017 1:27 pm

Richard,

Thank you very much for your detailed response, it helps a lot.

I was unaware that I could still get more logical lines than matrix rows when using ⎕S on the equivalent vector of vectors. That's critical to know, as I think lining up my results to correspond to the original char mat would be impossible without testing each row for embedded line-endings (or pre-searching and prohibiting them), I think that is yet another reason for me to use it with ¨ in Document mode for my use case, which is essentially searching a character column in a database. (Interestingly, with ⎕R, due its nature, no such issue arises, I think)

If I use an ¨, even though you are optimizing with ⍷ under the covers for simple searches, I assume there will still be an opportunity for me to optimize by doing ⍷ on the original matrix, with no need for me to nest it, let alone call ⎕S with an ¨.

The info on ML is very helpful too. I assume something similar is going on when there are multiple search patterns - there are (possibly multiple) calls to PCRE for each pattern.

Thanks again, and thanks for ⎕S and ⎕R. I don't think I realized how useful these (and regex) were going to be!
paulmansour
 
Posts: 420
Joined: Fri Oct 03, 2008 4:14 pm

Re: Regular Expressions

Postby paulmansour on Thu May 18, 2017 5:56 pm

(Interestingly, with ⎕R, due its nature, no such issue arises, I think)


I take that back.

I think one really has to use ¨ and Mode=D when doing any search or replace on the individual rows of a char mat (or a vec of vecs) and it is crucial that the result has a one-to-one correspondence with the input.

I don't think there is any way around it, short of prohibiting line ending chars in the original input matrix.
paulmansour
 
Posts: 420
Joined: Fri Oct 03, 2008 4:14 pm


Return to Language

Who is online

Users browsing this forum: No registered users and 1 guest