Regex in Dyalog

APL-related discussions - a stream of APL consciousness.
Not sure where to start a discussion ? Here's the place to be
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !

Regex in Dyalog

Postby kai on Sat Feb 06, 2016 11:24 am

Is this an APL topic? Not sure, but it does not fit elsewhere, so here we go.

If you just want to help but you are not interested in Regular Expression (Regex) then don't read this: as so often I solved my problem while writing the posting. Since it

  • might be of interest.
  • is one of the rare case when an APL solution would need more coding.
  • pressed the "Submit" button accidentally at a very early stage.
.
I leave it:

Imagine an HTML file. Self-constucted, valid HTML5. It may contain "---" (three dashes) which shall be converted to "—", or it may contain "--" which shall be converted to "–" or it may contain "..." which shall be converted into "…".

With a little help from Richard Smith (I am everything but an expert on RegEx) I solved this:

Code: Select all
iot←'<.([^>]*?)>'       ⍝ ignoreOpeningTags
icc←'<code>.*?</code>'  ⍝ignoreCodeContent
o←('Mode' 'D')('DotAll' 1)
h←icc iot '---'⎕R'\0' '\0' '\&mdash;'⍠ o ⊣ h
h←icc iot '--'⎕R'\0' '\0' '\&ndash;'⍠ 0 ⊣ h
h←icc iot '\.\.\.'⎕R'\0' '\0' '\&mldr;'⍠ o ⊣ h


This is fine for something like this (I know, the attribute does not make sense, but it illustrates the point anyway):

      h←'<div class="---"> --- <code> translates --- </code> </div>'


which results in

Code: Select all
<div class="---"> &mdash; <code> translates --- </code> </div>


Which is exactly what I want.

However, ..., well, not however, while explaining my problem I found the solution!

Originally, the last step mutilated everything. In this last step I tried to exchange any " against &ldquo; and &rdquo; by doing this:

Code: Select all
ioc iot '"((.|\s)*?)"'⎕R'\0' '\0' '\&ldquo;\1\&rdquo;'⍠ o ⊣ h


and indeed that now works perfectly well.

What did I change? I realized that the sequence is important:

Code: Select all
icc iot '---'⎕R ... ⍝ is okay
iot icc '---'⎕R ... ⍝ is NOT okay


Now this input:

Code: Select all
html←'<a href="apl---wiki">The --- APL wiki</a>'


results in this:

Code: Select all
<a href="apl---wiki">The &mdash; APL wiki</a>


Excellent. At least as long as nothing like this comes along:

Code: Select all
<dic attr="<>">
User avatar
kai
 
Posts: 137
Joined: Thu Jun 18, 2009 5:10 pm
Location: Hillesheim / Germany

Re: Regex in Dyalog

Postby DanB|Dyalog on Sun Feb 07, 2016 3:52 am

There are many ways to code regexes (just like in APL).

I am not sure if I understand your problem correctly but it seems to me that this expression

Code: Select all
'(<code>.*?</code>|<.*?>)' '---' '--' '\.{3}' ⎕r '&' '\&mdash;' '\&ndash;' '\&mldr;'⊢H


should do the trick. Unless your tags may include CRs you should be able to drop the 'Mode' 'D' and 'DotAll' (but it doesn't hurt to add them).

It does not solve the <dic attr="<>" problem.
DanB|Dyalog
 

Re: Regex in Dyalog

Postby kai on Sat Feb 13, 2016 12:06 pm

Thanks Dan.

However, I am not sure whether I want to know even more ways to achieve them same result ;)

Anyway, I have another problem!

Given this character vector:

Code: Select all
q←'<p>This "is" an <abbr title="This and that">Abbr</abbr></p>'


I want to make sure that any double quotes (") are replaced by there typographically correct siblings, but of course not anything within a tag definition. So those around "is" shall be converted, but not those inside the <abbr> tag.

This expression:

Code: Select all
cot←'<.([^>]*?)>'  ⍝ Catch Opening Tags


is supposes to catch tags from start (<) to end (>) and therefore can be used to replace the catch with /0 which stands for just the catch, meaning nothing changes.

And it seemed to work:

Code: Select all
      cot'"((.|\s)*?)"'⎕R '\0' '\&ldquo;\1\&rdquo;'⍠('Mode' 'D')('DotAll' 1)⊣q
<p>This &ldquo;is&rdquo; an <abbr title="This and that">Abbr</abbr></p>


Now when I change the input string slightly:

Code: Select all
<p>This "is" an <abbr title="This and that">Abbr</abbr></p>     ⍝ old
<p>This "is" an "<abbr title="This and that">Abbr</abbr>"</p>   ⍝ New


it stops working - the result is now:

Code: Select all
<p>This &ldquo;is&rdquo; an &ldquo;<abbr title=&rdquo;This and that&ldquo;>Abbr</abbr>&rdquo;</p>


Now the two double quotes inside the <abbr> tag got changed as well, and I have no idea why that is!

Anybody?!
User avatar
kai
 
Posts: 137
Joined: Thu Jun 18, 2009 5:10 pm
Location: Hillesheim / Germany

Re: Regex in Dyalog

Postby DanB|Dyalog on Sat Feb 13, 2016 5:49 pm

When you use several elements in the left operand, ⎕R proceeds from one character to the next in the right argument,

testing each elements of the left operand and, if it fails, testing the next one and so on.
For example
Code: Select all
      ('AA' 'BA' ⎕r 'xx' 'yy') 'AABAABAA'
xxyyAyyA


Here AA matched the first 2 characters and ⎕R did the replacement 'xxBAABAA'. It then looked at the 3rd character, 'B'

and that did not match AA so it tried with 'BA' and that worked so the new replacement generated 'xxyyABAA' and so on.

This is what happened here in your case: the "<abbr... matched the 2nd operand element '"((.|\s)*?)"' and not the first one '<.([^>]*?)>'.

The reason why it worked in
Code: Select all
      q←'<p>This "is" an <abbr title="This and that">Abbr</abbr></p>'
      cot'"((.|\s)*?)"'⎕R '\0' '\&ldquo;\1\&rdquo;'⍠('Mode' 'D')('DotAll' 1)⊣q
<p>This &ldquo;is&rdquo; an <abbr title="This and that">Abbr</abbr></p>
is because everything was paired properly.

What would you have like to happen?


BTW
Code: Select all
'<.([^>]*?)>'
means "look for a '<', look for ANY character, look for as few as possible of NOT '>', then a '>'"
This can be simplified and sped up by using the expression
Code: Select all
'<.+?>'
which is "look for a '<', look for at least one character until a '>'"
Note that both would match '<><>' in the expression 'xxx<><>yyy'.

And the expression
Code: Select all
'"((.|\s)*?)"'
can be simplified as
Code: Select all
'"(.*?)"'
since (.|\s) is the same as (.) (dot will always match anything so |\s or |anything is superflous).
DanB|Dyalog
 

Re: Regex in Dyalog

Postby kai on Mon Feb 15, 2016 7:23 am

Dan, invaluable. Thanks a lot.

I think it is fair to say that I slowly start to understand how the stuff really works...
User avatar
kai
 
Posts: 137
Joined: Thu Jun 18, 2009 5:10 pm
Location: Hillesheim / Germany


Return to APL Chat

Who is online

Users browsing this forum: No registered users and 1 guest