Regex in Dyalog
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
5 posts
• Page 1 of 1
Regex in Dyalog
Is this an APL topic? Not sure, but it does not fit elsewhere, so here we go.
If you just want to help but you are not interested in Regular Expression (Regex) then don't read this: as so often I solved my problem while writing the posting. Since it
I leave it:
Imagine an HTML file. Self-constucted, valid HTML5. It may contain "---" (three dashes) which shall be converted to "—", or it may contain "--" which shall be converted to "–" or it may contain "..." which shall be converted into "…".
With a little help from Richard Smith (I am everything but an expert on RegEx) I solved this:
This is fine for something like this (I know, the attribute does not make sense, but it illustrates the point anyway):
which results in
Which is exactly what I want.
However, ..., well, not however, while explaining my problem I found the solution!
Originally, the last step mutilated everything. In this last step I tried to exchange any " against “ and ” by doing this:
and indeed that now works perfectly well.
What did I change? I realized that the sequence is important:
Now this input:
results in this:
Excellent. At least as long as nothing like this comes along:
If you just want to help but you are not interested in Regular Expression (Regex) then don't read this: as so often I solved my problem while writing the posting. Since it
- might be of interest.
- is one of the rare case when an APL solution would need more coding.
- pressed the "Submit" button accidentally at a very early stage.
I leave it:
Imagine an HTML file. Self-constucted, valid HTML5. It may contain "---" (three dashes) which shall be converted to "—", or it may contain "--" which shall be converted to "–" or it may contain "..." which shall be converted into "…".
With a little help from Richard Smith (I am everything but an expert on RegEx) I solved this:
- Code: Select all
iot←'<.([^>]*?)>' ⍝ ignoreOpeningTags
icc←'<code>.*?</code>' ⍝ignoreCodeContent
o←('Mode' 'D')('DotAll' 1)
h←icc iot '---'⎕R'\0' '\0' '\—'⍠ o ⊣ h
h←icc iot '--'⎕R'\0' '\0' '\–'⍠ 0 ⊣ h
h←icc iot '\.\.\.'⎕R'\0' '\0' '\…'⍠ o ⊣ h
This is fine for something like this (I know, the attribute does not make sense, but it illustrates the point anyway):
h←'<div class="---"> --- <code> translates --- </code> </div>'
which results in
- Code: Select all
<div class="---"> — <code> translates --- </code> </div>
Which is exactly what I want.
However, ..., well, not however, while explaining my problem I found the solution!
Originally, the last step mutilated everything. In this last step I tried to exchange any " against “ and ” by doing this:
- Code: Select all
ioc iot '"((.|\s)*?)"'⎕R'\0' '\0' '\“\1\”'⍠ o ⊣ h
and indeed that now works perfectly well.
What did I change? I realized that the sequence is important:
- Code: Select all
icc iot '---'⎕R ... ⍝ is okay
iot icc '---'⎕R ... ⍝ is NOT okay
Now this input:
- Code: Select all
html←'<a href="apl---wiki">The --- APL wiki</a>'
results in this:
- Code: Select all
<a href="apl---wiki">The — APL wiki</a>
Excellent. At least as long as nothing like this comes along:
- Code: Select all
<dic attr="<>">
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
Re: Regex in Dyalog
There are many ways to code regexes (just like in APL).
I am not sure if I understand your problem correctly but it seems to me that this expression
should do the trick. Unless your tags may include CRs you should be able to drop the 'Mode' 'D' and 'DotAll' (but it doesn't hurt to add them).
It does not solve the <dic attr="<>" problem.
I am not sure if I understand your problem correctly but it seems to me that this expression
- Code: Select all
'(<code>.*?</code>|<.*?>)' '---' '--' '\.{3}' ⎕r '&' '\—' '\–' '\…'⊢H
should do the trick. Unless your tags may include CRs you should be able to drop the 'Mode' 'D' and 'DotAll' (but it doesn't hurt to add them).
It does not solve the <dic attr="<>" problem.
- DanB|Dyalog
Re: Regex in Dyalog
Thanks Dan.
However, I am not sure whether I want to know even more ways to achieve them same result ;)
Anyway, I have another problem!
Given this character vector:
I want to make sure that any double quotes (") are replaced by there typographically correct siblings, but of course not anything within a tag definition. So those around "is" shall be converted, but not those inside the <abbr> tag.
This expression:
is supposes to catch tags from start (<) to end (>) and therefore can be used to replace the catch with /0 which stands for just the catch, meaning nothing changes.
And it seemed to work:
Now when I change the input string slightly:
it stops working - the result is now:
Now the two double quotes inside the <abbr> tag got changed as well, and I have no idea why that is!
Anybody?!
However, I am not sure whether I want to know even more ways to achieve them same result ;)
Anyway, I have another problem!
Given this character vector:
- Code: Select all
q←'<p>This "is" an <abbr title="This and that">Abbr</abbr></p>'
I want to make sure that any double quotes (") are replaced by there typographically correct siblings, but of course not anything within a tag definition. So those around "is" shall be converted, but not those inside the <abbr> tag.
This expression:
- Code: Select all
cot←'<.([^>]*?)>' ⍝ Catch Opening Tags
is supposes to catch tags from start (<) to end (>) and therefore can be used to replace the catch with /0 which stands for just the catch, meaning nothing changes.
And it seemed to work:
- Code: Select all
cot'"((.|\s)*?)"'⎕R '\0' '\“\1\”'⍠('Mode' 'D')('DotAll' 1)⊣q
<p>This “is” an <abbr title="This and that">Abbr</abbr></p>
Now when I change the input string slightly:
- Code: Select all
<p>This "is" an <abbr title="This and that">Abbr</abbr></p> ⍝ old
<p>This "is" an "<abbr title="This and that">Abbr</abbr>"</p> ⍝ New
it stops working - the result is now:
- Code: Select all
<p>This “is” an “<abbr title=”This and that“>Abbr</abbr>”</p>
Now the two double quotes inside the <abbr> tag got changed as well, and I have no idea why that is!
Anybody?!
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
Re: Regex in Dyalog
When you use several elements in the left operand, ⎕R proceeds from one character to the next in the right argument,
testing each elements of the left operand and, if it fails, testing the next one and so on.
For example
Here AA matched the first 2 characters and ⎕R did the replacement 'xxBAABAA'. It then looked at the 3rd character, 'B'
and that did not match AA so it tried with 'BA' and that worked so the new replacement generated 'xxyyABAA' and so on.
This is what happened here in your case: the "<abbr... matched the 2nd operand element '"((.|\s)*?)"' and not the first one '<.([^>]*?)>'.
The reason why it worked in
What would you have like to happen?
BTW
This can be simplified and sped up by using the expression
Note that both would match '<><>' in the expression 'xxx<><>yyy'.
And the expression
testing each elements of the left operand and, if it fails, testing the next one and so on.
For example
- Code: Select all
('AA' 'BA' ⎕r 'xx' 'yy') 'AABAABAA'
xxyyAyyA
Here AA matched the first 2 characters and ⎕R did the replacement 'xxBAABAA'. It then looked at the 3rd character, 'B'
and that did not match AA so it tried with 'BA' and that worked so the new replacement generated 'xxyyABAA' and so on.
This is what happened here in your case: the "<abbr... matched the 2nd operand element '"((.|\s)*?)"' and not the first one '<.([^>]*?)>'.
The reason why it worked in
- Code: Select all
q←'<p>This "is" an <abbr title="This and that">Abbr</abbr></p>'
cot'"((.|\s)*?)"'⎕R '\0' '\“\1\”'⍠('Mode' 'D')('DotAll' 1)⊣q
<p>This “is” an <abbr title="This and that">Abbr</abbr></p>
What would you have like to happen?
BTW
- Code: Select all
'<.([^>]*?)>'
This can be simplified and sped up by using the expression
- Code: Select all
'<.+?>'
Note that both would match '<><>' in the expression 'xxx<><>yyy'.
And the expression
- Code: Select all
'"((.|\s)*?)"'
- Code: Select all
'"(.*?)"'
- DanB|Dyalog
Re: Regex in Dyalog
Dan, invaluable. Thanks a lot.
I think it is fair to say that I slowly start to understand how the stuff really works...
I think it is fair to say that I slowly start to understand how the stuff really works...
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
5 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group