RegEx: identiify fenced code blocks in Markdown
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
10 posts
• Page 1 of 1
RegEx: identiify fenced code blocks in Markdown
I have this with ⎕IO←0 and ⎕ML←3:
The rules for identifying a fenced code block are:
It starts with a line that has zero to three whitespace characters, followed by at least 3 "~" characters (= no upper limit) followed by a newline character. The same rule defines the end of a code block.
The number of whitespace characters as well as the number of "~" defining the fence can vary between start and end definition.
This regular expression seems to work fine:
<Explanation for the interested reader>
That seams to work fine:
However, when I convert q into a nested variable:
and then try again:
The result has changed, and it is wrong! I have no idea why that is. Surely the nested version (q2) should be treated like the original version (q) !
May I ask the RegEx authorities for an explanation for this?
A second obstacle: if the APL code block contains a "~" character then I expected the regular expression to go wrong. For example:
It should go wrong because [^~] now fails when it reaches the ~ in ~0 1⍷⍵ but it keeps working:
It seems as if the [^~] is not needed and indeed it isn't:
But shouldn't the .* then consume all characters until the end of the document because it is greedy?
- Code: Select all
q←1↓∊(⎕UCS 10),¨'Para 1 ' '' ' ~~~' '{' '+/⍳⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~'
q
Para 1
~~~
{
+/⍳⍵
}
~~~
para 2
~~~
The rules for identifying a fenced code block are:
It starts with a line that has zero to three whitespace characters, followed by at least 3 "~" characters (= no upper limit) followed by a newline character. The same rule defines the end of a code block.
The number of whitespace characters as well as the number of "~" defining the fence can vary between start and end definition.
This regular expression seems to work fine:
- Code: Select all
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
8 21
21↑8↓q
21↑8↓q
~~~
{
+/⍳⍵
}
~~~
<Explanation for the interested reader>
- ('Mode' 'M') makes sure that ^ represents the beginning of a line (rather than the beginning of the whole document) and $ the end of a line (rather than the end of the whole document).
('DotAll' 1) means that .* includes the newline character.
^ means start at the beginning of a line
\s means "Single Whitespace character"
{0,3} means that the whitespace character may occur at least zero times, up to a maximum of 3.
~ means we are looking for a ~ character
{3,} means at least three of them, with no upper limit
[^~] means every character but a ~
.* means any character, including newline (because of Dotall=1)
^ means position at the beginning of (the next) line
\s, {0,3}, ~ and {3,} repeat the pattern listed above
$ means "end of line".
That seams to work fine:
- Code: Select all
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
8 21
21↑8↓q
~~~
{
+/⍳⍵
}
~~~
However, when I convert q into a nested variable:
- Code: Select all
q2←(⎕UCS 10){⎕ML←1 ⋄ 1↓¨⍺{⍵⊂⍨⍺=⍵}⍺,⍵}q
and then try again:
- Code: Select all
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q2
9 26
The result has changed, and it is wrong! I have no idea why that is. Surely the nested version (q2) should be treated like the original version (q) !
May I ask the RegEx authorities for an explanation for this?
A second obstacle: if the APL code block contains a "~" character then I expected the regular expression to go wrong. For example:
- Code: Select all
q←1↓∊(⎕UCS 10),¨'Para 1 ' '' ' ~~~' '{' '~0 1⍷⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~'
It should go wrong because [^~] now fails when it reaches the ~ in ~0 1⍷⍵ but it keeps working:
- Code: Select all
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
8 23
23↑8↓q
~~~
{
~0 1⍷⍵
}
~~~
It seems as if the [^~] is not needed and indeed it isn't:
- Code: Select all
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
8 23
'^\s{0,3}~{3,}.*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
8 23
But shouldn't the .* then consume all characters until the end of the document because it is greedy?
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
Re: RegEx: identiify fenced code blocks in Markdown
Hi Kai,
there is a small problem with your q: the "~~~" below para2 are prefixed with 4 whitespaces, so the regex doesn't match that part...! If you remove that one whitespace and add a
there is a small problem with your q: the "~~~" below para2 are prefixed with 4 whitespaces, so the regex doesn't match that part...! If you remove that one whitespace and add a
- Code: Select all
('Greedy' 0)
- Code: Select all
('Greedy' 1)
-
MBaas - Posts: 156
- Joined: Thu Oct 16, 2008 1:17 am
- Location: Gründau / Germany
Re: RegEx: identiify fenced code blocks in Markdown
Yes, but that has a purpose: it confirms that it is NOT found because of the 4 white spaces.
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
Re: RegEx: identiify fenced code blocks in Markdown
Michael has a point regarding the greed. In case I make sure that even the third fenced block has just three leading whitespace characters then things get worse.
That's obviously because it's greedy, so we must improve by making it non-greedy:
Okay that fine but my problem when it is a nested array remains unsolved:
- Code: Select all
q←1↓∊(⎕UCS 10),¨'Para 1 ' '' ' ~~~' '{' '+/⍳⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~'
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
0 37
That's obviously because it's greedy, so we must improve by making it non-greedy:
- Code: Select all
'^\s{0,3}~{3,}[^~].*?^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q
8 21
Okay that fine but my problem when it is a nested array remains unsolved:
- Code: Select all
q←,¨'Para 1 ' '' ' ~~~' '{' '+/⍳⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~'
'^\s{0,3}~{3,}[^~].*?^\s{0,3}~{3,}$'⎕S 0 1⍠('Mode' 'M')('DotAll' 1)⊣q
9 26
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
Re: RegEx: identiify fenced code blocks in Markdown
Ok, sorry - I misunderstood the part about ".*" consuming everything...
The nested case is interesting indeed and I look forward to a Guru-explanation for that one ;-)
The nested case is interesting indeed and I look forward to a Guru-explanation for that one ;-)
-
MBaas - Posts: 156
- Joined: Thu Oct 16, 2008 1:17 am
- Location: Gründau / Germany
Re: RegEx: identiify fenced code blocks in Markdown
Kai,
searching through a string and a list of strings (VTV) if not entirely the same.
For VTVs the line separator (the EOL option) is by default CR,LF which means that you get 1 extra character per "between line" matches because you used a single LF (UCS 10) in your string.
You can see this with a simple example:
You need to add the EOL option:
EOL is useless in the string case but it doesn't hurt to add it.
As for the [^~] "not working" it is actually working.
You specified "...~{3,}[∧~].*..." and the engine happily matched a minimum of 3 ~s. It stopped as soon as it found a character that was NOT ~. For sure the next character after that match was NOT a ~ and the [^~] matched naturally. It had to, unless we were at the end of the whole string. Here it matched the UCS 10 between the lines. It was superfluous as you found out.
It won't mind the ~ in "~0 1⍷⍵". The requirement is "∧\s{0,3}~{3,}", a minimum of 3 times, and that doesn't match so it matches a single "." (any char) instead.
You are right, the .* should consume all characters until the end of the document because it is greedy BUT, as Michael points out, your last line doesn't match because of the extra space at the beginning of the line.
Here is a simpler example:
Here the text delimiter is 'D' which MUST start at the beginning of a line. Instead of using the 'Greedy' 0 option I used a 'local' lazy option (the ? after .* which only applies to it). If you remove it it will match the entire text.
So your delimiter, here, is "^\s{0,3}~{3,}" and your text is ".*?". Instead of repeating the delimiter (and risk typos) you can ask the engine to reuse it by grouping it in parentheses and referring to it a second time with (?1) like this:
This reads:
(∧\s{0,3}~{3,}) define and use group 1
.*? match as little as possible
(?1) reuse group 1 to match
You don't really need the $
Hope this helps.
p.s. I used Classic to test my assertions and ^ doesn't work the same way there as in Unicode (the keyboard enters the APL ^ but ⎕S needs the ASCII ^) so careful when you cut and paste :(
searching through a string and a list of strings (VTV) if not entirely the same.
For VTVs the line separator (the EOL option) is by default CR,LF which means that you get 1 extra character per "between line" matches because you used a single LF (UCS 10) in your string.
You can see this with a simple example:
- Code: Select all
⎕ucs⊃'.*'⎕s'&'⎕OPT('Mode' 'M')('DotAll' 1) ,7⍴'aaa',⎕ucs 10
97 97 97 10 97 97 97
⎕ucs⊃'.*'⎕s'&'⎕OPT('Mode' 'M')('DotAll' 1) ,'aaa' 'aaa'
97 97 97 13 10 97 97 97
You need to add the EOL option:
- Code: Select all
⎕UCS⊃'∧\s{0,3}~{3,}[∧~].*∧\s{0,3}~{3,}$'⎕S'&'⎕OPT('Mode' 'M')('DotAll' 1)('EOL' 'LF')⊣q
10 32 126 126 126 10 123 10 43 47 9075 9077 10 125 10 32 32 32 126 126 126
⎕UCS⊃'∧\s{0,3}~{3,}[∧~].*∧\s{0,3}~{3,}$'⎕S'&'⎕OPT('Mode' 'M')('DotAll' 1)('EOL' 'LF')⊣q2
10 32 126 126 126 10 123 10 43 47 9075 9077 10 125 10 32 32 32 126 126 126
EOL is useless in the string case but it doesn't hurt to add it.
As for the [^~] "not working" it is actually working.
You specified "...~{3,}[∧~].*..." and the engine happily matched a minimum of 3 ~s. It stopped as soon as it found a character that was NOT ~. For sure the next character after that match was NOT a ~ and the [^~] matched naturally. It had to, unless we were at the end of the whole string. Here it matched the UCS 10 between the lines. It was superfluous as you found out.
It won't mind the ~ in "~0 1⍷⍵". The requirement is "∧\s{0,3}~{3,}", a minimum of 3 times, and that doesn't match so it matches a single "." (any char) instead.
You are right, the .* should consume all characters until the end of the document because it is greedy BUT, as Michael points out, your last line doesn't match because of the extra space at the beginning of the line.
Here is a simpler example:
- Code: Select all
'∧D.*?∧D'⎕S'&'⎕OPT('Mode' 'M')('DotAll' 1),¨'D' 'l2' 'D' 'l4' 'D' 'l6' 'D'
┌──┬──┐
│D │D │
│ │ │
│l2│l6│
│ │ │
│D │D │
└──┴──┘
Here the text delimiter is 'D' which MUST start at the beginning of a line. Instead of using the 'Greedy' 0 option I used a 'local' lazy option (the ? after .* which only applies to it). If you remove it it will match the entire text.
So your delimiter, here, is "^\s{0,3}~{3,}" and your text is ".*?". Instead of repeating the delimiter (and risk typos) you can ask the engine to reuse it by grouping it in parentheses and referring to it a second time with (?1) like this:
- Code: Select all
'(∧\s{0,3}~{3,}).*?(?1)' ⎕S '&' ⎕opt ('Mode' 'M')('DotAll' 1)⊣q
This reads:
(∧\s{0,3}~{3,}) define and use group 1
.*? match as little as possible
(?1) reuse group 1 to match
You don't really need the $
Hope this helps.
p.s. I used Classic to test my assertions and ^ doesn't work the same way there as in Unicode (the keyboard enters the APL ^ but ⎕S needs the ASCII ^) so careful when you cut and paste :(
- DanB|Dyalog
Re: RegEx: identiify fenced code blocks in Markdown
Thanks for the explanation. Very helpful.
That in case of the nested array a different result is returned is in my opinion a bug.
The `^∧` business is one more good reason to bury the classical version sooner rather than later.
That in case of the nested array a different result is returned is in my opinion a bug.
The `^∧` business is one more good reason to bury the classical version sooner rather than later.
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
Re: RegEx: identiify fenced code blocks in Markdown
> That in case of the nested array a different result is returned is in my opinion a bug.
Kai - let me expand on Dan's explanation to clarify why it is working correctly:
In specifying mixed mode (Mode M) you instructed the interpreter (using the PCRE search engine) to process the text in its entirety rather than line-by-line (line mode). Line mode reduces memory requirements and is set by default, but does not allow a search pattern to match across multiple lines because the search engine never sees beyond the end of any one line at a time - and you correctly chose a non-line mode because your search pattern needs to do exactly that.
However, in specifying that you wanted the text to be processed in its entirety but providing it as a vector of vectors (i.e. separate lines) it was necessary for the interpreter to construct the entire text and this required that the missing line ending characters be added. The default line ending CRLF (13 10) was assumed but you had previously used 10 (LF) - thus the text being processed was different in your two examples, and the results were different (and correct in each case).
When you removed the line endings from q to create q2, an essential piece of information was taken away; when q2 was presented to ⎕S it had no indication of what line ending you had in mind and assumed CRLF by default. However, ⎕S does allow you specify the line ending using the EOL variant option that Dan mentioned: you'll get the behaviour you expected if you specify that the missing line endings characters are LF:
Kai - let me expand on Dan's explanation to clarify why it is working correctly:
In specifying mixed mode (Mode M) you instructed the interpreter (using the PCRE search engine) to process the text in its entirety rather than line-by-line (line mode). Line mode reduces memory requirements and is set by default, but does not allow a search pattern to match across multiple lines because the search engine never sees beyond the end of any one line at a time - and you correctly chose a non-line mode because your search pattern needs to do exactly that.
However, in specifying that you wanted the text to be processed in its entirety but providing it as a vector of vectors (i.e. separate lines) it was necessary for the interpreter to construct the entire text and this required that the missing line ending characters be added. The default line ending CRLF (13 10) was assumed but you had previously used 10 (LF) - thus the text being processed was different in your two examples, and the results were different (and correct in each case).
When you removed the line endings from q to create q2, an essential piece of information was taken away; when q2 was presented to ⎕S it had no indication of what line ending you had in mind and assumed CRLF by default. However, ⎕S does allow you specify the line ending using the EOL variant option that Dan mentioned: you'll get the behaviour you expected if you specify that the missing line endings characters are LF:
- Code: Select all
'^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)('EOL' 'LF')⊣q2
8 21
-
Richard|Dyalog - Posts: 44
- Joined: Thu Oct 02, 2008 11:11 am
Re: RegEx: identiify fenced code blocks in Markdown
You don't get a different result if you use the EOL option.
You will get a different result only if you don't. And that is because YOU chose LF as line delimiter. The program has no idea what you will choose.
You may not like the default but this is not a bug, it's a feature :)
You will get a different result only if you don't. And that is because YOU chose LF as line delimiter. The program has no idea what you will choose.
You may not like the default but this is not a bug, it's a feature :)
- DanB|Dyalog
Re: RegEx: identiify fenced code blocks in Markdown
Dan and Richard: point taken.
I wonder whether that should be mentioned somewhere in the documentation.
I wonder whether that should be mentioned somewhere in the documentation.
-
kai - Posts: 137
- Joined: Thu Jun 18, 2009 5:10 pm
- Location: Hillesheim / Germany
10 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group