RegEx: identifying list markers

by **kai** on Sat Mar 19, 2016 10:03 am

I have the following problem: I want to count white space characters and list identifiers for list items. The following table shows input and desired result:

Code: Select all: 2 ← '* Item' 3 ← ' * Item' 4 ← ' * Item' 5 ← ' * Item ' ⍬ ← ' ** Item' ⍝ No hit: invalid ⍬ ← ' # Item' ⍝ No hit: invalid 3 ← ' + Item' 3 ← ' - Item' 3 ← '1. Item 4 ← '11. Item 5 ← ' 11) Item 6 ← ' 11) Item ⍬ ← ' 1234567890) Item ⍝ no hit: max 9 digits

For ordered lists this works:

Code: Select all: ⍬≡'\s*?\b\d{1,9}\b[.)]\s+' ⎕S 1⊢' 1234567890. Item' 1 '\s*?\b\d{1,9}\b[.)]\s+' ⎕S 1⊢' 123. Item' 9

For unordered lists I expected this to work:

Code: Select all: '\s*?([-*+]{1})\s+'⎕S 1⊢' * Item' 6 ⍬≡ '\s*?([-*+]{1})\s+'⎕S 1⊢' # Item' 1

That's fine but this is not:

Code: Select all: '\s*?([-*+]{1})\s+'⎕S 1⊢' ** Item' 4

It seems that the {1} is ignored.

It's not a missing feature: it works for ordered lists, restricting the number of digits to a minimum of 1 and a maximum of 9.

I wonder what is going on here...

by **DanB|Dyalog** on Sun Mar 20, 2016 12:02 pm

The {1} is unnecessary, having any character (or set) means ONE already so

Code: Select all: '\s*?([-*+]{1})\s+'

can be simplified to

Code: Select all: '\s*?([-*+])\s+'

and since you are not capturing the matched character you can get rid of the parens too.

The engine worked fine. When looking at ' ** Item' it searched for spaces from the beginning, then a star then another space and that failed because there was another star right after it.
It then tried matching starting at the 2nd character from the beginning and that failed for the same reason.
It then tried at the 3rd char and that too failed for the same reason.
At the 4th char (the 2nd star) the engine matched NO spaces and a single star and the following spaces and a match was found, matching the star and the following 3 spaces (4 in total).

I think what you wanted was a single match FROM THE BEGINNING and you should have used the ^ at the beginning for that, as in

Code: Select all: '^\s*?[-*+]\s+'

by **kai** on Sun Mar 20, 2016 6:26 pm

How many years does it take to master this stuff?

Thanks a lot Dan.

by **crishog** on Mon Mar 21, 2016 1:04 pm

How long does it take to master APL? :-)

After several years of often scratching my head I'm reasonably comfortable with regex: you have to me if you're fiddling with tools which generate web pages, or scan for spam, which I seem to have done a lot recently. Fortunately there a a lot of resources covering regular expressions on the Internet.

Once I got over the fact that this isn't a linear set of symbols like an APL function, it became much easier.

by **kai** on Tue Mar 22, 2016 5:12 pm

Just read this in an excellent book ("Regular Expressions Cookbook" by Jan Goyvaerts & Steven Levithan):

When confronted with a problem a guy thought "I am going to solve this with Regular Expressions".

Now he has two problems.

by **AndyS|Dyalog** on Wed Mar 23, 2016 11:22 am

I've been using regular expressions for over 25 years; I feel I have only scratched the surface of what is possible. Indeed, one of the early bug reporting systems used for Dyalog APL was written as a small number of UNIX scripts, each of which had one or two highly complex uses of regular expressions and the UNIX command sed.

What I have learned is that often it is more advantageous to use multiple regular expressions to massage the data in stages, rather than trying to use one more complex expression to do everything; it might be slightly slower, but it's considerably easier for anyone (including me) to work out what is going on. In Kai's example I might be inclined to use regular expressions to either extract valid list items, or filter out bad ones, and then use APL to calculate the correct result.

The tool of thought for

software solutions

RegEx: identifying list markers

RegEx: identifying list markers

Re: RegEx: identifying list markers

Re: RegEx: identifying list markers

Re: RegEx: identifying list markers

Re: RegEx: identifying list markers

Re: RegEx: identifying list markers

Who is online

QUICK LINKS