RegEx: identifying list markers

APL-related discussions - a stream of APL consciousness.
Not sure where to start a discussion ? Here's the place to be
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !

RegEx: identifying list markers

Postby kai on Sat Mar 19, 2016 10:03 am

I have the following problem: I want to count white space characters and list identifiers for list items. The following table shows input and desired result:

Code: Select all
2 ← '* Item'
3 ← ' * Item'
4 ← '  * Item'
5 ← '  *  Item   '
⍬ ← ' **  Item'   ⍝ No hit: invalid
⍬ ← ' #  Item'   ⍝ No hit: invalid
3 ← ' + Item'
3 ← ' - Item'
3 ← '1. Item
4 ← '11. Item
5 ← ' 11) Item
6 ← ' 11)  Item
⍬ ← ' 1234567890) Item ⍝ no hit: max 9 digits


For ordered lists this works:

Code: Select all
      ⍬≡'\s*?\b\d{1,9}\b[.)]\s+' ⎕S 1⊢'  1234567890.   Item'
1
      '\s*?\b\d{1,9}\b[.)]\s+' ⎕S 1⊢'  123.   Item'
9


For unordered lists I expected this to work:

Code: Select all
      '\s*?([-*+]{1})\s+'⎕S 1⊢'  *   Item'
6
      ⍬≡ '\s*?([-*+]{1})\s+'⎕S 1⊢'  #   Item'
1


That's fine but this is not:

Code: Select all
      '\s*?([-*+]{1})\s+'⎕S 1⊢'  **   Item'
4


It seems that the {1} is ignored.

It's not a missing feature: it works for ordered lists, restricting the number of digits to a minimum of 1 and a maximum of 9.

I wonder what is going on here...
User avatar
kai
 
Posts: 137
Joined: Thu Jun 18, 2009 5:10 pm
Location: Hillesheim / Germany

Re: RegEx: identifying list markers

Postby DanB|Dyalog on Sun Mar 20, 2016 12:02 pm

The {1} is unnecessary, having any character (or set) means ONE already so
Code: Select all
'\s*?([-*+]{1})\s+'
can be simplified to
Code: Select all
'\s*?([-*+])\s+'
and since you are not capturing the matched character you can get rid of the parens too.

The engine worked fine. When looking at ' ** Item' it searched for spaces from the beginning, then a star then another space and that failed because there was another star right after it.
It then tried matching starting at the 2nd character from the beginning and that failed for the same reason.
It then tried at the 3rd char and that too failed for the same reason.
At the 4th char (the 2nd star) the engine matched NO spaces and a single star and the following spaces and a match was found, matching the star and the following 3 spaces (4 in total).

I think what you wanted was a single match FROM THE BEGINNING and you should have used the ^ at the beginning for that, as in
Code: Select all
'^\s*?[-*+]\s+'
DanB|Dyalog
 

Re: RegEx: identifying list markers

Postby kai on Sun Mar 20, 2016 6:26 pm

How many years does it take to master this stuff?

Thanks a lot Dan.
User avatar
kai
 
Posts: 137
Joined: Thu Jun 18, 2009 5:10 pm
Location: Hillesheim / Germany

Re: RegEx: identifying list markers

Postby crishog on Mon Mar 21, 2016 1:04 pm

How long does it take to master APL? :-)

After several years of often scratching my head I'm reasonably comfortable with regex: you have to me if you're fiddling with tools which generate web pages, or scan for spam, which I seem to have done a lot recently. Fortunately there a a lot of resources covering regular expressions on the Internet.

Once I got over the fact that this isn't a linear set of symbols like an APL function, it became much easier.
crishog
 
Posts: 61
Joined: Mon Jan 25, 2010 9:52 am

Re: RegEx: identifying list markers

Postby kai on Tue Mar 22, 2016 5:12 pm

Just read this in an excellent book ("Regular Expressions Cookbook" by Jan Goyvaerts & Steven Levithan):

When confronted with a problem a guy thought "I am going to solve this with Regular Expressions".

Now he has two problems.
User avatar
kai
 
Posts: 137
Joined: Thu Jun 18, 2009 5:10 pm
Location: Hillesheim / Germany

Re: RegEx: identifying list markers

Postby AndyS|Dyalog on Wed Mar 23, 2016 11:22 am

I've been using regular expressions for over 25 years; I feel I have only scratched the surface of what is possible. Indeed, one of the early bug reporting systems used for Dyalog APL was written as a small number of UNIX scripts, each of which had one or two highly complex uses of regular expressions and the UNIX command sed.

What I have learned is that often it is more advantageous to use multiple regular expressions to massage the data in stages, rather than trying to use one more complex expression to do everything; it might be slightly slower, but it's considerably easier for anyone (including me) to work out what is going on. In Kai's example I might be inclined to use regular expressions to either extract valid list items, or filter out bad ones, and then use APL to calculate the correct result.
User avatar
AndyS|Dyalog
 
Posts: 257
Joined: Tue May 12, 2009 6:06 pm


Return to APL Chat

Who is online

Users browsing this forum: No registered users and 1 guest