Zero-length Regular Expression Matches Considered Harmful

I was asked by a colleague why ⎕S reports two matches in the following example:

      ('\d*'⎕S 0 1)'321'
┌───┬───┐
│0 3│3 0│
└───┴───┘

Here we are asking for the position and length of sequences of zero or more digits in an input document containing three numeric characters. Intuitively there is just one match of all three characters from the start of the sequence, but ⎕S reports an additional match of zero characters at the end.

The short explanation is this: when there is a match in the input, the matching characters are “consumed” and searching continues from after them. In this example the first match consumes all three characters in the input, leaving it empty. Searching then resumes and, as the pattern matches the zero remaining characters, there is a second match.

If it seems odd that we search when the input is empty then consider the case where the input is empty to start with:

      ('\d*'⎕S 0 1)''
┌───┐
│0 0│
└───┘

In this case we should not be surprised there is a match – the pattern explicitly allows it and, indeed, we would have coded the search pattern differently, perhaps as '\d+', if that was not wanted. It is therefore correct and consistent that the first example matches zero characters as well.

My colleague was not yet convinced. In this case, he asked, why are there not an infinite number of matches at the end?

It’s a good question. Zero-length matches have to be treated specially by the search engine to prevent exactly that, wherever they appear. The rule used by the PCRE search engine, which Dyalog uses, is that any pattern that results in a zero-length match is prohibited from generating another zero-length match until more characters are consumed from the input. PCRE is widely used so this behaviour is commonplace, but other search engine implementations take a different view and simply consume a single character following a zero-length match to prevent the repetition.

That difference in the rules would not have affected our example but consider a slightly more complex search pattern containing alternatives – here, zero or more digits, or one word character:

      ('\d*|\w'⎕S 0 1)'x321'
┌───┬───┬───┬───┐
│0 0│0 1│1 3│4 0│
└───┴───┴───┴───┘

In this example the first match is of zero digits at the start of the input – that is, at offset zero and zero characters long. PCRE then resumes searching at offset zero but without allowing further zero-length matches. Consequently the first alternative pattern cannot match this time but the second can. The next match is therefore a single word character (the 'x'), also at the start of the input. The 'x' is consumed and searching resumes with '321' as the input where, as explained above, there are two further matches making four in total.

Search engines which simply consume a character after a zero-length match would give a different result. After the same initial match of zero characters at the start of the input they would consume the 'x' and resume searching at '321' to give two further matches and a total of only three.

So, patterns which allow zero-length matches appear to be counter-intuitive and the cause of incompatibilities with other search engines. Should we avoid '*' and other qualifiers which allow zero repetitions of a pattern? Of course not – used carefully they can be very effective.

Firstly, a pattern which allows a sequence of zero characters within it can still be constructed so that the overall match can never have a length of zero. For example, 'colou?r' allows zero or one 'u' characters after the second 'o' so that it matches both the British English spelling 'colour' and the American English spelling 'color' – but it never results in a zero-length match, and should never surprise anyone in the way it works.

Secondly, anchors can often be used to prevent those unintuitive extra matches. Anchors in search patterns don’t match actual characters – rather, they allow the match to succeed only at specific locations in the input. '^' and '$' mean the start and end of a line respectively which, used in our very first example, prevent anything from preceding or following a match within a line and exclude the zero-length sequence because it does not satisfy that requirement:

      ('^\d*$'⎕S 0 1)'321'
┌───┐
│0 3│
└───┘

Thirdly – it often doesn’t matter that zero-length matches occur. In this example the search pattern matches any character except newline zero or more times and we are asking for it to be folded to upper-case:

      ('.*' ⎕R '\u&') 'Hello there'
HELLO THERE

The first match is of all 11 characters which are folded to upper-case and replaced. Then is there a second match of zero characters at the end and this too is folded and replaced, having no further effect. You can see this is so by using a different replacement pattern:

      ('.*' ⎕R '[\u&]') 'Hello there'
[HELLO THERE][]

As an aside, a ⎕R search pattern constructed so that it has no visible effect at all can be very useful for doing a quick and simple read of a text file into the workspace. In this example:

      tieno←⎕NTIE 'input.txt'
      ''⎕R''⊢tieno

text is read from the file without modification into a vector of character vectors, one line per element. There are many search and replace patterns that could be used to do this but this is the shortest, and it makes use of zero-length matches and replacements.

Perhaps, after all, the title of this blog should have been “zero-length regex matches considered helpful“?

Further Reading

Further information about the Dyalog search and replace operators is available in the following documents:

Footnote

Edsger W. Dijkstra wrote a paper in 1968 for Communications of the ACM entitled a “Case against the GO TO Statement” but it was published as a letter under the title “The goto statement considered harmful”. “Considered harmful” became a popular phrase in articles written in response to this and in other unrelated papers. There is an article about the phrase in Wikipedia.

Dijkstra also wrote a 1975 paper in which he criticised a number of computer languages he had used. Of APL he said, “APL is a mistake, carried through to perfection. It is the language of the future for the programming techniques of the past: it creates a new generation of coding bums”.

This author is not afraid to use goto statements or APL.

Three-and-a-bit

The most obvious expression for computing π in APL is ○1. But what if you can’t remember how works, or your O key is broken, or you feel like taking the road less travelled? With thanks to Wikipedia’s excellent list of Approximations of π, here are some short sweet APL expressions for three-and-a-bit:

      3                                 ⍝ very short
3
      4                                 ⍝ not so sweet
4
      s←*∘0.5                           ⍝ let's allow ourselves some square roots
      +/s 2 3
3.14626436994197234232913506571557
      31*÷3
3.141380652391393004493075896462748
      +/1.8*1 0.5                       ⍝ Ramanujan
3.141640786499873817845504201238766
      s 7+s 6+s 5
3.141632544503617840472137945142766
      ÷/7 4*7 9
3.141567230224609375
      9⍟995
3.141573605337628094187009177086444
      355÷113
3.141592920353982300884955752212389
      s s 2143÷22                       ⍝ Ramanujan again
3.141592652582646125206037179644022
      +∘÷/3 7 15 1 292                  ⍝ continued fraction
3.14159265301190260407226149477373
      ÷/63 25×17 7+15×s 5
3.14159265380568820189839000630151
      (1E100÷11222.11122)*÷193
3.141592653643822210363178893440074
      (⍟744+640320*3)÷s 163             ⍝ Ramanujan yet again
3.141592653589793238462643383279727

This last one is accurate to more places than I ever learned in my youth!

Technical note: to get plenty of precision, these examples were evaluated with 128-bit decimal floating-point, by setting ⎕FR←1287 and ⎕PP←34.

For more on continued fractions, see cfract in the dfns workspace.

Solving the 2014 APL Problem Solving Competition – it’s as easy as 1 1 2 3…

Competition LogoThe winners of the 2014 APL Problem Solving Competition were recognized at the Dyalog ’14 user meeting and had a chance to share their competition experiences. Emil Bremer Orloff won the student competition and received $2500 USD and an expenses-paid trip to Dyalog ’14, while Iryna Pashenkovska took first place among the non-student entries and received a complimentary registration to Dyalog ’14. Our congratulations to them for their fine efforts! This post is the first of a series where we will examine some of the problems selected for this year’s competition. The problem we’ll discuss first is Problem 3 from Phase I dealing with the Fibonacci sequence.

Problem 3 – Tell a Fib Write a dfn that takes an integer right argument and returns that number of terms in the Fibonacci sequence. Test cases:

      {your_solution} 10
 1 1 2 3 5 8 13 21 34 55
      {your_solution} 1
 1
      {your_solution} 0   ⍝ should return an empty vector

Essentially, you start with the sequence 1 1 and concatenate the sum of the last 2 numbers to the end and repeat until the sequence has the correct number of terms. In Python, a solution might look something like this:

def fib(n): # return the first n elements of the Fibonacci sequence
    result = []
    a, b = 0, 1
    while 0 < n:
        result.append(b)
        a, b = b, a+b
        n -= 1
    return result

You can write nearly the same code in APL:

r←fibLooping n;a;b
  r←⍬
  (a b)←0 1
:While 0<n
  r,←b
  (a b)←b,a+b
  n-←1
:EndWhile

While it’s possible to write APL code that looks like Python (or many other languages), one of the judging criteria for the competition is the effective use of APL syntax – in this case, by avoiding the explicit loop. Two ways to do this are 1) using recursion and 2) using the power operator ().

Recursive Solution

      fibRecursive←{⍵<3:⍵⍴1 ⋄ {⍵,+/¯2↑⍵}∇⍵-1}

The neat thing about recursion is that the function keeps calling itself with a “simpler” argument until it reaches a base case and then passes the results back up the stack. Here the base case occurs when the argument () is less than 3 and the function returns ⍵⍴1 – it’s either an empty vector, 1, or 1 1 for values of of 0, 1 and 2 respectively. It’s easy to illustrate the recursive process by adding some code (⎕←) to display information at each level of recursion.

      fibRecursive←{⍵<3:⎕←⍵⍴1 ⋄ {⎕←⍵,+/¯2↑⍵}∇⎕←⍵-1}
      fibRecursive 10
9
8
7
6
5
4
3
2
1 1
1 1 2
1 1 2 3
1 1 2 3 5
1 1 2 3 5 8
1 1 2 3 5 8 13
1 1 2 3 5 8 13 21
1 1 2 3 5 8 13 21 34
1 1 2 3 5 8 13 21 34 55

The recursive solution above is termed “stack recursive” in that it creates a new level on the function call stack for each recursive call. We can modify the code slightly to implement a “tail recursive” solution which Dyalog can detect and optimize by avoiding creating those additional function call stack levels. You can see the effect of this as each level computes its result immediately:

      fibTail←{⍺←0 1 ⋄ ⍵=0:⎕←1↓¯1↓⍺ ⋄ (⎕←⍺,+/¯2↑⍺)∇ ⎕←⍵-1}
      fibTail 10
        fibTail 10
9
0 1 1
8
0 1 1 2
7
0 1 1 2 3
6
0 1 1 2 3 5
5
0 1 1 2 3 5 8
4
0 1 1 2 3 5 8 13
3
0 1 1 2 3 5 8 13 21
2
0 1 1 2 3 5 8 13 21 34
1
0 1 1 2 3 5 8 13 21 34 55
0
0 1 1 2 3 5 8 13 21 34 55 89
1 1 2 3 5 8 13 21 34 55

Power Operator Solution

      fibPower←{⍵⍴({⍵,+/¯2↑⍵}⍣(0⌈⍵-1))1}

The power operator is defined as {R}←{X}(f⍣g)Y. When the right operand g is a numeric integer scalar – in this case (0⌈⍵-1), the power operator applies its left operand function f{⍵,+/¯2↑⍵} cumulatively g times to its argument Y, which in this case is 1. Similar to the recursive solution, we can add ⎕← to see what happens at each application:

      fibPower←{⍵⍴({⎕←⍵,+/¯2↑⍵}⍣(0⌈⍵-1))1}
      fibPower 10
1 1
1 1 2
1 1 2 3
1 1 2 3 5
1 1 2 3 5 8
1 1 2 3 5 8 13
1 1 2 3 5 8 13 21
1 1 2 3 5 8 13 21 34
1 1 2 3 5 8 13 21 34 55
1 1 2 3 5 8 13 21 34 55

In discussing this blog post with my fellow Dyalogers, Roger Hui showed me a rather neat construct:

      fibRoger←{1∧+∘÷\⍵⍴1}     ⍝ Roger's original version
      fibBrian←{1∧÷(+∘÷)\⍵⍴1}  ⍝ tweaked to make result match contest examples
      fibRoger 10
1 2 3 5 8 13 21 34 55 89
      fibBrian 10
1 1 2 3 5 8 13 21 34 55

I’ve added the parentheses around +∘÷ to make it a bit clearer what the left operand to the scan operator \ is. Let’s break the code down…

      ⍵⍴1 ⍝ produces a vector of ⍵ 1's 
1 1 1 1 1 ⍝ for ⍵=5

Then we apply scan with the function +∘÷, which in this case has the effect of adding 1 to the reciprocal of the previous result:

      (+∘÷)\1 1 1 1 1
1 2 1.5 1.666666667 1.6

Note that the denominator of each term is the corresponding element of the Fibonacci series…

1 ÷ 1 = 1
2 ÷ 1 = 2
3 ÷ 2 = 1.5
5 ÷ 3 = 1.6666666667
8 ÷ 5 = 1.6

To extract them, take the reciprocal and then the LCM (lowest common multiple) with 1.

      1∧÷1 2 1.5 1.666666667 1.6
1 1 2 3 5

What happens if you want to accurately compute larger Fibonacci numbers? There’s a limit to the precision based on the internal number representations. Because fibRoger and fibBrian delve into floating point representation, they’re the most limited. (Roger points out that the J language has native support for extended precision and does not suffer from this limitation.)

Dyalog has a system variable, ⎕FR, that sets the how floating-point operations are performed using either IEEE 754 64-bit floating-point operations or IEEE 754-2008 128-bit decimal floating-point operation. For most applications, 64-bit operations are perfectly acceptable, but in applications that demand a very high degree of precision, 128-bit operations can be used, albeit at some performance degradation. Using 64-bit operations, fibBrian loses accuracy after the 35th term while using 128-bits extends the accuracy to 68 terms.

Even setting ⎕FR to use 128-bit operations and the print precision, ⎕PP, to its maximum, we can only compute up to the 164th element 34-digit 8404037832974134882743767626780173 before being forced into scientific notation.

Can we go further easily? Yes we can! Using Dyalog for Microsoft Windows’ .NET integration, we can seamlessly make use of the BigInteger class in the System.Numerics .NET namespace.
First we need to specify the namespace and where to look for it…

      ⎕USING←'System.Numerics,system.numerics.dll'

Then we make a trivial change to our code to use BigInteger instead of native data types…

      fibRecursive←{⍵<3:⍵⍴⎕NEW BigInteger 1 ⋄ {⍵,+/¯2↑⍵}∇ ⍵-1}

And we’re done!

So the next time someone walks up to you and asks “What can you tell me about the 1,000th element of the Fibonacci series?”, you can confidently reply that it has 209 digits and a value of
43466557686937456435688527675040625802564660517371780402481729
08953655541794905189040387984007925516929592259308032263477520
96896232398733224711616429964409065331879382989696499285160037
04476137795166849228875.

For other APL Fibonacci implementations, check out this page.

A Speed-Up Story

The first e-mail of the work week came from Nicolas Delcros. He wondered whether anything clever can be done with ∘.≡ on enclosed character strings. I immediately thought of using “magic functions“, an implementation technique whereby interpreter facilities are coded as dfns. I thought of magic functions because the APL expressions involved in this case are particularly terse:

   t ←' zero one two three four five six seven eight nine'
   t,←' zéro un deux trois quatre cinq six sept huit neuf'
   t,←' zero eins zwei drei vier fünf sechs sieben acht neun'
   b←⌽a←1↓¨(' '=t)⊂t

   cmpx 'a∘.≡a' '∘.=⍨⍳⍨a'
  a∘.≡a   → 9.69E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  ∘.=⍨⍳⍨a → 8.21E¯6 | -92% ⎕⎕
   cmpx 'a∘.≡b' '(a⍳a)∘.=(a⍳b)'
  a∘.≡b         → 9.70E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (a⍳a)∘.=(a⍳b) → 1.43E¯5 | -86% ⎕⎕⎕⎕

   y←⌽x←300⍴a

   cmpx 'x∘.≡x' '∘.=⍨⍳⍨x'
  x∘.≡x   → 9.53E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  ∘.=⍨⍳⍨x → 1.52E¯4 | -99% ⎕
   cmpx 'x∘.≡y' '(x⍳x)∘.=(x⍳y)'
  x∘.≡y         → 9.55E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (x⍳x)∘.=(x⍳y) → 1.95E¯4 | -98% ⎕

The advantage will be even greater if/when we speed up . (And, obviously, the idea is also applicable to ∘.≢ : just replace with and = with .)

Jay Foad objected that the comparisons above aren’t quite fair as the more verbose expressions should check that either ⎕ct←0 or that the arguments do not contain any elements subject to tolerant comparison. I countered that the checking in C would not greatly affect the benchmarks as the time to do the checking is O(m+n) but the benefits are O(m×n) .

Since the factors are so promising and the coding relatively easy, I went ahead and did the work, with the following results:

14.1 14.0 ratio cost of checking
a∘.≡a 9.16e¯6 9.69e¯5 10.58 1.09
a∘.≡b 1.53e¯5 9.70e¯5 6.34 1.08
x∘.≡x 1.48e¯4 9.53e¯3 64.39 1.00
x∘.≡y 1.94e¯4 9.55e¯3 49.23 1.01

The last column (the cost of checking that tolerant comparison is not involved) is under 10% and decreases as the argument sizes increase.

This work is another illustration of the ubiquity and practical usefulness of the “selfie” concept — in the new (14.1) implementation, x∘.≡y is faster when the left and right arguments are the same than when they are not. In particular, the selfie x⍳x or ⍳⍨x occurs twice, and bolsters the point made in item 3 of Sixteen APL Amuse-Bouches:

x⍳x are like ID numbers; questions of identity on x can often be answered more efficiently on x⍳x than on x itself.

Finally, after all that, I recommend that one should consider using x⍳y before using x∘.≡y . The latter requires O(m×n) space and O(m×n) time, and is inherently inefficient.

Creating Shortcuts (Microsoft Windows)

At Dyalog, a developer not only needs access to all of the readily available editions of the interpreter but also to earlier versions that are no longer officially supported. On my Microsoft Windows Desktop I have a folder that contains a shortcut to my developer builds of all these interpreters.

I’ve recently suffered a complete failure of my C drive which, of course, contained my Windows desktop and all my shortcuts – which were lost.

The Dyalog interpreter exists in Classic and Unicode editions and 32 and 64 bit flavours…for versions from 12.0 to 14.1 this is a total of 28 interpreters. I couldn’t bring myself to create each of those 28 shortcuts by hand; let’s not even consider that if I were to build both debug and optimised binaries that would be 56 shortcuts!

Fortunately we have an old COM example (in the shortcut.dws workspace) which is shipped with the product. It’s a trivial exercise to call the supplied SHORTCUT function in one or more (ahem!) loops to create all the required shortcuts:

 Dyalogs;vers;chars;bits;location;v;c;b;dyalog;dir;name

 vers←'12.0.dss' '12.1.dss' '13.0.dss' '13.1.dss' '13.2.dss' '14.0.dss' 'trunk'
 chars←'classic' 'unicode'
 bits←'32' '64'

 ⎕NA'u shell32|SHGetFolderPath* u u u u >0T'
 location←'\Dyalog',⍨2⊃SHGetFolderPath 0 0 0 0 255

 :For v :In vers
     :For c :In chars
         :For b :In bits

             dir←'e:\obj\',v,'\2005\apl\win\',b,'\',c,'\winapi\dev\dbg'
             dyalog←dir,'\dyalog.exe'
             name←'.lnk',⍨location,'\',({v↓⍨('.dss'≡¯4↑v)/¯4}v),' ',b,' ',c

             name SHORTCUT dyalog dir'' 1('' 0)0 ''
         :End
     :End
 :End

shortcuts