Newlines etc.

by **petermsiegel** on Mon Aug 21, 2017 5:16 am

Does anyone know which characters the Dyalog system alters, treats as line terminators, etc., i.e. which does it not treat as literals in quoted text. I'm especially interested in what is transformed by ⎕FMT, ⎕FX, ⎕ED, )EDIT, etc.

My goal is to create quoted strings on the fly and then fix them in functions within quotes. While it is trivial to change strings with special chars (like NEL = ⎕UCS 133) to concatenated strings 'TEXT...',(⎕UCS 133),'MORE TEXT', knowing which ones won't be altered by system routines would be helpful and far more efficient than concatenating little strings each time through. That way, I only need escape those that create errors or unexpected newlines.

(I am assming that ⎕R and the I/O functions make no quiet changes to these characters, beyond the defined EOL character, if normalization is kept off.)

Typically, the universe of newline characters is CR LF CRLF NEL VT FF LS PS. How are they treated internally, viz. in quoted strings, when entered by external editors or by construction.

by **AndyS|Dyalog** on Mon Aug 21, 2017 9:15 am

I'm not sure that there is a simple answer to this question, and in particular, there isn't a simple answer that is future-proof. Can you give a more exact example of what you're trying to do ?

What about examples like

      str←'''',(⎕ucs 13 10),''''
      ⎕fx'fn' str
      ⍴⎕cr'fn'
2 4
      ⍴⎕fmt ⎕cr'fn'
4 4

Are you OK with this, or do you need to preserve the text '⎕ucs 13 10' so that the number of lines in ⍴⎕fmt ⎕cr is 2 ?

As a first pass, I'd treat 133 and everything less than 32 specially. As a more thorough test I would write some code which applies ⎕FX/⎕FMT/⎕JSON etc to a series of strings, each of which contains one of the characters that you consider might be included in a string, and check to see which ones "work" and which don't. Such a test may well run sufficiently quickly that you could run it each time and build up a list of good and bad characters on the fly.

Be aware that even if the string "works", you can have problems further down the line when you display that string .. when what looks like a space isn't, and you start to get what appear to be random SYNTAX ERRORs when you copy and paste into the session.

If there's any chance that your code would need to run in a Classic interpreter, you also need to check that the characters exist in the ⎕AVU of the target namespace.

As is is, we've been talking about altering the way the interpreter deals internally with EOLs for a long time, and might even make some changes in 17.0 .. this could potentially affect your code too.

by **petermsiegel** on Mon Aug 21, 2017 6:12 pm

Thanks for your analysis... Answer to your question at bottom of note...

I think I've figured out where NELs and other chars disappear-- happy to have any errors pointed out to me. I believe that ⎕R replaces NELs even if ('EOL' 'LF') is set . (EOL LF) should mean that ONLY LFs are treated as EOL chars in this instance -- and there is no EOL normalization, whenever ⎕R would by default (or option) create a Nested result. To prevent it (since I have multiple actual lines with newlines), specify ('ResultText' 'Simple') and split via APL primitives.

At first glance, this makes sense, but ResultText Nested should only (IMHO) create separate vectors (strings) where the current newline appears. If ('EOL' 'LF') is specified, that should ONLY be when linefeeds (newlines) are seen. From the documentation: "The implicit line ending character may be set using the EOL option. Explicit line ending characters may also be replaced by this character - so that all line endings are normalised - using the NEOL option."

Here's the options I use with ⎕R -- NELs and PSs are preserved properly.
R∆OptsModeL←('Mode' 'L')('ResultText' 'Simple')('EOL' 'LF')('IC' 1)('UCP' 1)
But here's a simpler test case:

Code: Select all: ⍝ TEST CASE NEL←⎕UCS 133 test←'Line 1',NEL,'Line 2' test Line 1 Line 2 NEL∊test 1 test2←'L'⎕R'l'⊣test test2 line 1 line 2 NEL∊test2 1 test3←'L'⎕R'l'⍠('ResultText' 'Nested')⊣test test3 line 1 line 2 NEL∊∊test3 0 test4←'L'⎕R'l'⊣test 'And another' test4 line 1 line 2 And another NEL∊test4 0 test5←'L'⎕R'l'⍠('ResultText' 'Simple')⊣test 'And another' NEL∊∊test5 ⍝ ResultText Simple forces ⎕R not to eat the NEL 1 test5 line 1 line 2 And another

My preference would be simply to standardize and document: perhaps, handle end of lines as narrowly and as consistently as possible and let everything else through. Whatever the case, I'd do so by FIAT, rather than programming drift. (If you specify that ALL UNICODE-std EOL chars are treated as newlines or otherwise not passed through intact, that's fine-- but that then means ALL of them.)

This document

http://unicode.org/standard/reports/tr13/tr13-5.html

lays out the problem

by **Richard|Dyalog** on Wed Aug 23, 2017 1:50 pm

Many thanks for your analysis of this. It is the case that there is some inconsistency in the way some Dyalog functions work with line endings - these historical differences have come about from the prevailing standards at the time they were implemented, the behaviour of the original Classic interpreter when converted to Unicode, and so on. When ⎕R and ⎕S were implemented, we tried to faithfully honour the Unicode standard requirements - but it's really quite complicated and there may be errors.

You said:

I believe that ⎕R replaces NELs even if ('EOL' 'LF') is set. (EOL LF) should mean that ONLY LFs are treated as EOL chars in this instance.

You are correct in your analysis of what is happening; the EOL option only specifies which line ending sequence is used when one needs to be implicitly generated or, when normalisation takes place, which line ending sequence to normalise to. Your assertion that it should also influence how the contents of input is interpreted contradicts my own reading of the Unicode document you cite, most specifically:

A readline function should stop at NLF [CR, LF, CRLF, or NEL], LS, FF, or PS.

Here is a brief overview of how a document "flows" through ⎕R. (Actually, there are two models but you are exclusively using Line mode so I will focus on that one):

The document (that is, the input) is a stream of characters which may derive from either (a) a single character vector (which may or may not contain line ending characters) ("simple"), (b) a vector of character vectors (which have an implicit line ending between them, and may also contain explicit line ending characters) ("nested"), or (c) a file.
Regardless of the document format, this stream of characters is then broken up into individual lines by splitting on any and all line ending sequences, as mandated by the Unicode specification. Note that the individual lines do not include the line ending sequences themselves (although each line "remembers" which line ending sequence terminated it in case it is needed later in stage 4).
The lines are individually passed to the PCRE search engine and modified as directed. (This could introduce new line ending sequences which again cause the text to be split into separate lines.)
If the result is nested then each line is used to create one character vector in the output. If the result is simple then the lines are "glued back together". The line ending sequence used to "glue" them is the one that had been "remembered" for that line.
Note that regardless of whether the input document was simple or nested, it is broken up and processed in lines. The default output format matches the input format; the ResultText option simply overrides that default.

I hope this helps clarify!

by **petermsiegel** on Wed Aug 23, 2017 5:36 pm

Much obliged. It's all a problem of standards. As the old quip (from Tanenbaum) goes: "The nice thing about standards is that you have so many to choose from."

The tool of thought for

software solutions

Newlines etc.

Newlines etc.

Re: Newlines etc.

Re: Newlines etc.

Re: Newlines etc.

Re: Newlines etc.

Who is online

QUICK LINKS