nums ← {utab} ##.utf8 chrs                  ⍝ ⎕AV → UTF-8 translation.
chrs ← {utab} ##.utf8 nums                  ⍝ UTF-8 → ⎕AV translation.

[chrs]  is  a  simple  character  vector and [nums] is its UTF-8 encoding into a
vector of numbers in the range 0..255.

To  write the resulting UTF-8 numbers to native file using ⎕nappend, they should
first be "cast" to 1-byte integers in the range ¯128..127 using →int←; functions
→utf8put← and →utf8get← do this.

It  takes  a significant (though fixed) time (in the order of a millisecond), to
build  the  UTF-8  encoding table. For casual use, this may not be a problem but
for  repeated  use,  we might prefer to generate the table once and supply it as
utf8's  optional left argument for subsequent calls. To return the table, rather
than a translation, call utf8 with a null '' left argument. See example below.

(muse:

    If Dyalog were extended to allow function-returning functions, we could make
    a  utf8∆, which returned either a ⎕AV_to_UTF-8 or UTF8_to_⎕AV function, each
    bound with the encoding vector to form a closure:

    utf8∆ ← {⎕io ⎕ml←0 1                        ⍝ UTF-8 ←→ ⎕AV translation.
    ·
    ·   utab←{                                  ⍝ 256-item encoding vector.
    ·   ·   ⍵<128:⍵                             ⍝ 0000 .. 007f
    ·   ·   ⍵<2048:192 128+0 64⊤⍵               ⍝ 0080 .. 07ff
    ·   ·   224 128 128+0 64 64⊤⍵               ⍝ 0800 .. ffff
    ·   }¨{                                     ⍝ Unicode codepoints / ⎕AV char.
    ·   ·   16⊥'0123456789abcdef'⍳⍉⍵            ⍝ hex to decimal.
    ·   }{                                      ⍝
    ·   ·   ↑(1↓¨(' '=⍵)⊂⍵)~⊂''                 ⍝ 256×4 char matrix.
    ·   }{
    ·   ·   ' 0000 0008 000a 000d 0020 000c e000 e001 ',⍵}{    ⍝ 00-07  ········
    ·   ·   ' 001b 0009 2336 026b 0025 0027 237a 2375 ',⍵}{    ⍝ 08-0f  ··⌶ɫ%'⍺⍵
    ·   ...
    ·   ·   ' 003a 2377 00bf 00a1 22c4 2190 2192 235d ',⍵}{    ⍝ f0-f7  :⍷¿¡⋄←→⍝
    ·   ·   ' 0029 005d e009 e00a 00a7 2395 235e 2363 ',⍵}''   ⍝ f8-ff  )]§⎕⍞⍣
    ·
    ·   'a2u'≡⍵:{∊utab[⎕av⍳⍵]}                  ⍝ ⎕AV to UTF-8 function.
    ·
    ·   'u2a'≡⍵:{                               ⍝ UTF-8 to ⎕AV function.
    ·   ·   ⎕av[+⌿0⌈(∨⌿¯1≠⍵)/⍵]                 ⍝ ⎕av posn per sequence.
    ·   }∘{
    ·   ·   ¯1+↑257|1+utab∘⍳¨1 2 3,/¨⊂⍵         ⍝ sequences per sequence length.
    ·   }
    }

    Then:

        u2a ← utf8∆'u2a'        ⍝ generate UTF-8 to ⎕AV function.

        a2u ← utf8∆'a2u'        ⍝ generate ⎕AV to UTF-8 function.

        a2u'hello⍳world'        ⍝ apply ⎕AV to UTF-8 function.
    104 101 108 108 111 226 141 179 119 111 114 108 100

        u2a a2u'hello⍳world'    ⍝ and it's inverse.
    hello⍳world

        ⎕av≡ u2a a2u ⎕av        ⍝ round-trip all ⎕AV characters.
    1

    See: http://www.dyalog.com/dfnsdws/fre.pdf
)

Examples:

    4 0⍕ utf8 'hello world'
 104 101 108 108 111  32 119 111 114 108 100
⍝<h> <e> <l> <l> <o> < > <w> <o> <r> <l> <d>

    4 0⍕ utf8 'hello⍳world'
 104 101 108 108 111 226 141 179 119 111 114 108 100
⍝<h> <e> <l> <l> <o> <⍳········> <w> <o> <r> <l> <d>

    4 0⍕ utf8 'a⍳⍴¨¯2',⊃⎕av
  97 226 141 179 226 141 180 194 168 194 175  50   0
⍝<a> <⍳········> <⍴········> <¨····> <¯····> <2> <>

    utf8 utf8 'a⍳⍴¨¯2'              ⍝ round-trip some APL chars.
a⍳⍴¨¯2

    ⎕av ≡ utf8 utf8 ⎕av             ⍝ round-trip all APL chars.
1

    {⍵≡utf8 utf8 ⍵} notes.utf8      ⍝ round-trip char vec with newlines.
1
    utab ← '' utf8 ''               ⍝ save encoding table.

    ufast8 ← utab∘utf8              ⍝ bind table to avoid setup cost.

    ⎕av ≡ ufast8 ufast8 ⎕av         ⍝ round-trip all APL chars.
1

See also: utf8get utf8put int xhtml

Back to: contents

Back to: Dyalog APL

Trouble seeing APL font?