nums ← {utab} ##.utf8 chrs ⍝ ⎕AV → UTF-8 translation.
chrs ← {utab} ##.utf8 nums ⍝ UTF-8 → ⎕AV translation.
[chrs] is a simple character vector and [nums] is its UTF-8 encoding into a
vector of numbers in the range 0..255.
To write the resulting UTF-8 numbers to native file using ⎕nappend, they should
first be "cast" to 1-byte integers in the range ¯128..127 using →int←; functions
→utf8put← and →utf8get← do this.
It takes a significant (though fixed) time (in the order of a millisecond), to
build the UTF-8 encoding table. For casual use, this may not be a problem but
for repeated use, we might prefer to generate the table once and supply it as
utf8's optional left argument for subsequent calls. To return the table, rather
than a translation, call utf8 with a null '' left argument. See example below.
(muse:
If Dyalog were extended to allow function-returning functions, we could make
a utf8∆, which returned either a ⎕AV_to_UTF-8 or UTF8_to_⎕AV function, each
bound with the encoding vector to form a closure:
utf8∆ ← {⎕io ⎕ml←0 1 ⍝ UTF-8 ←→ ⎕AV translation.
·
· utab←{ ⍝ 256-item encoding vector.
· · ⍵<128:⍵ ⍝ 0000 .. 007f
· · ⍵<2048:192 128+0 64⊤⍵ ⍝ 0080 .. 07ff
· · 224 128 128+0 64 64⊤⍵ ⍝ 0800 .. ffff
· }¨{ ⍝ Unicode codepoints / ⎕AV char.
· · 16⊥'0123456789abcdef'⍳⍉⍵ ⍝ hex to decimal.
· }{ ⍝
· · ↑(1↓¨(' '=⍵)⊂⍵)~⊂'' ⍝ 256×4 char matrix.
· }{
· · ' 0000 0008 000a 000d 0020 000c e000 e001 ',⍵}{ ⍝ 00-07 ········
· · ' 001b 0009 2336 026b 0025 0027 237a 2375 ',⍵}{ ⍝ 08-0f ··⌶ɫ%'⍺⍵
· ...
· · ' 003a 2377 00bf 00a1 22c4 2190 2192 235d ',⍵}{ ⍝ f0-f7 :⍷¿¡⋄←→⍝
· · ' 0029 005d e009 e00a 00a7 2395 235e 2363 ',⍵}'' ⍝ f8-ff )]§⎕⍞⍣
·
· 'a2u'≡⍵:{∊utab[⎕av⍳⍵]} ⍝ ⎕AV to UTF-8 function.
·
· 'u2a'≡⍵:{ ⍝ UTF-8 to ⎕AV function.
· · ⎕av[+⌿0⌈(∨⌿¯1≠⍵)/⍵] ⍝ ⎕av posn per sequence.
· }∘{
· · ¯1+↑257|1+utab∘⍳¨1 2 3,/¨⊂⍵ ⍝ sequences per sequence length.
· }
}
Then:
u2a ← utf8∆'u2a' ⍝ generate UTF-8 to ⎕AV function.
a2u ← utf8∆'a2u' ⍝ generate ⎕AV to UTF-8 function.
a2u'hello⍳world' ⍝ apply ⎕AV to UTF-8 function.
104 101 108 108 111 226 141 179 119 111 114 108 100
u2a a2u'hello⍳world' ⍝ and it's inverse.
hello⍳world
⎕av≡ u2a a2u ⎕av ⍝ round-trip all ⎕AV characters.
1
See: http://www.dyalog.com/dfnsdws/fre.pdf
)
Examples:
4 0⍕ utf8 'hello world'
104 101 108 108 111 32 119 111 114 108 100
⍝<h> <e> <l> <l> <o> < > <w> <o> <r> <l> <d>
4 0⍕ utf8 'hello⍳world'
104 101 108 108 111 226 141 179 119 111 114 108 100
⍝<h> <e> <l> <l> <o> <⍳········> <w> <o> <r> <l> <d>
4 0⍕ utf8 'a⍳⍴¨¯2',⊃⎕av
97 226 141 179 226 141 180 194 168 194 175 50 0
⍝<a> <⍳········> <⍴········> <¨····> <¯····> <2> <>
utf8 utf8 'a⍳⍴¨¯2' ⍝ round-trip some APL chars.
a⍳⍴¨¯2
⎕av ≡ utf8 utf8 ⎕av ⍝ round-trip all APL chars.
1
{⍵≡utf8 utf8 ⍵} notes.utf8 ⍝ round-trip char vec with newlines.
1
utab ← '' utf8 '' ⍝ save encoding table.
ufast8 ← utab∘utf8 ⍝ bind table to avoid setup cost.
⎕av ≡ ufast8 ufast8 ⎕av ⍝ round-trip all APL chars.
1
See also: utf8get utf8put int xhtml
Back to: contents
Back to: Dyalog APL
Trouble seeing APL font?