Inverted Tables

Roger K.W. Hui





⎕io ← ⎕ct ← 0

Here, the index origin is 0, although the ideas, algorithms, code, examples, …, work in 1-origin with obvious modifications.

As well, we assume ⎕ct←0 . If ⎕ct≠0 , that is, if tolerant comparison is used, computing the analogs ofand related functions on tables is a difficult problem. It can be solved using techniques described in Index-Of on Multiple Floats in Dyalog ’17 but is beyond the scope of this note.
 

Introduction I

A table is 2-dimensional data in which the items of a column have the same data type; an inverted table is a table in which the columns are nested simple arrays. An inverted table is more agreeably displayed by application of ⍪¨ .

X is an example of a table. It has 8 rows and 5 columns. T is the equivalent inverted table. The function invert applies to a table to produce the equivalent inverted table.

(We sometimes call a table a “verted table”, to clearly distinguish it from an inverted table.)
 

Introduction II

A verted table can be created by repeated comma-bar assign. An inverted table can be created by repeated comma-bar each assign.

Note that for an inverted table, items which are vectors must be fixed-length vectors.

Two puzzles: The first puzzle is for Array Notation gurus. How would you express an inverted table in Array Notation? The second puzzle is for APL2 array prototype phanatics: What is the prototype for an inverted table? For that matter, what is the prototype for a verted table?
 

Space

An inverted table is more efficient in space than a verted table, in this case, by a factor of about 10.
 

Time I

An inverted table is usually more efficient in time than a verted table. The first and second benchmarks are a queries. The third benchmark finds the index of one table in another; the function 8⌶ implements index-of on inverted tables. The fourth benchmark indexes into a table, and in such indexing a verted table is faster than an inverted table, but in both cases the computation takes a tiny amount of time.
 

Time II

Alas, the news isn’t all good. Appending a new row to a verted table is much more efficient than doing the same to an inverted table. The benchmark on X1 v T1 shows that ⍪← is supported by special code but ⍪¨← is not. An RFE has been filed on this.
 

Internal Representation: Verted

The internal representations explain why there are such vast differences in space and time. In APL, each item of a nested array is a pointer to an array, and every array has a header and data. On a 64-bit system, a pointer takes 64 bits. The size in bytes of a simple array is:

   size←{24+(8×≢⍴⍵) + 8×⌈64÷⍨(×/⍴⍵)×⌊0.1×⎕dr ⍵}

The components of size are, from right to left:

⌊0.1×⎕dr ⍵  # of bits per item
×/⍴⍵ # of items
8×⌈64÷⍨ # of bytes for the items, rounded up to the next multiple of 8
24+(8×≢⍴⍵) # of bytes for the header

The space for all those headers adds up.

This same diagram also explains why verted tables are worse in time. The data for the verted table items are scattered in memory, and access to an item requires going through a pointer.

This “mechanical sympathy” with APL implementations suggests that it might be beneficial to have an i-beam (a system facility) to compact the array items of a nested array. (This is different from workspace compaction, which coalesces free blocks of memory but does not reorder existing arrays.) The following benchmarks support this argument:

foo n;_
p←(n,8)⍴⊂⍬
_←{0⊣p[;⍵]←⍕¨?n⍴2e9}¨⍳8
q←⍕¨?(n,8)⍴2e9
r←(n,8)⍴⊂⍬
r[n?n;]←⍕¨?(n,8)⍴2e9

foo 1e6 creates verted tables p , q , and r with the same size, 1e6 rows and 8 columns.

•   Vectors in a column of p are very near each other in memory.
Vectors in a column of q are somewhat near each other in memory.
Vectors in a column of r are unlikely to be near each other in memory.
   ⎕wa  ⍝ force workspace compaction
2069598432
   foo
   ⎕wa
735192904

   cmpx 'p[;2]' 'q[;2]' 'r[;2]'
  p[;2] → 1.88E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* q[;2] → 4.05E¯2 | +115% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* r[;2] → 5.60E¯2 | +197% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The array compactor should allow optimization for row-major or column-major access.
 

Internal Representation: Inverted

In contrast, for an inverted table, each column is a simple array, so

•  the number of pointers is the same as the number of columns
the number of headers is the same as the number of columns, plus 1 for the table itself.

Moreover, the data for each column are contiguous in memory, resulting in faster access.

We saw earlier that vector data items for an inverted table must be fixed-length. In a verted table, the overhead for vector data items is 8 bytes for the pointer, plus 32 bytes for the header. Therefore, the padded required for fixed-length vectors is “wasteful” only if the average length is greater than 40.
 

Utilities

The remainder of this note presents utilities for working with inverted tables. If a table has “key” columns, then computations such as index should be applied to those columns rather than to the entire table.
 

Assertions

tassert asserts that the argument is an inverted table: a non-empty vector whose items have the same number of major cells, and each item is simple.
 

Indexing

⍺ tindex ⍵ indexes row(s)from inverted table .

   tindex ← {(⊂⊂⍺)⌷¨⍵}

   ⍪¨ 3 1 1 2 tindex T
┌──────┬───────┬─┬──┬────┐
│Wilson│Diana  │1│23│1.25│
│Jones │Dakota │1│29│0.97│
│Jones │Dakota │1│29│0.97│
│Chan  │Wilson │2│47│2.11│
└──────┴───────┴─┴──┴────┘

 

Index-of & Co. I

Index-of and utilities defined in terms of it, require non-tolerant comparisons (if there are floating point data then ⎕ct must be 0).

One method for doing index-of is to substitute column i by its indices in column i of T , then computeas usual on the transposed tables of integers.

   S← 3 1 1 2 tindex T

   ⍉ ↑ T ⍳¨ S
3 3 1 0 0
1 1 1 1 1
1 1 1 1 1
2 2 0 2 2
   ⍉ ↑ T ⍳¨ T
0 0 0 0 0
1 1 1 1 1
2 2 0 2 2
3 3 1 0 0
4 4 1 4 4
5 5 0 5 5
6 6 0 6 6
7 7 1 7 7
   (⍉ ↑ T ⍳¨ T) ⍳ (⍉ ↑ T ⍳¨ S)
3 1 1 2

   tix1 ← {(⍉↑⍺⍳¨⍺) ⍳ (⍉↑⍺⍳¨⍵)}

   T tix1 S
3 1 1 2

Index-of & Co. II

The function 8⌶ is available for doing index-of on inverted tables. (8 “i-beam”; 8 ←→ IIX , Inverted table IndeX, get it? :-)

   tix ← 8⌶
   T tix S
3 1 1 2

The i-beam function is faster than the APL-defined function (tix1), and faster than the equivalent computation on a verted table, which runs at the speed it does only because Dyalog APL recognizes it. Index-of on a general nested matrix is slower.

   cmpx 'TT tix  TT' 'TT tix1 TT' 'XX ⍳ XX'
  TT tix  TT → 9.38E¯4 |    0% ⎕⎕⎕⎕⎕⎕⎕
  TT tix1 TT → 4.27E¯3 | +355% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  XX ⍳ XX    → 1.61E¯3 |  +71% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   QQ←XX ⋄ QQ[9999;]←34  ⍝ QQ is a general nested matrix

   cmpx 'TT tix TT ' 'XX ⍳ XX' 'QQ ⍳ QQ'
  TT tix  TT → 9.34E¯4 |     0% ⎕⎕⎕
  XX ⍳ XX    → 1.61E¯3 |   +71% ⎕⎕⎕⎕⎕
* QQ ⍳ QQ    → 1.03E¯2 | +1004% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

Index-of & Co. III

Membership

⍺ teps ⍵ are those indices in ⍵ tix ⍺ that are less than the number of rows in .

   teps← {(≢⊃⍵) > ⍵ tix ⍺}

   T teps S
0 1 1 1 0 0 0 0

Without

⍺ twithout ⍵ are rows ofwhich are not rows of . Thus:

   twithout ← {⍺ ⌿¨⍨ ⊂~⍺ teps ⍵}

   ⍪¨ T twithout S
┌──────┬───────┬─┬──┬────┐
│Smith │John   │2│23│1.25│
│Saxon │Joan   │1│31│2.8 │
│Angelo│Roberto│2│19│1.11│
│Smits │Jack   │2│27│3.14│
│Franck│Donna  │1│38│2.72│
└──────┴───────┴─┴──┴────┘

Unique

The unique rows of an inverted table obtains by an analog of the well-known idiom for finding the unique.

   tunique ← {⍵ ⌿¨⍨ ⊂(⍳≢⊃⍵)=tix⍨ ⍵}

   T ≡ tunique 2⌿¨T
1

Key

andare inverted tables; rows ofspecify keys for the corresponding rows of . Then: ⍺ ⍺⍺ tkey ⍵ applies ⍺⍺ to rows ofhaving identical keys.

   tkey ← {(⊂tix⍨⍺) ⍺⍺⌸¨ ⍵}

   ⍪¨ T
┌──────┬───────┬─┬──┬────┐
│Smith │John   │2│23│1.25│
│Jones │Dakota │1│29│0.97│
│Chan  │Wilson │2│47│2.11│
│Wilson│Diana  │1│23│1.25│
│Saxon │Joan   │1│31│2.8 │
│Angelo│Roberto│2│19│1.11│
│Smits │Jack   │2│27│3.14│
│Franck│Donna  │1│38│2.72│
└──────┴───────┴─┴──┴────┘

   ⍪¨ T[,2] {⊂⍵} tkey T[2 3 4]
┌─────────┬─────────────┬─────────────────────┐
│┌───────┐│┌───────────┐│┌───────────────────┐│
││2 2 2 2│││23 47 19 27│││1.25 2.11 1.11 3.14││
│├───────┤│├───────────┤│├───────────────────┤│
││1 1 1 1│││29 23 31 38│││0.97 1.25 2.8 2.72 ││
│└───────┘│└───────────┘│└───────────────────┘│
└─────────┴─────────────┴─────────────────────┘

   ⍪¨ T[,2] {⌈/⍵} tkey T[2 3 4]
┌─┬──┬────┐
│2│47│3.14│
│1│38│2.8 │
└─┴──┴────┘

Grade & Sort

An inverted table can be graded as follows: starting with the identity permutation and proceeding from right to left, derive the next permutation by grading the permutation of the current column. This is “lexicographic grading”.

   tgr ← {⊃ {⍵[⍋(⊂⍵)⌷⍺]}/ ⍵,⊂⍳≢⊃⍵}

   tgr T
5 2 7 1 4 0 6 3

   ⍪¨ (tgr T) tindex T
┌──────┬───────┬─┬──┬────┐
│Angelo│Roberto│2│19│1.11│
│Chan  │Wilson │2│47│2.11│
│Franck│Donna  │1│38│2.72│
│Jones │Dakota │1│29│0.97│
│Saxon │Joan   │1│31│2.8 │
│Smith │John   │2│23│1.25│
│Smits │Jack   │2│27│3.14│
│Wilson│Diana  │1│23│1.25│
└──────┴───────┴─┴──┴────┘

Grade can also be computed by replacing each column with the rankings within each column, with equal items being assigned equal rankings. Thus:

   ↑ {(⍋⍋⍵)[⍳⍨⍵]}¨ T
4 2 1 6 3 0 4 6
3 0 7 1 2 6 3 3
0 4 0 4 4 0 0 4
1 5 7 1 6 0 1 1
2 0 6 2 7 1 2 2
   ⍋ ⍉ ↑ {(⍋⍋⍵)[⍳⍨⍵]}¨ T
5 2 7 1 4 0 6 3

   tgr1 ← {⍋ ⍉ ↑ {(⍋⍋⍵)[⍳⍨⍵]}¨ ⍵}

   tgr1 T
5 2 7 1 4 0 6 3

Grade and sort can be packaged as a single function: In the monadic case, torder ⍵ computes the grade of . In the dyadic case, ⍺ torder ⍵ ordersby the grade of .

   torder ← {0=⎕nc '⍺':tgr ⍵ ⋄ (tgr ⍵)tindex ⍺}

   torder T
5 2 7 1 4 0 6 3

   ⍪¨ T torder T
┌──────┬───────┬─┬──┬────┐
│Angelo│Roberto│2│19│1.11│
│Chan  │Wilson │2│47│2.11│
│Franck│Donna  │1│38│2.72│
│Jones │Dakota │1│29│0.97│
│Saxon │Joan   │1│31│2.8 │
│Smith │John   │2│23│1.25│
│Smits │Jack   │2│27│3.14│
│Wilson│Diana  │1│23│1.25│
└──────┴───────┴─┴──┴────┘

   ⍪¨ T torder T[1 0]
┌──────┬───────┬─┬──┬────┐
│Jones │Dakota │1│29│0.97│
│Wilson│Diana  │1│23│1.25│
│Franck│Donna  │1│38│2.72│
│Smits │Jack   │2│27│3.14│
│Saxon │Joan   │1│31│2.8 │
│Smith │John   │2│23│1.25│
│Angelo│Roberto│2│19│1.11│
│Chan  │Wilson │2│47│2.11│
└──────┴───────┴─┴──┴────┘

The definition of torder can be made more elegant if there is a dyadic operator(U+2834) such that

     M⠴D ⍵  ←→  M ⍵
   ⍺ M⠴D ⍵  ←→  ⍺ D ⍵

whence torder obtains as tgr⠴{(tgr ⍵) tindex ⍺} or, equivalently, tgr⠴((tgr ⊢) tindex ⊣) .
 

Collected Definitions

⎕ct      ← 0
invert   ← {↑¨ ⊂⍤¯1 ⍉ ⍵}
assert   ← {⍺←'assertion failure' ⋄ 0∊⍵:⍺ ⎕signal 8 ⋄ shy←0}

tassert  ← {
  assert (1≤≢⍵)∧1=⍴⍴⍵ :  ⍝ non-empty vector
  assert (⊃=⊢)≢¨⍵ :      ⍝ equal tally in each item
  assert 2=≡⍵ :          ⍝ nested array with simple items
  1
}

tindex   ← {(⊂⊂⍺)⌷¨⍵}
  tix1   ← {(⍉↑⍺⍳¨⍺)⍳(⍉↑⍺⍳¨⍵)}
tix      ← 8⌶
teps     ← {(≢⊃⍵) > ⍵ tix ⍺}
twithout ← {⍺ ⌿¨⍨ ⊂~⍺ teps ⍵}
tunique  ← {⍵ ⌿¨⍨ ⊂(⍳≢⊃⍵)=tix⍨ ⍵}
tkey     ← {(⊂tix⍨⍺) ⍺⍺⌸¨ ⍵}
tgr      ← {⊃ {⍵[⍋(⊂⍵)⌷⍺]}/ ⍵,⊂⍳≢⊃⍵}
  tgr1   ← {⍋ ⍉ ↑ {(⍋⍋⍵)[⍳⍨⍵]}¨ ⍵}
torder   ← {0=⎕nc '⍺':tgr ⍵ ⋄ (tgr ⍵)tindex ⍺}

Translated and adapted from the J Wiki Essay Inverted Table, 2007.



created:  2018-03-23 15:15
updated: