Data Analysis with βŽ•CSV and ⌸¢

In this workshop we're going to learn to use βŽ•CSV and ⌸ to analyse a real dataset. The dataset is adapted from this freely available dataset, and the workshop itself is adapted from a 2023 workshop by Dyalog's Rich Park.

InΒ [1]:
βŽ•IO←0 ⍝ feel free to use βŽ•IO←1 if you prefer
βŽ•PW←12345
]BOX on
Was OFF

βŽ•CSV and Inverted TablesΒΆ

We'll be working with the dataset given in the CSV (comma separated values) file order_data.csv. It consists of a header row, giving titles to each column of our dataset, and rows of data values separated by commas.

Click here to download order_data.csv.

InΒ [2]:
βͺ10β†‘βŠƒβŽ•NGET'order_data.csv'1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚id,timestamp,city,state,payment,category                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚1,2017-10-02 10:56:33,sao paulo,SP,18.12,housewares            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚2,2018-07-24 20:41:37,barreiras,BA,141.46,perfumery            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚3,2018-08-08 08:38:49,vianopolis,GO,179.12,auto                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚4,2017-11-18 19:28:06,sao goncalo do amarante,RN,72.20,pet_shopβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚5,2018-02-13 21:18:39,santo andre,SP,28.62,stationery          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚6,2017-07-09 21:57:05,congonhinhas,PR,175.26,auto              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚7,2017-05-16 13:10:30,nilopolis,RJ,75.16,auto                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚8,2017-01-23 18:29:09,faxinalzinho,RS,35.95,furniture_decor    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚9,2017-07-29 11:55:02,sorocaba,SP,161.42,office_furniture      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Now, let's use βŽ•CSV to load our data into format we can work with.

InΒ [3]:
(orders_matrix headers)β†βŽ•CSV 'order_data.csv' ⍬ (2 1 1 1 2 1) 1

Here we're using βŽ•CSV to load our dataset into a matrix.

  • 'order_data.csv' is the path to the CSV file, but you can also provide CSV data directly here.

  • The second part of the argument (⍬) specifies the encoding of the file. My providing a ⍬, we are allowing βŽ•CSV to detect the encoding.

  • 2 1 1 1 2 1 specifies the data types of the columns of our file.

    • 0 means ignore this column.
    • 1 means this column is character data.
    • 2 means this column is numeric.
    • 3, 4, or 5 mean this column is numeric, but with tolerance for certain empty and/or invalid entries.

    So in our case, the first and fifth columns (id and payment) are numeric, while the rest are character data.

  • The final 1 in the argument indicates that our file includes the header row id,timestamp,city,state,payment,category, and that this should not be interpreted as data. Instead, the header row is returned separately to us in the result.

You can also export data from the workspace to a CSV file with the dyadic form. Full documentation for βŽ•CSV can be found here.

InΒ [4]:
headers
5↑orders_matrix
β”Œβ”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚idβ”‚timestampβ”‚cityβ”‚stateβ”‚paymentβ”‚categoryβ”‚
β””β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚1β”‚2017-10-02 10:56:33β”‚sao paulo              β”‚SPβ”‚18.12 β”‚housewaresβ”‚
β”œβ”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚2β”‚2018-07-24 20:41:37β”‚barreiras              β”‚BAβ”‚141.46β”‚perfumery β”‚
β”œβ”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚3β”‚2018-08-08 08:38:49β”‚vianopolis             β”‚GOβ”‚179.12β”‚auto      β”‚
β”œβ”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚4β”‚2017-11-18 19:28:06β”‚sao goncalo do amaranteβ”‚RNβ”‚72.2  β”‚pet_shop  β”‚
β”œβ”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚5β”‚2018-02-13 21:18:39β”‚santo andre            β”‚SPβ”‚28.62 β”‚stationeryβ”‚
β””β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is not the only format that βŽ•CSV supports. We can also import the data as an inverted table, using the Invert option, which can use less memory and be more efficient to query.

By giving the option 'Invert' 1, character data is imported as matrices. By giving 'Invert' 2, character data is imported as vectors of character vectors.

InΒ [5]:
(orders_columns   _)β†βŽ•CSVβŽ•OPT('Invert' 1)⊒'order_data.csv' ⍬ (2 1 1 1 2 1) 1
(orders_columns_2 _)β†βŽ•CSVβŽ•OPT('Invert' 2)⊒'order_data.csv' ⍬ (2 1 1 1 2 1) 1
βͺ5↑¨orders_columns
βͺ5↑¨orders_columns_2
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚1 2 3 4 5                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚2017-10-02 10:56:33                    β”‚
β”‚2018-07-24 20:41:37                    β”‚
β”‚2018-08-08 08:38:49                    β”‚
β”‚2017-11-18 19:28:06                    β”‚
β”‚2018-02-13 21:18:39                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚sao paulo                              β”‚
β”‚barreiras                              β”‚
β”‚vianopolis                             β”‚
β”‚sao goncalo do amarante                β”‚
β”‚santo andre                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚SP                                     β”‚
β”‚BA                                     β”‚
β”‚GO                                     β”‚
β”‚RN                                     β”‚
β”‚SP                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚18.12 141.46 179.12 72.2 28.62         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚housewares                             β”‚
β”‚perfumery                              β”‚
β”‚auto                                   β”‚
β”‚pet_shop                               β”‚
β”‚stationery                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚1 2 3 4 5                                                                                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚β”‚2017-10-02 10:56:33β”‚2018-07-24 20:41:37β”‚2018-08-08 08:38:49β”‚2017-11-18 19:28:06β”‚2018-02-13 21:18:39β”‚β”‚
β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                 β”‚
β”‚β”‚sao pauloβ”‚barreirasβ”‚vianopolisβ”‚sao goncalo do amaranteβ”‚santo andreβ”‚                                 β”‚
β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚β”Œβ”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”                                                                                     β”‚
β”‚β”‚SPβ”‚BAβ”‚GOβ”‚RNβ”‚SPβ”‚                                                                                     β”‚
β”‚β””β”€β”€β”΄β”€β”€β”΄β”€β”€β”΄β”€β”€β”΄β”€β”€β”˜                                                                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚18.12 141.46 179.12 72.2 28.62                                                                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                      β”‚
β”‚β”‚housewaresβ”‚perfumeryβ”‚autoβ”‚pet_shopβ”‚stationeryβ”‚                                                      β”‚
β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In our case, using an inverted table with matrices for character columns uses the least space (by far).

InΒ [6]:
βͺβŽ•SIZE 'orders_matrix' 'orders_columns' 'orders_columns_2'
28717912
 9888840
22252568

Note: there is more to inverted tables than we will cover today. See this talk from Dyalog '18 for more information.

We're going to use the inverted table format (with character matrices) for the rest of this workshop. We could just pass around all the columns in a vector and unpack them as need be:

(col1 col2 ...)←columns

But this makes our code brittle - every time the schema of our data changes we need to change every place we unpack these columns. So instead, let's extract all our columns into variables in a namespace for ease of access.

InΒ [7]:
ordersβ†βŽ•NS ⍬
orders.(id timestamp city state payment category)←orders_columns

And now it's nice and easy to work with columns by name.

InΒ [8]:
⌈/orders.payment ⍝ (blimey)
13664.08

PreprocessingΒΆ

Right now, our orders namespace has a timestamp variable, with the character data for the timestamps of each purchase. To work with this more easily, we're going to need to parse this into numeric columns.

Exercise: Parse timestamp into separate year, month, day, hour, minute, and second columns. You may want to use βŽ•VFI (documentation here).

SPOILERS AHEAD





















InΒ [9]:
⍝ simple solution
orders.(year month day hour minute second←↓⍉{1βŠƒ' -:'βŽ•VFI⍡}⍀1⊒timestamp)

⍝ fast solution - only possible with the inverted table format
zeroβ†βŽ•UCS '0'
timestamp_numericβ†β‰βŽ•UCS orders.timestamp
Extract←{10βŠ₯timestamp_numeric[⍡;]-zero}
orders.year  ←Extract 0 1 2 3
orders.month ←Extract 5 6
orders.day   ←Extract 8 9
orders.hour  ←Extract 11 12
orders.minute←Extract 14 15
orders.second←Extract 17 18
InΒ [10]:
Slow←{
    orders.(year month day hour minute second←↓⍉{1βŠƒ' -:'βŽ•VFI⍡}⍀1⊒timestamp)
    1
}
Fast←{
    zeroβ†βŽ•UCS '0'
    timestamp_numericβ†β‰βŽ•UCS orders.timestamp
    Extract←{10βŠ₯timestamp_numeric[⍡;]-zero}
    orders.year  ←Extract 0 1 2 3
    orders.month ←Extract 5 6
    orders.day   ←Extract 8 9
    orders.hour  ←Extract 11 12
    orders.minute←Extract 14 15
    orders.second←Extract 17 18
    1
}

]Runtime -c 'Slow ⍬' 'Fast ⍬'
                                                                  
  Slow ⍬ β†’ 1.3EΒ―1 |   0% βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ• 
  Fast ⍬ β†’ 4.7EΒ―3 | -97% βŽ•                                        

KeyΒΆ

⌸ is Dyalog's solution to SQL's GROUP BY clause. It is called in the form

⍺ ⍺⍺⌸ ⍡

Essentially, this groups ⍡ by the corresponding ⍺, and applies ⍺⍺ to each group. For example:

InΒ [11]:
numbers←3 1 4 1 5 9 2 6 5 3 5 9
odds←2|numbers
odds {⍺ ⍡}⌸ numbers
β”Œβ”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚1β”‚3 1 1 5 9 5 3 5 9β”‚
β”œβ”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚0β”‚4 2 6            β”‚
β””β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Time for some details. When ⍺ is not provided, ⍡ used as a default. ⍺ and ⍡ must have the same number of major cells (the same β‰’). The function operand ⍺⍺ given to ⌸ is given the key (from ⍺) as left argument and the cells of ⍡ corresponding to that key as right argument. The results of ⍺⍺ applied to all keys are mixed to give the final result. The order of groups in the output is the order of the corresponding keys in ⍺.

For an example with our dataset, let's see what the average payment is for each state.

InΒ [12]:
orders.state {(βŠ‚βΊ),(+/Γ·β‰’)⍡}⌸ orders.payment
β”Œβ”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚SPβ”‚140.3866109β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚BAβ”‚178.0894336β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚GOβ”‚167.417267 β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚RNβ”‚207.1084402β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚PRβ”‚156.8674639β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚RJβ”‚163.8149305β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚RSβ”‚158.1150865β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚MGβ”‚158.2128593β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚SCβ”‚166.0354887β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚RRβ”‚224.0435897β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚PEβ”‚190.4632381β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚TOβ”‚214.2737269β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚CEβ”‚202.896166 β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚DFβ”‚163.9466195β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚SEβ”‚208.2690991β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚MTβ”‚204.492849 β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚PBβ”‚256.0266012β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚PAβ”‚214.0911746β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ROβ”‚233.8208403β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ESβ”‚157.3739535β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚APβ”‚239.33     β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚MSβ”‚191.1419364β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚MAβ”‚204.0557284β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚PIβ”‚217.0201493β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ALβ”‚233.3682697β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ACβ”‚241.6105128β”‚
β”œβ”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚AMβ”‚185.7943056β”‚
β””β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Exercises:

  • Find how many purchases were made in the most popular category (by number of purchases).
  • Which category/ies are the most popular (again by number of purchases)?

SPOILERS AHEAD





















InΒ [13]:
⍝ exercise 1
⌈/{≒⍡}⌸orders.category

⍝ exercise 2
(category count)←↓⍉{⍺(≒⍡)}⌸orders.category
(count=⌈/count)/category
9166
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚bed_bath_table                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Special FormsΒΆ

Some special forms of ⌸ are optimised. These are

    {βŠ‚β΅}⌸
    {⍺⍡}⌸
     {⍺}⌸
       ⊣⌸

    {≒⍡}⌸
 {⍺(≒⍡)}⌸
  {⍺,≒⍡}⌸

   {F/⍡}⌸
   {F⌿⍡}⌸
{⍺(F/⍡)}⌸
{⍺(F⌿⍡)}⌸
 {⍺,F/⍡}⌸
 {⍺,F⌿⍡}⌸

where

  • F is one of +, ⌈, or ⌊ when ⍡ is numeric;
  • F is one of ∧, ∨, =, β‰ , + when ⍡ is Boolean.

These are senstive to exactly how you spell them!

InΒ [14]:
numbers←10000?10000
]runtime -c '(2|numbers){+/⍡}⌸numbers' '(2|numbers){⊒+/⍡}⌸numbers'
                                                                                      
  (2|numbers){+/⍡}⌸numbers  β†’ 1.6EΒ―5 |    0% βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•                                
  (2|numbers){⊒+/⍡}⌸numbers β†’ 7.1EΒ―5 | +334% βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ•βŽ• 

OrderΒΆ

You might have noticed that ⌸ orders its output by the order they appeared in the input. Sometimes this isn't what we want, and we need to do some extra steps.

Exercise: Write a function PaymentPerState which takes

  • as left argument, a list of state codes, for instance 'GO' 'TO' 'SC',
  • as right argument, the orders table,

and returns the total payment made in each state, in the same order as the left argument. There are two ways you could go about fixing this order:

  1. sorting after the fact, or
  2. prepending some data in the correct order.

SPOILERS AHEAD





















InΒ [15]:
⍝ solution 1 - using ⍳ to get the right totals
PaymentPerState_1←{
	(state payment)←⍡.(state payment)
	(state total)  ←↓⍉state{⍺(+/⍡)}⌸payment
	total[state⍳⍺] ⍝ we are lucky that all states appear!
}

⍝ solution 2 - prepending ⍺ to get the right order
PaymentPerState_2←{
	(state payment)←⍡.(state payment)
	state  βͺ⍨←↑⍺    ⍝ prepend ⍺
	payment,⍨←0⍴⍨≒⍺ ⍝ prepend a 0 for each state in ⍺
	(≒⍺)↑state{+/⍡}⌸payment
}

'TO' 'GO' 'SC' PaymentPerState_1 orders
'TO' 'GO' 'SC' PaymentPerState_2 orders
58068.18 319766.98 579297.82
58068.18 319766.98 579297.82

In Dyalog '23 - D13: Giving Key a Vocabulary, an extension to ⌸ was proposed. This extension would allow us to give ⌸ a 'vocabulary' as left operand, instead of a function. This would control exactly the order that groups appear in the result.

InΒ [16]:
⍝ source: https://raw.githubusercontent.com/abrudz/dyalog_vision/main/QuadEqual.aplo
QuadEqual←{ ⍝ ⌸ allowing operand to be vocabulary (and then using {βŠ‚β΅} as internal operand)
	βΊβ†βŠ’  ⍝
	3=40 βŽ•ATX'⍺⍺':⍺ ⍺⍺⌸⍡  ⍝ fn operand: current definition
	3=40 βŽ•ATX'⍺':⍡ βˆ‡β³β‰’β΅   ⍝ monadic
	
	βŽ•IO←1
	uvoc←βˆͺ⍺⍺
	mask←(β‰’uvoc)β‰₯uvoc⍳⍺  ⍝ high-rank ∊
	keys←uvocβͺmask⌿⍺
	values←mask⌿⍡
	values⍴⍨←(β‰’uvoc)+@1⍴values  ⍝ append filler values while guarding against NONCE ERRORs
	valuesβŠ–β¨β†-(β‰’uvoc)           ⍝ move them to front

	(1↓¨keys{βŠ‚β΅}⌸values)[uvoc⍳⍺⍺]
}

This makes PaymentPerState much easier.

InΒ [17]:
PaymentPerState_3←{
	(state payment)←⍡.(state payment)
	+/Β¨(↓state) (⍺ QuadEqual) payment
}
'TO' 'GO' 'SC' PaymentPerState_3 orders
58068.18 319766.98 579297.82

Multiple KeysΒΆ

Sometimes we need to group our data by more than one than one key. In this case we really have two options:

  1. Use ⌸ multiple times - once for each column we're grouping by,
  2. Create a compound key from the individual keys.

Exercise: Write a function PaymentPerMonthByState which takes

  • as left argument, a list of state codes,
  • as right argument, the orders table,

and returns a matrix of total payments whose columns correspond to the given state codes and whose rows correspond to months in 2017. For example, 'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState orders should return

 43103.53  13139.53 1453.98  1922.78
 80348.6   33197.29 3298.4   3583.36
140767.23  59495.67 2582.92  2702.55
130989.25  61960.3  2288.91  3912.86
188394.13  75293.52 6679.58  7560.36
185274.77  59246.08 2626.96  4788.16
197902.88  84167.86 2938.77 11235.49
212931.9   85555.98 5072.72  6939.29
231109.84 104566.94 3242.68  8101.66
239321.27 108026.61 4544.47 12828.51
391137.77 166838.56 3745.39 13144.66
301554.04 124615.01 3482    10432.55

For instance, a total of 43103.53 was spent in the state SP in January of 2017.

SPOILERS AHEAD





















InΒ [18]:
⍝ solution 1 - compound key
PaymentPerMonthByState_1←{
	(state payment year month)←⍡.(state payment year month)

	⍝ get only 2017
	mask←year=2017
	state  βŒΏβ¨β†mask
	paymentβŒΏβ¨β†mask
	month  βŒΏβ¨β†mask

	⍝ prepend for ordering
	n←12×≒⍺
	state  βͺ⍨←↑n⍴⍺
	month  ,⍨←(≒⍺)/1+⍳12
	payment,⍨←n⍴0

	⍝ compound key
	month_state←⍉↑month (↓state)

	⍝ compute matrix
	12 (≒⍺)⍴month_state {+/⍡}⌸ payment
}

⍝ solution 2 - a faster version
PaymentPerMonthByState_2←{
	(state payment year month)←⍡.(state payment year month)

	⍝ get only 2017
	mask←2017=year
	state  βŒΏβ¨β†mask
	paymentβŒΏβ¨β†mask
	month  βŒΏβ¨β†mask

	⍝ prepend for ordering
	n←12×≒⍺
	state  βͺ⍨←↑n⍴⍺
	month  ,⍨←(≒⍺)/1+⍳12
	payment,⍨←n⍴0

	⍝ compound key
	state←⍳⍨state             ⍝ ids idiom
	month_state←⍉↑month state ⍝ now flat

	⍝ compute matrix
	12 (≒⍺)⍴month_state {+/⍡}⌸ payment
}

⍝ solution 3 - double key
PaymentPerMonthByState_3←{
	(state payment year month)←⍡.(state payment year month)

	⍝ get only 2017
	mask←2017=year
	state  βŒΏβ¨β†mask
	paymentβŒΏβ¨β†mask
	month  βŒΏβ¨β†mask

	month,⍨←1+⍳12
	i←month {βŠ‚1↓⍡}⌸ (12⍴0),⍳≒payment
	β†‘βΊβˆ˜{
		state  ←   (↑⍺)βͺstate[⍡;]
		payment←(0⍴⍨≒⍺),payment[⍡]
		(≒⍺)↑state {+/⍡}⌸ payment
	}Β¨i
}

'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState_1 orders
'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState_2 orders
'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState_3 orders
 43103.53  13139.53 1453.98  1922.78
 80348.6   33197.29 3298.4   3583.36
140767.23  59495.67 2582.92  2702.55
130989.25  61960.3  2288.91  3912.86
188394.13  75293.52 6679.58  7560.36
185274.77  59246.08 2626.96  4788.16
197902.88  84167.86 2938.77 11235.49
212931.9   85555.98 5072.72  6939.29
231109.84 104566.94 3242.68  8101.66
239321.27 108026.61 4544.47 12828.51
391137.77 166838.56 3745.39 13144.66
301554.04 124615.01 3482    10432.55
 43103.53  13139.53 1453.98  1922.78
 80348.6   33197.29 3298.4   3583.36
140767.23  59495.67 2582.92  2702.55
130989.25  61960.3  2288.91  3912.86
188394.13  75293.52 6679.58  7560.36
185274.77  59246.08 2626.96  4788.16
197902.88  84167.86 2938.77 11235.49
212931.9   85555.98 5072.72  6939.29
231109.84 104566.94 3242.68  8101.66
239321.27 108026.61 4544.47 12828.51
391137.77 166838.56 3745.39 13144.66
301554.04 124615.01 3482    10432.55
 43103.53  13139.53 1453.98  1922.78
 80348.6   33197.29 3298.4   3583.36
140767.23  59495.67 2582.92  2702.55
130989.25  61960.3  2288.91  3912.86
188394.13  75293.52 6679.58  7560.36
185274.77  59246.08 2626.96  4788.16
197902.88  84167.86 2938.77 11235.49
212931.9   85555.98 5072.72  6939.29
231109.84 104566.94 3242.68  8101.66
239321.27 108026.61 4544.47 12828.51
391137.77 166838.56 3745.39 13144.66
301554.04 124615.01 3482    10432.55