Data Analysis with βCSV
and βΈ
ΒΆ
In this workshop we're going to learn to use βCSV
and βΈ
to analyse a real dataset. The dataset is adapted from this freely available dataset, and the workshop itself is adapted from a 2023 workshop by Dyalog's Rich Park.
βIOβ0 β feel free to use βIOβ1 if you prefer
βPWβ12345
]BOX on
Was OFF
βCSV
and Inverted TablesΒΆ
We'll be working with the dataset given in the CSV (comma separated values) file order_data.csv
. It consists of a header row, giving titles to each column of our dataset, and rows of data values separated by commas.
βͺ10βββNGET'order_data.csv'1
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βid,timestamp,city,state,payment,category β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β1,2017-10-02 10:56:33,sao paulo,SP,18.12,housewares β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β2,2018-07-24 20:41:37,barreiras,BA,141.46,perfumery β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β3,2018-08-08 08:38:49,vianopolis,GO,179.12,auto β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β4,2017-11-18 19:28:06,sao goncalo do amarante,RN,72.20,pet_shopβ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β5,2018-02-13 21:18:39,santo andre,SP,28.62,stationery β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β6,2017-07-09 21:57:05,congonhinhas,PR,175.26,auto β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β7,2017-05-16 13:10:30,nilopolis,RJ,75.16,auto β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β8,2017-01-23 18:29:09,faxinalzinho,RS,35.95,furniture_decor β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β9,2017-07-29 11:55:02,sorocaba,SP,161.42,office_furniture β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now, let's use βCSV
to load our data into format we can work with.
(orders_matrix headers)ββCSV 'order_data.csv' β¬ (2 1 1 1 2 1) 1
Here we're using βCSV
to load our dataset into a matrix.
'order_data.csv'
is the path to the CSV file, but you can also provide CSV data directly here.The second part of the argument (
β¬
) specifies the encoding of the file. My providing aβ¬
, we are allowingβCSV
to detect the encoding.2 1 1 1 2 1
specifies the data types of the columns of our file.0
means ignore this column.1
means this column is character data.2
means this column is numeric.3
,4
, or5
mean this column is numeric, but with tolerance for certain empty and/or invalid entries.
So in our case, the first and fifth columns (
id
andpayment
) are numeric, while the rest are character data.The final
1
in the argument indicates that our file includes the header rowid,timestamp,city,state,payment,category
, and that this should not be interpreted as data. Instead, the header row is returned separately to us in the result.
You can also export data from the workspace to a CSV file with the dyadic form. Full documentation for βCSV
can be found here.
headers
5βorders_matrix
ββββ¬ββββββββββ¬βββββ¬ββββββ¬ββββββββ¬βββββββββ βidβtimestampβcityβstateβpaymentβcategoryβ ββββ΄ββββββββββ΄βββββ΄ββββββ΄ββββββββ΄βββββββββ
βββ¬ββββββββββββββββββββ¬ββββββββββββββββββββββββ¬βββ¬βββββββ¬βββββββββββ β1β2017-10-02 10:56:33βsao paulo βSPβ18.12 βhousewaresβ βββΌββββββββββββββββββββΌββββββββββββββββββββββββΌβββΌβββββββΌβββββββββββ€ β2β2018-07-24 20:41:37βbarreiras βBAβ141.46βperfumery β βββΌββββββββββββββββββββΌββββββββββββββββββββββββΌβββΌβββββββΌβββββββββββ€ β3β2018-08-08 08:38:49βvianopolis βGOβ179.12βauto β βββΌββββββββββββββββββββΌββββββββββββββββββββββββΌβββΌβββββββΌβββββββββββ€ β4β2017-11-18 19:28:06βsao goncalo do amaranteβRNβ72.2 βpet_shop β βββΌββββββββββββββββββββΌββββββββββββββββββββββββΌβββΌβββββββΌβββββββββββ€ β5β2018-02-13 21:18:39βsanto andre βSPβ28.62 βstationeryβ βββ΄ββββββββββββββββββββ΄ββββββββββββββββββββββββ΄βββ΄βββββββ΄βββββββββββ
This is not the only format that βCSV
supports. We can also import the data as an inverted table, using the Invert
option, which can use less memory and be more efficient to query.
By giving the option 'Invert' 1
, character data is imported as matrices. By giving 'Invert' 2
, character data is imported as vectors of character vectors.
(orders_columns _)ββCSVβOPT('Invert' 1)β’'order_data.csv' β¬ (2 1 1 1 2 1) 1
(orders_columns_2 _)ββCSVβOPT('Invert' 2)β’'order_data.csv' β¬ (2 1 1 1 2 1) 1
βͺ5βΒ¨orders_columns
βͺ5βΒ¨orders_columns_2
βββββββββββββββββββββββββββββββββββββββββ β1 2 3 4 5 β βββββββββββββββββββββββββββββββββββββββββ€ β2017-10-02 10:56:33 β β2018-07-24 20:41:37 β β2018-08-08 08:38:49 β β2017-11-18 19:28:06 β β2018-02-13 21:18:39 β βββββββββββββββββββββββββββββββββββββββββ€ βsao paulo β βbarreiras β βvianopolis β βsao goncalo do amarante β βsanto andre β βββββββββββββββββββββββββββββββββββββββββ€ βSP β βBA β βGO β βRN β βSP β βββββββββββββββββββββββββββββββββββββββββ€ β18.12 141.46 179.12 72.2 28.62 β βββββββββββββββββββββββββββββββββββββββββ€ βhousewares β βperfumery β βauto β βpet_shop β βstationery β βββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β1 2 3 4 5 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ ββββββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββ¬βββββββββββββββββββββ ββ2017-10-02 10:56:33β2018-07-24 20:41:37β2018-08-08 08:38:49β2017-11-18 19:28:06β2018-02-13 21:18:39ββ ββββββββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββ΄βββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ ββββββββββββ¬ββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββ¬ββββββββββββ β ββsao pauloβbarreirasβvianopolisβsao goncalo do amaranteβsanto andreβ β ββββββββββββ΄ββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βββββ¬βββ¬βββ¬βββ¬βββ β ββSPβBAβGOβRNβSPβ β βββββ΄βββ΄βββ΄βββ΄βββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β18.12 141.46 179.12 72.2 28.62 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ βββββββββββββ¬ββββββββββ¬βββββ¬βββββββββ¬βββββββββββ β ββhousewaresβperfumeryβautoβpet_shopβstationeryβ β βββββββββββββ΄ββββββββββ΄βββββ΄βββββββββ΄βββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
In our case, using an inverted table with matrices for character columns uses the least space (by far).
βͺβSIZE 'orders_matrix' 'orders_columns' 'orders_columns_2'
28717912 9888840 22252568
Note: there is more to inverted tables than we will cover today. See this talk from Dyalog '18 for more information.
We're going to use the inverted table format (with character matrices) for the rest of this workshop. We could just pass around all the columns in a vector and unpack them as need be:
(col1 col2 ...)βcolumns
But this makes our code brittle - every time the schema of our data changes we need to change every place we unpack these columns. So instead, let's extract all our columns into variables in a namespace for ease of access.
ordersββNS β¬
orders.(id timestamp city state payment category)βorders_columns
And now it's nice and easy to work with columns by name.
β/orders.payment β (blimey)
13664.08
PreprocessingΒΆ
Right now, our orders
namespace has a timestamp
variable, with the character data for the timestamps of each purchase. To work with this more easily, we're going to need to parse this into numeric columns.
Exercise: Parse timestamp
into separate year
, month
, day
, hour
, minute
, and second
columns. You may want to use βVFI
(documentation here).
SPOILERS AHEAD
β simple solution
orders.(year month day hour minute secondβββ{1β' -:'βVFIβ΅}β€1β’timestamp)
β fast solution - only possible with the inverted table format
zeroββUCS '0'
timestamp_numericβββUCS orders.timestamp
Extractβ{10β₯timestamp_numeric[β΅;]-zero}
orders.year βExtract 0 1 2 3
orders.month βExtract 5 6
orders.day βExtract 8 9
orders.hour βExtract 11 12
orders.minuteβExtract 14 15
orders.secondβExtract 17 18
Slowβ{
orders.(year month day hour minute secondβββ{1β' -:'βVFIβ΅}β€1β’timestamp)
1
}
Fastβ{
zeroββUCS '0'
timestamp_numericβββUCS orders.timestamp
Extractβ{10β₯timestamp_numeric[β΅;]-zero}
orders.year βExtract 0 1 2 3
orders.month βExtract 5 6
orders.day βExtract 8 9
orders.hour βExtract 11 12
orders.minuteβExtract 14 15
orders.secondβExtract 17 18
1
}
]Runtime -c 'Slow β¬' 'Fast β¬'
Slow β¬ β 1.3EΒ―1 | 0% ββββββββββββββββββββββββββββββββββββββββ Fast β¬ β 4.7EΒ―3 | -97% β
KeyΒΆ
βΈ
is Dyalog's solution to SQL's GROUP BY
clause. It is called in the form
βΊ βΊβΊβΈ β΅
Essentially, this groups β΅
by the corresponding βΊ
, and applies βΊβΊ
to each group. For example:
numbersβ3 1 4 1 5 9 2 6 5 3 5 9
oddsβ2|numbers
odds {βΊ β΅}βΈ numbers
βββ¬ββββββββββββββββββ β1β3 1 1 5 9 5 3 5 9β βββΌββββββββββββββββββ€ β0β4 2 6 β βββ΄ββββββββββββββββββ
Time for some details. When βΊ
is not provided, β΅
used as a default. βΊ
and β΅
must have the same number of major cells (the same β’
). The function operand βΊβΊ
given to βΈ
is given the key (from βΊ
) as left argument and the cells of β΅
corresponding to that key as right argument. The results of βΊβΊ
applied to all keys are mixed to give the final result. The order of groups in the output is the order of the corresponding keys in βΊ
.
For an example with our dataset, let's see what the average payment is for each state.
orders.state {(ββΊ),(+/Γ·β’)β΅}βΈ orders.payment
ββββ¬ββββββββββββ βSPβ140.3866109β ββββΌββββββββββββ€ βBAβ178.0894336β ββββΌββββββββββββ€ βGOβ167.417267 β ββββΌββββββββββββ€ βRNβ207.1084402β ββββΌββββββββββββ€ βPRβ156.8674639β ββββΌββββββββββββ€ βRJβ163.8149305β ββββΌββββββββββββ€ βRSβ158.1150865β ββββΌββββββββββββ€ βMGβ158.2128593β ββββΌββββββββββββ€ βSCβ166.0354887β ββββΌββββββββββββ€ βRRβ224.0435897β ββββΌββββββββββββ€ βPEβ190.4632381β ββββΌββββββββββββ€ βTOβ214.2737269β ββββΌββββββββββββ€ βCEβ202.896166 β ββββΌββββββββββββ€ βDFβ163.9466195β ββββΌββββββββββββ€ βSEβ208.2690991β ββββΌββββββββββββ€ βMTβ204.492849 β ββββΌββββββββββββ€ βPBβ256.0266012β ββββΌββββββββββββ€ βPAβ214.0911746β ββββΌββββββββββββ€ βROβ233.8208403β ββββΌββββββββββββ€ βESβ157.3739535β ββββΌββββββββββββ€ βAPβ239.33 β ββββΌββββββββββββ€ βMSβ191.1419364β ββββΌββββββββββββ€ βMAβ204.0557284β ββββΌββββββββββββ€ βPIβ217.0201493β ββββΌββββββββββββ€ βALβ233.3682697β ββββΌββββββββββββ€ βACβ241.6105128β ββββΌββββββββββββ€ βAMβ185.7943056β ββββ΄ββββββββββββ
Exercises:
- Find how many purchases were made in the most popular category (by number of purchases).
- Which category/ies are the most popular (again by number of purchases)?
SPOILERS AHEAD
β exercise 1
β/{β’β΅}βΈorders.category
β exercise 2
(category count)βββ{βΊ(β’β΅)}βΈorders.category
(count=β/count)/category
9166
βββββββββββββββββββββββββββββββββββββββββ βbed_bath_table β βββββββββββββββββββββββββββββββββββββββββ
Special FormsΒΆ
Some special forms of βΈ
are optimised. These are
{ββ΅}βΈ
{βΊβ΅}βΈ
{βΊ}βΈ
β£βΈ
{β’β΅}βΈ
{βΊ(β’β΅)}βΈ
{βΊ,β’β΅}βΈ
{F/β΅}βΈ
{FβΏβ΅}βΈ
{βΊ(F/β΅)}βΈ
{βΊ(FβΏβ΅)}βΈ
{βΊ,F/β΅}βΈ
{βΊ,FβΏβ΅}βΈ
where
F
is one of+
,β
, orβ
whenβ΅
is numeric;F
is one ofβ§
,β¨
,=
,β
,+
whenβ΅
is Boolean.
These are senstive to exactly how you spell them!
numbersβ10000?10000
]runtime -c '(2|numbers){+/β΅}βΈnumbers' '(2|numbers){β’+/β΅}βΈnumbers'
(2|numbers){+/β΅}βΈnumbers β 1.6EΒ―5 | 0% βββββββββ (2|numbers){β’+/β΅}βΈnumbers β 7.1EΒ―5 | +334% ββββββββββββββββββββββββββββββββββββββββ
OrderΒΆ
You might have noticed that βΈ
orders its output by the order they appeared in the input. Sometimes this isn't what we want, and we need to do some extra steps.
Exercise: Write a function PaymentPerState
which takes
- as left argument, a list of state codes, for instance
'GO' 'TO' 'SC'
, - as right argument, the
orders
table,
and returns the total payment made in each state, in the same order as the left argument. There are two ways you could go about fixing this order:
- sorting after the fact, or
- prepending some data in the correct order.
SPOILERS AHEAD
β solution 1 - using β³ to get the right totals
PaymentPerState_1β{
(state payment)ββ΅.(state payment)
(state total) βββstate{βΊ(+/β΅)}βΈpayment
total[stateβ³βΊ] β we are lucky that all states appear!
}
β solution 2 - prepending βΊ to get the right order
PaymentPerState_2β{
(state payment)ββ΅.(state payment)
state βͺβ¨βββΊ β prepend βΊ
payment,β¨β0β΄β¨β’βΊ β prepend a 0 for each state in βΊ
(β’βΊ)βstate{+/β΅}βΈpayment
}
'TO' 'GO' 'SC' PaymentPerState_1 orders
'TO' 'GO' 'SC' PaymentPerState_2 orders
58068.18 319766.98 579297.82
58068.18 319766.98 579297.82
In Dyalog '23 - D13: Giving Key a Vocabulary, an extension to βΈ
was proposed. This extension would allow us to give βΈ
a 'vocabulary' as left operand, instead of a function. This would control exactly the order that groups appear in the result.
β source: https://raw.githubusercontent.com/abrudz/dyalog_vision/main/QuadEqual.aplo
QuadEqualβ{ β βΈ allowing operand to be vocabulary (and then using {ββ΅} as internal operand)
βΊββ’ β
3=40 βATX'βΊβΊ':βΊ βΊβΊβΈβ΅ β fn operand: current definition
3=40 βATX'βΊ':β΅ ββ³β’β΅ β monadic
βIOβ1
uvocββͺβΊβΊ
maskβ(β’uvoc)β₯uvocβ³βΊ β high-rank β
keysβuvocβͺmaskβΏβΊ
valuesβmaskβΏβ΅
valuesβ΄β¨β(β’uvoc)+@1β΄values β append filler values while guarding against NONCE ERRORs
valuesββ¨β-(β’uvoc) β move them to front
(1βΒ¨keys{ββ΅}βΈvalues)[uvocβ³βΊβΊ]
}
This makes PaymentPerState
much easier.
PaymentPerState_3β{
(state payment)ββ΅.(state payment)
+/Β¨(βstate) (βΊ QuadEqual) payment
}
'TO' 'GO' 'SC' PaymentPerState_3 orders
58068.18 319766.98 579297.82
Multiple KeysΒΆ
Sometimes we need to group our data by more than one than one key. In this case we really have two options:
- Use
βΈ
multiple times - once for each column we're grouping by, - Create a compound key from the individual keys.
Exercise: Write a function PaymentPerMonthByState
which takes
- as left argument, a list of state codes,
- as right argument, the
orders
table,
and returns a matrix of total payments whose columns correspond to the given state codes and whose rows correspond to months in 2017. For example, 'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState orders
should return
43103.53 13139.53 1453.98 1922.78
80348.6 33197.29 3298.4 3583.36
140767.23 59495.67 2582.92 2702.55
130989.25 61960.3 2288.91 3912.86
188394.13 75293.52 6679.58 7560.36
185274.77 59246.08 2626.96 4788.16
197902.88 84167.86 2938.77 11235.49
212931.9 85555.98 5072.72 6939.29
231109.84 104566.94 3242.68 8101.66
239321.27 108026.61 4544.47 12828.51
391137.77 166838.56 3745.39 13144.66
301554.04 124615.01 3482 10432.55
For instance, a total of 43103.53
was spent in the state SP
in January of 2017.
SPOILERS AHEAD
β solution 1 - compound key
PaymentPerMonthByState_1β{
(state payment year month)ββ΅.(state payment year month)
β get only 2017
maskβyear=2017
state βΏβ¨βmask
paymentβΏβ¨βmask
month βΏβ¨βmask
β prepend for ordering
nβ12Γβ’βΊ
state βͺβ¨ββnβ΄βΊ
month ,β¨β(β’βΊ)/1+β³12
payment,β¨βnβ΄0
β compound key
month_stateβββmonth (βstate)
β compute matrix
12 (β’βΊ)β΄month_state {+/β΅}βΈ payment
}
β solution 2 - a faster version
PaymentPerMonthByState_2β{
(state payment year month)ββ΅.(state payment year month)
β get only 2017
maskβ2017=year
state βΏβ¨βmask
paymentβΏβ¨βmask
month βΏβ¨βmask
β prepend for ordering
nβ12Γβ’βΊ
state βͺβ¨ββnβ΄βΊ
month ,β¨β(β’βΊ)/1+β³12
payment,β¨βnβ΄0
β compound key
stateββ³β¨state β ids idiom
month_stateβββmonth state β now flat
β compute matrix
12 (β’βΊ)β΄month_state {+/β΅}βΈ payment
}
β solution 3 - double key
PaymentPerMonthByState_3β{
(state payment year month)ββ΅.(state payment year month)
β get only 2017
maskβ2017=year
state βΏβ¨βmask
paymentβΏβ¨βmask
month βΏβ¨βmask
month,β¨β1+β³12
iβmonth {β1ββ΅}βΈ (12β΄0),β³β’payment
ββΊβ{
state β (ββΊ)βͺstate[β΅;]
paymentβ(0β΄β¨β’βΊ),payment[β΅]
(β’βΊ)βstate {+/β΅}βΈ payment
}Β¨i
}
'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState_1 orders
'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState_2 orders
'SP' 'RJ' 'PI' 'MT' PaymentPerMonthByState_3 orders
43103.53 13139.53 1453.98 1922.78 80348.6 33197.29 3298.4 3583.36 140767.23 59495.67 2582.92 2702.55 130989.25 61960.3 2288.91 3912.86 188394.13 75293.52 6679.58 7560.36 185274.77 59246.08 2626.96 4788.16 197902.88 84167.86 2938.77 11235.49 212931.9 85555.98 5072.72 6939.29 231109.84 104566.94 3242.68 8101.66 239321.27 108026.61 4544.47 12828.51 391137.77 166838.56 3745.39 13144.66 301554.04 124615.01 3482 10432.55
43103.53 13139.53 1453.98 1922.78 80348.6 33197.29 3298.4 3583.36 140767.23 59495.67 2582.92 2702.55 130989.25 61960.3 2288.91 3912.86 188394.13 75293.52 6679.58 7560.36 185274.77 59246.08 2626.96 4788.16 197902.88 84167.86 2938.77 11235.49 212931.9 85555.98 5072.72 6939.29 231109.84 104566.94 3242.68 8101.66 239321.27 108026.61 4544.47 12828.51 391137.77 166838.56 3745.39 13144.66 301554.04 124615.01 3482 10432.55
43103.53 13139.53 1453.98 1922.78 80348.6 33197.29 3298.4 3583.36 140767.23 59495.67 2582.92 2702.55 130989.25 61960.3 2288.91 3912.86 188394.13 75293.52 6679.58 7560.36 185274.77 59246.08 2626.96 4788.16 197902.88 84167.86 2938.77 11235.49 212931.9 85555.98 5072.72 6939.29 231109.84 104566.94 3242.68 8101.66 239321.27 108026.61 4544.47 12828.51 391137.77 166838.56 3745.39 13144.66 301554.04 124615.01 3482 10432.55