Data formats for frequencies

Frequencies are actually not raw data: they are the counts of data belonging to a certain cell of your design. As such, they are summary statistics, a bit like the mean is a summary statistic of data. In this vignette, we review various ways that data can be coded in a data frame.

All along, we use a simple example, (ficticious) data of speakers classified according to their ability to play high intensity sound (low ability, medium ability, or high ability, three levels) and the pitch with which they play these sound (soft or hard, two levels). This is therefore a design with two factors, noted in brief a $3 \times 2$ design. A total of 20 speakers have been examined ($N=20$).

Before we begin, we load the package ANOFA (if is not present on your computer, first upload it to your computer from CRAN or from the source repository devtools::install_github("dcousin3/ANOFA")):

library(ANOFA)

First format: Raw data format

In this format, there is one line per subject and one column for each possible category (one for low, one for medium, etc.). The column contains a 1 (a checkmark if you wish) if the subject is classified in this category and zero for the alternative categories. In a $3 \times 2$ design, there is therefore a total of $3+2 = 5$ columns.

The raw data are:

dataRaw

##    Low Medium High Soft Hard
## 1    1      0    0    1    0
## 2    1      0    0    1    0
## 3    0      1    0    1    0
## 4    0      1    0    1    0
## 5    0      1    0    1    0
## 6    0      0    1    1    0
## 7    0      0    1    1    0
## 8    0      0    1    1    0
## 9    0      0    1    1    0
## 10   0      0    1    1    0
## 11   1      0    0    0    1
## 12   1      0    0    0    1
## 13   1      0    0    0    1
## 14   1      0    0    0    1
## 15   0      1    0    0    1
## 16   0      1    0    0    1
## 17   0      0    1    0    1
## 18   0      0    1    0    1
## 19   0      0    1    0    1
## 20   0      0    1    0    1

To provide raw data to anofa(), the formula must be given as

w <- anofa( ~ cbind(Low,Medium,High) + cbind(Soft,Hard), dataRaw)

where cbind() is used to group the categories of a single factor. The formula has no left-hand side (lhs) term because the categories are signaled by the columns named on the left.

This format is actually the closest to how the data are recorded: if you are coding the data manually, you would have a score sheed and placing checkmarks were appropriate.

Second format: Wide data format

In this format, instead of coding checkmarks under the relevant category (using 1s), only the applicable category is recorded. Hence, if ability to play high intensity is 1 (and the others zero), this format only keep “High” in the record. Consequently, for a design with two factors, there is only two columns, and as many lines as there are subjects:

dataWide

##    Intensity Pitch
## 1        Low  Soft
## 2        Low  Soft
## 3     Medium  Soft
## 4     Medium  Soft
## 5     Medium  Soft
## 6       High  Soft
## 7       High  Soft
## 8       High  Soft
## 9       High  Soft
## 10      High  Soft
## 11       Low  Hard
## 12       Low  Hard
## 13       Low  Hard
## 14       Low  Hard
## 15    Medium  Hard
## 16    Medium  Hard
## 17      High  Hard
## 18      High  Hard
## 19      High  Hard
## 20      High  Hard

To use this format in anofa, use

w <- anofa( ~ Intensity * Pitch, dataWide)

(you can verify that the results are identical, whatever the format by checking summary(w)).

Third format: dataCompiled

This format is compiled, in the sense that the frequencies have been count for each cell of the design. Hence, we no longer have access to the raw data. In this format, there is $3*2 = 6$ lines, one for each combination of the factor levels, and $2+1 = 3$ columns, two for the factor levels and 1 for the count in that cell (aka the frequency). Thus,

dataCompiled

##   Intensity Pitch Frequency
## 1       Low  Soft         2
## 2    Medium  Soft         3
## 3      High  Soft         5
## 4       Low  Hard         4
## 5    Medium  Hard         2
## 6      High  Hard         4

To use a compiled format in anofa, use

w <- anofa(Frequency ~ Intensity * Pitch, dataCompiled )

where Frequency identifies in which column the counts are stored.

Fourth format: dataLong

This format may be prefered for linear modelers (but it may rapidly becomes very long!). There is always the same three columns: One Id column, one column to indicate a factor, and one column to indicate the observed level of that factor for that subject. There are $20 =40 $ lines in the present example (number of subjects times number of factors.)

dataLong

##    Id    Factor  Level
## 1   1 Intensity    Low
## 2   1     Pitch   Soft
## 3   2 Intensity    Low
## 4   2     Pitch   Soft
## 5   3 Intensity Medium
## 6   3     Pitch   Soft
## 7   4 Intensity Medium
## 8   4     Pitch   Soft
## 9   5 Intensity Medium
## 10  5     Pitch   Soft
## 11  6 Intensity   High
## 12  6     Pitch   Soft
## 13  7 Intensity   High
## 14  7     Pitch   Soft
## 15  8 Intensity   High
## 16  8     Pitch   Soft
## 17  9 Intensity   High
## 18  9     Pitch   Soft
## 19 10 Intensity   High
## 20 10     Pitch   Soft
## 21 11 Intensity    Low
## 22 11     Pitch   Hard
## 23 12 Intensity    Low
## 24 12     Pitch   Hard
## 25 13 Intensity    Low
## 26 13     Pitch   Hard
## 27 14 Intensity    Low
## 28 14     Pitch   Hard
## 29 15 Intensity Medium
## 30 15     Pitch   Hard
## 31 16 Intensity Medium
## 32 16     Pitch   Hard
## 33 17 Intensity   High
## 34 17     Pitch   Hard
## 35 18 Intensity   High
## 36 18     Pitch   Hard
## 37 19 Intensity   High
## 38 19     Pitch   Hard
## 39 20 Intensity   High
## 40 20     Pitch   Hard

To analyse such data format within anofa(), use

w <- anofa( Level ~ Factor | Id, dataLong)

The vertical line symbol indicates that the observations are nested within Id (i.e., all the lines with the same Id are actually the same subject).

Converting between formats

Once entered in an anofa() structure, it is possible to convert to any format using toRaw(), toWide(), toCompiled() and toLong(). For example:

toCompiled(w)

##   Intensity Pitch Frequency
## 1      High  Hard         4
## 2       Low  Hard         4
## 3    Medium  Hard         2
## 4      High  Soft         5
## 5       Low  Soft         2
## 6    Medium  Soft         3

The compiled format is probably the most compact format, but the raw format is the most explicite format (as we see all the subjects and all the checkmarks for each).

The only limitation is with regards to the raw format: It is not possible to guess the name of the factors from the names of the columns. By default, anofa() will use uppercase letters to identify the factors.

w <- anofa( ~ cbind(Low,Medium,High) + cbind(Soft,Hard), dataRaw)
toCompiled(w)

##        A    B Frequency
## 1   High Hard         4
## 2    Low Hard         4
## 3 Medium Hard         2
## 4   High Soft         5
## 5    Low Soft         2
## 6 Medium Soft         3

To overcome this limit, you can manually provide factor names with

w <- anofa( ~ cbind(Low,Medium,High) + cbind(Soft,Hard), dataRaw, 
            factors = c("Intensity","Pitch")
          )
toCompiled(w)

##   Intensity Pitch Frequency
## 1      High  Hard         4
## 2       Low  Hard         4
## 3    Medium  Hard         2
## 4      High  Soft         5
## 5       Low  Soft         2
## 6    Medium  Soft         3

To know more about analyzing frequency data with ANOFA, refer to Laurencelle & Cousineau (2023) or to What is an ANOFA?.

Data formats for frequencies

Data formats for frequencies

First format: Raw data format

Second format: Wide data format

Third format: dataCompiled

Fourth format: dataLong

Converting between formats

References