1.5 Factors and Levels

In data analysis many times you will deal with categorical data such as Boolean TRUE/FALSE, types of cars, gender, a.k.a., Qualitative variables. They cannot be meaningfully expressed in numbers. But how to handle them in R? The answer is factors and levels. Factors and levels are often useful for Statistical Modeling and Plotting data. Since categorical variables are differnent than continuous variables, storing data as factors ensures that the modeling functions will treat such data correctly9.

In R, A factor is a data structure used to represent a vector as categorical data. You can work with factors and levels in base-R. But some tidyverse packages: dplyr, tidyr, forcats, readr also help to deal with them.

1.5.1 Creating a factor

x_month <- c(
    "June", "July", "August", "September",
    "August", "July", "July", "August"
)

Using a string to record such variable has two major problems10:

  • You might make typos

    x_month_2 <- c("July", "August", "Jamuary", "Septenber")
  • And, the vector doesn’t get sorted in a useful way!

    sort(x_month)
    #> [1] "August"    "August"    "August"    "July"      "July"      "July"     
    #> [7] "June"      "September"

    To solve this problem, you need create factor with valid levels.

Get used to

The function factor() is used create a factor. The only required argument to factor() is a vector of values which will be returned as a vector of factor values. Levels are the unique set of values whose order directs the ordering of the factor.

month_levels <- c(
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
)
fac_x_month <- factor(x_month, levels = month_levels)
# All the levels are stored with the created factor
fac_x_month
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> 12 Levels: January February March April May June July August ... December

sort(fac_x_month)
#> [1] June      July      July      July      August    August    August   
#> [8] September
#> 12 Levels: January February March April May June July August ... December

And any values not in the set are silently converted to NA.

factor(x_month_2, levels = month_levels)
#> [1] July   August <NA>   <NA>  
#> 12 Levels: January February March April May June July August ... December

You can use readr::parse_factor() to get warning.

parse_factor(x_month_2, levels = month_levels)
#> Warning: 2 parsing failures.
#> row col           expected    actual
#>   3  -- value in level set Jamuary  
#>   4  -- value in level set Septenber
#> [1] July   August <NA>   <NA>  
#> attr(,"problems")
#> # A tibble: 2 × 4
#>     row   col expected           actual   
#>   <int> <int> <chr>              <chr>    
#> 1     3    NA value in level set Jamuary  
#> 2     4    NA value in level set Septenber
#> 12 Levels: January February March April May June July August ... December

By default, levels of any vector are determined by the as.character() function which sorts the values in dictionary order.

fac_y_char <- factor(x_month)
fac_y_char
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> Levels: August July June September
Numeric data to factor

Both numeric and character variables can be stored as factors, but levels of a factor are always character values.

x_num <- c(1, 2, 2, 3, 1, 2, 3, 3, 1, 2, 3, 3, 1)
fac_x_num <- factor(x_num)

fac_x_num
#>  [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
#> Levels: 1 2 3
mode(levels(fac_x_num))
#> [1] "character"

mean(x_num)
#> [1] 2.076923

mean(fac_x_num)
#> Warning in mean.default(fac_x_num): argument is not numeric or logical:
#> returning NA
#> [1] NA

fac_x_num is not numeric, so mean(fac_x_num) has returned NA.

To convert fac_x_num into original numeric values, use levels() and as.numeric() functions.

numeric_fac_x_num <- as.numeric(levels(fac_x_num)[fac_x_num])
mode(numeric_fac_x_num)
#> [1] "numeric"
mean(numeric_fac_x_num)
#> [1] 2.076923

You might think the above conversion to numeric vector is equivlent to as.numeric(fac_x_num). But NO! it returns a numeric vector different than the desired11.

1.5.2 Levels and labels in factor

The labels = and levels = arguments in the factor() function are quite confusing. The labels = argument allows you to modify the factor levels names, in other words, it labels the levels. Hence, labels is related to output.

factor(x_month,
    levels = month_levels,
    labels = c(
        "Jan", "Feb", "Mar", "Apr", "May", "Jun",
        "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
    )
)
#> [1] Jun Jul Aug Sep Aug Jul Jul Aug
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Whereas, levels is related to input, allowing you to specify how the levels are coded.

Length of labels

labels must have same length as the number of unique groups of the input vector.

gender <- c("male", "female", "female", "male")

# otherwise numbers will be concatinated with it
factor(gender, labels = c("f"))
#> [1] f2 f1 f1 f2
#> Levels: f1 f2

# alphabetical ordering of factors
factor(gender, labels = c("m", "f"))
#> [1] f m m f
#> Levels: m f

# the right way
fac_gender <- factor(gender, levels = c("male", "female"), labels = c("m", "f"))
fac_gender
#> [1] m f f m
#> Levels: m f

The specified labels must have the desired order as levels.

1.5.3 Relevel, Relabel, Reorder

Ordered levels

Even if you ceate a factor with desired levels in order, you won’t be able to do boolean operation on the factor elements.

fac_x_month[1] < fac_x_month[2]
#> Warning in Ops.factor(fac_x_month[1], fac_x_month[2]): '<' not meaningful for
#> factors
#> [1] NA

To fix this issue, you need to set ordered = TRUE which treats the levels as ordered (in the given order).

fac_x_month_order <- factor(x_month, levels = month_levels, ordered = TRUE)
fac_x_month_order
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> 12 Levels: January < February < March < April < May < June < ... < December
fac_x_month_order[3] < fac_x_month_order[4]
#> [1] TRUE
Relevel

Sometimes you might need to change order of some levels. The relevel() and forcats::fct_relevel() functions can help you12.

relevel(fac_gender, ref = "f")
#> [1] m f f m
#> Levels: f m
fct_relevel(fac_gender, "m", after = 1)
#> [1] m f f m
#> Levels: f m

# Only one level can be put at first
relevel(fac_x_month, "December")
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> 12 Levels: December January February March April May June July ... November

# forcats::fct_relevel() function extends the capabilities of relevel()
fct_relevel(fac_x_month, "November", "December")
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> 12 Levels: November December January February March April May June ... October
Reverse the order of levels
factor(fac_x_month, levels = rev(levels(fac_x_month)))
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> 12 Levels: December November October September August July June May ... January
Drop levels
months_random <- factor(x_month,
    levels = c(
        "Garbage", "January", "February",
        "March", "April", "May", "June", "July", "August",
        "September", "October", "November", "December"
    )
)

droplevels(months_random, exclude = "Garbage")
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> Levels: June July August September
Get info about the levels the in the factor

For a factor, the table() function displays all the levels present in the factor and respective counts.

table(months_random)
#> months_random
#>   Garbage   January  February     March     April       May      June      July 
#>         0         0         0         0         0         0         1         3 
#>    August September   October  November  December 
#>         3         1         0         0         0

When a new factor is created from another factor, it retains only the levels present in the old factor without affecting the order of the levels.

factor(months_random)
#> [1] June      July      August    September August    July      July     
#> [8] August   
#> Levels: June July August September
table(factor(months_random))
#> 
#>      June      July    August September 
#>         1         3         3         1

1.5.4 Continuous to Categorical

In data analysis, sometimes it is useful to turn continuous numeric variables into categorical ones. For example, ages of people, exam scores include certain set labels for a certain range of values:

  • “baby” for 0-5 years olds, “teen” for 13-19 years olds and so on.
  • “A+” for 90-100% marks, “A” for 80-90% marks and blah blah blah.
Using cut()

The cut() function is used to convert a numeric variable into a factor. With breaks = argument, you can specify the number of intervals( \(\geq\) 2) into which the input vector is to be cut.

# 10 iid samples from U(0,1000) distribution
x_unif <- round(1000 * runif(10))

# Breaking into 4 intervals
fac_x <- cut(x_unif, breaks = 4)
fac_x
#>  [1] (550,726] (198,375] (726,903] (550,726] (375,550] (550,726] (375,550]
#>  [8] (198,375] (375,550] (726,903]
#> Levels: (198,375] (375,550] (550,726] (726,903]
table(fac_x)
#> fac_x
#> (198,375] (375,550] (550,726] (726,903] 
#>         2         3         3         2

# Labeling the intervals
cut(x_unif, breaks = 3, labels = c("L", "M", "H"))
#>  [1] M L H M M M L L M H
#> Levels: L M H

breaks = also takes a valid numeric vector containing the specified breakpoints.

cut(x_unif, breaks = c(0, 1, 8, 900, 1000))
#>  [1] (8,900]     (8,900]     (900,1e+03] (8,900]     (8,900]     (8,900]    
#>  [7] (8,900]     (8,900]     (8,900]     (8,900]    
#> Levels: (0,1] (1,8] (8,900] (900,1e+03]
Break at nice points

The pretty() function yields a nicer set of intervals. It computes a sequence of about \(n+1\) equally spaced rounded numbers covering the range of the values in the input vector. Most of the times the chosen values are 1, 2 or 5 times a power of 10.

xp_factor <- cut(x_unif, breaks = pretty(x_unif, n = 4))
xp_factor
#>  [1] (400,600]   (200,400]   (800,1e+03] (400,600]   (400,600]   (600,800]  
#>  [7] (400,600]   (0,200]     (400,600]   (600,800]  
#> Levels: (0,200] (200,400] (400,600] (600,800] (800,1e+03]
table(xp_factor)
#> xp_factor
#>     (0,200]   (200,400]   (400,600]   (600,800] (800,1e+03] 
#>           1           1           5           2           1

But it may or may not provide the desired levels!

Factor from Quantiles

Quantiles are cut points dividing the observations in a sample into continuous intervals with equal probabilities. You can produce factors based on quantiles of your data to ensure an “equal” distribution.

xq_factor <- cut(x_unif,
    breaks = quantile(x_unif,
        probs = seq(0, 1, by = 0.25)
    )
)
table(xq_factor)
#> xq_factor
#> (199,432] (432,528] (528,628] (628,902] 
#>         2         2         2         3

By default, cut() provides half-close intervals as levels. Consequently, it doesn’t include the lowest number! You can fix it by setting include.lowest = TRUE.

xq_factor
#>  [1] (528,628] (199,432] (628,902] (528,628] (432,528] (628,902] (199,432]
#>  [8] <NA>      (432,528] (628,902]
#> Levels: (199,432] (432,528] (528,628] (628,902]

# The fix
xq_factor <- cut(x_unif,
    breaks = quantile(x_unif,
        probs = seq(0, 1, by = 0.25)
    ), include.lowest = TRUE
)
xq_factor
#>  [1] (528,628] [199,432] (628,902] (528,628] (432,528] (628,902] [199,432]
#>  [8] [199,432] (432,528] (628,902]
#> Levels: [199,432] (432,528] (528,628] (628,902]

1.5.5 Date & Time with factors

Converting date and time into factors is often useful for labelling certain period of a year.

First create a date vector of the year 2021

everyday_2021 <- seq(
    from = as.Date("2021-1-1"),
    to = as.Date("2021-12-31"),
    by = "day"
)

format(), strptime(), strftime() can be used to extract the month from each date using "%b"

month_everyday_2021 <- format(everyday_2021, format = "%b")
head(month_everyday_2021, 3)
#> [1] "Jan" "Jan" "Jan"
df_m_2021 <- as.data.frame(table(month_everyday_2021))

# Changing the names of the variables of the dataframe
names(df_m_2021) <- c("Month", "Freq")
df_m_2021
#>    Month Freq
#> 1    Apr   30
#> 2    Aug   31
#> 3    Dec   31
#> 4    Feb   28
#> 5    Jan   31
#> 6    Jul   31
#> 7    Jun   30
#> 8    Mar   31
#> 9    May   31
#> 10   Nov   30
#> 11   Oct   31
#> 12   Sep   30

You can use unique() function which returns a factor of unique values in the order they are encountered. So the levels argument will provide the month abbreviations in the correct order to produce an properly ordered factor.

# Months as factors
months_2021 <- factor(month_everyday_2021,
    levels = unique(month_everyday_2021),
    ordered = TRUE
)

df_m_2021 <- as.data.frame(table(months_2021))
names(df_m_2021) <- c("Month", "Freq")

df_m_2021
#>    Month Freq
#> 1    Jan   31
#> 2    Feb   28
#> 3    Mar   31
#> 4    Apr   30
#> 5    May   31
#> 6    Jun   30
#> 7    Jul   31
#> 8    Aug   31
#> 9    Sep   30
#> 10   Oct   31
#> 11   Nov   30
#> 12   Dec   31

Let’s see the factored dates by breaking them in week, quarters13

  • levels - 53 dates of each week of the year 2021
weeks_2021 <- cut(everyday_2021, breaks = "week")
levels(weeks_2021)
#>  [1] "2020-12-28" "2021-01-04" "2021-01-11" "2021-01-18" "2021-01-25"
#>  [6] "2021-02-01" "2021-02-08" "2021-02-15" "2021-02-22" "2021-03-01"
#> [11] "2021-03-08" "2021-03-15" "2021-03-22" "2021-03-29" "2021-04-05"
#> [16] "2021-04-12" "2021-04-19" "2021-04-26" "2021-05-03" "2021-05-10"
#> [21] "2021-05-17" "2021-05-24" "2021-05-31" "2021-06-07" "2021-06-14"
#> [26] "2021-06-21" "2021-06-28" "2021-07-05" "2021-07-12" "2021-07-19"
#> [31] "2021-07-26" "2021-08-02" "2021-08-09" "2021-08-16" "2021-08-23"
#> [36] "2021-08-30" "2021-09-06" "2021-09-13" "2021-09-20" "2021-09-27"
#> [41] "2021-10-04" "2021-10-11" "2021-10-18" "2021-10-25" "2021-11-01"
#> [46] "2021-11-08" "2021-11-15" "2021-11-22" "2021-11-29" "2021-12-06"
#> [51] "2021-12-13" "2021-12-20" "2021-12-27"
  • levels - 4 quarters of the year 2021
qtrs_2021 <- cut(everyday_2021,
    breaks = "3 months",
    labels = paste("Q", 1:4, sep = "")
)
levels(qtrs_2021)
#> [1] "Q1" "Q2" "Q3" "Q4"

1.5.6 An Example in Data Visualization: Bar-chart of monthly deceased count

The Master.csv file contains deceased data of COVID-19 patients curated from Government of Karnataka COVID-19 Bulletin. We would like to see a bar chart of monthly deceased count. First load the file:

deceased_data <- read.csv(
    file = "../../assets/datasets/Master.csv",
    header = TRUE
)

You can use months() to extract months of reporting date

deceased_data$month <- months(as.Date(deceased_data$MB.Date))

To plot the data, you need to convert the table into a dataframe.

deceased_df <- as.data.frame(table(deceased_data$month))
names(deceased_df) <- c("month_name", "count")

deceased_df
#>    month_name count
#> 1       April  2974
#> 2      August  4108
#> 3    December   430
#> 4    February   483
#> 5     January   843
#> 6        July  3561
#> 7        June  6049
#> 8       March   239
#> 9         May 13599
#> 10   November   737
#> 11    October  2593
#> 12  September  3643

# Plotting
ggplot(data = deceased_df, aes(x = month_name, y = count, fill = month_name)) +
    geom_bar(stat = "identity", alpha = .6, width = .4) +
    coord_flip() +
    scale_fill_viridis_d() +
    xlab("") +
    theme_bw()

But the months doesn’t seem to be in order. ggplot() uses ordering alphabetically and treats Month as factor and default level ordering.

Let’s try different approach. Try to arrange the dataframe in a preferred ordering.

deceased_df <- arrange(deceased_df, count)

ggplot(data = deceased_df, aes(x = month_name, y = count, fill = month_name)) +
    geom_bar(stat = "identity", alpha = .6, width = .4) +
    scale_fill_viridis_d() +
    coord_flip() +
    xlab("") +
    theme_bw()

The plot is still the same as before. ggplot() takes ordering into account from the factor given by its levels & NOT as we see in the dataframe!

Ah! Factors can solve this problem. Reorder the levels with month_levels.

deceased_df$month_name <- factor(deceased_df$month_name, levels = month_levels)

ggplot(data = deceased_df, aes(x = month_name, y = count, fill = month_name)) +
    geom_bar(stat = "identity", alpha = .6, width = .4) +
    coord_flip() +
    scale_fill_viridis_d() +
    xlab("") +
    theme_bw()

Hurray! ggplot() obliges the order now. We get our desired plot.

Some more

You can use forcats::fct_infreq() function to get months according to the order of count.

deceased_data <- read.csv(
    file = "../../assets/datasets/Master.csv",
    header = TRUE
)
deceased_data$month <- months(as.Date(deceased_data$MB.Date))

# Plot
ggplot(
    data = deceased_data,
    mapping = aes(
        x = fct_infreq(month),
        fill = cut(Age.In.Years, pretty(Age.In.Years, 4))
    )
) +
    geom_bar(stat = "count", alpha = .6, width = .4) +
    scale_fill_viridis_d() +
    xlab("") +
    coord_flip() +
    theme_bw()

Some more customization to get a nicer plot

ggplot(
    data = deceased_data,
    mapping = aes(
        x = fct_infreq(month),
        fill = cut(Age.In.Years, pretty(Age.In.Years, 4))
        # cut() is used to add Age.In.Years bin layer
    )
) +
    geom_bar(stat = "count", width = .4) +
    labs(fill = "Age") +
    scale_fill_viridis_d() +
    coord_flip() +
    ggtitle("Deceased Month Count- Age Stacked") +
    ylab("Count of Deceased") +
    xlab("Month") +
    theme_bw() +
    theme(
        plot.title = element_text(
            color = "#e95462",
            size = 14,
            face = "bold",
            hjust = 0.5
        ),
        axis.title.x = element_text(
            color = "#e95462",
            size = 14, vjust = 0.5, face = "bold"
        ),
        axis.title.y = element_text(
            color = "#e95462",
            size = 14, face = "bold"
        )
    )

Exercise

Exercise 1.17 (Function definition syntax in R)

function_name <- function(<comma_separated_parameters>) {
   # function_statements
}

Write functions using the above syantax in R that do each the following: [Decide appropriately what the input and outputs should be. Each item must be separate function that can be called.]

  1. Find the levels of factor of a given vector.
  2. Change the first level of a factor with another level of a given factor.
  3. Create an ordered factor from data consisting of the names of months.
  4. Concatenate two given factors in a single factor.
  5. Convert a given vector of integers to an ordered factor.
  6. Extract the five of the levels of factor created from a random sample from the LETTERS.
  7. Create a factor corresponding to height in women data set, which contains height and weights for a sample of women.

References

Bryan, Jenny. 2017. STAT 545. https://stat545.com/factors-boss.html#change-order-of-the-levels-because-i-said-so.
González, José Carlos Soage. 2022. Factor in R.”
———. 2011b. Factors in R.”
———. 2017d. R for Data Science.” In. https://r4ds.had.co.nz/factors.html.

  1. Spector (2011b)↩︎

  2. Wickham and Grolemund (2017d)↩︎

  3. González (2022)↩︎

  4. Bryan (2017), Spector (2011b)↩︎

  5. Spector (2011b)↩︎