3 Analysis population

Following ICH E3 guidance, we need to summarize the number of participants included in each efficacy analysis in Section 11.1, Data Sets Analysed.

library(haven) # Read SAS data
library(dplyr) # Manipulate data
library(tidyr) # Manipulate data
library(r2rtf) # Reporting in RTF format

In this chapter, we illustrate how to create a summary table for the analysis population for a study.

The first step is to read relevant datasets into R. For the analysis population table, all the required information is saved in the ADSL dataset. We can use the haven package to read the dataset.

adsl <- read_sas("data-adam/adsl.sas7bdat")

We illustrate how to prepare a report data for a simplified analysis population table using variables below:

  • USUBJID: unique subject identifier
  • ITTFL: intent-to-treat population flag
  • EFFFL: efficacy population flag
  • SAFFL: safty population flag
adsl %>%
  select(USUBJID, ITTFL, EFFFL, SAFFL) %>%
  head(4)
#> # A tibble: 4 × 4
#>   USUBJID     ITTFL EFFFL SAFFL
#>   <chr>       <chr> <chr> <chr>
#> 1 01-701-1015 Y     Y     Y    
#> 2 01-701-1023 Y     Y     Y    
#> 3 01-701-1028 Y     Y     Y    
#> 4 01-701-1033 Y     Y     Y

3.1 Helper functions

Before we write the analysis code, let’s discuss the possibility of reusing R code by writing helper functions.

As discussed in R for data science, “You should consider writing a function whenever you’ve copied and pasted a block of code more than twice”.

In Chapter 2, there are a few repeating steps to:

  • Format the percentages using the formatC() function.
  • Calculate the numbers and percentages by treatment arm.

We create two ad-hoc functions and use them to create the tables in the rest of this book.

To format numbers and percentages, we create a function called fmt_num(). It is a very simple function wrapping formatC().

fmt_num <- function(x, digits, width = digits + 4) {
  formatC(x,
    digits = digits,
    format = "f",
    width = width
  )
}

The main reason to create the fmt_num() function is to enhance the readability of the analysis code.

For example, we can compare the two versions of code to format the percentage used in Chapter 2 and fmt_num.

formatC(n / n() * 100,
  digits = 1, format = "f", width = 5
)
fmt_num(n / n() * 100, digits = 1)

To calculate the numbers and percentages of participants by groups, we provide a simple (but not robust) wrapper function, count_by(), using the dplyr and tidyr package.

The function can be enhanced in multiple ways, but here we only focus on simplicity and readability. More details about writing R functions can be found in the STAT 545 course.

count_by <- function(data, # Input data set
                     grp, # Group variable
                     var, # Analysis variable
                     var_label = var, # Analysis variable label
                     id = "USUBJID") { # Subject ID variable
  data <- data %>% rename(grp = !!grp, var = !!var, id = !!id)

  left_join(
    count(data, grp, var),
    count(data, grp, name = "tot"),
    by = "grp",
  ) %>%
    mutate(
      pct = fmt_num(100 * n / tot, digits = 1),
      n = fmt_num(n, digits = 0),
      npct = paste0(n, " (", pct, ")")
    ) %>%
    pivot_wider(
      id_cols = var,
      names_from = grp,
      values_from = c(n, pct, npct),
      values_fill = list(n = "0", pct = fmt_num(0, digits = 0))
    ) %>%
    mutate(var_label = var_label)
}

By using the count_by() function, we can simplify the analysis code as below.

count_by(adsl, "TRT01PN", "EFFFL") %>%
  select(-ends_with(c("_54", "_81")))
#> # A tibble: 2 × 5
#>   var   n_0    pct_0   npct_0         var_label
#>   <chr> <chr>  <chr>   <chr>          <chr>    
#> 1 N     "   7" "  8.1" "   7 (  8.1)" EFFFL    
#> 2 Y     "  79" " 91.9" "  79 ( 91.9)" EFFFL

3.2 Analysis code

With the helper function count_by, we can easily prepare a report dataset as

# Derive a randomization flag
adsl <- adsl %>% mutate(RANDFL = "Y")

pop <- count_by(adsl, "TRT01PN", "RANDFL",
  var_label = "Participants in Population"
) %>%
  select(var_label, starts_with("n_"))
pop1 <- bind_rows(
  count_by(adsl, "TRT01PN", "ITTFL",
    var_label = "Participants included in ITT population"
  ),
  count_by(adsl, "TRT01PN", "EFFFL",
    var_label = "Participants included in efficacy population"
  ),
  count_by(adsl, "TRT01PN", "SAFFL",
    var_label = "Participants included in safety population"
  )
) %>%
  filter(var == "Y") %>%
  select(var_label, starts_with("npct_"))

Now we combine individual rows into one table for reporting purpose. tbl_pop is used as input for r2rtf to create the final report.

names(pop) <- gsub("n_", "npct_", names(pop))
tbl_pop <- bind_rows(pop, pop1)

tbl_pop %>% select(var_label, npct_0)
#> # A tibble: 4 × 2
#>   var_label                                    npct_0        
#>   <chr>                                        <chr>         
#> 1 Participants in Population                   "  86"        
#> 2 Participants included in ITT population      "  86 (100.0)"
#> 3 Participants included in efficacy population "  79 ( 91.9)"
#> 4 Participants included in safety population   "  86 (100.0)"

We define the format of the output using code below.

rel_width <- c(2, rep(1, 3))
colheader <- " | Placebo | Xanomeline line Low Dose| Xanomeline line High Dose"
tbl_pop %>%
  # Table title
  rtf_title(
    "Summary of Analysis Sets", ## "Participants Accounting in Analysis Population" is a weird title for me. Merck uses this table title for CSR? 
    "(All Participants Randomized)"
  ) %>%
  # First row of column header
  rtf_colheader(colheader,
    col_rel_width = rel_width
  ) %>%
  # Second row of column header
  rtf_colheader(" | n (%) | n (%) | n (%)",
    border_top = "",
    col_rel_width = rel_width
  ) %>%
  # Table body
  rtf_body(
    col_rel_width = rel_width,
    text_justification = c("l", rep("c", 3))
  ) %>%
  # Encoding RTF syntax
  rtf_encode() %>%
  # Save to a file
  write_rtf("tlf/tbl_pop.rtf")

The procedure to generate an analysis population table can be summarized as follows:

  • Step 1: Read data (i.e., adsl) into R.
  • Step 2: Bind the counts/percentages of the ITT population, the efficacy population, and the safety population by row using the count_by() function.
  • Step 3: Format the output from Step 2 using r2rtf.