There are many approaches for analyzing data that tracks multiple events ocurring over time, either multiple types of events or repeated ocurrences of the same type of event. The great news is that most of these approaches are implemented in R packages. The not-as-great news is that different packages expect your data to be input in different formats. The aim of this vignette is to show you how to use dplyr to get your data into the format needed for common approaches and packages for multievent data.

Raw data

We will start from “wide” data. In this data, each row is a patient. ID represents a unique patient ID. Each patient had up to 8 clinic visits, and t1 through t8 represent the time from enrollment until each clinic visit (in days). x1 through x8 represent the patient’s status at time of the correpsonding visit, with 1 indicating that the patient was progression free and 2 representing the patient had experienced progression. (For patients that had fewer that 8 visits, the t. and x. varaibles are filled in with NA.) dtime represents the time (in days) from enrollment until the patient died or was lost to followup, and dstatus is an indicator for death. Patients with dstatus=1 diead at their dtime time. Patients with dstatus=0 were followed until their dtime time and were still alive at that time, but we don’t know what happened to them after that.

datfile <- system.file("messydata", "multievent.csv", package="untidydata2")
multievent<-as.tibble(read.csv(datfile))
#> Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
#> This warning is displayed once per session.

ID	t1	t2	t3	t4	t5	t6	t7	t8	x1	x2	x3	x4	x5	x6	x7	x8	dtime	dstatus
1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	160.4364	1
2	191.0156	367.3284	547.7683	727.9825	928.0502	1095.994	1268.574	NA	1	1	1	1	1	2	2	NA	1465.1233	0
3	202.4294	355.7819	NA	NA	NA	NA	NA	NA	2	2	NA	NA	NA	NA	NA	NA	401.7873	1
4	191.7655	373.0439	535.9324	717.8133	NA	NA	NA	NA	1	1	1	1	NA	NA	NA	NA	764.9412	1
5	166.2427	349.1972	NA	NA	NA	NA	NA	NA	1	2	NA	NA	NA	NA	NA	NA	375.4677	1
6	176.6494	370.1799	NA	NA	NA	NA	NA	NA	1	1	NA	NA	NA	NA	NA	NA	455.2306	1

Tidying the data

survival package - overall survival endpoint

The survival package is most often used to make Kaplan-Meier curves and fit Cox models for a single event per person. The input dataset should have one row per person, which is the format our data is currently in. To analyze time to death, ignoring progression, we already have the variables we need, dtime and dstatus. These will play the roles of time and event in the creation of a survival object via the function Surv.

ID	dtime	dstatus
1	160.4364	1
2	1465.1233	0
3	401.7873	1
4	764.9412	1
5	375.4677	1
6	455.2306	1

survival package - observed progression-free survival endpoint

Composite endpoints are often used to analyze data with multiple event types. In this dataset, we may be interested in defining a progression/death composite endpoint. To analyze this endpoint, we still need a dataset with one row per patient; all we need to do is define two new variables, which we will call opfstime and opfsevent. opfstime represents, for each patient, the earliest of either death or observed progression, and opfsevent represents an indicator variable that is equal to 1 if a patient progression or died, and equal to 0 if they were alive and progression free at the end of followup.

multievent %>%
# Make two variables. The one called "key" stores the variable names "t1","t2", up to "x8" . The one called "value" stores the corresponding values, number of days or status (1/2). 
  gather(key, value, t1:x8) %>%
# Define the variable "visit" by pulling the number from the string in key
  mutate(visit = parse_number(key),
# Define the varialbe "type" by pulling the letter from the string in key
         type = str_sub(key, 1, 1),
# Delete the key variable
         key = NULL) %>%
# Make a new column for each unique value of type, x or t, and make the value of the new variable equal to value 
  spread(type, value) %>%
# Replace NAs in x (indicating no visit) with 0
  mutate(x = ifelse(is.na(x), 0, x)) %>%
# Group by ID and status at visit
  group_by(ID, x) %>%
# Make a counter variable which counts up from 1 within each ID/status group
  mutate(count = 1:n()) %>%
# Only keep the first row within each ID/status group
  filter(count == 1) %>%
# Redefine grouping to be by ID only
  group_by(ID) %>%
# Keep only rows for each person's max status. If they ever had 2, we will have that row. Otherwise, we will have have a single row for each person with status 0 or 1. 
  filter(x == max(x)) %>%
# Take out the x and count variables. 
  select(ID, dtime, dstatus, t) %>%
# Define the opfstime and opfsevent variables.   
  mutate(opfstime = ifelse(is.na(t), dtime, t), 
         opfsevent = ifelse(!is.na(t), 1, dstatus)) -> composite

ID	opfstime	opfsevent
1	160.4364	1
2	1095.9936	1
3	202.4294	1
4	191.7655	1
5	349.1972	1
6	176.6494	1

Preparing multiple event data for analysis

Anne Eaton

2019-09-13

Raw data

Tidying the data

survival package - overall survival endpoint

survival package - observed progression-free survival endpoint

Contents