multievent.Rmd
There are many approaches for analyzing data that tracks multiple events ocurring over time, either multiple types of events or repeated ocurrences of the same type of event. The great news is that most of these approaches are implemented in R packages. The not-as-great news is that different packages expect your data to be input in different formats. The aim of this vignette is to show you how to use dplyr
to get your data into the format needed for common approaches and packages for multievent data.
We will start from “wide” data. In this data, each row is a patient. ID
represents a unique patient ID. Each patient had up to 8 clinic visits, and t1
through t8
represent the time from enrollment until each clinic visit (in days). x1
through x8
represent the patient’s status at time of the correpsonding visit, with 1
indicating that the patient was progression free and 2
representing the patient had experienced progression. (For patients that had fewer that 8 visits, the t.
and x.
varaibles are filled in with NA
.) dtime
represents the time (in days) from enrollment until the patient died or was lost to followup, and dstatus
is an indicator for death. Patients with dstatus
=1 diead at their dtime
time. Patients with dstatus
=0 were followed until their dtime
time and were still alive at that time, but we don’t know what happened to them after that.
datfile <- system.file("messydata", "multievent.csv", package="untidydata2")
multievent<-as.tibble(read.csv(datfile))
#> Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
#> This warning is displayed once per session.
ID | t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | dtime | dstatus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 160.4364 | 1 |
2 | 191.0156 | 367.3284 | 547.7683 | 727.9825 | 928.0502 | 1095.994 | 1268.574 | NA | 1 | 1 | 1 | 1 | 1 | 2 | 2 | NA | 1465.1233 | 0 |
3 | 202.4294 | 355.7819 | NA | NA | NA | NA | NA | NA | 2 | 2 | NA | NA | NA | NA | NA | NA | 401.7873 | 1 |
4 | 191.7655 | 373.0439 | 535.9324 | 717.8133 | NA | NA | NA | NA | 1 | 1 | 1 | 1 | NA | NA | NA | NA | 764.9412 | 1 |
5 | 166.2427 | 349.1972 | NA | NA | NA | NA | NA | NA | 1 | 2 | NA | NA | NA | NA | NA | NA | 375.4677 | 1 |
6 | 176.6494 | 370.1799 | NA | NA | NA | NA | NA | NA | 1 | 1 | NA | NA | NA | NA | NA | NA | 455.2306 | 1 |
The survival
package is most often used to make Kaplan-Meier curves and fit Cox models for a single event per person. The input dataset should have one row per person, which is the format our data is currently in. To analyze time to death, ignoring progression, we already have the variables we need, dtime
and dstatus
. These will play the roles of time
and event
in the creation of a survival object via the function Surv
.
ID | dtime | dstatus |
---|---|---|
1 | 160.4364 | 1 |
2 | 1465.1233 | 0 |
3 | 401.7873 | 1 |
4 | 764.9412 | 1 |
5 | 375.4677 | 1 |
6 | 455.2306 | 1 |
Composite endpoints are often used to analyze data with multiple event types. In this dataset, we may be interested in defining a progression/death composite endpoint. To analyze this endpoint, we still need a dataset with one row per patient; all we need to do is define two new variables, which we will call opfstime
and opfsevent
. opfstime
represents, for each patient, the earliest of either death or observed progression, and opfsevent
represents an indicator variable that is equal to 1 if a patient progression or died, and equal to 0 if they were alive and progression free at the end of followup.
multievent %>%
# Make two variables. The one called "key" stores the variable names "t1","t2", up to "x8" . The one called "value" stores the corresponding values, number of days or status (1/2).
gather(key, value, t1:x8) %>%
# Define the variable "visit" by pulling the number from the string in key
mutate(visit = parse_number(key),
# Define the varialbe "type" by pulling the letter from the string in key
type = str_sub(key, 1, 1),
# Delete the key variable
key = NULL) %>%
# Make a new column for each unique value of type, x or t, and make the value of the new variable equal to value
spread(type, value) %>%
# Replace NAs in x (indicating no visit) with 0
mutate(x = ifelse(is.na(x), 0, x)) %>%
# Group by ID and status at visit
group_by(ID, x) %>%
# Make a counter variable which counts up from 1 within each ID/status group
mutate(count = 1:n()) %>%
# Only keep the first row within each ID/status group
filter(count == 1) %>%
# Redefine grouping to be by ID only
group_by(ID) %>%
# Keep only rows for each person's max status. If they ever had 2, we will have that row. Otherwise, we will have have a single row for each person with status 0 or 1.
filter(x == max(x)) %>%
# Take out the x and count variables.
select(ID, dtime, dstatus, t) %>%
# Define the opfstime and opfsevent variables.
mutate(opfstime = ifelse(is.na(t), dtime, t),
opfsevent = ifelse(!is.na(t), 1, dstatus)) -> composite
ID | opfstime | opfsevent |
---|---|---|
1 | 160.4364 | 1 |
2 | 1095.9936 | 1 |
3 | 202.4294 | 1 |
4 | 191.7655 | 1 |
5 | 349.1972 | 1 |
6 | 176.6494 | 1 |