There are many approaches for analyzing data that tracks multiple events ocurring over time, either multiple types of events or repeated ocurrences of the same type of event. The great news is that most of these approaches are implemented in R packages. The not-as-great news is that different packages expect your data to be input in different formats. The aim of this vignette is to show you how to use dplyr to get your data into the format needed for common approaches and packages for multievent data.

Raw data

We will start from “wide” data. In this data, each row is a patient. ID represents a unique patient ID. Each patient had up to 8 clinic visits, and t1 through t8 represent the time from enrollment until each clinic visit (in days). x1 through x8 represent the patient’s status at time of the correpsonding visit, with 1 indicating that the patient was progression free and 2 representing the patient had experienced progression. (For patients that had fewer that 8 visits, the t. and x. varaibles are filled in with NA.) dtime represents the time (in days) from enrollment until the patient died or was lost to followup, and dstatus is an indicator for death. Patients with dstatus=1 diead at their dtime time. Patients with dstatus=0 were followed until their dtime time and were still alive at that time, but we don’t know what happened to them after that.

ID t1 t2 t3 t4 t5 t6 t7 t8 x1 x2 x3 x4 x5 x6 x7 x8 dtime dstatus
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 160.4364 1
2 191.0156 367.3284 547.7683 727.9825 928.0502 1095.994 1268.574 NA 1 1 1 1 1 2 2 NA 1465.1233 0
3 202.4294 355.7819 NA NA NA NA NA NA 2 2 NA NA NA NA NA NA 401.7873 1
4 191.7655 373.0439 535.9324 717.8133 NA NA NA NA 1 1 1 1 NA NA NA NA 764.9412 1
5 166.2427 349.1972 NA NA NA NA NA NA 1 2 NA NA NA NA NA NA 375.4677 1
6 176.6494 370.1799 NA NA NA NA NA NA 1 1 NA NA NA NA NA NA 455.2306 1

Tidying the data

survival package - overall survival endpoint

The survival package is most often used to make Kaplan-Meier curves and fit Cox models for a single event per person. The input dataset should have one row per person, which is the format our data is currently in. To analyze time to death, ignoring progression, we already have the variables we need, dtime and dstatus. These will play the roles of time and event in the creation of a survival object via the function Surv.

ID dtime dstatus
1 160.4364 1
2 1465.1233 0
3 401.7873 1
4 764.9412 1
5 375.4677 1
6 455.2306 1

survival package - observed progression-free survival endpoint

Composite endpoints are often used to analyze data with multiple event types. In this dataset, we may be interested in defining a progression/death composite endpoint. To analyze this endpoint, we still need a dataset with one row per patient; all we need to do is define two new variables, which we will call opfstime and opfsevent. opfstime represents, for each patient, the earliest of either death or observed progression, and opfsevent represents an indicator variable that is equal to 1 if a patient progression or died, and equal to 0 if they were alive and progression free at the end of followup.

ID opfstime opfsevent
1 160.4364 1
2 1095.9936 1
3 202.4294 1
4 191.7655 1
5 349.1972 1
6 176.6494 1