In this vignette we demonstrate how to use rprev to generate predictions of disease prevalence from a registry data set. Prevalence is defined as the number of people affected with a disease at a specified index date, while this package is concerned with n-year prevalence: the number of affected individuals who were diagnosed in the preceding n years. The package is designed to work with disease registry data sets containing individual level information, rather than averaged population tables. If n is less than or equal to the number of years of registry data, then the prevalence estimate is made by simply counting those remaining in the prevalent pool at the index date. Frequently, disease registries do not contain sufficient longitudinal information to accurately measure disease prevalence, and the desired n is greater than the number of registry years for which we have real patient data.
Following the methodology of Crouch et al. (2014), which has been employed in published work Smith et al. (2015), prevalence contributions from patients incident prior to the registry beginning are estimated using Monte Carlo Simulation. This is illustrated in the diagram below:
Modelling prevalence therefore involves two stochastic processes:
incidence, and survival. rprev
provides an object-oriented
way of specifying each of these two processes, along with appropriate
user-friendly defaults that work well in general situations. In the
following sections we aim to provide a reference manual for using
rprev
to generate accurate estimates of disease prevalence.
Using the default models is covered, followed by a guide on providing
custom incidence and survival objects for fine-grained control. It is
important that both of these processes are accurately modelled to
generate reliable prevalence estimates; the Diagnostics vignette goes into depth on
evaluating the assumptions behind the default models.
rprev
provides a simulated data set for testing
purposes, called prevsim. It has been synthesized to resemble
disease registry data. Incident cases are recorded from 2003-01-01 to
2013-01-30, and events occur between 2003-01-09 and 2015-03-17. It has 6
columns and is organised in a fashion typical to that found in real
registry data sets. Patient data includes the date of both entry into
the registry and last follow-up, survival time (time) and a
death indicator (status) along with both age and sex.
## time status age sex entrydate
## Min. : 1.0 Min. :0.000 Min. :36.43 M:508 Min. :2003-01-07
## 1st Qu.: 235.0 1st Qu.:0.000 1st Qu.:58.16 F:492 1st Qu.:2005-07-22
## Median : 893.5 Median :1.000 Median :64.56 Median :2008-04-08
## Mean :1236.0 Mean :0.558 Mean :64.95 Mean :2008-02-26
## 3rd Qu.:1868.2 3rd Qu.:1.000 3rd Qu.:71.88 3rd Qu.:2010-10-11
## Max. :4444.0 Max. :1.000 Max. :95.32 Max. :2013-01-30
## eventdate
## Min. :2003-01-09
## 1st Qu.:2008-09-20
## Median :2012-03-03
## Mean :2011-07-16
## 3rd Qu.:2015-03-17
## Max. :2015-03-17
The following Kaplan-Meier plot shows that survival in prevsim is typical of many diseases, whereby males have poorer survival outcomes than females. It also highlights that survival starts to level off after 2000 days.
The primary function in rprev
is
prevalence
, which performs all the data pre-processing and
simulation required for estimating prevalence at an index date, given
registry data and the specification of the incidence and survival
processes. The function is designed to be flexible and modular, it does
not make any assumptions on the nature of the two processes but only
requires that they have specified behaviours (described later). We have
provided default incidence and survival models with the package that are
flexible enough to cover the majority of data sets. This section details
how to get up-and-running using these default models to obtain
prevalence estimations.
The default incidence model assumes a Poisson homogeneous process,
i.e. that the incidence rate is constant. This may be a reasonable
assumption for diseases that don’t have a seasonal component in a
population of stable size Of course, it is important to check whether
your data meets this assumption; diagnostics are covered in a separate vignette. A Poisson homogeneous
process relies on a single parameter, the incidence rate. In
rprev
this is calculated within the prevalence
function from incidence dates into the registry. An additional
functionality that rprev
provides is allowing for
stratification of incidence by a categorical variable, for example,
sex.
The homogeneous Poisson process model is specified by an argument to
prevalence
called inc_formula
, which accepts a
formula with the LHS as name of the column that holds the incident
dates, and the RHS naming the variables to stratify by (or 1 if none).
For example, in the prevsim
data set, the
entrydate column describes the date the patient was entered
into the registry, and so the formula for a non-stratified incidence
model is inc_formula = entrydate ~ 1
. For example, if we
have reason to believe that males and females have significantly
different incidence rates then we can stratify by sex:
inc_formula = entrydate ~ sex
.
The default survival model assumes that event times follow a standard
parametric distribution. The default implementation in
rprev
is an optimized interface to the well-known
survival::survreg
function. There are two arguments to
prevalence
that control the default survival model,
surv_formula
and dist
.
surv_formula
is a formula formatted in the same way as the
argument to survival::survreg
, i.e. where the LHS is a
Surv
object specifying survival time and event indicators,
and the RHS details any covariates to include. The dist
argument accepts a string specifying the distribution to use. Currently,
it accepts the following values: weibull, lognormal,
and exponential for the optimized implementation. If other
distributions are required then a flexsurv
object can be
used, see below for details.
The function call for estimating prevalence in the
prevsim
data set using the default incidence and survival
models is shown below. Aside from the arguments specifying these two
processes, there are a number of prevalence-specific parameters.
index_date
specifies the date at which to estimate point
prevalence with num_years_to_estimate
detailing the
required number of years preceding the index date that contribute
incident cases. If any values are larger than the number of available
complete years of registry data then incident cases over the remaining
time are simulated. By passing a vector to
num_years_to_estimate
, multiple estimates of prevalence at
the index date can be calculated with their own confidence intervals.
The death_column
parameter accepts the name of the column
in the registry data set that holds date of death information. Its
presence is required to count prevalence over the registry duration, if
it isn’t provided then the entire prevalence estimate is calculated by
simulation. The optional population_size
argument is used
to provide relative rates estimates.
prevalence_total <- prevalence(index='2013-01-30',
num_years_to_estimate=c(3, 5, 10, 20),
data=prevsim,
inc_formula = entrydate ~ sex,
surv_formula = Surv(time, status) ~ age + sex,
dist='weibull',
population_size = 1e6,
death_column = 'eventdate')
Printing the returned prevalence
object displays the
point estimates of prevalence at the index date using the specified
years of data, increasing with n:
## Estimated prevalence at 2013-01-30:
## 3 years: 208 (20.8 per 1e+05)
## 5 years: 304 (30.4 per 1e+05)
## 10 years: 514 (51.4 per 1e+05)
## 20 years: 767.65 (76.76 per 1e+05)
More detail from the prevalence
object can be extracted
using summary
, including the p-value from a hypothesis test
(Poisson) of the difference between the predicted and counted prevalence
for the available years of registry data.
## Prevalence
## ~~~~~~~~~~
## Estimated prevalence at 2013-01-30:
## 3 years: 208 (20.8 per 1e+05)
## 5 years: 304 (30.4 per 1e+05)
## 10 years: 514 (51.4 per 1e+05)
## 20 years: 767.65 (76.76 per 1e+05)
##
## Registry Data
## ~~~~~~~~~~~~~
## Index date: 2013-01-30
## Start date: 2003-01-07
## Overall incidence rate: 0.272
## Counted prevalent cases: 516
##
## Simulation
## ~~~~~~~~~~
## Iterations: 1000
## Average incidence rate: 0.273
## P-value: 0.4885847
The prevalence object’s estimates
attribute holds the
point prevalence estimate along with relative rates and confidence
intervals.
## $y3
## $y3$absolute.prevalence
## [1] 208
##
## $y3$per100K
## [1] 20.8
##
## $y3$per100K.upper
## [1] 23.63
##
## $y3$per100K.lower
## [1] 17.97
##
##
## $y5
## $y5$absolute.prevalence
## [1] 304
##
## $y5$per100K
## [1] 30.4
##
## $y5$per100K.upper
## [1] 33.82
##
## $y5$per100K.lower
## [1] 26.98
##
##
## $y10
## $y10$absolute.prevalence
## [1] 514
##
## $y10$per100K
## [1] 51.4
##
## $y10$per100K.upper
## [1] 55.84
##
## $y10$per100K.lower
## [1] 46.96
##
##
## $y20
## $y20$absolute.prevalence
## [1] 767.65
##
## $y20$per100K
## [1] 76.76
##
## $y20$per100K.upper
## [1] 85.74
##
## $y20$per100K.lower
## [1] 67.79
Additional information about the simulation can be found in the
simulated
object, which contains a data.table
containing information about the simulated incident population. Each row
corresponds to a simulated incident individual, with the sim
column specifying the simulation number. Simulated covariate values are
also included, which for this example is just sex and
age. alive_at_index is a binary value of whether this
individual was still alive at the index date, with the subsequent
columns indicating if they contributed to any n-year prevalence.
prev_registry measures whether the person was contributing to
prevalence after being incident at the same time the registry was
collecting data, allowing for a direct comparison between the known
prevalence for that time-frame and the simulated prevalence.
## sim sex age incident_date alive_at_index prev_3yr prev_5yr
## <int> <char> <num> <Date> <int> <num> <num>
## 1: 1 M 73.72316 1993-02-01 0 0 0
## 2: 1 M 58.52194 1993-02-10 1 0 0
## 3: 1 M 62.35457 1993-02-12 0 0 0
## 4: 1 M 75.18218 1993-02-17 0 0 0
## 5: 1 M 80.15367 1993-02-25 0 0 0
## 6: 1 M 74.06112 1993-03-05 0 0 0
## prev_10yr prev_20yr prev_registry
## <num> <num> <num>
## 1: 0 0 0
## 2: 0 1 0
## 3: 0 0 0
## 4: 0 0 0
## 5: 0 0 0
## 6: 0 0 0
flexsurv
objectsThe default survival models are based on the
survival::survreg
function and are optimized to improve
runtime. A more flexible alternative is to provide
flexsurvreg
objects from the flexsurv
package.
This is an easily extensible framework that comes with implementations
of a large number of standard parametric families in addition to other
models such as Royston and Parmar’s Flexible Parametric Models.
To use a flexsurvreg
object with
prevalence
, simply fit a model to the registry data and
then pass it in through the surv_model
argument. For
example, the log-logistic distribution isn’t currently supported by the
default survival model in rprev
, but it can be used in the
flexsurv
implementation. Firstly the survival model is
fitted, allowing for appropriate diagnostics to be performed first.
## Call:
## flexsurv::flexsurvreg(formula = Surv(time, status) ~ age + sex,
## data = prevsim, dist = "llogis")
##
## Estimates:
## data mean est L95% U95% se exp(est)
## shape NA 6.36e-01 5.92e-01 6.83e-01 2.32e-02 NA
## scale NA 2.49e+04 7.34e+03 8.44e+04 1.55e+04 NA
## age 6.49e+01 -4.82e-02 -6.64e-02 -3.00e-02 9.29e-03 9.53e-01
## sexF 4.92e-01 4.49e-01 9.01e-02 8.08e-01 1.83e-01 1.57e+00
## L95% U95%
## shape NA NA
## scale NA NA
## age 9.36e-01 9.70e-01
## sexF 1.09e+00 2.24e+00
##
## N = 1000, Events: 558, Censored: 442
## Total time at risk: 1235976
## Log-likelihood = -4633.701, df = 4
## AIC = 9275.403
Now, the surv_model
argument is used to pass in the
survival model directly, rather than specifying
surv_formula
and dist
as before. It must be
emphasized that the runtime significantly increases when using a
flexsurv
object as they have not been optimized for use in
rprev
, however, they provide greater flexibility in the
survival modelling. For example, the user can compare different survival
models in the familiar flexsurv
framework before using the
final object in estimating prevalence.
prev_llog <- prevalence(index='2013-01-30',
num_years_to_estimate=c(3, 5, 10, 20),
data=prevsim,
inc_formula = entrydate ~ sex,
surv_model=llog,
population_size = 1e6,
death_column = 'eventdate',
N_boot = 100)
As can be seen, the prevalence estimates from different survival distributions can vary largely so it is important to use as accurate a model as possible. The diagnostics vignette discusses strategies on how to identify well-fitting models.
## Estimated prevalence at 2013-01-30:
## 3 years: 208 (20.8 per 1e+05)
## 5 years: 304 (30.4 per 1e+05)
## 10 years: 514 (51.4 per 1e+05)
## 20 years: 808.58 (80.86 per 1e+05)
An additional consideration when predicting long-term disease survivability is the issue of whether a patient has been cured. For some diseases, a subset of patients can be identified who have been cured of the disease, and thereby have different survival characteristics compared to the non-cured subset. Cure models are extensions of standard survival models to model this behaviour. We refer to Lambert et al. (2006) for an overview of cure models in the literature.
We have included a cure model implementation with rprev
in the function fixed_cure
, which specifies a cure
time for the disease, beyond which a patient’s survival is
determined to have returned to population survival rates. In the example
below, we are imagining that we have reason to believe that after 5
years with the simulated disease, a patient is cured. Note that the
cure_time needs to be in the same time-scale as that used in the
Surv
object, so we convert 5 years into days.
By default population survival estimates is drawn from the UK
lifetable that is provided with the package in UKmortality
.
Please refer to the documentation for fixed_cure
if you
wish to use alternative population survival estimates.
fix_cure_mod <- fixed_cure(Surv(time, status) ~ age + sex, data=prevsim, cure_time=5*365.25,
dist='weibull')
Use of a fixed cure model here has increased the prevalence as patients are reverting back to population levels of survival. However, note that incorporating the population survival rates adds considerable computational expense, in the example below only 30 simulations are being run.
prevalence(index='2013-01-30',
num_years_to_estimate=20,
data=prevsim,
inc_formula = entrydate ~ sex,
surv_model=fix_cure_mod, # Pass in the cure model that was built above
population_size = 1e6,
death_column = 'eventdate',
N_boot = 30)
## Estimated prevalence at 2013-01-30:
## 20 years: 806.77 (80.68 per 1e+05)
The estimates from the full model are displayed below for comparison.
## Estimated prevalence at 2013-01-30:
## 3 years: 208 (20.8 per 1e+05)
## 5 years: 304 (30.4 per 1e+05)
## 10 years: 514 (51.4 per 1e+05)
## 20 years: 767.65 (76.76 per 1e+05)
An additional cure model implementation that can also be used are the
mixture and non-mixture cure models as implemented by
flexsurvcure
. These are fitted in the same way as standard
flexsurv
object as described above. These are also
considerably slower to run than non-cure survival models.
mixture_cure <- flexsurvcure::flexsurvcure(Surv(time, status) ~ age + sex, data=prevsim, dist='weibull', link='logistic', mixture=TRUE)
prevalence(index='2013-01-30',
num_years_to_estimate=20,
data=prevsim,
inc_formula = entrydate ~ sex,
surv_model=mixture_cure,
population_size = 1e6,
death_column = 'eventdate',
N_boot = 30)
## Estimated prevalence at 2013-01-30:
## 20 years: 807.37 (80.74 per 1e+05)
The object-oriented manner in which the prevalence
function is designed allows for custom survival and incidence objects to
be used rather than relying on the default implementations. The previous
section described both how to use the default models and also how to
provide flexsurv
survival objects. The latter works because
the appropriate interface for flexsurv
has been supplied
with rprev
, but the same mechanism can be used to provide
custom objects for both the incidence and survival processes.
During each bootstrap iteration, new incidence and survival models are fitted to the bootstrapped registry data, generating new parameters. Based on these parameters, each model is then used for prediction, either predicting an incident population or predicting the survival of this population.
This section provides details on how to provide custom models for both of these 2 roles.
Both incidence and survival objects must contain a call
object that holds the initial function call used to build the model;
this is obtained through match.call()
. This call must
contain an argument (name not important) which is passed the value
data
, as it is this argument which is changed to provide
the bootstrapped data during simulation.
For example:
build_my_survival_object <- function(formula, input_data) {
# Build a survival model using the specified formula and input data
model <- ...
object <- list(model=model,
call=match.call()) # the function call must be included as an item 'call'
class(object) <- "myobj"
object
}
It is crucial that the parameter passing in the data to fit the model to is labelled data, as below.
data <- data.frame(...)
myobj <- build_my_survival_object(Surv(time, status) ~ sex, data)
prevalence(...
surv_model=myobj, # This will work
...)
The example below will not work.
some_data <- data.frame(...)
myobj <- build_my_survival_object(Surv(time, status) ~ sex, some_data)
prevalence(...
surv_model=myobj, # This WON'T work, since the data parameter was called 'some_data' instead
...)
Once the models have been built, generic methods are used to generate the estimated incident cases and survival probabilities, which is achieved by specifying an appropriate S3 class method. The following sections describes these methods and their parameterisation. See Hadley Wickham’s guide to S3 objects for further support on object-oriented programming in R.
An additional source of support is the source code for the existing
objects that have been provided with the package which is freely
available on CRAN and the development code is hosted on GitHub. For
example, homogeneous_poisson.R contains the necessary methods
for the default incidence model, and survregmin.R and
flexsurv.R provide survival objects for the default survival
implementation and flexsurv
objects respectively.
In this example a homogeneous Poisson process will still be used to demonstrate how to provide custom incidence objects. This process is parameterised by a single value: the rate λ, which will need to be saved in the model object along with the function call.
The function below builds an object that contains both the process parameter (λ) and the function call.
build_poisson <- function(input_data) {
rate <- nrow(input_data) / as.numeric(difftime(max(input_data$entrydate), min(input_data$eventdate)))
# Build a survival model using the specified formula and input data
object <- list(rate=rate,
call=match.call()) # the function call must be included as an item 'call'
class(object) <- "pois"
object
}
When building the object, remember that the input data frame
needs to be called data
.
Printing the object shows that the requirements are met:
data
## $rate
## [1] 0.2721829
##
## $call
## build_poisson(input_data = data)
##
## attr(,"class")
## [1] "pois"
Incidence objects need to implement a method called
draw_incident_population
that will generate the incident
population specified by their incident times and any covariates that are
required for the survival modelling. The incident times are relative to
the origin, which in the simulation is the index minus N years, where N
is max(num_years_to_estimate)
. The required
parameterisation of draw_incident_population
is shown
below.
# object: The incidence model that will have been created on the bootstrapped data
# data: The data available to fit the model on, will be the registry data set provided to prevalence as this acts as the attribute prior distribution.
# timeframe: A single number specifying how long to generate incident cases for.
# covars: A character vector specifying the names of individual covariates that must be included in the returned data.table (or data frame)
# Returns a data.table (or data frame but data.table is preferred) where each row represents an incident case with:
# - The first column being the time since the origin, i.e. index date - N year prevalence
# - Any subsequent columns holding covariates that must be provided as specified in the 'covars' argument
draw_incident_population <- function(object, data, timeframe, covars, ...)
For this example using a homogeneous Poisson process, inter-arrival times are exponentially distributed, so simulating an incident population simply requires sampling inter-arrival times, turning these into arrival times, and then sampling individual attributes from the prior information (the registry data that is passed into the function).
draw_incident_population.pois <- function(object, data, timeframe, covars, ...) {
# Firstly draw inter-arrival times in the period [0, timeframe].
# The expected number is simply timeframe * rate so we'll take this amount + a margin for error.
expected <- 1.5 * (timeframe * object$rate)
# Now draw inter-arrival times
inter_arrival <- rexp(expected, object$rate)
# Determine absolute incident times
incident_times <- cumsum(inter_arrival)
# Truncate to those within the timeframe
incident_times <- incident_times[incident_times < timeframe]
num_incident <- length(incident_times)
# Sample individual attributes into a matrix. The required attributes are given by 'covars' argument
attrs <- do.call('cbind', lapply(covars, function(x) sample(data[[x]], num_incident, replace=T)))
# Now add the incident times as the first column
attrs <- cbind(incident_times, attrs)
# Convert to data frame and add column names
attrs <- data.frame(attrs)
colnames(attrs) <- c('incident_time', covars)
# Return this data frame
attrs
}
To validate that an incidence model has been correctly specified, the
validate_incidence_model
function has been provided. It
accepts the incidence model itself and the registry data that it has
been designed to work with. It will verify that the required attributes
and methods are available, and that
draw_incident_population
successfully simulates
individuals.
If any issues are found then the function stops execution and displays an error message detailing the fault, otherwise it returns a list of simulated arrival times, allowing for further diagnostics to be performed.
## Incidence model passed all tests.
## incident_time
## 1 9.077914
## 2 10.178961
## 3 11.192854
## 4 13.678644
## 5 14.097884
## 6 17.656732
Once the object has been validated, it can be used in
prevalence
through the inc_model
argument.
Note that an additional argument is required to provide the name of the
column in the data set that provides the incident dates, since this is
no longer provided by the unused inc_formula
option.
prevalence(index='2013-01-30',
num_years_to_estimate=c(3, 5, 10, 20),
data=prevsim,
inc_model = pois_mod,
surv_formula = Surv(time, status) ~ age + sex,
dist='weibull',
population_size = 1e6,
incident_column = 'entrydate',
death_column = 'eventdate')
## Warning in `[.data.table`(results, , `:=`(c("time_to_index", "time_to_entry"),
## : Tried to assign NULL to column 'time_to_entry', but this column does not
## exist to remove
## Estimated prevalence at 2013-01-30:
## 3 years: 208 (20.8 per 1e+05)
## 5 years: 304 (30.4 per 1e+05)
## 10 years: 514 (51.4 per 1e+05)
## 20 years: 837.57 (83.76 per 1e+05)
For this example a Weibull model with age
as the sole
covariate will be used. The survival model will be fitted using the
flexsurv
package, this functionality is already implemented
as discussed earlier, but it is an appropriate demonstration.
A Weibull survival model is parameterised by its coefficients for
each covariate and the distribution specific parameters. The function
below builds an object of class mysurv
that contains these
coefficients, as well as the function call (which is also saved by the
flexsurv
object).
library(flexsurv)
build_wei <- function(data) {
mod <- flexsurvreg(Surv(time, status) ~ age, data=data, dist='weibull')
obj <- list(coefs=coef(mod),
call=match.call())
class(obj) <- 'mysurv'
obj
}
With just these two attributes, a fully specified survival model has been generated. It has the required saved information:
data
## $coefs
## shape scale age
## -0.64951710 10.86692915 -0.04472098
##
## $call
## build_wei(data = data)
##
## attr(,"class")
## [1] "mysurv"
Survival objects have two methods that need to be implemented:
extract_covars
predict_survival_probability
extract_covars
simply returns a character string
detailing which of the covariates passed into prevalence
through the data
argument are used in the survival model.
This allows the simulation to know how to generate the incident
population as described above in draw_incident_population
.
In fact, the output of extract_covars
is passed directly
into draw_incident_population
through the
covars
parameter.
# object: The survival model
# Returns a character vector detailing the covariates required to fit this model. All of
# these values should be columns in the data that is passed in the main 'prevalence' function.
extract_covars <- function(object)
For this survival model age is the only patient-level covariate being used.
predict_survival_probability
estimates survival
probability at the index date. It is specified as follows:
# object: The survival object
# newdata: A data frame (or data.table) with the incident population stored with their
# required covariates for the model.
# times: A vector of times to estimate survival probability at for individuals in
# corresponding rows of 'newdata'. This should be the same length as there are
# rows in 'newdata' since each individual has their survival probability estimated once.
# Returns:
# A vector of survival probabilities of length equal to the length of 'times' and the
# number of rows in 'newdata'.
predict_survival_probability <- function(object, newdata, times)
For this simple Weibull model these estimates are simply provided by
1 − F(x), with the
CDF already implemented in base R as pweibull
.
predict_survival_probability.mysurv <- function(object, newdata, times) {
# Calculate linear predictor, this will form the shape parameter
shape <- exp(object$coefs[1] + newdata$age*object$coefs[3])
scale <- exp(object$coefs[2])
1 - pweibull(times, shape, scale)
}
While more in-depth testing would be required to validate the
predictions output by predict_survival_probability
, from a
programming perspective at least it is outputting numbers.
## [1] 0.4941064 0.4202817
There is also a function validate_survival_model
that
checks the survival model contains the required attributes and methods,
and that predict_survival_probability
outputs sensible
probabilities that are monotonically decreasing with time. If the test
passes, it returns survival probabilities taken from random individuals
in the supplied registry data at random time-points.
## Survival model passed all tests.
## [1] 0.3959881 0.3896183 0.3863363 0.3964004 0.4278924 0.4107967
Plugging this model into the prevalence
function now
works.
prevalence(index='2013-01-30',
num_years_to_estimate=c(3, 5, 10, 20),
data=prevsim,
inc_formula = entrydate ~ 1,
surv_model = survobj,
population_size = 1e6,
death_column = 'eventdate',
N_boot = 100)
## Estimated prevalence at 2013-01-30:
## 3 years: 208 (20.8 per 1e+05)
## 5 years: 304 (30.4 per 1e+05)
## 10 years: 514 (51.4 per 1e+05)
## 20 years: 898.91 (89.89 per 1e+05)