The user guide highlighted the basic
use of the prevalence simulation methodology, demonstrating that the
prevalence process is a result of two distinct processes: incidence and
survival. rprev
allows for custom specification of both of
these systems, in addition to providing default implementations that
work well in many situations. This vignette acts as a guide to assessing
the modelling assumptions made when using these default models,
including the use of diagnostic functions provided with
rprev
.
First, we will inspect the consistency of incidence data across the range of the registry and determine if it can be appropriately modelled as an homogeneous Poisson process. Second, we will look at verifying assumptions made when using a parametric model of survival. We emphasize that the user should check that their registry data does meet the required assumptions and if not, should understand that estimates of prevalence made using simulation based on that data may not be correct.
The simulation of incident cases in the years for which registry data
is not available currently assumes that the incidence process is
homogeneous Poisson. test_homogeneity
calculates several
summary statistics of the disease incidence to identify the suitability
of this model, including fitting a smoothed function to the annual
incidence data.
It requires a vector of entry dates into the registry, with other
information about the registry being passed in to optional arguments
(see ?test_homogeneity
for details). For example, the
population_size
parameter below allows for relative
incidence rates to be calculated, while truncate_end
removes diagnoses from the last calendar year of the registry (2013) as
it only has 1 month of incidence data.
The summary
method displays the most useful information
about the incidence process of the disease of interest.
## Number of years of registry data: 10
##
## Incidence
## ~~~~~~~~~
## Known incidence by year: 91 107 96 98 88 88 100 107 108 111
## Diagnoses (time since registry began):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 921.5 1894.0 1866.1 2824.8 3640.0
## p-values for over/under dispersion: 0.66509 0.33491
##
## Fitted smooth:
## ~~~~~~~~~~~~~
## Call:
## smooth.spline(x = diags, y = seq_along(diags), df = df)
##
## Smoothing Parameter spar= 1.127494 lambda= 0.1929953 (13 iterations)
## Equivalent Degrees of Freedom (Df): 4.00047
## Penalized Criterion (RSS): 24898.51
## GCV: 25.33735
The known yearly incidence rates are displayed starting at the the
day and month specified in year_start
, which defaults to
“01-01”, in the first registry year. This is 2013-01-01 for
prevsim
, so there were 91 incident cases between 2003-01-01
and 2004-01-01. It can be concluded that the disease has around 100 new
cases each year.
Below that is a summary of the cumulative diagnosis times (in days),
useful as a quick check of the distribution of incident cases. The
p-values of a simulation test indicating whether the yearly incidence
estimates are under or over dispersed relative to a homogeneous Poisson
process are also displayed. Inspection of the smoothed incidence
function should reveal whether the problem is one of non-homogeneity
(which may lead to inaccurate prevalence estimates) or of failure of the
Poisson assumption (which may lead to inaccurate estimates of confidence
intervals). Another useful incidence statistic is the mean annual
incidence rate per 100,000 within the study population, which is
obtained from the mean
attribute. Confidence intervals are
provided at the specified level (default is 95%):
## $absolute
## [1] 99.4
##
## $per100K
## [1] 3.11
##
## $per100K.lower
## [1] 3.05
##
## $per100K.upper
## [1] 3.17
Alongside the p-values from the over/under dispersion test,
rprev
provides visual tools to assess the consistency of
the incidence data with an homogeneous Poisson process.
The plot
method displays a plot of the actual incidence
rate as recorded in the registry, with the smooth overlaid. If incidence
is an homogeneous Poisson process, both the smooth (green) and incidence
process (red) should remain within the 95% confidence interval (dashed
blue) and be evenly distributed about the mean (blue line).
Another useful diagnostic is to look at the age distribution of the incident cases.
ggplot(prevsim, aes(age, y=..count..)) +
geom_line(stat='density') +
xlim(0, 100) +
labs(x='Age (years)', y='Number incident cases') +
theme_bw()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In the default implementation, survival is modelled using a
parametric fit with a specified distribution. It is
highly recommended that the user inspects this model to
assess its suitability. In this example, we’ll use the
prevsim
data set and test the assumptions of a Weibull
model acting on the sex and age covariates, since these are common
demographic factors that are likely to be used in prevalence
estimation.
First, it is always useful to plot the Kaplan-Meier estimator of the data, both as a whole and stratified by age to visually inspect for any inconsistencies:
km <- survfit(Surv(time, status) ~ 1, data=prevsim)
plot(km, lwd=2, col="blue", main="Overall Survival", xlab="Days",
ylab="Survival probability")
ages = c(55, 65, 75, 85, 100)
km2 <- survfit(Surv(time, status) ~ cut(age, breaks=ages), data=prevsim)
plot(km2, lwd=2, col=1:length(ages), main="Survival stratified by age", xlab="Days",
ylab="Survival probability")
legend("topright", legend=substring(names(km2$strata), 25, 32), lty = 1,
col=1:length(ages))
It is also a useful diagnostic aid to plot the survival curve for each year of the registry to determine whether there is any inhomogeneity:
plot(km, lwd=2, col="blue", mark.time=F, conf.int=T, xlab="Days",
ylab="Survival probability")
num_reg_years <- 9
registry_years <- sapply(0:9, function(x) as.Date(paste0(2004+x, "-01-30")))
sapply(seq(num_reg_years),
function(i) lines(survfit(Surv(time, status) ~ 1,
data=prevsim[prevsim$entrydate >=
registry_years[i] &
prevsim$entrydate <
registry_years[i + 1], ]),
mark.time = F, conf.int = F))
The effect of age on hazard can be visualized to determine if there
are any non-proportional effects, by inspecting Schoenfeld residuals
from a Cox model. This is easily done using the cox.zph
function from the survival
package.
cx <- coxph(Surv(time, status) ~ age, data=prevsim)
cxp <- survfit(cx,
newdata=data.frame(age=sapply(seq(length(ages) - 1),
function(i) mean(c(ages[i], ages[i + 1])))))
plot(cox.zph(cx))
lines(cxp, lwd=2, col=1:length(ages), lty=2, mark.time=F)
An overall test of the proportional hazards assumption may be made.
## chisq df p
## age 0.0661 1 0.8
## GLOBAL 0.0661 1 0.8
The standard standard assumption in a survival model is that
continuous variables have a linear relationship with log-hazard. When
looking at the impact of age on disease survival, we highly recommend
investigating whether a non-linear functional form provides a more
appropriate fit. In particular, the rms
package provides
the cph
function to fit a Cox model, to which restricted
cubic splines can be placed on covariates using rcs
as
demonstrated below. Plotting log hazard as a function of age provides a
visual means of assessing model fit, to be used alongside inspection of
the model coefficients.
library(rms)
library(dplyr)
mod_spline <- cph(Surv(time, status) ~ rcs(age, df=4), prevsim, x=TRUE, y=TRUE, surv=TRUE, time.inc=1)
# Calculates log hazard linear predictor at 100 linearly separated ages between the limits
# in the registr data
age_range <- seq(min(prevsim$age), max(prevsim$age), length.out=100)
preds <- predict(mod_spline, newdata=age_range, se.fit=T)
preds_df <- as.data.frame(preds) %>%
rename(lp=linear.predictors, se=se.fit) %>%
mutate(age = age_range,
upper = lp + 2 * se,
lower = lp - 2 * se)
ggplot(preds_df, aes(x=age, y=lp)) +
geom_ribbon(aes(ymin=lower, ymax=upper),
colour='#d2d2d2', alpha=0.30) +
geom_line(colour='#0080ff', size=1.2) +
theme_bw() +
labs(x='Age', y='Log relative hazard')
For example, in this situation there isn’t enough evidence to suggest a non-linear effect provides a better model fit over a linear relationship.
## Cox Proportional Hazards Model
##
## cph(formula = Surv(time, status) ~ rcs(age, df = 4), data = prevsim,
## x = TRUE, y = TRUE, surv = TRUE, time.inc = 1)
##
## Model Tests Discrimination
## Indexes
## Obs 1000 LR chi2 31.45 R2 0.031
## Events 558 d.f. 4 R2(4,1000)0.027
## Center 2.8919 Pr(> chi2) 0.0000 R2(4,558)0.048
## Score chi2 32.52 Dxy 0.145
## Pr(> chi2) 0.0000
##
## Coef S.E. Wald Z Pr(>|Z|)
## age 0.0513 0.0246 2.09 0.0370
## age' -0.1745 0.1128 -1.55 0.1221
## age'' 0.9259 0.6024 1.54 0.1242
## age''' -1.2260 0.8784 -1.40 0.1628
As a reminder, prevalence is estimated using incidence and survival data from n years. However, registry data (and thus known incidence and survival) data may only be known for r years, where r <= n. If r < n, the remaining n-r years of incidence and survival are simulated using Monte Carlo techniques. If the incidence and survival models are well specified then the prevalence estimates should be reliable, however, it is beneficial to check the performance of these bootstrapped models (their variance in particular) before drawing any conclusions from the results.
prevalence_total <- prevalence(index='2013-01-30',
num_years_to_estimate=c(3, 5, 10, 20),
data=prevsim,
inc_formula = entrydate ~ sex,
surv_formula = Surv(time, status) ~ age + sex,
dist='weibull',
population_size = 1e6,
death_column = 'eventdate')
As a test of whether the model is predicting reasonable values of
prevalence, we can use the fact that we can directly measure the
discrepancy between the predicted and actual prevalence for the
available registry years. This difference can be formally tested with an
exact Poisson test of the counted prevalence from both the simulated
estimate and the known registry value; the resulting p-value resulting
is returned as an attribute of a prevalence
object, called
pval
.
## [1] 0.5087715
For this model, there is no evidence to reject the null hypothesis.
This can also be calculated manually with the
test_prevalence_fit
function.
## [1] 0.5087715
We do not provide any functions for diagnosing the bootstrapped
incidence models, however, all the objects are available in the
inc_models
attribute of the returned
prevalence
object and can be used to check for any errors
in fitting. If the default homogeneous Poisson process model is being
used then the techniques described earlier can be applied.
To inspect the distribution of the bootstrapped survival models (see Crouch et al. 2014 for details), a
survfit.prev
object can be constructed using the usual
survfit
call, accepting a data frame of new data,
defaulting to the average covariate values in the registry data. In the
example below, survival probability is estimated for a 60 year old
male:
prevsurv <- survfit(prevalence_total, newdata=data.frame(age=60, sex='M', stringsAsFactors = TRUE))
prevsurv
## Survival probability calculated at 4444 timepoints, across 1000 bootstraps.
The summary
method provides N-year survival
probabilities, with N specified as an argument vector:
## Survival probability estimated using 1000 bootstrap survival curves:
## 1 year survival: 0.715 (0.684 - 0.746)
## 3 year survival: 0.551 (0.511 - 0.592)
## 5 year survival: 0.458 (0.415 - 0.503)
## 10 year survival: 0.326 (0.281 - 0.373)
Plotting this object displays the survival curve of a Weibull model
using the original data set (orange), along with a 95% confidence band
derived using the bootstrapped models. This plot is useful to assess the
variability of the survival models. Further manual inspection can be
carried out by looking at the objects themselves, saved in the
surv_models
attribute of the prevalence
object.
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the rprev package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `mutate()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the rprev package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `summarise_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `summarise()` instead.
## ℹ The deprecated feature was likely used in the rprev package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `group_by()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the rprev package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The simulated
attribute of the prevalence
object holds a data.table
with the incident population from
every simulation, along with derived fields indicating whether they
contributed to prevalence at the index date of any years of interest. It
can be used as an overall check of the appropriateness of the generated
incidence population, as well as identifying any discrepancies in the
survival modelling.
sim | sex | age | incident_date | alive_at_index | prev_3yr | prev_5yr | prev_10yr | prev_20yr | prev_registry |
---|---|---|---|---|---|---|---|---|---|
1 | M | 67.18160 | 1993-01-30 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | M | 73.72316 | 1993-02-03 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | M | 55.23388 | 1993-02-10 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | M | 62.83518 | 1993-02-18 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | M | 48.25562 | 1993-03-06 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | M | 56.03014 | 1993-03-09 | 0 | 0 | 0 | 0 | 0 | 0 |