In this vignette, we demonstrate the usage of the aftsrr() function from the aftgee package (Chiou, Kang, and Yan 2014a, 2014b). The vignette is based on aftgee version 1.1.7.

> library(aftgee)
> packageVersion("aftgee")
[1] '1.1.7'

AFT model

The univariate semiparametric semiparametric accelerated failure time (AFT) model is specified as \[\log(T_i) = X_i^\top\beta + \epsilon_i, i = 1, \ldots, n, \] where \(T_i\) is the failure time, \(X_i\) is the \(p\times1\) covariate vector, \(\beta\) is the \(p\times1\) regression coefficient, and \(\epsilon_i\)’s are independent and identically distributed random variables with an unspecified distribution. In the presence of right censoring, the observed data are independent copies of \((Y_i, \Delta_i, X_i), i = 1, \ldots, n\), where \(Y_i = \min(T_i, C_i), \Delta_i = I(T_i < C_i)\), and \(I(\cdot)\) is the indicator function.

Rank-based estimator

The regression parameters, \(\beta\), can be estimated by solving the rank-based weighted estimating equation, \[U_{n, \varphi}(\beta) = \sum_{i = 1}^nw_i\varphi_i(\beta)\Delta_i \left[X_i - \frac{\sum_{j = 1}^nw_jX_jI\{e_j(\beta)\ge e_i(\beta)\}}{\sum_{j = 1}^nw_jI\{e_j(\beta)\ge e_i(\beta)\}}\right] = 0,\] where \(e_i(\beta) = Y_i - X_i^\top\beta\), \(w_i\) is a sampling weight, and \(\varphi(\beta)\) is a possibly data-dependent non-negative weight function with values between 0 and 1. Let \(\widehat{F}_{e_i(\beta)}(t)\) be the estimated cumulative distribution based on the residuals \(e_i(\beta)\)’s. Common choices of \(\varphi_i(\beta)\) implemented in the aftgee package are.

  • Logrank weight (Prentice 1978); \(\varphi_i(\beta) = 1\)
  • Gehan weight (Gehan 1965); \(\varphi_i(\beta) = n^{-1}\sum_{j = 1}^nI\{e_j(\beta)\ge e_i(\beta)\}\)
  • Prentice-Wilcoxon weight (Prentice 1978); \(\varphi_i(\beta) = 1 - \widehat{F}_{e_i(\beta)}(t)\)
  • General \(G^p\) class weight (Harrington and Fleming 1982); \(\varphi_i(\beta) = \left[1 - \widehat{F}_{e_i(\beta)}(t)\right]^p, p\ge0\)

The solution to \(U_{n, \varphi}(\beta) = 0\), denoted by \(\widehat\beta_{n, \varphi}\), is consistent to the true parameter, \(\beta_0\), and is asymptotically normal (e.g., Tsiatis 1990). We used the Kaplan-Meier estimator to obtain \(\widehat{F}_{e_i(\beta)}(t)\). The Barzilai-Borwein spectral method implemented in package BB package (Varadhan and Gilbert 2009) is the used to solve \(U_{n, \varphi}(\beta) = 0\) directly.

Smoothed rank-based estimator

A computationally more efficient approach is the induced smoothing procedure (Brown and Wang 2005, 2007). The idea is to replace the nonsmooth estimating equations with a smoothed version, whose solutions are asymptotically equivalent to those of the former. Define an independent \(p\times1\) standard normal random vector \(Z\) and a \(p\times p\) matrix \(\Gamma_n\) such that \(\Gamma_n^2 = \Sigma_n\), where \(\Sigma_n\) is a symmetric positive definite matrix. The induced smoothing procedure replaces \(U_{n, \varphi}(\beta)\) with \(E_Z[U_{n, \varphi}(\beta + n^{-1/2}\Gamma_nZ)]\), where the expectation is taken with respect to \(Z\). With the Gehan’s weight, the induced smooth estimating equation is \[\widetilde{U}_{n, G}(\beta) = \sum_{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i(X_i - X_j) \Phi\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0, \] where \(r_{ij}^2 = (X_i - X_j)^\top\Sigma_n(X_i - X_j)\) and \(\Phi(\cdot)\) is the standard normal cumulative distribution function. The smooth Gehan estimating equation is monotone and continuously differentiable with respect to \(\beta\), hence, its root can be found with standard numerical methods such as the Barzilai-Borwein spectral method.

On the other hand, deriving the smoothed estimating equations with general weights is challenging because \(E_Z[U_{n, \varphi}(\beta + n^{-1/2}\Gamma_nZ)]\) involves the expectation of the ratio of two random quantities. The aftgee package implements the iterative induced smoothing procedure proposed in Chiou, Kang, and Yan (2015). For general weights, the regression parameters, \(\beta\), can be estimated with the following steps:

  1. Obtain an initial estimate \(\widetilde\beta_{n, \varphi}^{(0)} = b_n\) of \(\beta\) and initialize with \(m = 1\).
  2. Update \(\widetilde\beta_{n, \varphi}^{(m)}\) by solving \(\widetilde{U}_{n, \varphi}(\widetilde\beta_{n, \varphi}^{(m-1)} , \widetilde\beta_{n, \varphi}^{(m)}) = 0\).
  3. Increase \(m\) by one and repeat Step 2 until \(\left|\widetilde\beta_{n, \varphi, q}^{(m-1)} - \widetilde\beta_{n, \varphi, q}^{(m)}\right| < t\) for all \(q = 1, \ldots, p\), where \(\widetilde\beta_{n, \varphi, q}^{(m)}\) is the \(q\)th component of \(\widetilde\beta_{n, \varphi}^{(m)}\) and \(t\) is a prefixed tolerance.

The smoothed estimating equation, \(\widetilde{U}_{n, \varphi}(b, \beta)\), has the form \[\widetilde{U}_{n, \varphi}(b, \beta) = \sum_{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i\phi_i(b)(X_i - X_j)\Phi\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0.\] A simple choice of the initial estimator is the easy-to-compute Gehan’s estimator, \(\widetilde\beta_{n, G}\).

The aftsrr() interface

The aftsrr() is used to fit a semiparametric AFT model with rank-based approaches. The function interface of aftsrr() is

> args(aftsrr)
function (formula, data, subset, id = NULL, contrasts = NULL, 
    weights = NULL, B = 100, rankWeights = c("gehan", "logrank", 
        "PW", "GP", "userdefined"), eqType = c("is", "ns", "mis", 
        "mns"), se = c("NULL", "bootstrap", "MB", "ZLCF", "ZLMB", 
        "sHCF", "sHMB", "ISCF", "ISMB"), control = list()) 
NULL

The arguments are:

  • formula: a formula expression, of the form response ~ predictors. The response is a Surv object object with right censoring.
  • data: an optional data.frame in which to interpret the variables occurring in the formula.
  • subset: an optional vector specifying a subset of observations to be used in the fitting process.
  • id: an optional vector used to identify the clusters. If missing, each individual row of data is presumed to represent a distinct subject.
  • contrasts: an optional list.
  • weights: an optional vector of observation weights.
  • B: a numeric value specifies the resampling number. When B = 0 or se = NULL, only the point estimator will be displayed.
  • rankWeights: a character string specifying the type of general weights. The following are permitted:
    • logrank: logrank weight.
    • gehan: Gehan’s weight.
    • PW: Prentice-Wilcoxon weight.
    • GP: GP class weight.
    • userdefined: a user defined weight provided as a vector with length equal to the number of subject. This argument is still under-development.
  • eqType: a character string specifying the type of the estimating equation used to obtain the regression parameters. The following are permitted:
    • is: Regression parameters are estimated by directly solving the induced-smoothing estimating equations. This is the default and recommended method.
    • ns: Regression parameters are estimated by directly solving the nonsmooth estimating equations.
    • mis: Regression parameters are estimated by iterating the monotonic induced smoothing Gehan-based estimating equations. This is typical when rankWeights = "PW" and rankWeights = "GP".
    • mns: Regression parameters are estimated by iterating the monotonic non-smoothed Gehan-based estimating equations. This is typical when rankWeights = "PW" and rankWeights = "GP".
  • se: a character string, or a list of character strings, specifying the estimating method for the variance-covariance matrix. The following are permitted:
    • NULL: The variance-covariance matrix will not be computed.
    • bootstrap: nonparametric bootstrap.
    • MB: multiplier resampling.
    • ZLCF: Zeng and Lin’s approach with closed form \(V\).
    • ZLMB: Zeng and Lin’s approach with empirical \(V\).
    • sHCF: Huang’s approach with closed form \(V\).
    • sHMB: Huang’s approach with empirical \(V\).
    • ISCF: Johnson and Strawderman’s sandwich variance estimates with closed form \(V\).
    • ISMB: Johnson and Strawderman’s sandwich variance estimates with empirical \(V\).
  • control: a list of control parameters.

The equation types

The arguments eqType = "is"estimates the regression parameters by solving the non-smoothed estimating equation, \(U_{n, \varphi}(\beta)\) directly. The arguments eqType = "is"estimates the regression parameters by solving the approximated induced smoothing estimating equation \[U_{n, \varphi}(\beta) = \sum_{i = 1}^nw_i\varphi_i(\beta)\Delta_i \left[X_i - \frac{\sum_{j = 1}^nw_jX_j\Phi\{e_j(\beta)\ge e_i(\beta)\}}{\sum_{j = 1}^nw_j\Phi\{e_j(\beta)\ge e_i(\beta)\}}\right] = 0.\] The arguments eqType = "mis"estimates the regression parameters using the above mentioned iterative procedure with \[\widetilde{U}_{n, \varphi}(b, \beta) = \sum_{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i\phi_i(b)(X_i - X_j)\Phi\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0.\] The arguments eqType = "mis"estimates the regression parameters using the same algorithm except with the induced smoothing \(\widetilde{U}_{n, \varphi}(b, \beta)\) replaced with \[\widehat{U}_{n, \varphi}(b, \beta) = \sum_{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i\phi_i(b)(X_i - X_j)I\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0.\]

Variance estimation

The aftsrr() allows users to estiamte the variance-covariance matrix of \(\beta\) through the nonparametric bootstrap or the multiplier resampling procedure. Bootstrap samples that failed to converge are removed when computing the empirical variance matrix. When bootstrap is not called, we assume the variance-covariance matrix has a sandwich form \[\Sigma = A^{-1}V(A^{-1})^\top,\] where \(V\) is the asymptotic variance of the estimating function and \(A\) is the slope matrix. In aftgee, we provide seveal methods to estimate the variance-covariance matrix via this sandwich form, depending on how \(V\) and \(A\) are estimated. Specifically, the asymptotic variance, \(V\), can be estimated by either a closed-form formulation (CF) or through bootstrap the estimating equations (MB). On the other hand, the methods to estimate the slope matrix \(A\) are the induced smoothing approach (IS), Zeng and Lin’s approach (ZL) (Zeng and Lin 2008), and the smoothed Huang’s approach (sH) (Huang 2002).

The control list

The control list controls equation solver, maxiter, tolerance, and resampling variance estimation. The available equation solvers are BBsolve and dfsane of the BB package. The default algorithm control parameters are used when these functions are called. However, the monotonicity parameter, M, can be specified by users via the control list. When M is specified, the merit parameter, noimp, is set at 10 * M. The readers are refered to the BB package for details. Instead of searching for the zero crossing, options including BBoptim and optim will return solution from maximizing the corresponding objective function. When se = "bootstrap" or se = "MB", an additional argument parallel = TRUE can be specified to enable parallel computation. The number of CPU cores can be specified with parCl, the default number of CPU cores is the integer value of detectCores() / 2.

An simulated example

The following function generates simulated data following an AFT model \[\log(T) = 2 + X_1 + X_2 + \epsilon, \] where \(X_1\) is a Bernoulli random variable with \(p = 0.5\), \(X_2\) is a standard normal random variable, and \(\epsilon\) is an independent standard normal random variable.

> datgen <- function(n = 100) {
+   x1 <- rbinom(n, 1, 0.5)
+   x2 <- rnorm(n)
+   e <- rnorm(n)
+   tt <- exp(2 + x1 + x2 + e)
+   cen <- runif(n, 0, 100)
+   data.frame(Time = pmin(tt, cen), status = 1 * (tt < cen),
+              x1 = x1, x2 = x2, id = 1:n)
+ }
> set.seed(1); head(dat <- datgen(n = 50))
       Time status x1          x2 id
1  9.349450      1  0 -0.05612874  1
2  4.058903      1  0 -0.15579551  2
3  4.619807      1  1 -1.47075238  3
4 13.412556      1  1 -0.47815006  4
5  6.224049      1  0  0.41794156  5
6 10.880632      0  1  1.35867955  6

The observed data, \((Y, \Delta, X)\), correspond to Time, status, x1, and x2. The following gives the induced smoothing Gehan estimate (default) of \(\beta\) with the variance matrix estimated by the induced smoothing approach and Zeng and Lin’s approach.

> system.time(fit1 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, se = c("ISMB", "ZLMB")))
   user  system elapsed 
  0.084   0.000   0.084 
> summary(fit1)
Call:
aftsrr(formula = Surv(Time, status) ~ x1 + x2, data = dat, se = c("ISMB", 
    "ZLMB"))

Variance Estimator: ISMB
   Estimate StdErr z.value   p.value    
x1    1.012  0.189   5.363 < 2.2e-16 ***
x2    0.817  0.098   8.324 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variance Estimator: ZLMB
   Estimate StdErr z.value   p.value    
x1    1.012  0.208   4.862 < 2.2e-16 ***
x2    0.817  0.098   8.361 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The non-smooth counterpart presented below yield similar results.

> system.time(fit2 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, 
+                            eqType = "ns", se = c("ISMB", "ZLMB")))
   user  system elapsed 
  0.092   0.000   0.093 
> summary(fit2)
Call:
aftsrr(formula = Surv(Time, status) ~ x1 + x2, data = dat, eqType = "ns", 
    se = c("ISMB", "ZLMB"))

Variance Estimator: ISMB
   Estimate StdErr z.value   p.value    
x1    1.023  0.188   5.446 < 2.2e-16 ***
x2    0.820  0.086   9.575 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variance Estimator: ZLMB
   Estimate StdErr z.value   p.value    
x1    1.023  0.219   4.678 < 2.2e-16 ***
x2    0.820  0.090   9.157 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Prentice-Wilcoxon estimator obtained by the iterative procedure also yields similar results.

> system.time(fit3 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, 
+                            rankWeights = "PW", se = c("ISMB", "ZLMB")))
   user  system elapsed 
  1.608   0.004   1.610 
> summary(fit3)
Call:
aftsrr(formula = Surv(Time, status) ~ x1 + x2, data = dat, rankWeights = "PW", 
    se = c("ISMB", "ZLMB"))

Variance Estimator: ISMB
   Estimate StdErr z.value   p.value    
x1    1.022  0.171   5.978 < 2.2e-16 ***
x2    0.818  0.100   8.147 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variance Estimator: ZLMB
   Estimate StdErr z.value   p.value    
x1    1.022  0.206   4.952 < 2.2e-16 ***
x2    0.818  0.111   7.363 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The following provides an example with the variance estimate obtained by bootstrap while taking advantage of parallel computing. The bootstrap variance estimator is compatible with the sandwich variance estimator.

> system.time(fit4 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, se = "bootstrap", 
+                            control = list(parallel = TRUE)))
   user  system elapsed 
  0.024   0.016   1.649 
> summary(fit4)
Call:
aftsrr(formula = Surv(Time, status) ~ x1 + x2, data = dat, se = "bootstrap", 
    control = list(parallel = TRUE))

Variance Estimator: bootstrap
   Estimate StdErr z.value   p.value    
x1    1.012  0.188   5.377 < 2.2e-16 ***
x2    0.817  0.104   7.855 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

National Wilm’s Tumor Study

We demonstrate the performance of the weighted approach with an application to the cohort study conducted by the National Wilm’s Tumor Study Group (NWTSG) (Breslow and Chatterjee 1999). The interest of the study was to assess the relationship between the tumor histology and the outcome, time to tumor relapse. The data set can be loaded with the following code.

> data(nwtco, package = "survival")
> head(nwtco)
  seqno instit histol stage study rel edrel age in.subcohort
1     1      2      2     1     3   0  6075  25        FALSE
2     2      1      1     2     3   0  4121  50        FALSE
3     3      2      2     1     3   0  6069   9        FALSE
4     4      2      1     4     3   0  6200  28         TRUE
5     5      2      2     2     3   0  1244  55        FALSE
6     6      1      1     2     3   0  2932  32        FALSE

The variables of interest are the following.

  • seqno: Patient id
  • edrel: Time to relapse
  • red: Indicator for relapse
  • histol: Histology from central lab; (1 = favorable, 0 = unfavorable)
  • age: Patient’s age in month
  • study: Study group; (3 = NWTSG-3 and 4 = NWTSG-4)

There were a total of 4028 subjects in the full cohort. Among them, 571 were cases who experienced the relapse of tumor; a censoring rate of about 86%. The following codes converts the variables, histol, stage, and study into a factor and transform age to year.

> nwtco$age <- nwtco$age / 12
> nwtco$histol <- factor(nwtco$histol)
> nwtco$stage <- factor(nwtco$stage)
> nwtco$study <- factor(nwtco$study)

We first fit the full cohort data.

> system.time(fit.full <- aftsrr(Surv(edrel, rel) ~ histol + age + study, 
+                                data = nwtco, se = "ISMB"))
   user  system elapsed 
 31.296   0.004  31.304 
> summary(fit.full)
Call:
aftsrr(formula = Surv(edrel, rel) ~ histol + age + study, data = nwtco, 
    se = "ISMB")

Variance Estimator: ISMB
        Estimate StdErr z.value p.value    
histol2   -3.220  0.149 -21.680  <2e-16 ***
age       -0.231  0.026  -8.794  <2e-16 ***
study4    -0.024  0.193  -0.124   0.901    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now create an artificial case-cohort data to evaluate the performance of the weighted approach. The following creates a case0cohort with 668 patients selected as sub-cohort sample to form the total case-cohort sample size of 1154. We also created the case-cohort weight accordingly.

> set.seed(1)
> subinx <- sample(1:nrow(nwtco), 668, replace = FALSE)
> nwtco$subcohort <- 0
> nwtco$subcohort[subinx] <- 1
> pn <- mean(nwtco$subcohort)
> nwtco$hi <- nwtco$rel + ( 1 - nwtco$rel) * nwtco$subcohort / pn

The case-cohort analysis is provided below. Point estimates are close to these in case-cohort analysis, with their standard errors taken into consideration. All the standard errors decrease compared to the case-cohort analyses, which is expected as full information became available for all covariates.

> system.time(fit.case <- aftsrr(Surv(edrel, rel) ~ histol + age + study, weights = hi, 
+                                data = nwtco, subset = subcohort > 0, se = "ISMB"))
   user  system elapsed 
  0.864   0.000   0.864 
> summary(fit.case)
Call:
aftsrr(formula = Surv(edrel, rel) ~ histol + age + study, data = nwtco, 
    subset = subcohort > 0, weights = hi, se = "ISMB")

Variance Estimator: ISMB
        Estimate StdErr z.value p.value    
histol2   -4.231  0.450  -9.412  <2e-16 ***
age       -0.239  0.088  -2.724   0.006 ** 
study4     0.645  0.555   1.163   0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reference

Breslow, Norman E, and Nilanjan Chatterjee. 1999. “Design and Analysis of Two-Phase Studies with Binary Outcome Applied to Wilms Tumour Prognosis.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 48 (4): 457–68.

Brown, Bruce M, and You-Gan Wang. 2005. “Standard Errors and Covariance Matrices for Smoothed Rank Estimators.” Biometrika 92 (1): 149–58.

———. 2007. “Induced Smoothing for Rank Regression with Censored Survival Times.” Statistics in Medicine 26 (4): 828–36.

Chiou, Sy Han, Sangwook Kang, and Jun Yan. 2014a. “Fitting Accelerated Failure Time Models in Routine Survival Analysis with R Package aftgee.” Journal of Statistical Software 61 (11): 1–23. https://doi.org/10.18637/jss.v061.i11.

———. 2014b. “Fitting Accelerated Failure Time Models in Routine Survival Analysis with R Package Aftgee.” Journal of Statistical Software 61 (11): 1–23.

———. 2015. “Rank-Based Estimating Equations with General Weight for Accelerated Failure Time Models: An Induced Smoothing Approach.” Statistics in Medicine 34 (9): 1495–1510.

Gehan, Edmund A. 1965. “A Generalized Wilcoxon Test for Comparing Arbitrarily Singly-Censored Samples.” Biometrika 52: 203–23.

Harrington, David P, and Thomas R Fleming. 1982. “A Class of Rank Test Procedures for Censored Survival Data.” Biometrika 69 (3): 553–66.

Huang, Yijian. 2002. “Calibration Regression of Censored Lifetime Medical Cost.” Journal of the American Statistical Association 97 (457): 318–27.

Prentice, R. L. 1978. “Linear Rank Tests with Right Censored Data (Corr: V70 P304).” Biometrika 65: 167–80.

Tsiatis, Anastasios A. 1990. “Estimating Regression Parameters Using Linear Rank Tests for Censored Data.” The Annals of Statistics 18: 354–72.

Varadhan, Ravi, and Paul Gilbert. 2009. “BB: An R Package for Solving a Large System of Nonlinear Equations and for Optimizing a High-Dimensional Nonlinear Objective Function.” Journal of Statistical Software 32 (4): 1–26. http://www.jstatsoft.org/v32/i04/.

Zeng, Donglin, and DY Lin. 2008. “Efficient Resampling Methods for Nonsmooth Estimating Functions.” Biostatistics 9 (2): 355–63.