Least-Squares Estimators for Accelerated Failure Time Model

In this vignette, we demonstrate the usage of the aftgee() function from the aftgee package(Chiou, Kang, and Yan 2014a, 2014b). The vignette is based on aftgee version 1.1.7.

> library(aftgee)
> packageVersion("aftgee")

[1] '1.1.7'

AFT model

The univariate semiparametric semiparametric accelerated failure time (AFT) model is specified as \[\log(T_i) = X_i^\top\beta + \epsilon_i, i = 1, \ldots, n, \] where \(T_i\) is the failure time, \(X_i\) is the \(p\times1\) covariate vector, \(\beta\) is the \(p\times1\) regression coefficient, and \(\epsilon_i\)’s are independent and identically distributed random variables with an unspecified distribution. In the presence of right censoring, the observed data are independent copies of \((Y_i, \Delta_i, X_i), i = 1, \ldots, n\), where \(Y_i = \min(T_i, C_i), \Delta_i = I(T_i < C_i)\), and \(I(\cdot)\) is the indicator function.

Least-squares estimator

With survival data from right censoring, Buckley and James (1979) proposed a least-squares approach that replaces each response \(T_i\) with the conditional expectation \(Y_i(\beta) = E_\beta(T_i|Y_i, \Delta_i, X_i)\). The theoretical properties of the Buckley and James estimator have been studied by Ritov (1990) and Lai and Ying (1991). However, the Buckley and James estimator is rarely used in practice due to numerical challenges. Jin, Lin, and Ying (2006) proposed an iterative procedure to generalize the Buckley and James estimator, making the latter more practical. Recently, Chiou et al. (2014) embedded the generalized estimating equation (GEE) procedure to generalized the iterative procedure to accommodates within-cluster dependence in cluster survival data. The GEE embedded approach accounts for multivariate dependence through working correlation structures to improve efficiency. To illustrate the idea, we expand the notations to multivariate AFT model, \[T_{ik} + X_{ik}^\top\beta + \epsilon_{ik}, i = 1, \ldots, n, k = 1, \ldots, K_i, \] where \(K_i\) is the \(i\) cluster size, \(\beta\) is an unknown \(p\times1\) vector of regression parameters and the error terms, \(\epsilon_{i}= \{\epsilon_{i1}, \ldots, \epsilon_{iK_i}\}\) are independent and identically distributed random variables with an unspecified distribution throughout clusters. In the presence of censoring, the observed data consists of copies of \(\{Y_{ik}, \Delta_{ik}, X_{ik}\}\) for \(i = 1, \ldots, n\), and \(k = 1, \ldots, K_i\), where \(Y_{ik} = \min(T_{ik}, C_{ik})\) and \(\Delta_{ik} = I(T_{ik} < C_{ik})\).

Define \(\Omega_{i}^{-1}\big(\alpha(b)\big)\) to be an \(K_i\times K_i\) nonsingular working weight matrix that may involve additional working parameters \(\alpha\) that depends on an initial value \(b\) of \(\beta_0\). For \(i = 1, \ldots, n\) and \(k = 1, \ldots, K_i\), let \(\widehat{\mathbf{Y}}_i(b)\), \(\mathbf{Y}_i\), \(\mathbf{T}_i\), \(\mathbf{C}_i\) and \(\mathbf{\Delta}_i\) be \(K_i\times 1\) vector formed by stacking \(\widehat{Y}_{ik}(b)\), \(Y_{ik}\), \(C_{ik}\) and \(\Delta_{ik}\), where \[\begin{equation*} \widehat{Y}_{ik}(b) = \Delta_{ik}Y_{ik} + (1 - \Delta_{ik}) \left[\frac{\int_{e_{ik}(b)}^\infty u \mathrm{d}\widehat{F}_{e_{ik}(\beta)}}{1 - \widehat{F}_{e_{ik}(\beta)}\{e_{ik}(b)\}} + X_{ik}^\top b \right]. \end{equation*}\] A generalization of iterative least-square estimating equation is \[\begin{equation} \label{equ:gee} U_{n, GEE}(\beta, b, \alpha) = \sum_{i=1}^n(\mathbf{X}_i - \bar{\mathbf{X}})^\top\Omega_i^{-1}\left\{\alpha(b)\right\}\{\widehat{\mathbf{Y}}_i(b) - \mathbf{X}_i\beta\} = 0, \end{equation}\] where \(\bar{\mathbf{X}}_i = \sum_{i=1}^n X_i / n\). Given \(\alpha\) and \(b\), the solution to \(U_{n, GEE}(\beta, b, \alpha) = 0\) has a closed form \[\begin{equation*} L_n(b, \alpha) = \left [ \sum_{i=1}^{n}(\mathbf{X}_{i}-\bar{\mathbf{X}})^{\top} \Omega_{i}^{-1}\big(\alpha(b)\big) (\mathbf{X}_{i}-\bar{\mathbf{X}}) \right ]^{-1} \left [ \sum_{i=1}^{n}(\mathbf{X}_{i}-\bar{X})^\top \Omega_{i}^{-1}\big(\alpha(b)\big) \left(\widehat{\mathbf{Y}}_{i}(b)-\bar{\mathbf{Y}}(b)\right)\right ], \end{equation*}\] where \(\bar{\mathbf{Y}}(b) = \sum_i^n\widehat{\mathbf{Y}}_i(b)/n\).

The GEE estimator, denoted by \(\widehat{\beta}_{n, GEE}\), can be obtained from an iterative procedure:

Obtain an initial estimate \(\widehat{\beta}^{(0)}_{n, GEE} = b_n\) of \(\beta\) and initialize with \(m = 1\).
Obtain an estimate \(\widehat\alpha_n\) of \(\alpha\) given \(\widehat\beta^{(m-1)}_{n, GEE}\), \(\widehat\alpha_n(\widehat{\beta}_{n, GEE}^{(m-1)})\).
Update with \(\widehat\beta^{(m)}_{n, GEE} = L_n\big(\widehat\beta^{(m-1)}_{n, GEE}, \widehat\alpha_n(\widehat\beta^{(m-1)}_{n, GEE})\big)\).
Increase \(m\) by one and repeat 2 and 3 until convergence.

The iteration proceeds with the aid of function in . The estimator reduces to the least squares estimator of when the working weight matrix \(\Omega_i\)’s are the identity matrix. We refer to for more details. The working parameter estimate \(\widehat{\alpha}_n\) does not affect the consistency of the GEE estimator, but may affect its efficiency. Higher efficiency can be achieved if \(\Omega_i\) is closer to the covariance matrix of \(\widehat{\mathbf{Y}}_i(b)\) and even an imperfect working weight still improves the efficiency . The variance of the estimator can again be estimated by resampling procedures.

The `aftgee()` interface

The aftgee() is used to fit a semiparametric AFT model with rank-based approaches. The function interface of aftgee() is

> args(aftgee)

function (formula, data, subset, id = NULL, contrasts = NULL, 
    weights = NULL, margin = NULL, corstr = "independence", binit = "srrgehan", 
    B = 100, control = aftgee.control()) 
NULL

The arguments are:

formula: a formula expression, of the form response ~ predictors. The response is a Surv object object with right censoring.
data: an optional data.frame in which to interpret the variables occurring in the formula.
subset: an optional vector specifying a subset of observations to be used in the fitting process.
id: an optional vector used to identify the clusters. If missing, each individual row of data is presumed to represent a distinct subject.
contrasts: an optional list.
weights: an optional vector of observation weights.(This is still under development).
margin: an optional vector to specify the margins; default at 1.
corstr: a character string specifying the correlation structure. The following are permitted: "independence", "exchangeable", "ar", "unstructured", "fixed".
binit: an optional vector can be either a numeric vector or a character string specifying the initial slope estimator. When binit is a vector, its length should be the same as the dimension of covariates. When binit is a character string, it should be either lm for simple linear regression, or srrgehan for smoothed Gehan weight estimator. The default value is “srrgehan”.
B: a numeric value specifies the resampling number. When B = 0, only the beta estimate will be displayed.
control: controls maxiter and tolerance.

In the uncensored case, aftgee() with independent working correlation structure will return an ordinary least squares estimate. In the multivariate case, efficiency can be improved in aftgee when the working correlation structure is close to the true correlation even in the absent of censoring.

An simulated example

The following function generates simulated data following an AFT model \[\log(T) = 2 + X_1 + X_2 + \epsilon, \] where \(X_1\) is a Bernoulli random variable with \(p = 0.5\), \(X_2\) is a standard normal random variable, and \(\epsilon\) is an independent standard normal random variable.

> datgen <- function(n = 100, tau = 0.3, dim = 2) {
+   x1 <- rbinom(dim * n, 1, 0.5)
+   x2 <- rnorm(dim * n)
+   e <- c(t(exp(MASS::mvrnorm(n = n, mu = rep(0, dim), Sigma = tau + (1 - tau) * diag(dim)))))
+   tt <- exp(2 + x1 + x2 + e)
+   cen <- runif(n, 0, 100)
+   data.frame(Time = pmin(tt, cen), status = 1 * (tt < cen),
+              x1 = x1, x2 = x2, id = rep(1:n, each = dim))
+ }

In datgen(), the error term \(\epsilon_i\) follows a multivariate normal distribution with mean 0 and variance covariance matrix \((1 - \tau) I_{K_i} + \tau J_{K_i}\), where \(I_d\) is the \(d\)-dimensional identity matrix and \(J_d\) is a \(d\times d\) matrix of ones. The arguments dim controls the cluster size, \(K_i\). The observed data, \((Y, \Delta, X)\), correspond to Time, status, x1, and x2. The following gives an example a simulated data from an AFT model generated by datgen().

> set.seed(1); head(dat <- datgen(n = 100, dim = 2))

       Time status x1          x2 id
1 11.801097      1  0 -0.62036668  1
2 68.486090      0  0  0.04211587  1
3  9.291650      1  1 -0.91092165  2
4 88.002129      1  1  0.15802877  2
5 11.835658      0  0 -0.65458464  3
6  3.910006      0  1  1.76728727  3

The following fits the GEE embedded with independence and exchangeable correlation structures.

> system.time(fit.ind <- aftgee(Surv(Time, status) ~ x1 + x2, data = dat, id = id, corstr = "ind"))

   user  system elapsed 
  3.972   0.012   3.987

> summary(fit.ind)

Call:
aftgee(formula = Surv(Time, status) ~ x1 + x2, data = dat, id = id, 
    corstr = "ind")

AFTGEE Estimator
            Estimate StdErr z.value   p.value    
(Intercept)    3.369  0.139  24.152 < 2.2e-16 ***
x1             1.054  0.156   6.751 < 2.2e-16 ***
x2             0.958  0.087  11.064 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> system.time(fit.ex <- update(fit.ind, corstr = "ex"))

   user  system elapsed 
  6.464   0.024   6.489

> summary(fit.ex)

Call:
aftgee(formula = Surv(Time, status) ~ x1 + x2, data = dat, id = id, 
    corstr = "ex")

AFTGEE Estimator
            Estimate StdErr z.value   p.value    
(Intercept)    3.369  0.123  27.396 < 2.2e-16 ***
x1             1.055  0.141   7.497 < 2.2e-16 ***
x2             0.960  0.082  11.774 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The standard errors decrease slightly when the exchangeable correlation structure is specified over the independent correlation structure.

Reference

Buckley, Jonathan, and Ian James. 1979. “Linear Regression with Censored Data.” Biometrika 66 (3): 429–36.

Chiou, Sy Han, Sangwook Kang, Junghi Kim, and Jun Yan. 2014. “Marginal Semiparametric Multivariate Accelerated Failure Time Model with Generalized Estimating Equations.” Lifetime Data Analysis 20 (4): 599–618.

Chiou, Sy Han, Sangwook Kang, and Jun Yan. 2014a. “Fitting Accelerated Failure Time Models in Routine Survival Analysis with R Package aftgee.” Journal of Statistical Software 61 (11): 1–23. https://doi.org/10.18637/jss.v061.i11.

———. 2014b. “Fitting Accelerated Failure Time Models in Routine Survival Analysis with R Package Aftgee.” Journal of Statistical Software 61 (11): 1–23.

Jin, Zhezhen, DY Lin, and Zhiliang Ying. 2006. “On Least-Squares Regression with Censored Data.” Biometrika 93 (1): 147–61.

Lai, Tze Leung, and Zhiliang Ying. 1991. “Large Sample Theory of a Modified Buckley-James Estimator for Regression Analysis with Censored Data.” The Annals of Statistics, 1370–1402.

Ritov, Yaacov. 1990. “Estimation in a Linear Regression Model with Censored Data.” The Annals of Statistics, 303–28.

Sy Han (Steven) Chiou

2022-05-11

AFT model

Least-squares estimator

The `aftgee()` interface

An simulated example

Reference

Least-Squares Estimators for Accelerated Failure Time Model

Sy Han (Steven) Chiou

2022-05-11

AFT model

Least-squares estimator

The aftgee() interface

An simulated example

Reference

The `aftgee()` interface