This document contains some discussion points from class

1 The i386 error message

Some of you (especially these with Windows machines) ran into error messages along the lines of

*** arch - i386
ERROR: compilation failed for package 'survMisc'

Similar issues are reported here, here, and more. A workaround approach is to run the following codes before the command line that produce the error message:

> library(devtools)
> assignInNamespace("version_info", c(devtools:::version_info, list("3.5" = list(version_min = "3.3.0", version_max = "99.99.99", path = "bin"))), "devtools")

See here for the original discussion. This enable Rtools package to be used alongside devtools. However, this is only a temporary fix, and the user might need to run this line every time devtools is loaded.

A possible reason for this issue is that Rtools is not being detected or not installed properly. For this, the user can try to install Rtools for windows from this link. Make sure to check “add rtools to system PATH”:

2 Adding a data point

In one of the class example, we plotted the empirical survival curve with:

> whas100 %>% filter(fstat > 0) %>% mutate(surv = 1 - ecdf(lenfol)(lenfol)) %>% 
+   ggplot(aes(lenfol, surv)) + geom_step()

The codes can be modified in the following way so that the empirical survival curve starts from $\hat{S}_e(0)= 1$.

> whas100 %>% filter(fstat > 0) %>% mutate(surv = 1 - ecdf(lenfol)(lenfol)) %>% 
+   add_row(lenfol = 0, surv = 1,.before = 1) %>% ggplot(aes(lenfol, surv)) + 
+   geom_step() + labs(title = "Codes provided by Tian Jiang")

The following example produces the similar plot without the need to store $\hat{S}_e(t)$ in surv.

> whas100 %>% filter(fstat > 0) %>% 
+   ggplot(aes(x = lenfol)) + geom_step(aes(y = 1 - ..y..), stat = "ecdf", show.legend = TRUE) +
+   ylab("surv")

Here we have $\hat{S}_e(t) = 1$ for $t < 6$ and $\hat{S}_e(t) = 0$ for $t \ge 2710$.

3 Generate $\{1, -1, \ldots, 1\}$

Suppose we want to generate the sequence $\{1, -1, \ldots, -1, 1\}$ with 10 1’s and 9 -1’s (the vector then has length = 19). Here are some possible approaches:

> (-1)^(0:18)

 [1]  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1

> diff(0:19 %% 2)

 [1]  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1

> diff(rep(0:1, 10))

 [1]  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1

> rep(c(1, -1), 10)[1:19]

 [1]  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1

> ifelse(1:19 %% 2, 1, -1)

 [1]  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1 -1  1

4 Selecting subset for `lm`

Suppose we want to fit a linear regression with y as the outcome variable and $\{$X, Z$\}$ as the covariates, where X is a family of covariates $\{$x1, x2, …, x100$\}$ generated from standard uniform and Z is a family of covariates $\{$z1, z2, …, z100$\}$ generated from standard normal. The data frame is observed in a way that the X’s and Z’s are placed in no particular order. For example, consider the following dat.

Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
This warning is displayed once per session.

> ## fake data
> datatable(dat, options = list(pageLength = 10, scrollX=TRUE)) %>% formatRound(1:201, 3)

This fits all 200 covariates:

> head(coef(lm(y ~ ., dat)), 10)

 (Intercept)          x88          z17          x56          x77 
-0.154390393  0.846932399 -0.001144469  1.066026142 -0.991538457 
         z86          z14          z31          x52          x22 
-0.005226447  0.051806282  0.039848434  1.032491798  0.987676788

> length(coef(lm(y ~ ., dat)))

[1] 201

This fits linear regression with only the X’s:

> head(coef(lm(y ~ ., dat[,grep("x|y", names(dat))])))

(Intercept)         x88         x56         x77         x52         x22 
 -0.2496609   0.8704349   1.0865774  -0.9767312   0.9717011   1.0111935

> length(coef(lm(y ~ ., dat[,grep("x|y", names(dat))])))

[1] 101

5 Homework 2, # 5b

From the lecture note, we have \[\begin{align} M_i(t) = N_i(t) - \int_0^t\lambda(u)\,du = N_i(t) - \int_0^th(u)Y_i(u)\,du = N_i(t) - \int_0^{\min(t, \tilde{T}_i)}h(u)\,du \end{align}\] for the $i$th individual. The “residual” for the $i$th individual at his/her follow-up time is \[\begin{align} \hat M_i(t_i) = \Delta_i - \hat H(t_i), \end{align}\]

where, like most regression models, $\sum_{i=1}^n M_i(t_i)=0$. Consider a simulated dataset where we generated the true survival time from a standard exponential and the censoring time from a uniform distribution.

> n <- 100
> set.seed(1)
> sim <- tibble(cenTime = runif(n, 0, 5), time = pmin(rexp(n), cenTime), 
+               status = 1 * (time < cenTime)) %>% 
+   select(-cenTime)

The following codes can be used to create a new dataset containing variables like $d_i$, $n_i$, etc.

> dat <- sim %>% arrange(time) %>% group_by(time) %>% 
+   summarize(di = sum(status), ni = length(status)) %>% 
+   mutate(Nt = cumsum(di), ni = 100:1)
> dat

# A tibble: 100 x 4
     time    di    ni    Nt
    <dbl> <dbl> <int> <dbl>
 1 0.0203     1   100     1
 2 0.0372     1    99     2
 3 0.0521     1    98     3
 4 0.0593     1    97     4
 5 0.0694     1    96     5
 6 0.0897     1    95     6
 7 0.0911     1    94     7
 8 0.0988     1    93     8
 9 0.110      1    92     9
10 0.117      0    91     9
# … with 90 more rows

The individual martingales can be computed with

> dat <- dat %>% mutate(hi = di / ni, Hi = cumsum(hi), Mi = di - Hi)
> dat

# A tibble: 100 x 7
     time    di    ni    Nt     hi     Hi      Mi
    <dbl> <dbl> <int> <dbl>  <dbl>  <dbl>   <dbl>
 1 0.0203     1   100     1 0.01   0.01    0.99  
 2 0.0372     1    99     2 0.0101 0.0201  0.980 
 3 0.0521     1    98     3 0.0102 0.0303  0.970 
 4 0.0593     1    97     4 0.0103 0.0406  0.959 
 5 0.0694     1    96     5 0.0104 0.0510  0.949 
 6 0.0897     1    95     6 0.0105 0.0616  0.938 
 7 0.0911     1    94     7 0.0106 0.0722  0.928 
 8 0.0988     1    93     8 0.0108 0.0829  0.917 
 9 0.110      1    92     9 0.0109 0.0938  0.906 
10 0.117      0    91     9 0      0.0938 -0.0938
# … with 90 more rows

Checking the Nelson-Aalen’s estimator.

> qplot(time, Hi, data = dat, geom = "step") + 
+   geom_abline(intercept = 0, slope = 1, col = 2) + ylab("H(t)")

In this case, the sum of the martingale residuals is sum(dat$Mi)=4.302114210^{-16}, close to 0 as expected.

In the aml dataset, the sum of the martingale residuals is

> dat <- aml %>% arrange(time) %>% select(-x) %>% group_by(time) %>% 
+   summarize(di = sum(status), ni = length(status)) %>% 
+   mutate(Nt = cumsum(di), ni = rev(cumsum(rev(ni))), Na = cumsum(di/ni)) %>% right_join(aml)
> with(dat, sum(status - Na))

[1] 3.330669e-16

6 Brief history on Cox model

“Cox described the proportional hazards model in Cox (1972), in what is now the most quoted statistical papers in history. He also outlined in this paper the method for estimation which he referred to as using conditional likelihood. It was pointed out to him in the literature that what he proposed was not a conditional likelihood and that there may be some flaws in his logic. Cox (1975) was able to recast his method of estimation through what he called ‘partial likelihood’ and published this in Biometrika. This approach seemed to be based on sound inferential principles. Rigorous proofs showing the consistency and asymptotic normality were not published until 1981 when Tsiatis and others (1981) demonstrated these large sample properties. In 1982, Andersen and Gill (1982) simplified and generalized these results through the use of counting processes.”

Origin of this historial note can be found here.

7 Reference

e2c8136d4854ff9384fceb00437a00cc60452ee7

Andersen, Per Kragh, and Richard David Gill. 1982. “Cox’s Regression Model for Counting Processes: A Large Sample Study.” The Annals of Statistics. JSTOR, 1100–1120.

Cox, David R. 1972. “Regression Models and Life-Tables.” Journal of the Royal Statistical Society. Series B (Methodological) 34 (2): 87–22.

———. 1975. “Partial Likelihood.” Biometrika 62 (2). Oxford University Press: 269–76.

Tsiatis, Anastasios A, and others. 1981. “A Large Sample Study of Cox’s Regression Model.” The Annals of Statistics 9 (1). Institute of Mathematical Statistics: 93–108.

Course supplement

1 The i386 error message

2 Adding a data point

3 Generate \(\{1, -1, \ldots, 1\}\)

4 Selecting subset for `lm`

5 Homework 2, # 5b

6 Brief history on Cox model

7 Reference

Course supplement

1 The i386 error message

2 Adding a data point

3 Generate \(\{1, -1, \ldots, 1\}\)

4 Selecting subset for lm

5 Homework 2, # 5b

6 Brief history on Cox model

7 Reference

4 Selecting subset for `lm`